You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**The Featurization Service depends on a `Featurization Specification` that contains**:
122
-
-`featurizationName`: A name for the featurization task.
121
+
The Featurization Service depends on a **Featurization Specification**:
123
122
124
-
-`projectName`: The associated project name. You can use `project.name` here
123
+
{{< tip >}}
124
+
The descriptions of the specifications on this page indicate the Python data types,
125
+
but you can substitute them as follows for a schema description in terms of JSON:
126
+
127
+
| Python | JSON |
128
+
|:--------|:-------|
129
+
|`dict`| object |
130
+
|`list`| array |
131
+
|`int`| number |
132
+
|`float`| number |
133
+
|`str`| string |
134
+
{{< /tip >}}
135
+
136
+
-`databaseName` (str): The database name the source data is in.
137
+
138
+
-`featurizationName` (str): A name for the featurization task.
139
+
140
+
-`projectName` (str): The associated project name. You can use `project.name` here
125
141
if it was created or retrieved as described above.
126
142
127
-
-`graphName`: The associated graph name that exists within the database.
143
+
-`graphName` (str): The associated graph name that exists within the database.
128
144
129
-
-`featureSetID`Optional: The ID of an existing Feature Set to re-use. If provided, the `metagraph` dictionary can be ommitted. Defaults to `None`.
145
+
-`featureSetID`(str, _optional_): The ID of an existing Feature Set to re-use. If provided, the `metagraph` dictionary can be omitted. Defaults to `None`.
130
146
131
-
-`featurizationConfiguration`Optional: The optional default configuration to be applied
147
+
-`featurizationConfiguration`(dict, _optional_): The optional default configuration to be applied
132
148
across all features. Individual collection feature settings override this option.
133
149
134
-
-`featurePrefix`: The prefix to be applied to all individual features generated. Default is `feat_`.
150
+
-`featurePrefix` (str): The prefix to be applied to all individual features generated. Default is `feat_`.
135
151
136
-
-`outputName`: Adjust the default feature name. This can be any valid ArangoDB attribute name. Defaults to `x`.
152
+
-`outputName` (str): Adjust the default feature name. This can be any valid ArangoDB attribute name. Defaults to `x`.
-`disabled` (bool): Whether to disable dimensionality reduction. Default is `false`,
140
156
therefore dimensionality reduction is applied after Featurization by default.
141
-
-`size`: The number of dimensions to reduce the feature length to. Default is `512`.
142
-
143
-
-`defaultsPerFeatureType`: A dictionary mapping each feature to how missing or mismatched values should be handled. The keys of this dictionary are the features, and the values are sub-dictionaries with the following keys:
144
-
-`missing`: A sub-dictionary detailing how missing values should be handled.
145
-
-`strategy`: The strategy to use for missing values. Options include `REPLACE` or `RAISE`.
146
-
-`replacement`: The value to replace missing values with. Only needed if `strategy` is `REPLACE`.
147
-
-`mismatch`: A sub-dictionary detailing how mismatched values should be handled.
148
-
-`strategy`: The strategy to use for mismatched values. Options include `REPLACE`, `RAISE`, `COERCE_REPLACE`, or `COERCE_RAISE`.
149
-
-`replacement`: The value to replace mismatched values with. Only needed if `strategy` is `REPLACE`, or `COERCE_REPLACE`.
150
-
151
-
-`jobConfiguration` Optional: A set of configurations that are applied to the job.
152
-
-`batchSize`: The number of documents to process in a single batch. Default is `32`.
153
-
-`runAnalysisChecks`: Whether to run analysis checks, used to perform a high-level analysis of the data quality before proceeding. Default is `true`.
154
-
-`skipLabels`: Skips the featurization process for attributes marked as `label`. Default is `false`.
155
-
-`useFeatureStore`: Enables the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore features are written to the source graph.
156
-
-`overwriteFSGraph`: Whether to overwrite the Feature Store Graph if features were previously generated. Default is `false`, therefore features are written to an existing Feature Store Graph.s
157
-
-`writeToSourceGraph`: Whether to store the generated features on the Source Graph. Default is `true`.
158
-
159
-
-`metagraph`: Metadata to represent the node & edge collections of the graph.
160
-
-`vertexCollections`: A dictionary mapping the node collection names to the following values:
161
-
-`features`: A dictionary mapping document properties to the following values:
162
-
-`featureType`: The type of feature. Options include `text`, `category`, `numeric`, or `label`.
-`featurePrefix`: Identical to global `featurePrefix` but for this collection.
165
-
-`dimensionalityReduction`: Identical to global `dimensionalityReduction` but for this collection.
166
-
-`outputName`: Identical to global `outputName`, but specifically for this collection.
167
-
-`defaultsPerFeatureType`: Identical to global `defaultsPerFeatureType`, but specifically for this collection.
168
-
-`edgeCollections`: A dictionary mapping the edge collection names to an empty dictionary, as edge attributes are not currently supported.
157
+
-`size` (int): The number of dimensions to reduce the feature length to. Default is `512`.
158
+
159
+
-`defaultsPerFeatureType` (dict): A dictionary mapping each feature type to how missing or mismatched values should be handled. The keys of this dictionary are the feature types, and the values are sub-dictionaries:
160
+
-`text` / `numeric` / `category` / `label`:
161
+
-`missing` (dict): A sub-dictionary detailing how missing values should be handled.
162
+
-`strategy` (str): The strategy to use for missing values. Options include `REPLACE` or `RAISE`.
163
+
-`replacement` (int/float for `numeric`, otherwise str): The value to replace missing values with. Only needed if `strategy` is `REPLACE`.
164
+
-`mismatch` (dict): A sub-dictionary detailing how mismatched values should be handled.
165
+
-`strategy` (str): The strategy to use for mismatched values. Options include `REPLACE`, `RAISE`, `COERCE_REPLACE`, or `COERCE_RAISE`.
166
+
-`replacement` (int/float for `numeric`, otherwise str): The value to replace mismatched values with. Only needed if `strategy` is `REPLACE`, or `COERCE_REPLACE`.
167
+
168
+
-`jobConfiguration` (dict, _optional): A set of configurations that are applied to the job.
169
+
-`batchSize` (int): The number of documents to process in a single batch. Default is `32`.
170
+
-`runAnalysisChecks` (bool): Whether to run analysis checks, used to perform a high-level analysis of the data quality before proceeding. Default is `true`.
171
+
-`skipLabels` (bool): Skips the featurization process for attributes marked as `label`. Default is `false`.
172
+
-`useFeatureStore` (bool): Enables the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore features are written to the source graph.
173
+
-`overwriteFSGraph` (bool): Whether to overwrite the Feature Store Graph if features were previously generated. Default is `false`, therefore features are written to an existing Feature Store Graph.
174
+
-`writeToSourceGraph` (bool): Whether to store the generated features on the Source Graph. Default is `true`.
175
+
176
+
-`metagraph` (dict): Metadata to represent the node & edge collections of the graph.
177
+
-`vertexCollections` (dict): A dictionary mapping the node collection names to a configuration dictionary:
178
+
-_collection name_ (dict):
179
+
-`features` (dict): A dictionary mapping document properties to the following values:
180
+
-`featureType` (str): The type of feature. Options include `text`, `category`, `numeric`, or `label`.
-`featurePrefix` (str): Identical to global `featurePrefix` but for this collection.
183
+
-`dimensionalityReduction` (dict): Identical to global `dimensionalityReduction` but for this collection.
184
+
-`outputName` (str): Identical to global `outputName`, but specifically for this collection.
185
+
-`defaultsPerFeatureType` (dict): Identical to global `defaultsPerFeatureType`, but specifically for this collection.
186
+
-`edgeCollections` (dict): A dictionary mapping the edge collection names to an empty dictionary, as edge attributes are not currently supported.
187
+
-_collection name_ (dict): An empty dictionary.
169
188
170
189
The Featurization Specification example is used for the GDELT dataset:
171
190
- It featurizes the `name` attribute of the `Actor`, `Class`, `Country`,
@@ -383,34 +402,37 @@ Training Graph Machine Learning Models with GraphML requires two steps:
383
402
1. Describe which data points should be included in the Training Job.
384
403
2. Pass the Training Specification to the Training Service.
385
404
386
-
**The Training Service depends on a `Training Specification` that contains**:
387
-
-`featureSetID`: The feature set ID that was generated during the Featurization Job (if any). It replaces the need to provide the `metagraph`, `databaseName`, and `projectName` fields.
405
+
The Training Service depends on a **Training Specification**:
406
+
407
+
-`featureSetID` (str): The feature set ID that was generated during the Featurization Job (if any). It replaces the need to provide the `metagraph`, `databaseName`, and `projectName` fields.
388
408
389
-
-`databaseName`: The database name the source data is in. Can be omitted if `featureSetID` is provided.
409
+
-`databaseName` (str): The database name the source data is in. Can be omitted if `featureSetID` is provided.
390
410
391
-
-`projectName`: The top-level project to which all the experiments will link back. Can be omitted if `featureSetID` is provided.
411
+
-`projectName` (str): The top-level project to which all the experiments will link back. Can be omitted if `featureSetID` is provided.
392
412
393
-
-`useFeatureStore`: Boolean for enabling or disabling the use of the feature store. Default is `false`.
413
+
-`useFeatureStore` (bool): Boolean for enabling or disabling the use of the feature store. Default is `false`.
394
414
395
-
-`mlSpec`: Describes the desired machine learning task, input features, and
415
+
-`mlSpec` (dict): Describes the desired machine learning task, input features, and
396
416
the attribute label to be predicted.
397
-
-`classification`: Dictionary to describe the Node Classification Task Specification.
398
-
-`targetCollection`: The ArangoDB collection name that contains the prediction label.
399
-
-`inputFeatures`: The name of the feature to be used as input.
400
-
-`labelField`: The name of the attribute to be predicted.
401
-
-`batchSize`: The number of documents to process in a single training batch. Default is `64`.
402
-
-`graphEmbeddings`: Dictionary to describe the Graph Embedding Task Specification.
403
-
-`targetCollection`: The ArangoDB collection used to generate the embeddings.
404
-
-`embeddingSize`: The size of the embedding vector. Default is `128`.
405
-
-`batchSize`: The number of documents to process in a single training batch. Default is `64`.
406
-
-`generateEmbeddings`: Whether to generate embeddings on the training dataset. Default is `false`.
407
-
408
-
-`metagraph`: Metadata to represent the node & edge collections of the graph. If `featureSetID` is provided, this can be omitted.
409
-
-`graph`: The ArangoDB graph name.
410
-
-`vertexCollections`: A dictionary mapping the collection names to the following values:
411
-
-`x`: The name of the feature to be used as input.
412
-
-`y`: The name of the attribute to be predicted. Can only be specified for one collection.
413
-
-`edgeCollections`: A dictionary mapping the edge collection names to an empty dictionary, as edge features are not currently supported.
417
+
-`classification` (dict): Dictionary to describe the Node Classification Task Specification.
418
+
-`targetCollection` (str): The ArangoDB collection name that contains the prediction label.
419
+
-`inputFeatures` (str): The name of the feature to be used as input.
420
+
-`labelField` (str): The name of the attribute to be predicted.
421
+
-`batchSize` (int): The number of documents to process in a single training batch. Default is `64`.
422
+
-`graphEmbeddings` (dict): Dictionary to describe the Graph Embedding Task Specification.
423
+
-`targetCollection` (str): The ArangoDB collection used to generate the embeddings.
424
+
-`embeddingSize` (int): The size of the embedding vector. Default is `128`.
425
+
-`batchSize` (int): The number of documents to process in a single training batch. Default is `64`.
426
+
-`generateEmbeddings` (bool): Whether to generate embeddings on the training dataset. Default is `false`.
427
+
428
+
-`metagraph` (dict): Metadata to represent the node & edge collections of the graph. If `featureSetID` is provided, this can be omitted.
429
+
-`graph` (str): The ArangoDB graph name.
430
+
-`vertexCollections` (dict): A dictionary mapping the collection names to a configuration dictionary:
431
+
-_collection name_ (dict):
432
+
-`x` (str): The name of the feature to be used as input.
433
+
-`y` (str): The name of the attribute to be predicted. Can only be specified for one collection.
434
+
-`edgeCollections` (dict): A dictionary mapping the edge collection names to an empty dictionary, as edge features are not currently supported.
435
+
-_collection name_ (dict): An empty dictionary.
414
436
415
437
A Training Specification allows for concisely defining your training task in a
416
438
single object and then passing that object to the training service using the
After selecting a model, a Prediction Job can be created. The Prediction Job
711
731
will generate predictions and persist them to the source graph in a new
712
732
collection, or within the source documents.
713
733
714
-
**The Prediction Service depends on a `Prediction Specification` that contains**:
715
-
-`projectName`: The top-level project to which all the experiments will link back.
716
-
-`databaseName`: The database name the source data is in.
717
-
-`modelID`: The model ID to use for generating predictions.
718
-
-`featurizeNewDocuments`: Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`.
719
-
-`featurizeOutdatedDocuments`: Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`.
720
-
-`schedule`: A cron expression to schedule the prediction job. The cron syntax is a set of
734
+
The Prediction Service depends on a **Prediction Specification**:
735
+
736
+
-`projectName` (str): The top-level project to which all the experiments will link back.
737
+
-`databaseName` (str): The database name the source data is in.
738
+
-`modelID` (str): The model ID to use for generating predictions.
739
+
-`featurizeNewDocuments` (bool): Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`.
740
+
-`featurizeOutdatedDocuments` (bool): Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`.
741
+
-`schedule` (str): A cron expression to schedule the prediction job. The cron syntax is a set of
721
742
five fields in a line, indicating when the job should be executed. The format must follow
722
743
the following order: `minute``hour``day-of-month``month``day-of-week`
723
744
(e.g. `0 0 * * *` for daily predictions at 00:00). Default is `None`.
724
-
-`embeddingsField`: The name of the field to store the generated embeddings. This is only used for Graph Embedding tasks. Default is `None`.
745
+
-`embeddingsField` (str): The name of the field to store the generated embeddings. This is only used for Graph Embedding tasks. Default is `None`.
0 commit comments