Skip to content

Commit ec45612

Browse files
authored
Indicate Python data types for GraphML specifications (#752)
* Indicate Python data types for GraphML specifications * Add databaseName option for Featurization * Apply to 3.12 * Fix/add more types
1 parent 871c097 commit ec45612

File tree

2 files changed

+184
-142
lines changed

2 files changed

+184
-142
lines changed

site/content/3.12/data-science/graphml/notebooks-api.md

Lines changed: 92 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -118,54 +118,73 @@ arangoml.projects.list_projects()
118118

119119
**API Documentation: [ArangoML.jobs.featurize](https://arangoml.github.io/arangoml/api.html#agml_api.jobs.v1.api.jobs_api.JobsApi.featurize)**
120120

121-
**The Featurization Service depends on a `Featurization Specification` that contains**:
122-
- `featurizationName`: A name for the featurization task.
121+
The Featurization Service depends on a **Featurization Specification**:
123122

124-
- `projectName`: The associated project name. You can use `project.name` here
123+
{{< tip >}}
124+
The descriptions of the specifications on this page indicate the Python data types,
125+
but you can substitute them as follows for a schema description in terms of JSON:
126+
127+
| Python | JSON |
128+
|:--------|:-------|
129+
| `dict` | object |
130+
| `list` | array |
131+
| `int` | number |
132+
| `float` | number |
133+
| `str` | string |
134+
{{< /tip >}}
135+
136+
- `databaseName` (str): The database name the source data is in.
137+
138+
- `featurizationName` (str): A name for the featurization task.
139+
140+
- `projectName` (str): The associated project name. You can use `project.name` here
125141
if it was created or retrieved as described above.
126142

127-
- `graphName`: The associated graph name that exists within the database.
143+
- `graphName` (str): The associated graph name that exists within the database.
128144

129-
- `featureSetID` Optional: The ID of an existing Feature Set to re-use. If provided, the `metagraph` dictionary can be ommitted. Defaults to `None`.
145+
- `featureSetID` (str, _optional_): The ID of an existing Feature Set to re-use. If provided, the `metagraph` dictionary can be omitted. Defaults to `None`.
130146

131-
- `featurizationConfiguration` Optional: The optional default configuration to be applied
147+
- `featurizationConfiguration` (dict, _optional_): The optional default configuration to be applied
132148
across all features. Individual collection feature settings override this option.
133149

134-
- `featurePrefix`: The prefix to be applied to all individual features generated. Default is `feat_`.
150+
- `featurePrefix` (str): The prefix to be applied to all individual features generated. Default is `feat_`.
135151

136-
- `outputName`: Adjust the default feature name. This can be any valid ArangoDB attribute name. Defaults to `x`.
152+
- `outputName` (str): Adjust the default feature name. This can be any valid ArangoDB attribute name. Defaults to `x`.
137153

138-
- `dimensionalityReduction`: Object configuring dimensionality reduction.
139-
- `disabled`: Whether to disable dimensionality reduction. Default is `false`,
154+
- `dimensionalityReduction` (dict): Object configuring dimensionality reduction.
155+
- `disabled` (bool): Whether to disable dimensionality reduction. Default is `false`,
140156
therefore dimensionality reduction is applied after Featurization by default.
141-
- `size`: The number of dimensions to reduce the feature length to. Default is `512`.
142-
143-
- `defaultsPerFeatureType`: A dictionary mapping each feature to how missing or mismatched values should be handled. The keys of this dictionary are the features, and the values are sub-dictionaries with the following keys:
144-
- `missing`: A sub-dictionary detailing how missing values should be handled.
145-
- `strategy`: The strategy to use for missing values. Options include `REPLACE` or `RAISE`.
146-
- `replacement`: The value to replace missing values with. Only needed if `strategy` is `REPLACE`.
147-
- `mismatch`: A sub-dictionary detailing how mismatched values should be handled.
148-
- `strategy`: The strategy to use for mismatched values. Options include `REPLACE`, `RAISE`, `COERCE_REPLACE`, or `COERCE_RAISE`.
149-
- `replacement`: The value to replace mismatched values with. Only needed if `strategy` is `REPLACE`, or `COERCE_REPLACE`.
150-
151-
- `jobConfiguration` Optional: A set of configurations that are applied to the job.
152-
- `batchSize`: The number of documents to process in a single batch. Default is `32`.
153-
- `runAnalysisChecks`: Whether to run analysis checks, used to perform a high-level analysis of the data quality before proceeding. Default is `true`.
154-
- `skipLabels`: Skips the featurization process for attributes marked as `label`. Default is `false`.
155-
- `useFeatureStore`: Enables the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore features are written to the source graph.
156-
- `overwriteFSGraph`: Whether to overwrite the Feature Store Graph if features were previously generated. Default is `false`, therefore features are written to an existing Feature Store Graph.s
157-
- `writeToSourceGraph`: Whether to store the generated features on the Source Graph. Default is `true`.
158-
159-
- `metagraph`: Metadata to represent the node & edge collections of the graph.
160-
- `vertexCollections`: A dictionary mapping the node collection names to the following values:
161-
- `features`: A dictionary mapping document properties to the following values:
162-
- `featureType`: The type of feature. Options include `text`, `category`, `numeric`, or `label`.
163-
- `config`: Collection-level configuration settings.
164-
- `featurePrefix`: Identical to global `featurePrefix` but for this collection.
165-
- `dimensionalityReduction`: Identical to global `dimensionalityReduction` but for this collection.
166-
- `outputName`: Identical to global `outputName`, but specifically for this collection.
167-
- `defaultsPerFeatureType`: Identical to global `defaultsPerFeatureType`, but specifically for this collection.
168-
- `edgeCollections`: A dictionary mapping the edge collection names to an empty dictionary, as edge attributes are not currently supported.
157+
- `size` (int): The number of dimensions to reduce the feature length to. Default is `512`.
158+
159+
- `defaultsPerFeatureType` (dict): A dictionary mapping each feature type to how missing or mismatched values should be handled. The keys of this dictionary are the feature types, and the values are sub-dictionaries:
160+
- `text` / `numeric` / `category` / `label`:
161+
- `missing` (dict): A sub-dictionary detailing how missing values should be handled.
162+
- `strategy` (str): The strategy to use for missing values. Options include `REPLACE` or `RAISE`.
163+
- `replacement` (int/float for `numeric`, otherwise str): The value to replace missing values with. Only needed if `strategy` is `REPLACE`.
164+
- `mismatch` (dict): A sub-dictionary detailing how mismatched values should be handled.
165+
- `strategy` (str): The strategy to use for mismatched values. Options include `REPLACE`, `RAISE`, `COERCE_REPLACE`, or `COERCE_RAISE`.
166+
- `replacement` (int/float for `numeric`, otherwise str): The value to replace mismatched values with. Only needed if `strategy` is `REPLACE`, or `COERCE_REPLACE`.
167+
168+
- `jobConfiguration` (dict, _optional): A set of configurations that are applied to the job.
169+
- `batchSize` (int): The number of documents to process in a single batch. Default is `32`.
170+
- `runAnalysisChecks` (bool): Whether to run analysis checks, used to perform a high-level analysis of the data quality before proceeding. Default is `true`.
171+
- `skipLabels` (bool): Skips the featurization process for attributes marked as `label`. Default is `false`.
172+
- `useFeatureStore` (bool): Enables the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore features are written to the source graph.
173+
- `overwriteFSGraph` (bool): Whether to overwrite the Feature Store Graph if features were previously generated. Default is `false`, therefore features are written to an existing Feature Store Graph.
174+
- `writeToSourceGraph` (bool): Whether to store the generated features on the Source Graph. Default is `true`.
175+
176+
- `metagraph` (dict): Metadata to represent the node & edge collections of the graph.
177+
- `vertexCollections` (dict): A dictionary mapping the node collection names to a configuration dictionary:
178+
- _collection name_ (dict):
179+
- `features` (dict): A dictionary mapping document properties to the following values:
180+
- `featureType` (str): The type of feature. Options include `text`, `category`, `numeric`, or `label`.
181+
- `config` (dict): Collection-level configuration settings.
182+
- `featurePrefix` (str): Identical to global `featurePrefix` but for this collection.
183+
- `dimensionalityReduction` (dict): Identical to global `dimensionalityReduction` but for this collection.
184+
- `outputName` (str): Identical to global `outputName`, but specifically for this collection.
185+
- `defaultsPerFeatureType` (dict): Identical to global `defaultsPerFeatureType`, but specifically for this collection.
186+
- `edgeCollections` (dict): A dictionary mapping the edge collection names to an empty dictionary, as edge attributes are not currently supported.
187+
- _collection name_ (dict): An empty dictionary.
169188

170189
The Featurization Specification example is used for the GDELT dataset:
171190
- It featurizes the `name` attribute of the `Actor`, `Class`, `Country`,
@@ -383,34 +402,37 @@ Training Graph Machine Learning Models with GraphML requires two steps:
383402
1. Describe which data points should be included in the Training Job.
384403
2. Pass the Training Specification to the Training Service.
385404

386-
**The Training Service depends on a `Training Specification` that contains**:
387-
- `featureSetID`: The feature set ID that was generated during the Featurization Job (if any). It replaces the need to provide the `metagraph`, `databaseName`, and `projectName` fields.
405+
The Training Service depends on a **Training Specification**:
406+
407+
- `featureSetID` (str): The feature set ID that was generated during the Featurization Job (if any). It replaces the need to provide the `metagraph`, `databaseName`, and `projectName` fields.
388408

389-
- `databaseName`: The database name the source data is in. Can be omitted if `featureSetID` is provided.
409+
- `databaseName` (str): The database name the source data is in. Can be omitted if `featureSetID` is provided.
390410

391-
- `projectName`: The top-level project to which all the experiments will link back. Can be omitted if `featureSetID` is provided.
411+
- `projectName` (str): The top-level project to which all the experiments will link back. Can be omitted if `featureSetID` is provided.
392412

393-
- `useFeatureStore`: Boolean for enabling or disabling the use of the feature store. Default is `false`.
413+
- `useFeatureStore` (bool): Boolean for enabling or disabling the use of the feature store. Default is `false`.
394414

395-
- `mlSpec`: Describes the desired machine learning task, input features, and
415+
- `mlSpec` (dict): Describes the desired machine learning task, input features, and
396416
the attribute label to be predicted.
397-
- `classification`: Dictionary to describe the Node Classification Task Specification.
398-
- `targetCollection`: The ArangoDB collection name that contains the prediction label.
399-
- `inputFeatures`: The name of the feature to be used as input.
400-
- `labelField`: The name of the attribute to be predicted.
401-
- `batchSize`: The number of documents to process in a single training batch. Default is `64`.
402-
- `graphEmbeddings`: Dictionary to describe the Graph Embedding Task Specification.
403-
- `targetCollection`: The ArangoDB collection used to generate the embeddings.
404-
- `embeddingSize`: The size of the embedding vector. Default is `128`.
405-
- `batchSize`: The number of documents to process in a single training batch. Default is `64`.
406-
- `generateEmbeddings`: Whether to generate embeddings on the training dataset. Default is `false`.
407-
408-
- `metagraph`: Metadata to represent the node & edge collections of the graph. If `featureSetID` is provided, this can be omitted.
409-
- `graph`: The ArangoDB graph name.
410-
- `vertexCollections`: A dictionary mapping the collection names to the following values:
411-
- `x`: The name of the feature to be used as input.
412-
- `y`: The name of the attribute to be predicted. Can only be specified for one collection.
413-
- `edgeCollections`: A dictionary mapping the edge collection names to an empty dictionary, as edge features are not currently supported.
417+
- `classification` (dict): Dictionary to describe the Node Classification Task Specification.
418+
- `targetCollection` (str): The ArangoDB collection name that contains the prediction label.
419+
- `inputFeatures` (str): The name of the feature to be used as input.
420+
- `labelField` (str): The name of the attribute to be predicted.
421+
- `batchSize` (int): The number of documents to process in a single training batch. Default is `64`.
422+
- `graphEmbeddings` (dict): Dictionary to describe the Graph Embedding Task Specification.
423+
- `targetCollection` (str): The ArangoDB collection used to generate the embeddings.
424+
- `embeddingSize` (int): The size of the embedding vector. Default is `128`.
425+
- `batchSize` (int): The number of documents to process in a single training batch. Default is `64`.
426+
- `generateEmbeddings` (bool): Whether to generate embeddings on the training dataset. Default is `false`.
427+
428+
- `metagraph` (dict): Metadata to represent the node & edge collections of the graph. If `featureSetID` is provided, this can be omitted.
429+
- `graph` (str): The ArangoDB graph name.
430+
- `vertexCollections` (dict): A dictionary mapping the collection names to a configuration dictionary:
431+
- _collection name_ (dict):
432+
- `x` (str): The name of the feature to be used as input.
433+
- `y` (str): The name of the attribute to be predicted. Can only be specified for one collection.
434+
- `edgeCollections` (dict): A dictionary mapping the edge collection names to an empty dictionary, as edge features are not currently supported.
435+
- _collection name_ (dict): An empty dictionary.
414436

415437
A Training Specification allows for concisely defining your training task in a
416438
single object and then passing that object to the training service using the
@@ -705,23 +727,22 @@ print(best_model)
705727

706728
**API Documentation: [ArangoML.jobs.predict](https://arangoml.github.io/arangoml/api.html#agml_api.jobs.v1.api.jobs_api.JobsApi.predict)**
707729

708-
Final step!
709-
710730
After selecting a model, a Prediction Job can be created. The Prediction Job
711731
will generate predictions and persist them to the source graph in a new
712732
collection, or within the source documents.
713733

714-
**The Prediction Service depends on a `Prediction Specification` that contains**:
715-
- `projectName`: The top-level project to which all the experiments will link back.
716-
- `databaseName`: The database name the source data is in.
717-
- `modelID`: The model ID to use for generating predictions.
718-
- `featurizeNewDocuments`: Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`.
719-
- `featurizeOutdatedDocuments`: Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`.
720-
- `schedule`: A cron expression to schedule the prediction job. The cron syntax is a set of
734+
The Prediction Service depends on a **Prediction Specification**:
735+
736+
- `projectName` (str): The top-level project to which all the experiments will link back.
737+
- `databaseName` (str): The database name the source data is in.
738+
- `modelID` (str): The model ID to use for generating predictions.
739+
- `featurizeNewDocuments` (bool): Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`.
740+
- `featurizeOutdatedDocuments` (bool): Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`.
741+
- `schedule` (str): A cron expression to schedule the prediction job. The cron syntax is a set of
721742
five fields in a line, indicating when the job should be executed. The format must follow
722743
the following order: `minute` `hour` `day-of-month` `month` `day-of-week`
723744
(e.g. `0 0 * * *` for daily predictions at 00:00). Default is `None`.
724-
- `embeddingsField`: The name of the field to store the generated embeddings. This is only used for Graph Embedding tasks. Default is `None`.
745+
- `embeddingsField` (str): The name of the field to store the generated embeddings. This is only used for Graph Embedding tasks. Default is `None`.
725746

726747
```py
727748
# 1. Define the Prediction Specification

0 commit comments

Comments
 (0)