copyright | lastupdated | keywords | subcollection | ||
---|---|---|---|---|---|
|
2025-04-24 |
instructlab, ai, data, generate |
instructlab |
{{site.data.keyword.attribute-definition-list}}
{: #data-generate}
Complete the following steps to generate data from your taxonomy.
Data cannot be augmented, curated, or manually uploaded to train the model. Use this task to generate the data. {: restriction}
{: #data-generate-pre}
- Install the
ilab
CLI plug-in.{: cli} - Prepare your taxonomy.
- Add the taxonomy TAR to your {{site.data.keyword.cos_short}} bucket.
{: #data-generate-ui} {: ui}
-
In the console, open the {{site.data.keyword.instructlab_short}} service.
-
Click {{site.data.keyword.short_name}} Projects > your project > Training data > Generate.
-
Enter an alphanumeric name for the training data, select the taxonomy to use, and click Generate. The state is
queued
, thenrunning
. Wait for the state to becompleted
. When the data is generated, in the {{site.data.keyword.cos_short}} bucket, asynthetic_data
directory is created with logs for troubleshooting.
{: #data-generate-cli} {: cli}
-
List your taxonomies and make a note of the taxonomy you want to use.
ibmcloud ilab taxonomy list
{: pre}
Example output.
id name taxonomy_path 669a88c9488ee7b95ce8fe05 test-tax taxonomy.tar.gz
{: screen}
-
Generate data from your taxonomy. Note the ID for the data to use in the next step. Use alphanumeric characters in the name.
ibmcloud ilab data generate [--name NAME] [--taxonomy-id TAXONOMY-ID]
{: pre}
Example command.
ibmcloud ilab data generate --name testdata --taxonomy-id 669a88c9488ee7b95ce8fe05
{: pre}
Example output.
id 66a268c170dcb21150050e8e name test-data state queued status created_at 2024-07-19T15:40:29.000Z taxonomy_id 669a88c9488ee7b95ce8fe05
{: screen}
-
Check the details of your data generation. Include the ID for the data. The state is
queued
, thenrunning
. Wait for the state to becompleted
.ibmcloud ilab data get --id DATA_ID
{: pre}
Example
data get
command.ibmcloud ilab data get --id 66a268c170dcb21150050e8e
{: pre}
Example output.
id 66a268c170dcb21150050e8e name test-data state running status Generating data for taxonomy path compositional_skills->STEM->math->area: 12% 12/100 (total qna processed 1/147) created_at 2024-07-19T15:40:29.000Z taxonomy_id 669a88c9488ee7b95ce8fe05
{: screen}
Example
data get
command with the--output json
option which includes metrics.ibmcloud ilab data get --id 66a268c170dcb21150050e8e --output json
{: pre}
Example JSON output
{ "created_at": "2024-07-19T15:40:29.000Z", "data_metrics": { "samples": { "knowledge": 30, "skills": 70, "total": 100 } }, "id": "66a268c170dcb21150050e8e", "name": "test-data", "state": "completed", "status": "completed", "taxonomy_id": "669a88c9488ee7b95ce8fe05" }
{: screen}
When the state is completed
, in the {{site.data.keyword.cos_short}} bucket, a synthetic_data
directory is created with logs for troubleshooting.
{: #data-generate-api} {: api}
-
List your taxonomies and make a note of the taxonomy you want to use.
Example command.
curl -X 'GET' \ 'https://us-east.instructlab.ibm.com/v1/taxonomies' \ -H 'accept: application/json
{: pre}
Example output.
{ "taxonomies": [ { "id": "202a03c4-dcf1-432a-82b7-abecb2e019f7", "name": "example-taxonomy-name-1", "taxonomy_path_cos": "taxonomies/taxonomy.tar.gz", "created_at": "2024-10-23T02:58:50.000Z" } ] }
{: screen}
-
Generate data from your taxonomy. Note the ID for the data to use in the next step. Use alphanumeric characters in the name.
Example command.
curl -X 'POST' \ 'https://us-east.instructlab.ibm.com/v1/data' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "name": "example-data-1", "taxonomy_id": "202a03c4-dcf1-432a-82b7-abecb2e019f7" }'
{: pre}
Example output.
{ "id": "add785e6-a8c3-4f5f-ab89-c506a3f115da", "name": "example-data-1", "state": "", "status": "queued", "created_at": "2024-10-23T02:58:50.000Z", "taxonomy_id": "202a03c4-dcf1-432a-82b7-abecb2e019f7", "data_metrics": { "samples": { "additionalProp1": 1, "additionalProp2": 2, "additionalProp3": 3 } } }
{: screen}
-
Check the details of your data generation. Include the ID for the data. The state is
queued
, thenrunning
. Wait for the state to becompleted
.Example command.
curl -X 'GET' \ 'https://us-east.instructlab.ibm.com/v1/data/add785e6-a8c3-4f5f-ab89-c506a3f115da' \ -H 'accept: application/json'
{: pre}
Example output.
{ "id": "add785e6-a8c3-4f5f-ab89-c506a3f115da", "name": "example-data-1", "state": "", "status": "queued", "created_at": "2024-10-23T02:58:50.000Z", "taxonomy_id": "202a03c4-dcf1-432a-82b7-abecb2e019f7", "data_metrics": { "samples": { "additionalProp1": 1, "additionalProp2": 2, "additionalProp3": 3 } } }
{: screen}
When the state is completed
, in the {{site.data.keyword.cos_short}} bucket, a synthetic_data
directory is created with logs for troubleshooting.
{: #data-bucket}
After you generate data, your {{site.data.keyword.cos_short}} bucket contains a synthetic_data
directory with the following files.
Artifacts
: These files contain the samples on each leaf node. These are not used for training the model, but are provided for readability and can be used to see if a QNA is generating the expected number of samples.
Logs
: These files contain the {{site.data.keyword.instructlab_short}} execution logs and system details.
knowledge_train_msgs.jsonl
and skills_train_msgs.jsonl
: These are the Phase 1 and Phase 2 training files and contain samples used for training the model.
To understand why and how your data gets generated, see the SDG FAQs{: external} community doc.
{: #next-data}
After you've generated data from your taxonomy, you can begin training your model.