This repository hosts code used to generate the Funcdex-MT-Function-Calling dataset and subsequently, for training the Funcdex models.
This codebase can be used to generate ReAct-like Multi-turn Function Calling Datasets. Thinking generation can be turned off!
You only need to provide:
- Tool definitions
- Natural Language Workflow Description (optional, can be infered from the tool definitions).
This codebase comes with about 14000 tool definitions covering more than 600 popular applications!. See src/data/tool_definitions.json.
If your application is not present, you can simply add them. Either use the utility script we provide to ingest OpenAPI Spec JSON, or add them manually.
Funcdex-MT-Function-Calling contains tool-definition grounded conversations with the following toolkits:
| Toolkit/Bundle Name | Checkpoint Link |
|---|---|
ALL TOOLKITS |
Funcdex-1.7B |
gmail |
Funcdex-0.6B-gmail |
googlecalendar |
Funcdex-0.6B-googlecalendar |
googledrive |
Funcdex-0.6B-googledrive |
googledocs |
Funcdex-0.6B-googledocs |
jira |
Funcdex-0.6B-jira |
stripe |
Funcdex-0.6B-stripe |
asana |
Funcdex-0.6B-asana |
calendly |
Funcdex-0.6B-calendly |
whatsapp |
Funcdex-0.6B-whatsapp |
todoist |
Funcdex-0.6B-todoist |
gmail_googlecalendar |
Funcdex-0.6B-gmail_googlecalendar |
googledrive_googledocs |
Funcdex-0.6B-googledrive_googledocs |
jira_gmail |
Funcdex-0.6B-jira_gmail |
googledrive_calendly_googlecalendar |
Funcdex-0.6B-googledrive_calendly_googlecalendar |
whatsapp_todoist |
Funcdex-0.6B-whatsapp_todoist |
Our general model, Funcdex-1.7B is 20% more performant and 2x less expensive than gpt-oss-120b.
| LLM | Function Call Match | Cost ($) |
|---|---|---|
| GPT-OSS-120B (medium) | 0.51 | 9.32 |
| GPT-5 Mini (medium) | 0.58 | 99.71 |
| GPT-5 (minimal) | 0.59 | 205.45 |
| Qwen3-0.6B | 0.59 | 2.83 |
| Qwen3-1.7B | 0.69 | 5.73 |
| Funcdex-0.6B | 0.70 | 0.19 |
| Funcdex-1.7B | 0.81 | 5.64 |
Notes:
- Funcdex-0.6B is the average of performances of individual Funcdex-0.6B models.
- For cost, we track the number of prompt/completion tokens for evaluating 300 conversations.
- e.g. If token cost is input=$1 and output=$10 per million tokens, and evaluation needed
0.5Mand0.1Minput/output tokens, then cost is1 * 0.5 + 0.1 * 10 = $1.5. - Qwen3-0.6B and Qwen3-1.7B evaluation costs are estimated by extrapolating from Llama3.2-3B serverless costs. Other model's costs are sourced from Openrouter.
For more details, refer to the blog post.
Fun Fact: The name Funcdex is inspired from Pokedex -> Catch all the Functions!
This section sets up a simple dataset of 3 conversations involving the use of two applications/toolkits - Gmail and Google Drive.
- Install uv then run the following:
git clone https://github.com/premAI-io/Funcdex-Synthesizer
cd Funcdex-Synthesizer
uv sync
source .venv/bin/activate
You have two options:
- Local with
openai/gpt-oss-120b(either vLLM or llama.cpp). Recommended - Any LLM with a cloud provider that supports GBNF such as Fireworks.ai. Untested!
In either scenario, update your LLM configuration in the config file: examples/example_config.yaml.
- Follow the steps here to install vLLM.
- Serve openai/gpt-oss-120b using
CUDA_VISIBLE_DEVICES=0 vllm serve openai/gpt-oss-120b.
Note: The pipeline is CPU-bottlenecked due to Context-free Grammar compilation. Using multiple GPUs will not speed up the process too much.
Run python scripts/run_pipeline.py examples/example_config.yaml and grab a cup of coffee while it runs.
It should take roughly 5 mins for this to finish.
- To generate the full dataset similar to Funcdex-MT-Function-Calling, run
python scripts/run_pipeline.py config.yaml
You should see files populated in outputs/.
The file you are interested in is outputs/parsed_conversations.jsonl.
This can be used directly in TRL or Unsloth to finetune models!
For more information, refer to this page.
You may want to verify if the conversations match your usecase.
Enjoy the Slop Vibe-coded streamlit app to visualize the conversations by running:
streamlit run examples/utils/streamlit_inspector.py examples/example_config.yaml
And open http://localhost:8501 in your browser.
You should see a screen like so:

Tool/API:
- External functions that must execute with given parameters and return the result to the LLM for further processing.
Toolkit:
- A set of tools that are related some way (coming from the same service/application). E.g.
UPDATE_CALENDAR_LIST_ENTRYandCREATE_EVENTbelow are API routes supported by Google Calendar.
Bundle:
- A set of toolkits -> E.g. your workflow involves using both Calendly and Gmail tools.
Let's say this is you:
I get similar set of questions that are usually present in either in my gmail or google drive.
Can I ask an LLM to search through my Gmail or google drive to answer questions?
You infer that Gmail and Google Drive have these APIs that can help solve this problem:
[
{
"toolkit_name": "gmail",
"tool_id": "FETCH_EMAILS",
"tool_description": "Fetches a list of email messages from a gmail account, supporting filtering, pagination, and optional full content retrieval.",
"tool_input_parameters": "{\"ids_only\": {\"type\": \"boolean\", \"required\": false}, \"include_payload\": {\"type\": \"boolean\", \"required\": false}, \"include_spam_trash\": {\"type\": \"boolean\", \"required\": false}, \"label_ids\": {\"type\": \"array\", \"required\": false}, \"max_results\": {\"type\": \"integer\", \"required\": false}, \"page_token\": {\"type\": \"string\", \"required\": false}, \"query\": {\"type\": \"string\", \"required\": false}, \"user_id\": {\"type\": \"string\", \"required\": false}, \"verbose\": {\"type\": \"boolean\", \"required\": false}}",
"tool_response_parameters": "{\"data\": {\"type\": \"object\", \"required\": true}, \"error\": {\"type\": \"string\", \"required\": false}, \"successful\": {\"type\": \"boolean\", \"required\": true}}"
},
{
"toolkit_name": "gmail",
"tool_id": "LIST_THREADS",
"tool_description": "Retrieves a list of email threads from a gmail account, identified by `user id` (email address or 'me'), supporting filtering and pagination.",
"tool_input_parameters": "{\"max_results\": {\"type\": \"integer\", \"required\": false}, \"page_token\": {\"type\": \"string\", \"required\": false}, \"query\": {\"type\": \"string\", \"required\": false}, \"user_id\": {\"type\": \"string\", \"required\": false}, \"verbose\": {\"type\": \"boolean\", \"required\": false}}",
"tool_response_parameters": "{\"data\": {\"type\": \"object\", \"required\": true}, \"error\": {\"type\": \"string\", \"required\": false}, \"successful\": {\"type\": \"boolean\", \"required\": true}}"
},
{
"toolkit_name": "googledrive",
"tool_id": "FIND_FILE",
"tool_description": "Tool to list or search for files and folders in google drive. use when you need to find specific files based on query criteria or list contents of a drive/folder.",
"tool_input_parameters": "{\"corpora\": {\"type\": \"string\", \"required\": false}, \"driveId\": {\"type\": \"string\", \"required\": false}, \"fields\": {\"type\": \"string\", \"required\": false}, \"includeItemsFromAllDrives\": {\"type\": \"boolean\", \"required\": false}, \"orderBy\": {\"type\": \"string\", \"required\": false}, \"pageSize\": {\"type\": \"integer\", \"required\": false}, \"pageToken\": {\"type\": \"string\", \"required\": false}, \"q\": {\"type\": \"string\", \"required\": false}, \"spaces\": {\"type\": \"string\", \"required\": false}, \"supportsAllDrives\": {\"type\": \"boolean\", \"required\": false}}",
"tool_response_parameters": "{\"data\": {\"type\": \"object\", \"required\": true}, \"error\": {\"type\": \"string\", \"required\": false}, \"successful\": {\"type\": \"boolean\", \"required\": true}}"
},
{
"toolkit_name": "googledrive",
"tool_id": "LIST_FILES_AND_FOLDERS",
"tool_description": "Tool to list a user's files and folders in google drive. use this to search or browse for files and folders based on various criteria.",
"tool_input_parameters": "{\"corpora\": {\"type\": \"string\", \"required\": false}, \"driveId\": {\"type\": \"string\", \"required\": false}, \"fields\": {\"type\": \"string\", \"required\": false}, \"folderId\": {\"type\": \"string\", \"required\": false}, \"includeItemsFromAllDrives\": {\"type\": \"boolean\", \"required\": false}, \"includeLabels\": {\"type\": \"string\", \"required\": false}, \"includePermissionsForView\": {\"type\": \"string\", \"required\": false}, \"orderBy\": {\"type\": \"string\", \"required\": false}, \"pageSize\": {\"type\": \"integer\", \"required\": false}, \"pageToken\": {\"type\": \"string\", \"required\": false}, \"q\": {\"type\": \"string\", \"required\": false}, \"spaces\": {\"type\": \"string\", \"required\": false}, \"supportsAllDrives\": {\"type\": \"boolean\", \"required\": false}}",
"tool_response_parameters": "{\"data\": {\"type\": \"object\", \"required\": true}, \"error\": {\"type\": \"string\", \"required\": false}, \"successful\": {\"type\": \"boolean\", \"required\": true}}"
}
]
Now, this project will take these APIs as input, and synthesize a dataset that you can use to:
- Evaluate performance of various LLMs.
- Finetune Open LLMs.
Note: We provide a convenience script to convert OpenAPI Spec to the above format. More below.
Answer these questions:
- Do your workflows involve tools from different toolkits?
- Yes -> Use bundles.
- No -> Use toolkits.
- Now edit this file based on your answers.
Listing only the most useful ones:
include_assistant_thoughts: Generate a dataset for training a thinking/non-thinking assistant.total_conversations: How many conversations to generate.- Contents of
src/data/wanted_toolkits.jsonandsrc/data/tool_definitions.json - All of these:
data:
tool_definitions: src/data/wanted_toolkits.json
wanted_toolkits: src/data/tool_definitions.json
output:
conversations: outputs/conversations.jsonl
conversations_with_system_prompts: outputs/conversations_with_system_prompts.jsonl
scored_conversations: outputs/scored_conversations.jsonl
accepted_conversations: outputs/accepted_conversations.jsonl
parsed_conversations: outputs/parsed_conversations.jsonl
# In config.yaml
llm_api:
api_key: ${OPENAI_API_KEY:empty} # Uses env var or defaults to "empty"
base_url: ${LLM_BASE_URL:http://localhost:8000/v1/}
model: ${LLM_MODEL:openai/openai/gpt-oss-120b}The pipeline uses a Multi-agent process to synthesize conversations. The generation is done in a way to reduce duplication, encourage diversity and provide high-quality tool use signal for finetuning.
- Sample random combination of tools from the given set.
- If a natural language
workflow_descriptionis provided, use that to generate similar workflows else prompt an LLM to synethesize a user workflow that involves the use of sampled tools. - Deduplicate user scenario -> Use an embedding model to discard the scenario if a very similar scenario already exists.
- Prompt an LLM to generate a user's background (company name, position, and other relating context). This is used to make the conversation context richer and less "slop".
- Prompt an LLM to generate how the conversation would look like if the workflow was carried out by chatting with an LLM assistant.
- Prompt an LLM with all of the generated artefacts, including tool definitions to generate a conversation in the following format:
</user> Hey, can you help me plan a spontaneous weekend getaway for two?
</assistant> I'd love to! To get started, what dates are you thinking of for this weekend, and where will you be traveling from?
<tool>
{
"name": "get_current_location",
"arguments": {}
}
</tool>
<tool_response>
{
"city": "Paris",
"country": "France"
}
</tool_response>
</assistant> And are there any particular vibes you're going for? For example, are you looking for a romantic trip, a culinary adventure, a historical deep-dive, or something else?
</user> Let's do next weekend, so that would be Friday, October 3rd to Sunday, October 5th. We'll be leaving from my current location. We're both big foodies and love history.
We call this the Dialog Markup Language (DML).
In practice, just prompting doesn't work well. Even the strongest closed-LLMs (GPT-5-High, Gemini 2.5 Pro, Deepseek v3.2) will generate conversations that don't obey basic conversation flow. Some examples:
- Each
<tool_call>must have its corresponding<tool_response>. - Each
</user>message should be followed by either a</assistant>message or a<tool_call>. <tool_response>must always be followed by either</assistant>or<tool_call>.- etc..
A simple post-generation validation step to reject bad conversations doesn't work -- more than 90% of generations are rejected! To solve this, we use Context-free Grammar-based constrained decoding to force the LLM to generate valid conversations in DML format. The grammar rules are defined in here (without assistant thinking) and here. Read more: vLLM Structured Outputs and llama.cpp.
Though this comes with its own problems; Constrained decoding is CPU-bound, as the logit mask generation happens on the CPU, as a result, GPU will be idle majority of the time -> Making the generation extremely slow.
If you have an OpenAPI 3.x specification, you can automatically convert it to the tool definition format:
python3 examples/utils/parse_openapi_spec.py \
--input path/to/openapi.json \
--output path/to/output.json \
--toolkit-name your_toolkit_nameNote: The converter has limitations (nested objects are simplified, response schemas are merged). Review the output before using in production.
- Prompt the LLM with all artefacts to generate a short system prompt for each conversation.
- We recommend a "drop-out"-like setup for training so that the finetuned model doesn't rely too much on the system prompt, but also is steerable.
- We use an LLM-as-Judge approach to score and filter conversations.
- We use a comprehensive rubric and only keep conversations that score 4 or above in
tool_use_qualitymetric.
- If the tool expects file content, the pipeline will produce tool calls containing arguments like
"content": "<base64-encoded-content>". - Deduplication using Vector embeddings is not good enough. We've noticed that both: false positives and negatives, even with Qwen3-Reranker-8B.
- Pipeline performance is poor. It takes about 15 hours to generate 1700 conversations on 4 H200 GPUs.
- Conversation realism: Although the conversations are better than any public offering, the conversations still don't feel very natural. Idea: Use a better User simulator such as UserLM-8b.
- We didn't try optimizing the prompts at all.
- This codebase was refactored for release and has not been tested thoroughly in its current form. If you find a bug, please open an issue, we will attend to it at the earliest.
- No support for Parallel function calling, only multi-step.
- There is no rejection of examples apart from the embedding -- perhaps using n-gram overlap or ROUGE scores to dedup conversations are possible extensions.
- Generated conversations are "happy paths" -> it doesn't teach the model to recover from errors.
We build on Composio.dev's list of tool definitions to power this project.
The codebase, dataset, models are licensed under MIT License.