diff --git a/authors.yaml b/authors.yaml index ec7476bd74..823483db8d 100644 --- a/authors.yaml +++ b/authors.yaml @@ -2,6 +2,11 @@ # You can optionally customize how your information shows up cookbook.openai.com over here. # If your information is not present here, it will be pulled from your GitHub profile. +b-per: + name: "Benoit Perigaud" + website: "https://www.linkedin.com/in/benoit-perigaud/" + avatar: "https://avatars.githubusercontent.com/u/8754100?v=4" + WJPBProjects: name: "Wulfie Bain" website: "https://www.linkedin.com/in/wulfie-bain/" diff --git a/examples/Trusting_your_data_using_Agents_SDK_and_dbt_MCP_server.ipynb b/examples/Trusting_your_data_using_Agents_SDK_and_dbt_MCP_server.ipynb new file mode 100644 index 0000000000..09e85bf94d --- /dev/null +++ b/examples/Trusting_your_data_using_Agents_SDK_and_dbt_MCP_server.ipynb @@ -0,0 +1,726 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Trusting your data using the Agents SDK and the dbt MCP server\n", + "\n", + "Many AI workflows require the ability to access and query structured data in a database. \n", + "\n", + "[dbt](https://www.getdbt.com/product/what-is-dbt) is the modern standard for data transformation. The dbt framework and platform are used by thousands of teams wanting to scale their data lifecycle leveraging the [Analytics Developer Lifecycle (ADLC)](https://www.getdbt.com/resources/the-analytics-development-lifecycle) . With the ADLC, developers can plan, develop, test, document, and deploy trusted data assets including metrics.\n", + "\n", + "You can use OpenAI APIs on top of [the dbt MCP server](https://docs.getdbt.com/docs/dbt-ai/about-mcp) to:\n", + "\n", + "- Discover which assets exist in your dbt project and metadata about them (descriptions, columns etc)\n", + "- Accurately and consistently query your most important metrics using the dbt Semantic Layer\n", + "\n", + "The dbt MCP server is available [both locally hosted and remotely hosted via dbt Platform](https://docs.getdbt.com/docs/dbt-ai/about-mcp). The following demonstration uses the remote MCP server. To follow along with the examples, you can use your existing dbt account if you have one, or if not, you can set one up with a sample jaffle shop dataset following instructions in [this repo](https://github.com/dbt-labs/jaffle-shop)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Data Discovery with Agents SDK + dbt MCP server\n", + "\n", + "Before we ask an LLM to write SQL, it needs a reliable map of the warehouse. In this section, you'll connect OpenAI's Agents SDK to the **dbt MCP server** so the agent can *browse your trusted dbt assets*—not raw schemas or ad-hoc tables. Using dbt Cloud's metadata (models, descriptions, columns, and lineage), the agent can list marts, inspect model details, and trace dependencies to answer questions like \"what should I query to learn about my customers?\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Data Discovery Process Flow](../images/trusted_data_process_flow_openai_dbt_mcp.png)\n", + "\n", + "*The diagram above shows how the OpenAI Agents SDK connects to the dbt MCP server for data discovery.*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To call the remote MCP server from the OpenAI Agents SDK, we need to install the SDK via pip in a Python virtual environment by running the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install openai-agents" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We also need to set the following environment variables\n", + "\n", + "| **Variable** | **Description** |\n", + "| --- | --- |\n", + "| `OPENAI_API_KEY` | Your OpenAI API key |\n", + "| `DBT_PROD_ENV_ID` | The environment ID of your Production environment. When connected to the dbt platform, you can see it in the URL |\n", + "| `DBT_TOKEN` | A service token or personal access token with access to the dbt environment |\n", + "| `DBT_HOST` | The host of your dbt account if different from `cloud.getdbt.com` |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following code allows the OpenAI agents SDK to connect to the remote dbt MCP server and to leverage the following data discovery tools\n", + "\n", + "- `get_mart_models` - Gets all mart models\n", + "- `get_all_models` - Gets all models\n", + "- `get_model_details` - Gets details for a specific model\n", + "- `get_model_parents` - Gets parent nodes of a specific model\n", + "- `get_model_children` - Gets children models of a specific model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "import os\n", + "\n", + "from agents import Agent, Runner, trace\n", + "from agents.mcp import create_static_tool_filter\n", + "from agents.mcp.server import MCPServerStreamableHttp\n", + "\n", + "async def main():\n", + " prod_environment_id = os.environ.get(\"DBT_PROD_ENV_ID\", os.getenv(\"DBT_ENV_ID\"))\n", + " token = os.environ.get(\"DBT_TOKEN\")\n", + " host = os.environ.get(\"DBT_HOST\", \"cloud.getdbt.com\")\n", + "\n", + " async with MCPServerStreamableHttp(\n", + " name=\"dbt\",\n", + " params={\n", + " \"url\": f\"https://{host}/api/ai/v1/mcp/\",\n", + " \"headers\": {\n", + " \"Authorization\": f\"token {token}\",\n", + " \"x-dbt-prod-environment-id\": prod_environment_id,\n", + " },\n", + " },\n", + " client_session_timeout_seconds=20,\n", + " cache_tools_list=True,\n", + " tool_filter=create_static_tool_filter(\n", + " allowed_tool_names=[\n", + " \"get_mart_models\",\n", + " \"get_all_models\",\n", + " \"get_model_details\",\n", + " \"get_model_parents\",\n", + " \"get_model_children\",\n", + " ],\n", + " ),\n", + " ) as server:\n", + " agent = Agent(\n", + " name=\"Assistant\",\n", + " instructions=\"Use the tools to answer the user's questions\",\n", + " mcp_servers=[server],\n", + " model=\"gpt-5\", \n", + " )\n", + " with trace(workflow_name=\"Conversation\"):\n", + " conversation = []\n", + " result = None\n", + " while True:\n", + " if result:\n", + " conversation = result.to_input_list()\n", + " conversation.append({\"role\": \"user\", \"content\": input(\"User > \")})\n", + " result = await Runner.run(agent, conversation)\n", + " print(result.final_output)\n", + "\n", + "if __name__ == \"__main__\":\n", + " try:\n", + " asyncio.run(main())\n", + " except KeyboardInterrupt:\n", + " print(\"\\nExiting.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we run `python openai_agents_dbt.py` , we can then start a conversation with the LLM and get information about our data.\n", + "\n", + "> User > What are my marts models in my warehouse?\n", + "> \n", + "\n", + "> Answer:\n", + "> \n", + "\n", + "```bash\n", + "1. **Customers**\n", + " - **Unique ID:** model.jaffle_semantic_layer_testing.customers\n", + " - **Description:** Customer overview data mart, offering key details for each unique customer. One row per customer.\n", + "\n", + "2. **Order Items**\n", + " - **Unique ID:** model.jaffle_semantic_layer_testing.order_items\n", + " - **Description:** (No description provided)\n", + "\n", + "3. **Orders**\n", + " - **Unique ID:** model.jaffle_semantic_layer_testing.orders\n", + " - **Description:** Order overview data mart, offering key details for each order including if it's a customer's first order and a food vs. drink item breakdown. One row per order.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This response returns detailed information about the models in our project which are best suited for user consumption (marts).\n", + "\n", + "We can continue the conversation to get more details about specific data assets.\n", + "\n", + "> User > Give me more information about the Customers table\n", + "> \n", + "\n", + "> Answer:\n", + "> \n", + "\n", + "```\n", + "Here's more information about the **Customers** table:\n", + "\n", + "### Description\n", + "Customer overview data mart, offering key details for each unique customer. One row per customer.\n", + "\n", + "### Database Details\n", + "- **Database:** analytics\n", + "- **Schema:** analytics\n", + "\n", + "### Compiled SQL\n", + "The `Customers` model is built with several CTEs that pull from staging tables and aggregate data:\n", + "\n", + "```sql\n", + "with\n", + "customers as (\n", + " select * from analytics.analytics.stg_customers\n", + "),\n", + "orders_mart as (\n", + " select * from analytics.analytics.orders\n", + "),\n", + "order_items_mart as (\n", + " select * from analytics.analytics.order_items\n", + "),\n", + "order_summary as (\n", + " select\n", + " customer_id,\n", + " count(distinct om.order_id) as count_lifetime_orders,\n", + " count(distinct om.order_id) > 1 as is_repeat_buyer,\n", + " min(om.ordered_at) as first_ordered_at,\n", + " max(om.ordered_at) as last_ordered_at,\n", + " sum(oi.subtotal) as lifetime_spend_pretax,\n", + " sum(om.order_total) as lifetime_spend\n", + " from orders_mart om\n", + " left join order_items_mart oi on om.order_id = oi.order_id\n", + " group by 1\n", + "),\n", + "joined as (\n", + " select\n", + " customers.*,\n", + " order_summary.count_lifetime_orders,\n", + " order_summary.first_ordered_at,\n", + " order_summary.last_ordered_at,\n", + " order_summary.lifetime_spend_pretax,\n", + " order_summary.lifetime_spend,\n", + " case\n", + " when order_summary.is_repeat_buyer then 'returning'\n", + " else 'new'\n", + " end as customer_type\n", + " from customers\n", + " left join order_summary\n", + " on customers.customer_id = order_summary.customer_id\n", + ")\n", + "select * from joined\n", + "```\n", + "\n", + "### Columns\n", + "- **CUSTOMER_ID** (TEXT): The unique key of the orders mart.\n", + "- **CUSTOMER_NAME** (TEXT): Customers' full name.\n", + "- **COUNT_LIFETIME_ORDERS** (NUMBER): Total number of orders a customer has ever placed.\n", + "- **FIRST_ORDERED_AT** (TIMESTAMP_NTZ): The timestamp when a customer placed their first order.\n", + "- **LAST_ORDERED_AT** (TIMESTAMP_NTZ): The timestamp of a customer's most recent order.\n", + "- **LIFETIME_SPEND_PRETAX** (NUMBER): The sum of all the pre-tax subtotals of every order a customer has placed.\n", + "- **LIFETIME_SPEND** (NUMBER): The sum of all the order totals (including tax) that a customer has ever placed.\n", + "- **CUSTOMER_TYPE** (TEXT): Options are 'new' or 'returning', indicating if a customer has ordered more than once or has only placed their first order to date.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This answer lists all the columns contained in our table as well as their descriptions. This information can be useful to understand the actual definition of what information is stored in which column.\n", + "\n", + "Finally, we can ask another question to the LLM to understand the lineage of the `orders` table.\n", + "\n", + "> User > What tables do the `orders` table get information from?\n", + "> \n", + "\n", + "> Answer:\n", + "> \n", + "\n", + "```bash\n", + "The **Orders** table gets its information from the following tables:\n", + "\n", + "1. **stg_order_items**\n", + " - **Description:** Individual food and drink items that make up our orders, one row per item.\n", + "\n", + "2. **stg_orders**\n", + " - **Description:** Order data with basic cleaning and transformation applied, one row per order.\n", + "\n", + "3. **stg_products**\n", + " - **Description:** Product (food and drink items that can be ordered) data with basic cleaning and transformation applied, one row per product.\n", + "\n", + "4. **stg_supplies**\n", + " - **Description:** List of our supply expenses data with basic cleaning and transformation applied. One row per supply cost, not per supply. As supply costs fluctuate, they receive a new row with a new UUID, resulting in multiple rows per supply_id.\n", + "```\n", + "\n", + "This answer shows the `orders` table actually merges data related to orders, order items, products, and supplies." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Querying governed metrics with the dbt Semantic Layer\n", + "\n", + "Now that the agent can *find* trusted data, let's help it *answer questions* with consistency. In this section, we connect OpenAI's Agents SDK to the **dbt Semantic Layer via the MCP server**, so natural-language questions resolve to governed metrics—not ad-hoc SQL. The agent discovers available metrics and dimensions, generates a **Semantic Layer query**, and SL deterministically compiles it to valid warehouse SQL. You get the same answer every time, aligned to your definitions, with the option to inspect the compiled SQL for transparency.\n", + "\n", + "A common approach to query data warehouses from interacting with AI is to have the LLM generate SQL based on user inputs in natural language. This is usually described as text-to-sql.\n", + "\n", + "In some cases, the LLM will generate the exact query that is required to answer the user question, but there are occasions where either the SQL query generated might not be valid or that it will return incorrect data that doesn't conform with the user expectation and definition of some metrics.\n", + "\n", + "What could be happening to cause this:\n", + "\n", + "- the resulting query might be joining tables that aren't supposed to be joined together\n", + "- the SQL query generated does not comply with the data warehouse syntax\n", + "- the LLM generates different queries for the same question\n", + "- it is difficult for users to query specific metrics in natural language with their business specific logic. Example: The team thinks of new customers with the exclusion of trial customers" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Traditional Text-to-SQL Process](../images/trusted_data_process_flow_openai_data_warehouse.png)\n", + "\n", + "*This diagram illustrates the traditional text-to-SQL approach where the LLM directly generates SQL queries for the data warehouse.*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While you can improve the Text to SQL flow by providing additional context to the LLM there are some circumstances where pairing a deterministic system (a Semantic Layer) is preferable." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### An introduction to the dbt Semantic Layer\n", + "\n", + "The dbt Semantic Layer lets you define and query metrics and measures on transformed warehouse data.\n", + "\n", + "To define metrics, users add specific YAML configurations to their dbt project. The metrics will include information about joining data across tables and are version controlled in git, like any dbt component. The specification is flexible enough to define exactly how a metric should be calculated. For examples, you could define a \"number of paid customers\" metric as the count of rows in the table `dim_customers` where the customer didn't cancel during the trial period.\n", + "\n", + "Once metrics are defined and agreed upon, the dbt Semantic Layer can be queried directly, without the need to specify what tables to get data from, or what SQL to write. The SQL is dynamically generated by the Semantic Layer according to the configuration. The metrics are accessible [calling APIs](https://docs.getdbt.com/docs/use-dbt-semantic-layer/consume-metrics#query-with-apis) or by using [BI tools with built-in integrations with the dbt Semantic Layer](https://docs.getdbt.com/docs/cloud-integrations/avail-sl-integrations). \n", + "\n", + "In the past, the integration with BI tools was the primary use case for the Semantic layer, but increasingly the Semantic Layer is used in AI workflows.\n", + "\n", + "Below is a short example showing how to define measures in the Semantic Layer." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```yaml\n", + "semantic_models:\n", + " - name: orders\n", + " defaults:\n", + " agg_time_dimension: order_date\n", + " description: Order fact table. This table's grain is one row per order.\n", + " model: ref('fct_orders')\n", + " entities:\n", + " - name: order_id\n", + " type: primary\n", + " - name: customer\n", + " expr: customer_id\n", + " type: foreign\n", + " dimensions:\n", + " - name: order_date\n", + " type: time\n", + " type_params:\n", + " time_granularity: day\n", + " measures: \n", + " - name: order_total\n", + " description: The total amount for each order including taxes.\n", + " agg: sum\n", + " expr: amount\n", + " - name: order_count\n", + " expr: 1\n", + " agg: sum\n", + " - name: customers_with_orders\n", + " description: Distinct count of customers placing orders\n", + " agg: count_distinct\n", + " expr: customer_id\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Integrating the Agents SDK and the dbt Semantic Layer via the dbt MCP server\n", + "\n", + "The Semantic Layer can serve as an intermediary, deterministic step between the agent and the database." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Semantic Layer Integration Process](../images/trusted_data_process_flow_openai_dbt_semantic_layer.png)\n", + "\n", + "*This diagram shows the improved approach using the dbt Semantic Layer as an intermediary between OpenAI and the data warehouse, ensuring consistent and governed metrics.*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To create this flow, leveraging the OpenAI Agents SDK and GPT 5, we need to install the `openai-agents` package" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install openai-agents" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "and set the following environment variables\n", + "\n", + "| **Variable** | **Description** |\n", + "| --- | --- |\n", + "| `OPENAI_API_KEY` | Your OpenAI API key |\n", + "| `DBT_PROD_ENV_ID` | The environment ID of your Production environment. When connected to the dbt platform, you can see it in the URL |\n", + "| `DBT_TOKEN` | A service token or personal access token with access to the dbt environment |\n", + "| `DBT_HOST` | The host of your dbt account if different from `cloud.getdbt.com` |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copy/paste this code locally in `openai_dbt_semantic_layer.py` and run `python openai_dbt_semantic_layer.py` to interact with OpenAI and get results from the dbt Semantic Layer.\n", + "\n", + "In this example, we connect again to the dbt MCP server but we activate a different set of tools:\n", + "\n", + "- `list_metrics` - Retrieves all defined metrics\n", + "- `get_dimensions` - Gets dimensions associated with specified metrics\n", + "- `get_entities` - Gets entities associated with specified metrics\n", + "- `query_metrics` - Queries metrics with optional grouping, ordering, filtering, and limiting\n", + "- `get_metrics_compiled_sql` - Gets and returns the compiled SQL that would be generated for specified metrics and groupings without executing the query\n", + "\n", + "By inspecting `stream_events()` we capture the tools being called, including which parameters are sent, and print them in the terminal." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "import os\n", + "\n", + "from agents import Agent, Runner, trace\n", + "from agents.mcp import create_static_tool_filter\n", + "from agents.mcp.server import MCPServerStreamableHttp\n", + "from agents.stream_events import RawResponsesStreamEvent, RunItemStreamEvent\n", + "from openai.types.responses import ResponseCompletedEvent, ResponseOutputMessage\n", + "\n", + "def print_tool_call(tool_name, params, color=\"yellow\", show_params=True):\n", + " # Define color codes for different colors\n", + " # we could use a library like colorama but this avoids adding a dependency\n", + " color_codes = {\n", + " \"grey\": \"\\033[37m\",\n", + " \"yellow\": \"\\033[93m\",\n", + " }\n", + " color_code_reset = \"\\033[0m\"\n", + "\n", + " color_code = color_codes.get(color, color_codes[\"yellow\"])\n", + " msg = f\"Calling the tool {tool_name}\"\n", + " if show_params:\n", + " msg += f\" with params {params}\"\n", + " print(f\"{color_code}# {msg}{color_code_reset}\")\n", + "\n", + "def handle_event_printing(event, show_tool_calls=True):\n", + " if type(event) == RunItemStreamEvent and show_tool_calls:\n", + " if event.name == \"tool_called\":\n", + " print_tool_call(\n", + " event.item.raw_item.name,\n", + " event.item.raw_item.arguments,\n", + " color=\"grey\",\n", + " show_params=True,\n", + " )\n", + "\n", + " if type(event) == RawResponsesStreamEvent:\n", + " if type(event.data) == ResponseCompletedEvent:\n", + " for output in event.data.response.output:\n", + " if type(output) == ResponseOutputMessage:\n", + " print(output.content[0].text)\n", + "\n", + "async def main():\n", + " prod_environment_id = os.environ.get(\"DBT_PROD_ENV_ID\", os.getenv(\"DBT_ENV_ID\"))\n", + " token = os.environ.get(\"DBT_TOKEN\")\n", + " host = os.environ.get(\"DBT_HOST\", \"cloud.getdbt.com\")\n", + "\n", + " async with MCPServerStreamableHttp(\n", + " name=\"dbt\",\n", + " params={\n", + " \"url\": f\"https://{host}/api/ai/v1/mcp/\",\n", + " \"headers\": {\n", + " \"Authorization\": f\"token {token}\",\n", + " \"x-dbt-prod-environment-id\": prod_environment_id,\n", + " },\n", + " },\n", + " client_session_timeout_seconds=20,\n", + " cache_tools_list=True,\n", + " tool_filter=create_static_tool_filter(\n", + " allowed_tool_names=[\n", + " \"list_metrics\",\n", + " \"get_dimensions\",\n", + " \"get_entities\",\n", + " \"query_metrics\",\n", + " \"get_metrics_compiled_sql\",\n", + " ],\n", + " ),\n", + " ) as server:\n", + " agent = Agent(\n", + " name=\"Assistant\",\n", + " instructions=\"Use the tools to answer the user's questions. Do not invent data or sample data.\",\n", + " mcp_servers=[server],\n", + " model=\"gpt-5\",\n", + " )\n", + " with trace(workflow_name=\"Conversation\"):\n", + " conversation = []\n", + " result = None\n", + " while True:\n", + " if result:\n", + " conversation = result.to_input_list()\n", + " conversation.append({\"role\": \"user\", \"content\": input(\"User > \")})\n", + "\n", + " async for event in Runner.run_streamed(\n", + " agent, conversation\n", + " ).stream_events():\n", + " handle_event_printing(event, show_tool_calls=True)\n", + "\n", + "if __name__ == \"__main__\":\n", + " try:\n", + " asyncio.run(main())\n", + " except KeyboardInterrupt:\n", + " print(\"\\nExiting.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Example user interaction: \n", + "\n", + "> User > What is the revenue per month?\n", + "> \n", + "\n", + "> Answer:\n", + "> \n", + "\n", + "```\n", + "# Calling the tool list_metrics with params {}\n", + "# Calling the tool get_dimensions with params {\"metrics\":[\"revenue\"]}\n", + "# Calling the tool query_metrics with params {\"metrics\":[\"revenue\"],\"group_by\":[{\"name\":\"metric_time\",\"type\":\"time_dimension\",\"grain\":\"MONTH\"}],\"order_by\":[{\"name\":\"metric_time\",\"descending\":true}],\"limit\":12}\n", + "\n", + "Here are the monthly revenues for the last 12 months (latest first):\n", + "\n", + "- 2025-08: 102,379\n", + "- 2025-07: 90,396\n", + "- 2025-06: 93,683\n", + "- 2025-05: 91,388\n", + "- 2025-04: 79,246\n", + "- 2025-03: 70,218\n", + "- 2025-02: 40,906\n", + "- 2025-01: 42,472\n", + "- 2024-12: 33,660\n", + "- 2024-11: 26,338\n", + "- 2024-10: 20,684\n", + "- 2024-09: 17,032\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> User > What dimensions can we slice the revenue by?\n", + "> \n", + "\n", + "> Answer:\n", + "> \n", + "\n", + "```\n", + "# Calling the tool list_metrics with params {}\n", + "# Calling the tool get_dimensions with params {\"metrics\":[\"revenue\"]}\n", + "# Calling the tool get_entities with params {\"metrics\":[\"revenue\"]}\n", + "\n", + "Here are the dimensions you can slice Revenue by, as exposed in the semantic layer. I've grouped them by general category and noted the type when helpful.\n", + "\n", + "Time-based\n", + "- metric_time (TIME) with granularities: DAY, WEEK, MONTH, QUARTER, YEAR\n", + "- order_id__customer__first_ordered_at (TIME)\n", + "- order_id__customer__last_ordered_at (TIME)\n", + "- order_id__location__opened_at (TIME)\n", + "- order_id__ordered_at (TIME)\n", + "- order_item__ordered_at (TIME)\n", + "\n", + "Customer-related (categorical)\n", + "- order_id__customer__customer_name\n", + "- order_id__customer__customer_type\n", + "- order_id__customer__count_lifetime_orders\n", + "- order_id__customer__lifetime_spend\n", + "\n", + "Location-related\n", + "- order_id__location__location_name\n", + "- order_id__location__tax_rate\n", + "- order_id__location__opened_at (also listed under time-based)\n", + "\n", + "Product-related\n", + "- product__product_name\n", + "- product__product_type\n", + "- product__is_food_item\n", + "- product__is_drink_item\n", + "- product__product_price\n", + "- product__product_description\n", + "\n", + "Order-related\n", + "- order_id__order_total_dim\n", + "- order_id__is_food_order\n", + "- order_id__is_drink_order\n", + "- order_id__ordered_at (also listed under time-based)\n", + "\n", + "Other\n", + "- locations (a general/simple dimension representing locations)\n", + "\n", + "Would you like me to run a quick example query slicing Revenue by one of these dimensions (e.g., Revenue by metric_time month, or Revenue by customer_name), so you can see the results? If so, tell me which dimension you want to start with.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> User > Between March 2025 and August 2025, give me the revenue by month, by product type and by location name\n", + "> \n", + "\n", + "> Answer:\n", + "> \n", + "\n", + "```\n", + "# Calling the tool list_metrics with params {}\n", + "# Calling the tool get_dimensions with params {\"metrics\":[\"revenue\"]}\n", + "# Calling the tool get_entities with params {\"metrics\":[\"revenue\"]}\n", + "# Calling the tool query_metrics with params {\"metrics\":[\"revenue\"],\"group_by\":[{\"name\":\"metric_time\",\"type\":\"time_dimension\",\"grain\":\"MONTH\"},{\"name\":\"product__product_type\",\"type\":\"dimension\",\"grain\":null},{\"name\":\"order_id__location__location_name\",\"type\":\"dimension\",\"grain\":null}],\"order_by\":[{\"name\":\"metric_time\",\"descending\":false},{\"name\":\"product__product_type\",\"descending\":false},{\"name\":\"order_id__location__location_name\",\"descending\":false}],\"where\":\"{{ TimeDimension('metric_time', 'MONTH') }} >= '2025-03-01' AND {{ TimeDimension('metric_time', 'MONTH') }} < '2025-09-01'\"}\n", + "\n", + "Here's revenue for 2025, sliced by month, product type, and location (Brooklyn and Philadelphia). Data is pulled from the Revenue metric grouped by:\n", + "- metric_time (MONTH)\n", + "- product__product_type\n", + "- order_id__location__location_name\n", + "\n", + "August 2025\n", + "- beverage, Brooklyn: 32,053\n", + "- beverage, Philadelphia: 24,703\n", + "- jaffle, Brooklyn: 25,365\n", + "- jaffle, Philadelphia: 20,258\n", + "\n", + "July 2025\n", + "- beverage, Brooklyn: 26,394\n", + "- beverage, Philadelphia: 22,915\n", + "- jaffle, Brooklyn: 21,655\n", + "- jaffle, Philadelphia: 19,432\n", + "\n", + "June 2025\n", + "- beverage, Brooklyn: 25,496\n", + "- beverage, Philadelphia: 25,032\n", + "- jaffle, Brooklyn: 20,639\n", + "- jaffle, Philadelphia: 22,516\n", + "\n", + "May 2025\n", + "- beverage, Brooklyn: 21,876\n", + "- beverage, Philadelphia: 27,374\n", + "- jaffle, Brooklyn: 18,010\n", + "- jaffle, Philadelphia: 24,128\n", + "\n", + "April 2025\n", + "- beverage, Brooklyn: 18,476\n", + "- beverage, Philadelphia: 23,774\n", + "- jaffle, Brooklyn: 15,582\n", + "- jaffle, Philadelphia: 21,414\n", + "\n", + "March 2025\n", + "- beverage, Brooklyn: 13,142\n", + "- beverage, Philadelphia: 25,130\n", + "- jaffle, Brooklyn: 10,760\n", + "- jaffle, Philadelphia: 21,186\n", + "\n", + "Would you like this as a CSV or Excel file for download, or should I format it as a pivot-ready table (e.g., months as rows and a column per product/location combination)?\n", + "```\n", + "\n", + "OpenAI even suggests generating a CSV so that we could leverage this data in other tools." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Try it yourself\n", + "\n", + "By connecting an OpenAI agent to dbt Cloud through the dbt MCP server, you've seen how to give an LLM both the context it needs to **discover trusted assets** and the guardrails to **query governed metrics** via the Semantic Layer. This approach keeps answers consistent with your team's definitions, while still letting you ask questions in plain language.\n", + "\n", + "You can adapt the same pattern to other integration options - local or remote MCP, the Responses API, or future ChatGPT integrations - without changing the core logic. With a small amount of setup, you can start embedding these capabilities into internal tools, notebooks, or automations, making it easier for everyone to explore and trust your organization's data." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/images/trusted_data_process_flow_openai_data_warehouse.png b/images/trusted_data_process_flow_openai_data_warehouse.png new file mode 100644 index 0000000000..3be28a0a33 Binary files /dev/null and b/images/trusted_data_process_flow_openai_data_warehouse.png differ diff --git a/images/trusted_data_process_flow_openai_dbt_mcp.png b/images/trusted_data_process_flow_openai_dbt_mcp.png new file mode 100644 index 0000000000..9583f266de Binary files /dev/null and b/images/trusted_data_process_flow_openai_dbt_mcp.png differ diff --git a/images/trusted_data_process_flow_openai_dbt_semantic_layer.png b/images/trusted_data_process_flow_openai_dbt_semantic_layer.png new file mode 100644 index 0000000000..cf7f394ff3 Binary files /dev/null and b/images/trusted_data_process_flow_openai_dbt_semantic_layer.png differ diff --git a/registry.yaml b/registry.yaml index 1274794e6c..89bef84e81 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,6 +4,7 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. + - title: "Fine-tune gpt-oss for better Korean language performance" path: articles/gpt-oss/fine-tune-korean.ipynb date: 2025-08-26 @@ -14,7 +15,17 @@ tags: - gpt-oss - open-models - + +- title: Trusting your data using the Agents SDK and the dbt MCP server + path: examples/Trusting_your_data_using_Agents_SDK_and_dbt_MCP_server.ipynb + date: 2025-08-20 + authors: + - b-per + tags: + - agents-sdk + - gpt-5 + - mcp + - title: Verifying gpt-oss implementations path: articles/gpt-oss/verifying-implementations.md date: 2025-08-11