NVIDIA-NeMo · tgasser-nv · Jan 5, 2026 · Dec 31, 2025 · Dec 31, 2025 · Dec 31, 2025
diff --git a/benchmark/Procfile b/benchmark/Procfile
@@ -0,0 +1,8 @@
+# Procfile
+
+# NeMo Guardrails server
+gr: poetry run nemoguardrails server --config ../examples/configs/content_safety_local --default-config-id content_safety_local --port 9000
+
+# Guardrails NIMs for inference. PYTHONPATH is set to the project root so absolute imports work
+app_llm: PYTHONPATH=.. python mock_llm_server/run_server.py --workers 4 --port 8000 --config-file mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
+cs_llm: PYTHONPATH=.. python mock_llm_server/run_server.py --workers 4 --port 8001 --config-file mock_llm_server/configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env
diff --git a/nemoguardrails/benchmark/README.md → benchmark/README.md b/nemoguardrails/benchmark/README.md → benchmark/README.md
@@ -14,29 +14,40 @@ All models use the [Mock LLM Server](mock_llm_server), which is a simplified mod
 The aim of this benchmark is to detect performance-regressions as quickly as running unit-tests.
 
 ## Quickstart: Running Guardrails with Mock LLMs
+
 To run Guardrails with mocks for both the content-safety and main LLM, follow the steps below.
-All commands must be run in the `nemoguardrails/benchmark` directory.
-These assume you already have a working environment after following the steps in [CONTRIBUTING.md](../../CONTRIBUTING.md).
+All commands must be run in the `benchmark` directory.
+
+### 1. Set up benchmarking virtual environment
+
+The benchmarking tools have their own dependencies, which are managed using a virtual environment, pip, and the [requirements.txt](requirements.txt) file.
+In this section, you'll create a new virtual environment, activate it, and install all the dependencies needed to benchmark Guardrails.
 
-First, we need to install the honcho and langchain-nvidia-ai-endpoints packages.
-The `honcho` package is used to run Procfile-based applications, and is a Python port of [Foreman](https://github.com/ddollar/foreman).
-The `langchain-nvidia-ai-endpoints` package is used to communicate with Mock LLMs via Langchain.
+First you'll create the virtual environment and install dependencies.
 
 ```shell
-# Install dependencies
-$ poetry run pip install honcho langchain-nvidia-ai-endpoints
+# Create a virtual environment under ~/env/benchmark_env and activate it
+
+$ cd benchmark
+$ mkdir ~/env
+$ python -m venv ~/env/benchmark_env
+$ pip install -r requirements.txt
 ...
-Successfully installed filetype-1.2.0 honcho-2.0.0 langchain-nvidia-ai-endpoints-0.3.19
+Successfully installed fastapi-0.128.0 honcho-2.0.0 httpx-0.28.1 langchain-core-1.2.5 numpy-2.4.0 pydantic-2.12.5 pydantic-core-2.41.5 pydantic-settings-2.12.0 pyyaml-6.0.3 typer-0.21.0 typing-inspection-0.4.2 uuid-utils-0.12.0 uvicorn-0.40.0
+$ source ~/env/benchmark_env/bin/activate
+(benchmark_env) $
 ```
 
+### 2. Run Guardrails with Mock LLMs for Content-Safety and Application LLM
+
 Now we can start up the processes that are part of the [Procfile](Procfile).
 As the Procfile processes spin up, they log to the console with a prefix. The `system` prefix is used by Honcho, `app_llm` is the Application or Main LLM mock, `cs_llm` is the content-safety mock, and `gr` is the Guardrails service. We'll explore the Procfile in more detail below.
 Once the three 'Uvicorn running on ...' messages are printed, you can move to the next step. Note these messages are likely not on consecutive lines.
 
-```
-# All commands must be run in the nemoguardrails/benchmark directory
-$ cd nemoguardrails/benchmark
-$ poetry run honcho start
+```shell
+# These commands must be run in the benchmark directory after activating the benchmark_env virtual environment
+
+(benchmark_env) $ honcho start
 13:40:33 system    | gr.1 started (pid=93634)
 13:40:33 system    | app_llm.1 started (pid=93635)
 13:40:33 system    | cs_llm.1 started (pid=93636)
@@ -48,34 +59,34 @@ $ poetry run honcho start
 13:40:45 gr.1      | INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
 ```
 
-Once Guardrails and the mock servers are up, we can use the `validate_mocks.py` script to check they're healthy and serving the correct models.
+### 3. Validate services are running correctly
+
+Once Guardrails and the mock servers are up, we'll use the [validate_mocks.sh](scripts/validate_mocks.sh) script to validate everything is working.
+This doesn't require the `benchmark_env` virtual environment since we're running curl commands in the script.
 
 ```shell
-$ cd nemoguardrails/benchmark
-$ poetry run python validate_mocks.py
+# In a new shell, change into the benchmark directory and run these commands.
+
+$ cd benchmark
+$ scripts/validate_mocks.sh
 Starting LLM endpoint health check...
 
 --- Checking Port: 8000 ---
 Checking http://localhost:8000/health ...
-HTTP Request: GET http://localhost:8000/health "HTTP/1.1 200 OK"
 Health Check PASSED: Status is 'healthy'.
 Checking http://localhost:8000/v1/models for 'meta/llama-3.3-70b-instruct'...
-HTTP Request: GET http://localhost:8000/v1/models "HTTP/1.1 200 OK"
 Model Check PASSED: Found 'meta/llama-3.3-70b-instruct' in model list.
 --- Port 8000: ALL CHECKS PASSED ---
 
 --- Checking Port: 8001 ---
 Checking http://localhost:8001/health ...
-HTTP Request: GET http://localhost:8001/health "HTTP/1.1 200 OK"
 Health Check PASSED: Status is 'healthy'.
 Checking http://localhost:8001/v1/models for 'nvidia/llama-3.1-nemoguard-8b-content-safety'...
-HTTP Request: GET http://localhost:8001/v1/models "HTTP/1.1 200 OK"
 Model Check PASSED: Found 'nvidia/llama-3.1-nemoguard-8b-content-safety' in model list.
 --- Port 8001: ALL CHECKS PASSED ---
 
 --- Checking Port: 9000 (Rails Config) ---
 Checking http://localhost:9000/v1/rails/configs ...
-HTTP Request: GET http://localhost:9000/v1/rails/configs "HTTP/1.1 200 OK"
 HTTP Status PASSED: Got 200.
 Body Check PASSED: Response is an array with at least one entry.
 --- Port 9000: ALL CHECKS PASSED ---
@@ -88,10 +99,12 @@ Port 9000 (Rails Config): PASSED
 Overall Status: All endpoints are healthy!
 ```
 
+### 4. Make Guardrails requests
+
 Once the mocks and Guardrails are running and the script passes, we can issue curl requests against the Guardrails `/chat/completions` endpoint to generate a response and test the system end-to-end.
 
 ```shell
-curl -s -X POST http://0.0.0.0:9000/v1/chat/completions \
+ $ curl -s -X POST http://0.0.0.0:9000/v1/chat/completions \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
@@ -104,6 +117,7 @@ curl -s -X POST http://0.0.0.0:9000/v1/chat/completions \
       ],
       "stream": false
     }' | jq
+
 {
   "messages": [
     {
@@ -112,7 +126,6 @@ curl -s -X POST http://0.0.0.0:9000/v1/chat/completions \
     }
   ]
 }
-
 ```
 
 ------
@@ -123,20 +136,22 @@ In this section, we'll examine the configuration files used in the quickstart ab
 
 ### Procfile
 
-The [Procfile](Procfile?raw=true) contains all the processes that make up the application.
+The [Procfile](Procfile) contains all the processes that make up the application.
 The Honcho package reads in this file, starts all the processes, and combines their logs to the console
-The `gr` line runs the Guardrails server on port 9000 and sets the default Guardrails configuration as [content_safety_colang1](configs/guardrail_configs/content_safety_colang1?raw=true).
+The `gr` line runs the Guardrails server on port 9000 and sets the default Guardrails configuration as [content_safety_local](../examples/configs/content_safety_local).
 The `app_llm` line runs the Application or Main Mock LLM. Guardrails calls this LLM to generate a response to the user's query. This server uses 4 uvicorn workers and runs on port 8000. The configuration file here is a Mock LLM configuration, not a Guardrails configuration.
 The `cs_llm` line runs the Content-Safety Mock LLM. This uses 4 uvicorn workers and runs on port 8001.
 
 ### Guardrails Configuration
-The [Guardrails Configuration](configs/guardrail_configs/content_safety_colang1/config.yml) is used by the Guardrails server.
+
+The [Guardrails Configuration](../examples/configs/content_safety_local/config.yml) is used by the Guardrails server.
 Under the `models` section, the `main` model is used to generate responses to the user queries. The base URL for this model is the `app_llm` Mock LLM from the Procfile, running on port 8000. The `model` field has to match the Mock LLM model name.
 The `content_safety` model is configured for use in an input and output rail. The `type` field matches the `$model` used in the input and output flows.
 
 ### Mock LLM Endpoints
+
 The Mock LLM implements a subset of the OpenAI LLM API.
-There are two Mock LLM configurations, one for the Mock [main model](configs/mock_configs/meta-llama-3.3-70b-instruct.env), and another for the Mock [content-safety](configs/mock_configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env) model.
+There are two Mock LLM configurations, one for the Mock [main model](mock_llm_server/configs/meta-llama-3.3-70b-instruct.env), and another for the Mock [content-safety](mock_llm_server/configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env) model.
 The Mock LLM has the following OpenAI-compatible endpoints:
 
 * `/health`: Returns a JSON object with status set to healthy and timestamp in seconds-since-epoch. For example `{"status":"healthy","timestamp":1762781239}`
@@ -145,6 +160,7 @@ The Mock LLM has the following OpenAI-compatible endpoints:
 * `/v1/chat/completions`: Returns an [OpenAI chat completion object](https://platform.openai.com/docs/api-reference/chat/object) using the Mock configuration (see below).
 
 ### Mock LLM Configuration
+
 Mock LLMs are configured using the `.env` file format. These files are passed to the Mock LLM using the `--config-file` argument.
 The Mock LLMs return either a `SAFE_TEXT` or `UNSAFE_TEXT` response to `/v1/completions` or `/v1/chat/completions` inference requests.
 The probability of the `UNSAFE_TEXT` being returned if given by `UNSAFE_PROBABILITY`.
@@ -155,8 +171,9 @@ The latency of each response is also controllable, and works as follows:
 * If the sampled value is less than `LATENCY_MAX_SECONDS`, it is set to `LATENCY_MAX_SECONDS`.
 
 The full list of configuration fields is shown below:
+
 * `MODEL`: The Model name served by the Mock LLM. This will be returned on the `/v1/models` endpoint.
-* `UNSAFE_PROBABILITY`: Probability of an unsafe response. This is a probability, and must be in the range [0, 1].
+* `UNSAFE_PROBABILITY`: Probability of an unsafe response. This must be in the range [0, 1].
 * `UNSAFE_TEXT`: String returned as an unsafe response.
 * `SAFE_TEXT`: String returned as a safe response.
 * `LATENCY_MIN_SECONDS`: Minimum latency in seconds.

diff --git a/nemoguardrails/benchmark/__init__.py → benchmark/mock_llm_server/__init__.py b/nemoguardrails/benchmark/__init__.py → benchmark/mock_llm_server/__init__.py
diff --git a/...ardrails/benchmark/mock_llm_server/api.py → benchmark/mock_llm_server/api.py b/...ardrails/benchmark/mock_llm_server/api.py → benchmark/mock_llm_server/api.py
@@ -21,8 +21,8 @@
 
 from fastapi import Depends, FastAPI, HTTPException, Request
 
-from nemoguardrails.benchmark.mock_llm_server.config import ModelSettings, get_settings
-from nemoguardrails.benchmark.mock_llm_server.models import (
+from benchmark.mock_llm_server.config import ModelSettings, get_settings
+from benchmark.mock_llm_server.models import (
     ChatCompletionChoice,
     ChatCompletionRequest,
     ChatCompletionResponse,
@@ -34,7 +34,7 @@
     ModelsResponse,
     Usage,
 )
-from nemoguardrails.benchmark.mock_llm_server.response_data import (
+from benchmark.mock_llm_server.response_data import (
     calculate_tokens,
     generate_id,
     get_latency_seconds,

diff --git a/...rails/benchmark/mock_llm_server/config.py → benchmark/mock_llm_server/config.py b/...rails/benchmark/mock_llm_server/config.py → benchmark/mock_llm_server/config.py
diff --git a/...k_configs/meta-llama-3.3-70b-instruct.env → ...r/configs/meta-llama-3.3-70b-instruct.env b/...k_configs/meta-llama-3.3-70b-instruct.env → ...r/configs/meta-llama-3.3-70b-instruct.env
diff --git a/...llama-3.1-nemoguard-8b-content-safety.env → ...llama-3.1-nemoguard-8b-content-safety.env b/...llama-3.1-nemoguard-8b-content-safety.env → ...llama-3.1-nemoguard-8b-content-safety.env
diff --git a/...rails/benchmark/mock_llm_server/models.py → benchmark/mock_llm_server/models.py b/...rails/benchmark/mock_llm_server/models.py → benchmark/mock_llm_server/models.py
diff --git a/...enchmark/mock_llm_server/response_data.py → benchmark/mock_llm_server/response_data.py b/...enchmark/mock_llm_server/response_data.py → benchmark/mock_llm_server/response_data.py
@@ -19,7 +19,7 @@
 
 import numpy as np
 
-from nemoguardrails.benchmark.mock_llm_server.config import ModelSettings
+from benchmark.mock_llm_server.config import ModelSettings
 
 
 def generate_id(prefix: str = "chatcmpl") -> str:
@@ -56,7 +56,7 @@ def get_latency_seconds(config: ModelSettings, seed: Optional[int] = None) -> fl
         a_min=config.latency_min_seconds,
         a_max=config.latency_max_seconds,
     )
-    return float(latency_seconds)
+    return float(latency_seconds[0])
 
 
 def is_unsafe(config: ModelSettings, seed: Optional[int] = None) -> bool:

diff --git a/...s/benchmark/mock_llm_server/run_server.py → benchmark/mock_llm_server/run_server.py b/...s/benchmark/mock_llm_server/run_server.py → benchmark/mock_llm_server/run_server.py
@@ -27,7 +27,7 @@
 
 import uvicorn
 
-from nemoguardrails.benchmark.mock_llm_server.config import CONFIG_FILE_ENV_VAR
+from benchmark.mock_llm_server.config import CONFIG_FILE_ENV_VAR
 
 # 1. Get a logger instance
 log = logging.getLogger(__name__)
@@ -101,7 +101,7 @@ def main():  # pragma: no cover
 
     try:
         uvicorn.run(
-            "nemoguardrails.benchmark.mock_llm_server.api:app",
+            "benchmark.mock_llm_server.api:app",
             host=args.host,
             port=args.port,
             reload=args.reload,

diff --git a/benchmark/requirements.txt b/benchmark/requirements.txt
@@ -0,0 +1,21 @@
+# Runtime dependencies for benchmark tools
+#
+# Install with: pip install -r requirements.txt
+#
+# Note: Version constraints are aligned with the main nemoguardrails package
+# where applicable to ensure compatibility.
+
+# --- general dependencies ---
+honcho>=2.0.0
+
+# --- mock_llm_server dependencies ---
+fastapi>=0.103.0
+uvicorn>=0.23
+pydantic>=2.0
+pydantic-settings>=2.0
+numpy>=2.3.2
+
+# --- aiperf dependencies ---
+httpx>=0.24.1
+typer>=0.8
+pyyaml>=6.0