Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions benchmark/Procfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Procfile

# NeMo Guardrails server
gr: poetry run nemoguardrails server --config ../examples/configs/content_safety_local --default-config-id content_safety_local --port 9000

# Guardrails NIMs for inference. PYTHONPATH is set to the project root so absolute imports work
app_llm: PYTHONPATH=.. python mock_llm_server/run_server.py --workers 4 --port 8000 --config-file mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
cs_llm: PYTHONPATH=.. python mock_llm_server/run_server.py --workers 4 --port 8001 --config-file mock_llm_server/configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env
71 changes: 44 additions & 27 deletions nemoguardrails/benchmark/README.md → benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,29 +14,40 @@ All models use the [Mock LLM Server](mock_llm_server), which is a simplified mod
The aim of this benchmark is to detect performance-regressions as quickly as running unit-tests.

## Quickstart: Running Guardrails with Mock LLMs

To run Guardrails with mocks for both the content-safety and main LLM, follow the steps below.
All commands must be run in the `nemoguardrails/benchmark` directory.
These assume you already have a working environment after following the steps in [CONTRIBUTING.md](../../CONTRIBUTING.md).
All commands must be run in the `benchmark` directory.

### 1. Set up benchmarking virtual environment

The benchmarking tools have their own dependencies, which are managed using a virtual environment, pip, and the [requirements.txt](requirements.txt) file.
In this section, you'll create a new virtual environment, activate it, and install all the dependencies needed to benchmark Guardrails.

First, we need to install the honcho and langchain-nvidia-ai-endpoints packages.
The `honcho` package is used to run Procfile-based applications, and is a Python port of [Foreman](https://github.com/ddollar/foreman).
The `langchain-nvidia-ai-endpoints` package is used to communicate with Mock LLMs via Langchain.
First you'll create the virtual environment and install dependencies.

```shell
# Install dependencies
$ poetry run pip install honcho langchain-nvidia-ai-endpoints
# Create a virtual environment under ~/env/benchmark_env and activate it

$ cd benchmark
$ mkdir ~/env
$ python -m venv ~/env/benchmark_env
$ pip install -r requirements.txt
...
Successfully installed filetype-1.2.0 honcho-2.0.0 langchain-nvidia-ai-endpoints-0.3.19
Successfully installed fastapi-0.128.0 honcho-2.0.0 httpx-0.28.1 langchain-core-1.2.5 numpy-2.4.0 pydantic-2.12.5 pydantic-core-2.41.5 pydantic-settings-2.12.0 pyyaml-6.0.3 typer-0.21.0 typing-inspection-0.4.2 uuid-utils-0.12.0 uvicorn-0.40.0
$ source ~/env/benchmark_env/bin/activate
(benchmark_env) $
```

### 2. Run Guardrails with Mock LLMs for Content-Safety and Application LLM

Now we can start up the processes that are part of the [Procfile](Procfile).
As the Procfile processes spin up, they log to the console with a prefix. The `system` prefix is used by Honcho, `app_llm` is the Application or Main LLM mock, `cs_llm` is the content-safety mock, and `gr` is the Guardrails service. We'll explore the Procfile in more detail below.
Once the three 'Uvicorn running on ...' messages are printed, you can move to the next step. Note these messages are likely not on consecutive lines.

```
# All commands must be run in the nemoguardrails/benchmark directory
$ cd nemoguardrails/benchmark
$ poetry run honcho start
```shell
# These commands must be run in the benchmark directory after activating the benchmark_env virtual environment

(benchmark_env) $ honcho start
13:40:33 system | gr.1 started (pid=93634)
13:40:33 system | app_llm.1 started (pid=93635)
13:40:33 system | cs_llm.1 started (pid=93636)
Expand All @@ -48,34 +59,34 @@ $ poetry run honcho start
13:40:45 gr.1 | INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
```

Once Guardrails and the mock servers are up, we can use the `validate_mocks.py` script to check they're healthy and serving the correct models.
### 3. Validate services are running correctly

Once Guardrails and the mock servers are up, we'll use the [validate_mocks.sh](scripts/validate_mocks.sh) script to validate everything is working.
This doesn't require the `benchmark_env` virtual environment since we're running curl commands in the script.

```shell
$ cd nemoguardrails/benchmark
$ poetry run python validate_mocks.py
# In a new shell, change into the benchmark directory and run these commands.

$ cd benchmark
$ scripts/validate_mocks.sh
Starting LLM endpoint health check...

--- Checking Port: 8000 ---
Checking http://localhost:8000/health ...
HTTP Request: GET http://localhost:8000/health "HTTP/1.1 200 OK"
Health Check PASSED: Status is 'healthy'.
Checking http://localhost:8000/v1/models for 'meta/llama-3.3-70b-instruct'...
HTTP Request: GET http://localhost:8000/v1/models "HTTP/1.1 200 OK"
Model Check PASSED: Found 'meta/llama-3.3-70b-instruct' in model list.
--- Port 8000: ALL CHECKS PASSED ---

--- Checking Port: 8001 ---
Checking http://localhost:8001/health ...
HTTP Request: GET http://localhost:8001/health "HTTP/1.1 200 OK"
Health Check PASSED: Status is 'healthy'.
Checking http://localhost:8001/v1/models for 'nvidia/llama-3.1-nemoguard-8b-content-safety'...
HTTP Request: GET http://localhost:8001/v1/models "HTTP/1.1 200 OK"
Model Check PASSED: Found 'nvidia/llama-3.1-nemoguard-8b-content-safety' in model list.
--- Port 8001: ALL CHECKS PASSED ---

--- Checking Port: 9000 (Rails Config) ---
Checking http://localhost:9000/v1/rails/configs ...
HTTP Request: GET http://localhost:9000/v1/rails/configs "HTTP/1.1 200 OK"
HTTP Status PASSED: Got 200.
Body Check PASSED: Response is an array with at least one entry.
--- Port 9000: ALL CHECKS PASSED ---
Expand All @@ -88,10 +99,12 @@ Port 9000 (Rails Config): PASSED
Overall Status: All endpoints are healthy!
```

### 4. Make Guardrails requests

Once the mocks and Guardrails are running and the script passes, we can issue curl requests against the Guardrails `/chat/completions` endpoint to generate a response and test the system end-to-end.

```shell
curl -s -X POST http://0.0.0.0:9000/v1/chat/completions \
$ curl -s -X POST http://0.0.0.0:9000/v1/chat/completions \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
Expand All @@ -104,6 +117,7 @@ curl -s -X POST http://0.0.0.0:9000/v1/chat/completions \
],
"stream": false
}' | jq

{
"messages": [
{
Expand All @@ -112,7 +126,6 @@ curl -s -X POST http://0.0.0.0:9000/v1/chat/completions \
}
]
}

```

------
Expand All @@ -123,20 +136,22 @@ In this section, we'll examine the configuration files used in the quickstart ab

### Procfile

The [Procfile](Procfile?raw=true) contains all the processes that make up the application.
The [Procfile](Procfile) contains all the processes that make up the application.
The Honcho package reads in this file, starts all the processes, and combines their logs to the console
The `gr` line runs the Guardrails server on port 9000 and sets the default Guardrails configuration as [content_safety_colang1](configs/guardrail_configs/content_safety_colang1?raw=true).
The `gr` line runs the Guardrails server on port 9000 and sets the default Guardrails configuration as [content_safety_local](../examples/configs/content_safety_local).
The `app_llm` line runs the Application or Main Mock LLM. Guardrails calls this LLM to generate a response to the user's query. This server uses 4 uvicorn workers and runs on port 8000. The configuration file here is a Mock LLM configuration, not a Guardrails configuration.
The `cs_llm` line runs the Content-Safety Mock LLM. This uses 4 uvicorn workers and runs on port 8001.

### Guardrails Configuration
The [Guardrails Configuration](configs/guardrail_configs/content_safety_colang1/config.yml) is used by the Guardrails server.

The [Guardrails Configuration](../examples/configs/content_safety_local/config.yml) is used by the Guardrails server.
Under the `models` section, the `main` model is used to generate responses to the user queries. The base URL for this model is the `app_llm` Mock LLM from the Procfile, running on port 8000. The `model` field has to match the Mock LLM model name.
The `content_safety` model is configured for use in an input and output rail. The `type` field matches the `$model` used in the input and output flows.

### Mock LLM Endpoints

The Mock LLM implements a subset of the OpenAI LLM API.
There are two Mock LLM configurations, one for the Mock [main model](configs/mock_configs/meta-llama-3.3-70b-instruct.env), and another for the Mock [content-safety](configs/mock_configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env) model.
There are two Mock LLM configurations, one for the Mock [main model](mock_llm_server/configs/meta-llama-3.3-70b-instruct.env), and another for the Mock [content-safety](mock_llm_server/configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env) model.
The Mock LLM has the following OpenAI-compatible endpoints:

* `/health`: Returns a JSON object with status set to healthy and timestamp in seconds-since-epoch. For example `{"status":"healthy","timestamp":1762781239}`
Expand All @@ -145,6 +160,7 @@ The Mock LLM has the following OpenAI-compatible endpoints:
* `/v1/chat/completions`: Returns an [OpenAI chat completion object](https://platform.openai.com/docs/api-reference/chat/object) using the Mock configuration (see below).

### Mock LLM Configuration

Mock LLMs are configured using the `.env` file format. These files are passed to the Mock LLM using the `--config-file` argument.
The Mock LLMs return either a `SAFE_TEXT` or `UNSAFE_TEXT` response to `/v1/completions` or `/v1/chat/completions` inference requests.
The probability of the `UNSAFE_TEXT` being returned if given by `UNSAFE_PROBABILITY`.
Expand All @@ -155,8 +171,9 @@ The latency of each response is also controllable, and works as follows:
* If the sampled value is less than `LATENCY_MAX_SECONDS`, it is set to `LATENCY_MAX_SECONDS`.

The full list of configuration fields is shown below:

* `MODEL`: The Model name served by the Mock LLM. This will be returned on the `/v1/models` endpoint.
* `UNSAFE_PROBABILITY`: Probability of an unsafe response. This is a probability, and must be in the range [0, 1].
* `UNSAFE_PROBABILITY`: Probability of an unsafe response. This must be in the range [0, 1].
* `UNSAFE_TEXT`: String returned as an unsafe response.
* `SAFE_TEXT`: String returned as a safe response.
* `LATENCY_MIN_SECONDS`: Minimum latency in seconds.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@

from fastapi import Depends, FastAPI, HTTPException, Request

from nemoguardrails.benchmark.mock_llm_server.config import ModelSettings, get_settings
from nemoguardrails.benchmark.mock_llm_server.models import (
from benchmark.mock_llm_server.config import ModelSettings, get_settings
from benchmark.mock_llm_server.models import (
ChatCompletionChoice,
ChatCompletionRequest,
ChatCompletionResponse,
Expand All @@ -34,7 +34,7 @@
ModelsResponse,
Usage,
)
from nemoguardrails.benchmark.mock_llm_server.response_data import (
from benchmark.mock_llm_server.response_data import (
calculate_tokens,
generate_id,
get_latency_seconds,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

import numpy as np

from nemoguardrails.benchmark.mock_llm_server.config import ModelSettings
from benchmark.mock_llm_server.config import ModelSettings


def generate_id(prefix: str = "chatcmpl") -> str:
Expand Down Expand Up @@ -56,7 +56,7 @@ def get_latency_seconds(config: ModelSettings, seed: Optional[int] = None) -> fl
a_min=config.latency_min_seconds,
a_max=config.latency_max_seconds,
)
return float(latency_seconds)
return float(latency_seconds[0])


def is_unsafe(config: ModelSettings, seed: Optional[int] = None) -> bool:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@

import uvicorn

from nemoguardrails.benchmark.mock_llm_server.config import CONFIG_FILE_ENV_VAR
from benchmark.mock_llm_server.config import CONFIG_FILE_ENV_VAR

# 1. Get a logger instance
log = logging.getLogger(__name__)
Expand Down Expand Up @@ -101,7 +101,7 @@ def main(): # pragma: no cover

try:
uvicorn.run(
"nemoguardrails.benchmark.mock_llm_server.api:app",
"benchmark.mock_llm_server.api:app",
host=args.host,
port=args.port,
reload=args.reload,
Expand Down
21 changes: 21 additions & 0 deletions benchmark/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Runtime dependencies for benchmark tools
#
# Install with: pip install -r requirements.txt
#
# Note: Version constraints are aligned with the main nemoguardrails package
# where applicable to ensure compatibility.

# --- general dependencies ---
honcho>=2.0.0

# --- mock_llm_server dependencies ---
fastapi>=0.103.0
uvicorn>=0.23
pydantic>=2.0
pydantic-settings>=2.0
numpy>=2.3.2

# --- aiperf dependencies ---
httpx>=0.24.1
typer>=0.8
pyyaml>=6.0
Loading