From fdf52c0e2ca663871d4a42da4a3a28fcba0bca88 Mon Sep 17 00:00:00 2001 From: tgasser-nv <200644301+tgasser-nv@users.noreply.github.com> Date: Tue, 28 Oct 2025 20:07:00 -0500 Subject: [PATCH 1/8] Initial checkin --- .../advanced/model-memory-cache.md | 71 +++++++++++++++++++ 1 file changed, 71 insertions(+) create mode 100644 docs/user-guides/advanced/model-memory-cache.md diff --git a/docs/user-guides/advanced/model-memory-cache.md b/docs/user-guides/advanced/model-memory-cache.md new file mode 100644 index 000000000..3fba99c23 --- /dev/null +++ b/docs/user-guides/advanced/model-memory-cache.md @@ -0,0 +1,71 @@ +(model-memory-cache)= + +# In-Memory Model Cache + +Guardrails supports an in-memory cache to store user-prompts and the LLM response to them. +This can be applied to any model, using the `Model.cache` field + +## Example Configuration + +Let's walk through an example of adding caching to a Content-Safety Guardrails application. The initial `config.yml` is shown below. + +```yaml +# Content-Safety config.yml (without caching) +models: + - type: main + engine: nim + model: meta/llama-3.3-70b-instruct + + - type: content_safety + engine: nim + model: nvidia/llama-3.1-nemoguard-8b-content-safety + +rails: + input: + flows: + - content safety check input $model=content_safety + output: + flows: + - content safety check output $model=content_safety +``` + +The yaml file below shows the same configuration, but this time with caching enabled on the main LLM and Content-Safety Nemoguard model. +The `cache` section controls the caching. The `meta/llama-3.3-70b-instruct` model has a cache with a maximum size of 1,000 entries, while the `nvidia/llama-3.1-nemoguard-8b-content-safety` has a cache maximum size of 10,000 entries. +Both caches have telemetry reporting enabled. + +```yaml +# Content-Safety config.yml (with caching) +models: + - type: main + engine: nim + model: meta/llama-3.3-70b-instruct + cache: + enabled: true + maxsize: 1000 + stats: + enabled: true + + - type: content_safety + engine: nim + model: nvidia/llama-3.1-nemoguard-8b-content-safety + cache: + enabled: true + maxsize: 10000 + stats: + enabled: true +rails: + input: + flows: + - content safety check input $model=content_safety + output: + flows: + - content safety check output $model=content_safety +``` + + +## Least Frequently Used Cache + + +## Telemetry + +## Horizontal scaling and caching From a1afb4c5248b8cabdc164e6fe7b8d09aca1ee171 Mon Sep 17 00:00:00 2001 From: tgasser-nv <200644301+tgasser-nv@users.noreply.github.com> Date: Wed, 29 Oct 2025 10:04:03 -0500 Subject: [PATCH 2/8] Completed cache doc, some todos to fill in based on local integration testing --- .../advanced/model-memory-cache.md | 65 ++++++++++++++++--- 1 file changed, 57 insertions(+), 8 deletions(-) diff --git a/docs/user-guides/advanced/model-memory-cache.md b/docs/user-guides/advanced/model-memory-cache.md index 3fba99c23..0d22eb461 100644 --- a/docs/user-guides/advanced/model-memory-cache.md +++ b/docs/user-guides/advanced/model-memory-cache.md @@ -2,12 +2,17 @@ # In-Memory Model Cache -Guardrails supports an in-memory cache to store user-prompts and the LLM response to them. -This can be applied to any model, using the `Model.cache` field +Guardrails supports an in-memory cache which avoids making LLM calls for repeated prompts. It stores user-prompts and the corresponding LLM response. Prior to making an LLM call, Guardrails first checks if the prompt matches one already in the cache. If the prompt is found in the cache, the stored response is returned from the cache, rather than prompting the LLM. This improves latency. +In-memory caches are supported for the Main LLM, and all Nemoguard models ([Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control), and [Jailbreak Detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect)). Each model can be configured independently. +The cache uses exact-matching (after removing whitespace) on LLM prompts with a Least-Frequently-Used (LFU) algorithm for cache evictions. +For observability, cache hits and misses are visible in OTEL telemetry, and stored in logs on a configurable cadence. +To get started with caching, an example configuration is shown below. The rest of the page has a deep-dive into how the cache works, telemetry, and considerations when enabling caching in a horizontally-scalable service. ## Example Configuration -Let's walk through an example of adding caching to a Content-Safety Guardrails application. The initial `config.yml` is shown below. +Let's walk through an example of adding caching to a Content-Safety Guardrails application. The initial `config.yml` without caching is shown below. +We are using a [Llama 3.3 70B-Instruct](https://build.nvidia.com/meta/llama-3_3-70b-instruct) main LLM to generate responses, and checking user-input and LLM-response using the [Llama 3.1 Nemoguard 8B Content Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety) model. +The input rail checks the safety of the user prompt before sending it to the main LLM. The output rail checks both the user input and Main LLM response to make sure the response is safe. ```yaml # Content-Safety config.yml (without caching) @@ -29,9 +34,9 @@ rails: - content safety check output $model=content_safety ``` -The yaml file below shows the same configuration, but this time with caching enabled on the main LLM and Content-Safety Nemoguard model. -The `cache` section controls the caching. The `meta/llama-3.3-70b-instruct` model has a cache with a maximum size of 1,000 entries, while the `nvidia/llama-3.1-nemoguard-8b-content-safety` has a cache maximum size of 10,000 entries. -Both caches have telemetry reporting enabled. +The yaml file below shows the same configuration, with caching enabled on the Main and Content-Safety Nemoguard models. +The Main LLM and Nemoguard Content-Safety caches have maximum sizes of 1,000 and 10,000 respectively. +Both caches are configured to log cache statistics. The Main LLM cache statistics are logged every 60 seconds (or 1 minute), while the Content-Safety cache statistics are logged every 360 seconds (or 5 minutes). ```yaml # Content-Safety config.yml (with caching) @@ -44,6 +49,7 @@ models: maxsize: 1000 stats: enabled: true + log_interval: 60 - type: content_safety engine: nim @@ -53,6 +59,8 @@ models: maxsize: 10000 stats: enabled: true + log_interval: 360 + rails: input: flows: @@ -62,10 +70,51 @@ rails: - content safety check output $model=content_safety ``` +## How does the Cache work? + +When the cache is enabled, prior to each LLM call we first check to see if we sent the same prompt to the same LLM. This uses an exact-match lookup, after removing whitespace. +If there's a cache hit (i.e. the same prompt was sent to the same LLM earlier and the response was stored in the cache), then the response can be returned without calling the LLM. +If there's a cache miss (i.e. we don't have a stored LLM response for this prompt in the cache), then the LLM is called as usual. When the response is received, this is stored in the cache. + +For security reasons, user prompts are not stored directly. After removing whitespace, the user-prompt is hashed using SHA256 and then used as a cache key. -## Least Frequently Used Cache +If a new cache record needs to be added and the cache already has `maxsize` entries, the Least-Frequently Used (LFU) algorithm is used to decide which cache record to evict. +The LFU algorithm ensures that the most frequently accessed cache entries remain in the cache, improving the probability of a cache hit. +## Telemetry and logging + +Guardrails supports OTEL telemetry to trace client requests through Guardrails and any calls to LLMs or APIs. The cache operation is reflected in these traces, with cache hits having a far shorter duration and no LLM call and cache misses having an LLM call. This OTEL telemetry is a good fit for operational dashboards. +The cache statistics are also logged on a configurable cadence if `cache.stats.enabled` is set to `true`. Every `log_interval` seconds, the cache statistics are logged with the format below. +The most important metric below is the "Hit Rate", which is the proportion of LLM calls returned from the cache. If this value remains low, the exact-match may not be a good fit for your usecase. +**TODO! Do these reset on every measurement period, or increment forever (rollover concerns?)** + + +``` +# TODO! Replace with measured values +"LFU Cache Statistics - " +"Size: {stats['current_size']}/{stats['maxsize']} | " +"Hits: {stats['hits']} | " +"Misses: {stats['misses']} | " +"Hit Rate: {stats['hit_rate']:.2%} | " +"Evictions: {stats['evictions']} | " +"Puts: {stats['puts']} | " +"Updates: {stats['updates']}" +``` + +These metrics are detailed below: + +* Size: The number of LLM calls stored in the cache. +* Hits: The number of cache hits. +* Misses: The number of cache misses. +* Hit Rate: The proportion of calls returned from the cache. This is a float between 1.0 (all calls returned from cache) and 0.0 (all calls sent to LLM) +* Evictions: Number of cache evictions. +* Puts: Number of new cache records stored. +* Updates: Number of existing cache records updated. -## Telemetry ## Horizontal scaling and caching + +This cache is implemented in-memory on each Guardrails node. When operating as a horizontally-scaled backend-service, there are many Guardrails nodes running behind an API Gateway and load-balancer to distribute traffic and meet availability and performance targets. +The current cache implementation has a separate cache on each node, with no sharing of cache entries between nodes. +Because the load balancer spreads traffic over all Guardrails nodes, requests have to both be stored in cache, with the load balancer directing the same request to the same node. +In practice, frequently-requested user prompts will likely be spread over Guardrails nodes by the load balancer, so the performance impact may ne less significant. From 5bb99cedc20a0f27d0acdfedd8f996cd9a4378ad Mon Sep 17 00:00:00 2001 From: tgasser-nv <200644301+tgasser-nv@users.noreply.github.com> Date: Wed, 29 Oct 2025 16:29:14 -0500 Subject: [PATCH 3/8] Update example with content-safety, topic-control, and jailbreak nemoguard NIMs --- .../advanced/model-memory-cache.md | 60 +++++++++++++++---- 1 file changed, 49 insertions(+), 11 deletions(-) diff --git a/docs/user-guides/advanced/model-memory-cache.md b/docs/user-guides/advanced/model-memory-cache.md index 0d22eb461..a1559c37f 100644 --- a/docs/user-guides/advanced/model-memory-cache.md +++ b/docs/user-guides/advanced/model-memory-cache.md @@ -3,7 +3,7 @@ # In-Memory Model Cache Guardrails supports an in-memory cache which avoids making LLM calls for repeated prompts. It stores user-prompts and the corresponding LLM response. Prior to making an LLM call, Guardrails first checks if the prompt matches one already in the cache. If the prompt is found in the cache, the stored response is returned from the cache, rather than prompting the LLM. This improves latency. -In-memory caches are supported for the Main LLM, and all Nemoguard models ([Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control), and [Jailbreak Detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect)). Each model can be configured independently. +In-memory caches are supported for all Nemoguard models ([Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control), and [Jailbreak Detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect)). Each model can be configured independently. The cache uses exact-matching (after removing whitespace) on LLM prompts with a Least-Frequently-Used (LFU) algorithm for cache evictions. For observability, cache hits and misses are visible in OTEL telemetry, and stored in logs on a configurable cadence. To get started with caching, an example configuration is shown below. The rest of the page has a deep-dive into how the cache works, telemetry, and considerations when enabling caching in a horizontally-scalable service. @@ -11,8 +11,8 @@ To get started with caching, an example configuration is shown below. The rest o ## Example Configuration Let's walk through an example of adding caching to a Content-Safety Guardrails application. The initial `config.yml` without caching is shown below. -We are using a [Llama 3.3 70B-Instruct](https://build.nvidia.com/meta/llama-3_3-70b-instruct) main LLM to generate responses, and checking user-input and LLM-response using the [Llama 3.1 Nemoguard 8B Content Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety) model. -The input rail checks the safety of the user prompt before sending it to the main LLM. The output rail checks both the user input and Main LLM response to make sure the response is safe. +We are using a [Llama 3.3 70B-Instruct](https://build.nvidia.com/meta/llama-3_3-70b-instruct) main LLM to generate responses. Inputs are checked by [Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control) and [Jailbreak detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect) models. The LLM response is also checked by the [Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety) model. +The input rails check the user prompt before sending it to the Main LLM to generate a response. The output rail checks both the user input and Main LLM response to make sure the response is safe. ```yaml # Content-Safety config.yml (without caching) @@ -25,18 +25,33 @@ models: engine: nim model: nvidia/llama-3.1-nemoguard-8b-content-safety + - type: topic_control + engine: nim + model: nvidia/llama-3.1-nemoguard-8b-topic-control + + - type: jailbreak_detection + engine: nim + model: jailbreak_detect + rails: input: flows: + - jailbreak detection model - content safety check input $model=content_safety + - topic safety check input $model=topic_control + output: flows: - content safety check output $model=content_safety + + config: + jailbreak_detection: + nim_base_url: "https://ai.api.nvidia.com" + nim_server_endpoint: "/v1/security/nvidia/nemoguard-jailbreak-detect" + api_key_env_var: NVIDIA_API_KEY ``` -The yaml file below shows the same configuration, with caching enabled on the Main and Content-Safety Nemoguard models. -The Main LLM and Nemoguard Content-Safety caches have maximum sizes of 1,000 and 10,000 respectively. -Both caches are configured to log cache statistics. The Main LLM cache statistics are logged every 60 seconds (or 1 minute), while the Content-Safety cache statistics are logged every 360 seconds (or 5 minutes). +The yaml file below shows the same configuration, with caching enabled on the Content-Safety, Topic-Control, and Jailbreak detection Nemoguard NIMs. All three caches have a size of 10,000 records. The caches log their statistics every 60 seconds. ```yaml # Content-Safety config.yml (with caching) @@ -44,35 +59,58 @@ models: - type: main engine: nim model: meta/llama-3.3-70b-instruct + + - type: content_safety + engine: nim + model: nvidia/llama-3.1-nemoguard-8b-content-safety cache: enabled: true - maxsize: 1000 + maxsize: 10000 stats: enabled: true log_interval: 60 - - type: content_safety + - type: topic_control engine: nim - model: nvidia/llama-3.1-nemoguard-8b-content-safety + model: nvidia/llama-3.1-nemoguard-8b-topic-control + cache: + enabled: true + maxsize: 10000 + stats: + enabled: true + log_interval: 60 + + - type: jailbreak_detection + engine: nim + model: jailbreak_detect cache: enabled: true maxsize: 10000 stats: enabled: true - log_interval: 360 + log_interval: 60 rails: input: flows: + - jailbreak detection model - content safety check input $model=content_safety + - topic safety check input $model=topic_control + output: flows: - content safety check output $model=content_safety + + config: + jailbreak_detection: + nim_base_url: "https://ai.api.nvidia.com" + nim_server_endpoint: "/v1/security/nvidia/nemoguard-jailbreak-detect" + api_key_env_var: NVIDIA_API_KEY ``` ## How does the Cache work? -When the cache is enabled, prior to each LLM call we first check to see if we sent the same prompt to the same LLM. This uses an exact-match lookup, after removing whitespace. +When the cache is enabled, prior to each LLM call we first check to see if we already sent the same prompt to the LLM, and return the response if so. This uses an exact-match lookup, after removing whitespace. If there's a cache hit (i.e. the same prompt was sent to the same LLM earlier and the response was stored in the cache), then the response can be returned without calling the LLM. If there's a cache miss (i.e. we don't have a stored LLM response for this prompt in the cache), then the LLM is called as usual. When the response is received, this is stored in the cache. From 35177fa2e4958aad71c1debcd9869e29767d4061 Mon Sep 17 00:00:00 2001 From: tgasser-nv <200644301+tgasser-nv@users.noreply.github.com> Date: Wed, 29 Oct 2025 16:53:08 -0500 Subject: [PATCH 4/8] Add memory-caching to the table-of-contents --- docs/user-guides/advanced/index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/user-guides/advanced/index.rst b/docs/user-guides/advanced/index.rst index a6c221188..fffb846e2 100644 --- a/docs/user-guides/advanced/index.rst +++ b/docs/user-guides/advanced/index.rst @@ -22,3 +22,4 @@ Advanced nemoguard-contentsafety-deployment nemoguard-topiccontrol-deployment safeguarding-ai-virtual-assistant-blueprint + model-memory-cache From 5475a327833415577eb0418ce74620291913b267 Mon Sep 17 00:00:00 2001 From: tgasser-nv <200644301+tgasser-nv@users.noreply.github.com> Date: Thu, 30 Oct 2025 11:39:40 -0500 Subject: [PATCH 5/8] Cleaned up last TODOs --- .../advanced/model-memory-cache.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/docs/user-guides/advanced/model-memory-cache.md b/docs/user-guides/advanced/model-memory-cache.md index a1559c37f..7825377ce 100644 --- a/docs/user-guides/advanced/model-memory-cache.md +++ b/docs/user-guides/advanced/model-memory-cache.md @@ -1,6 +1,6 @@ (model-memory-cache)= -# In-Memory Model Cache +# Memory Model Cache Guardrails supports an in-memory cache which avoids making LLM calls for repeated prompts. It stores user-prompts and the corresponding LLM response. Prior to making an LLM call, Guardrails first checks if the prompt matches one already in the cache. If the prompt is found in the cache, the stored response is returned from the cache, rather than prompting the LLM. This improves latency. In-memory caches are supported for all Nemoguard models ([Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control), and [Jailbreak Detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect)). Each model can be configured independently. @@ -124,19 +124,18 @@ The LFU algorithm ensures that the most frequently accessed cache entries remain Guardrails supports OTEL telemetry to trace client requests through Guardrails and any calls to LLMs or APIs. The cache operation is reflected in these traces, with cache hits having a far shorter duration and no LLM call and cache misses having an LLM call. This OTEL telemetry is a good fit for operational dashboards. The cache statistics are also logged on a configurable cadence if `cache.stats.enabled` is set to `true`. Every `log_interval` seconds, the cache statistics are logged with the format below. The most important metric below is the "Hit Rate", which is the proportion of LLM calls returned from the cache. If this value remains low, the exact-match may not be a good fit for your usecase. -**TODO! Do these reset on every measurement period, or increment forever (rollover concerns?)** +These statistics accumulate for the time Guardrails is running. ``` -# TODO! Replace with measured values "LFU Cache Statistics - " -"Size: {stats['current_size']}/{stats['maxsize']} | " -"Hits: {stats['hits']} | " -"Misses: {stats['misses']} | " -"Hit Rate: {stats['hit_rate']:.2%} | " -"Evictions: {stats['evictions']} | " -"Puts: {stats['puts']} | " -"Updates: {stats['updates']}" +"Size: 0.23453 | " +"Hits: 20 | " +"Misses: 3 | " +"Hit Rate: 87% | " +"Evictions: 0 | " +"Puts: 20 | " +"Updates: 0" ``` These metrics are detailed below: From ba6178d77f918b6aa142ffbbad36dc2666fd3f76 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 30 Oct 2025 17:00:36 -0700 Subject: [PATCH 6/8] Apply suggestion from @greptile-apps[bot] Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Miyoung Choi --- docs/user-guides/advanced/model-memory-cache.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user-guides/advanced/model-memory-cache.md b/docs/user-guides/advanced/model-memory-cache.md index 7825377ce..91a2828df 100644 --- a/docs/user-guides/advanced/model-memory-cache.md +++ b/docs/user-guides/advanced/model-memory-cache.md @@ -154,4 +154,4 @@ These metrics are detailed below: This cache is implemented in-memory on each Guardrails node. When operating as a horizontally-scaled backend-service, there are many Guardrails nodes running behind an API Gateway and load-balancer to distribute traffic and meet availability and performance targets. The current cache implementation has a separate cache on each node, with no sharing of cache entries between nodes. Because the load balancer spreads traffic over all Guardrails nodes, requests have to both be stored in cache, with the load balancer directing the same request to the same node. -In practice, frequently-requested user prompts will likely be spread over Guardrails nodes by the load balancer, so the performance impact may ne less significant. +In practice, frequently-requested user prompts will likely be spread over Guardrails nodes by the load balancer, so the performance impact may be less significant. From 3d2bcfc2597b29d4a6aa27afe5251e19df2fef4a Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Fri, 31 Oct 2025 06:52:47 -0700 Subject: [PATCH 7/8] edit (#1486) Edits to the memory-cache docs --- docs/index.md | 1 + docs/user-guides/advanced/index.rst | 1 - .../advanced/model-memory-cache.md | 99 ++++++++++++------- 3 files changed, 65 insertions(+), 36 deletions(-) diff --git a/docs/index.md b/docs/index.md index 16196f2d3..5dfeda581 100644 --- a/docs/index.md +++ b/docs/index.md @@ -68,6 +68,7 @@ user-guides/advanced/nemoguard-jailbreakdetect-deployment user-guides/advanced/kv-cache-reuse user-guides/advanced/safeguarding-ai-virtual-assistant-blueprint user-guides/advanced/tools-integration +user-guides/advanced/model-memory-cache ``` ```{toctree} diff --git a/docs/user-guides/advanced/index.rst b/docs/user-guides/advanced/index.rst index fffb846e2..a6c221188 100644 --- a/docs/user-guides/advanced/index.rst +++ b/docs/user-guides/advanced/index.rst @@ -22,4 +22,3 @@ Advanced nemoguard-contentsafety-deployment nemoguard-topiccontrol-deployment safeguarding-ai-virtual-assistant-blueprint - model-memory-cache diff --git a/docs/user-guides/advanced/model-memory-cache.md b/docs/user-guides/advanced/model-memory-cache.md index 91a2828df..5cfa6d127 100644 --- a/docs/user-guides/advanced/model-memory-cache.md +++ b/docs/user-guides/advanced/model-memory-cache.md @@ -2,20 +2,29 @@ # Memory Model Cache -Guardrails supports an in-memory cache which avoids making LLM calls for repeated prompts. It stores user-prompts and the corresponding LLM response. Prior to making an LLM call, Guardrails first checks if the prompt matches one already in the cache. If the prompt is found in the cache, the stored response is returned from the cache, rather than prompting the LLM. This improves latency. -In-memory caches are supported for all Nemoguard models ([Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control), and [Jailbreak Detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect)). Each model can be configured independently. -The cache uses exact-matching (after removing whitespace) on LLM prompts with a Least-Frequently-Used (LFU) algorithm for cache evictions. -For observability, cache hits and misses are visible in OTEL telemetry, and stored in logs on a configurable cadence. -To get started with caching, an example configuration is shown below. The rest of the page has a deep-dive into how the cache works, telemetry, and considerations when enabling caching in a horizontally-scalable service. +Guardrails supports an in-memory cache that avoids making LLM calls for repeated prompts. The cache stores user prompts and their corresponding LLM responses. Prior to making an LLM call, Guardrails checks if the prompt already exists in the cache. If found, the stored response is returned instead of calling the LLM, improving latency. + +In-memory caches are supported for all Nemoguard models: [Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control), and [Jailbreak Detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect). Each model can be configured independently. + +The cache uses exact matching (after removing whitespace) on LLM prompts with a Least-Frequently-Used (LFU) algorithm for cache evictions. + +For observability, cache hits and misses are visible in OpenTelemetry (OTEL) telemetry and stored in logs on a configurable cadence. + +To get started with caching, refer to the example configurations below. The rest of this page provides a deep dive into how the cache works, telemetry, and considerations when enabling caching in a horizontally scalable service. + +--- ## Example Configuration -Let's walk through an example of adding caching to a Content-Safety Guardrails application. The initial `config.yml` without caching is shown below. -We are using a [Llama 3.3 70B-Instruct](https://build.nvidia.com/meta/llama-3_3-70b-instruct) main LLM to generate responses. Inputs are checked by [Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control) and [Jailbreak detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect) models. The LLM response is also checked by the [Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety) model. -The input rails check the user prompt before sending it to the Main LLM to generate a response. The output rail checks both the user input and Main LLM response to make sure the response is safe. +The following example configurations show how to add caching to a Content-Safety Guardrails application. +The examples use a [Llama 3.3 70B-Instruct](https://build.nvidia.com/meta/llama-3_3-70b-instruct) as the main LLM to generate responses. Inputs are checked by the [Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control), and [Jailbreak Detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect) models. The LLM response is also checked by the Content-Safety model. +The input rails check the user prompt before sending it to the main LLM to generate a response. The output rail checks both the user input and main LLM response to ensure the response is safe. + +### Without Caching + +The following `config.yml` file shows the initial configuration without caching. ```yaml -# Content-Safety config.yml (without caching) models: - type: main engine: nim @@ -51,10 +60,12 @@ rails: api_key_env_var: NVIDIA_API_KEY ``` -The yaml file below shows the same configuration, with caching enabled on the Content-Safety, Topic-Control, and Jailbreak detection Nemoguard NIMs. All three caches have a size of 10,000 records. The caches log their statistics every 60 seconds. +### With Caching + +The following configuration file shows the same configuration with caching enabled on the Content-Safety, Topic-Control, and Jailbreak Detection Nemoguard NIM microservices. +All three caches have a size of 10,000 records and log their statistics every 60 seconds. ```yaml -# Content-Safety config.yml (with caching) models: - type: main engine: nim @@ -108,26 +119,39 @@ rails: api_key_env_var: NVIDIA_API_KEY ``` -## How does the Cache work? +--- + +## How the Cache Works + +When the cache is enabled, Guardrails checks whether a prompt was already sent to the LLM before making each call. This uses an exact-match lookup after removing whitespace. -When the cache is enabled, prior to each LLM call we first check to see if we already sent the same prompt to the LLM, and return the response if so. This uses an exact-match lookup, after removing whitespace. -If there's a cache hit (i.e. the same prompt was sent to the same LLM earlier and the response was stored in the cache), then the response can be returned without calling the LLM. -If there's a cache miss (i.e. we don't have a stored LLM response for this prompt in the cache), then the LLM is called as usual. When the response is received, this is stored in the cache. +If there is a cache hit (that is, the same prompt was sent to the same LLM earlier and the response was stored in the cache), the response is returned without calling the LLM. -For security reasons, user prompts are not stored directly. After removing whitespace, the user-prompt is hashed using SHA256 and then used as a cache key. +If there is a cache miss (that is, there is no stored LLM response for this prompt in the cache), the LLM is called as usual. When the response is received, it is stored in the cache. + +For security reasons, user prompts are not stored directly. After removing whitespace, the user prompt is hashed using SHA256 and then used as a cache key. If a new cache record needs to be added and the cache already has `maxsize` entries, the Least-Frequently Used (LFU) algorithm is used to decide which cache record to evict. The LFU algorithm ensures that the most frequently accessed cache entries remain in the cache, improving the probability of a cache hit. -## Telemetry and logging +--- -Guardrails supports OTEL telemetry to trace client requests through Guardrails and any calls to LLMs or APIs. The cache operation is reflected in these traces, with cache hits having a far shorter duration and no LLM call and cache misses having an LLM call. This OTEL telemetry is a good fit for operational dashboards. -The cache statistics are also logged on a configurable cadence if `cache.stats.enabled` is set to `true`. Every `log_interval` seconds, the cache statistics are logged with the format below. -The most important metric below is the "Hit Rate", which is the proportion of LLM calls returned from the cache. If this value remains low, the exact-match may not be a good fit for your usecase. -These statistics accumulate for the time Guardrails is running. +## Telemetry and Logging +Guardrails supports OTEL telemetry to trace client requests through Guardrails and any calls to LLMs or APIs. The cache operation is reflected in these traces: -``` +- **Cache hits** have a far shorter duration with no LLM call +- **Cache misses** include an LLM call + +This OTEL telemetry is suited for operational dashboards. + +The cache statistics are also logged on a configurable cadence if `cache.stats.enabled` is set to `true`. Every `log_interval` seconds, the cache statistics are logged with the format shown below. + +The most important metric is the *Hit Rate*, which represents the proportion of LLM calls returned from the cache. If this value remains low, the exact-match approach might not be a good fit for your use case. + +These statistics accumulate while Guardrails is running. + +```text "LFU Cache Statistics - " "Size: 0.23453 | " "Hits: 20 | " @@ -138,20 +162,25 @@ These statistics accumulate for the time Guardrails is running. "Updates: 0" ``` -These metrics are detailed below: +The following list describes the metrics included in the cache statistics: + +- **Size**: The number of LLM calls stored in the cache. +- **Hits**: The number of cache hits. +- **Misses**: The number of cache misses. +- **Hit Rate**: The proportion of calls returned from the cache. This is a float between 1.0 (all calls returned from the cache) and 0.0 (all calls sent to the LLM). +- **Evictions**: The number of cache evictions. +- **Puts**: The number of new cache records stored. +- **Updates**: The number of existing cache records updated. + +--- + +## Horizontal Scaling and Caching -* Size: The number of LLM calls stored in the cache. -* Hits: The number of cache hits. -* Misses: The number of cache misses. -* Hit Rate: The proportion of calls returned from the cache. This is a float between 1.0 (all calls returned from cache) and 0.0 (all calls sent to LLM) -* Evictions: Number of cache evictions. -* Puts: Number of new cache records stored. -* Updates: Number of existing cache records updated. +This cache is implemented in-memory on each Guardrails node. When operating as a horizontally-scaled backend service, multiple Guardrails nodes run behind an API Gateway and load balancer to distribute traffic and meet availability and performance targets. +The current cache implementation maintains a separate cache on each node without sharing cache entries between nodes. For a cache hit to occur, the following conditions must be met: -## Horizontal scaling and caching +1. The request must have been previously sent and stored in a cache. +2. The load balancer must direct the subsequent request to the same node. -This cache is implemented in-memory on each Guardrails node. When operating as a horizontally-scaled backend-service, there are many Guardrails nodes running behind an API Gateway and load-balancer to distribute traffic and meet availability and performance targets. -The current cache implementation has a separate cache on each node, with no sharing of cache entries between nodes. -Because the load balancer spreads traffic over all Guardrails nodes, requests have to both be stored in cache, with the load balancer directing the same request to the same node. -In practice, frequently-requested user prompts will likely be spread over Guardrails nodes by the load balancer, so the performance impact may be less significant. +In practice, the load balancer spreads traffic across all Guardrails nodes, distributing frequently-requested user prompts across multiple nodes. This reduces cache hit rates in horizontally-scaled deployments compared to single-node deployments. From 001f84a7222c8d3a1514aaa326fafd0495b00361 Mon Sep 17 00:00:00 2001 From: tgasser-nv <200644301+tgasser-nv@users.noreply.github.com> Date: Fri, 31 Oct 2025 09:04:48 -0500 Subject: [PATCH 8/8] Final updates to example log format --- docs/user-guides/advanced/model-memory-cache.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user-guides/advanced/model-memory-cache.md b/docs/user-guides/advanced/model-memory-cache.md index 5cfa6d127..4701bc32f 100644 --- a/docs/user-guides/advanced/model-memory-cache.md +++ b/docs/user-guides/advanced/model-memory-cache.md @@ -153,13 +153,13 @@ These statistics accumulate while Guardrails is running. ```text "LFU Cache Statistics - " -"Size: 0.23453 | " +"Size: 23/10000 | " "Hits: 20 | " "Misses: 3 | " "Hit Rate: 87% | " "Evictions: 0 | " -"Puts: 20 | " -"Updates: 0" +"Puts: 21 | " +"Updates: 4" ``` The following list describes the metrics included in the cache statistics: