Skip to content

Conversation

@narengogi
Copy link
Collaborator

No description provided.

@matter-code-review
Copy link
Contributor

Code Quality new feature

Summary By MatterAI MatterAI logo

🔄 What Changed

Added support for Anthropic-style cache token usage tracking in Google Vertex AI's chat completion response. Extracts cache_creation_input_tokens and cache_read_input_tokens from usage data and includes them conditionally in both sync and streaming responses. Also computes a new total_tokens field aggregating all token types.

🔍 Impact of the Change

Enables accurate cost and performance monitoring for cached LLM calls in Vertex AI by exposing cache-specific token metrics. Improves observability for users leveraging model caching, aligning response format with Anthropic’s schema expectations.

📁 Total Files Changed

File ChangeLog
Cache Mapping src/providers/google-vertex-ai/chatComplete.ts Added parsing and conditional inclusion of cache token fields in usage object for both full and streaming responses

🧪 Test Added/Recommended

Recommended

  • Unit test for usage object when cache_creation_input_tokens is present but cache_read_input_tokens is absent (and vice versa)
  • Integration test validating total_tokens calculation with mixed cache/non-cache scenarios
  • Schema validation test ensuring backward compatibility when cache fields are missing

🔒 Security Vulnerabilities

N/A

⏳ Estimated code review effort

LOW (~7 minutes)

Tip

Quality Recommendations

  1. Add explicit type definitions for cache token fields to prevent runtime errors

  2. Include zero-value handling in tests to ensure consistent serialization

  3. Add telemetry logging when cache tokens are detected to monitor cache hit rates

♫ Tanka Poem

Cache tokens now seen, 🌱
Hidden costs come to light—
Total tokens counted, 💡
Efficiency blooms in code, 🌿
Smart reuse, less load, more speed. 🚀

Sequence Diagram

sequenceDiagram
    participant V as Vertex AI
    participant C as chatComplete()
    participant R as Response Handler
    participant S as Stream Processor

    V->>C: response.usage
    C->>C: Extract input_tokens, output_tokens
    C->>C: Destructure cache_creation_input_tokens, cache_read_input_tokens
    C->>C: Compute total_tokens = input + output + cache_creation + cache_read
    C->>C: Set shouldSendCacheUsage flag

    alt Cache Usage Present
        C->>R: Include cache fields in usage
        R-->>C: Enhanced usage object
    else No Cache Data
        C->>R: Omit cache fields
    end

    S->>S: Parse stream chunk
    S->>S: Initialize streamState.usage with prompt_tokens
    S->>S: Conditionally add cache tokens to usage
    S->>S: Compute totalTokens on delta
    S-->>Client: Stream data with total_tokens
Loading

@matter-code-review
Copy link
Contributor

✅ Reviewed the changes: Review of cache usage token mapping implementation for Google Vertex AI responses

@narengogi narengogi requested a review from VisargD October 28, 2025 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant