Skip to content

Conversation

@UmeshpJadhav
Copy link
Contributor

@UmeshpJadhav UmeshpJadhav commented Dec 26, 2025

feat: implement document chunking and embedding utilities and integrate into core DocumentRetriever

PR Checklist

Please check if your PR fulfills the following requirements:

Bugs / Features

What is the current behavior?

There are no built-in utilities for document chunking, embedding, or ingestion in the core framework, making it difficult to implement RAG workflows.

What is the new behavior?

This PR introduces a new @voltagent/documents package and integrates it with @voltagent/core.

  • New Package: packages/documents
  • RecursiveCharacterTextSplitter: Smart text chunking with overlap.
  • OpenAIEmbeddingModel: Wrapper for OpenAI embeddings.
  • DocumentProcessor: Utility to split and embed text.
  • Core Integration:
    -DocumentRetriever: New abstract class in core that adds ingest() capabilities.

fixes #6

Notes for reviewers

  • The core build may have pre-existing environment issues, but the new test-integration script verifies that the types and exports work correctly.
  • A new dependency on openai was added to packages/documents.

Summary by cubic

Adds document chunking and embeddings via a new @voltagent/documents package, and integrates ingestion into core DocumentRetriever to enable RAG workflows. Fixes #6.

  • New Features

    • New @voltagent/documents package with RecursiveCharacterTextSplitter, OpenAIEmbeddingModel, and DocumentProcessor.
    • Core retriever adds ingest(), default retrieve that embeds queries, empty-array safety, and abstract hooks: upsertDocuments() and queryVectors(). Also exports ProcessedDocument type.
    • README and tests for documents utilities.
  • Migration

    • Implement upsertDocuments() and queryVectors() in your retriever to store and search vectors.
    • Set OPENAI_API_KEY for embeddings; optionally configure model and chunk sizes (chunkOverlap must be less than chunkSize).
    • Ingest with processor.process(text, metadata) or retriever.ingest(text, metadata).

Written for commit 64e2cff. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • New document processing package with automatic text chunking and segmentation
    • Built-in support for document embeddings powered by AI models
    • Integrated document retrieval and semantic search in core framework
  • Documentation

    • Added comprehensive RAG (Retrieval-Augmented Generation) documentation with implementation examples

✏️ Tip: You can customize this high-level summary in your review settings.

@changeset-bot
Copy link

changeset-bot bot commented Dec 26, 2025

🦋 Changeset detected

Latest commit: 64e2cff

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@voltagent/documents Minor
@voltagent/core Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 17 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/documents/package.json">

<violation number="1" location="packages/documents/package.json:35">
P1: Version mismatch: vitest `^1.0.0` conflicts with project-level `^3.2.4`. In a monorepo with syncpack, dependency versions should be consistent across packages to avoid compatibility issues.</violation>

<violation number="2" location="packages/documents/package.json:36">
P2: Version mismatch: @types/node `^20.0.0` conflicts with project-level `^24.2.1`. Consider aligning with the monorepo&#39;s standard version for consistent type definitions.</violation>
</file>

<file name="packages/documents/src/DocumentProcessor.ts">

<violation number="1" location="packages/documents/src/DocumentProcessor.ts:23">
P2: Missing validation that `embeddings.length` matches `chunks.length`. If the embedding model returns fewer embeddings than expected, `embeddings[index]` could be `undefined`, causing silent data corruption in the returned `ProcessedDocument[]`. Consider adding a length validation check after fetching embeddings.</violation>
</file>

<file name="packages/documents/src/text-splitters/TextSplitter.ts">

<violation number="1" location="packages/documents/src/text-splitters/TextSplitter.ts:14">
P2: Missing validation for positive values. The validation checks that `chunkOverlap &lt; chunkSize`, but doesn&#39;t ensure `chunkSize &gt; 0` and `chunkOverlap &gt;= 0`. This allows invalid configurations like zero or negative values to pass.</violation>
</file>

<file name="packages/core/src/retriever/document-retriever.ts">

<violation number="1" location="packages/core/src/retriever/document-retriever.ts:56">
P1: Accessing `input[input.length - 1]` will throw a TypeError if `input` is an empty array. Add a guard to handle this edge case.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.

Copy link
Member

@omeraplak omeraplak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks, the PR looks great!
I actually started working on the packages/rag package a few weeks ago. Do you think we should include this package as well?

Also, would you like to add it to the docs here?
https://voltagent.dev/docs/rag/overview/

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="website/docs/rag/overview.md">

<violation number="1" location="website/docs/rag/overview.md:257">
P1: Incorrect API usage in documentation example. `RecursiveChunker` constructor accepts an optional `Tokenizer`, not options. Options like `maxTokens` and `overlapTokens` should be passed to the `chunk()` method. Also, `chunk()` is synchronous, not async.</violation>

<violation number="2" location="website/docs/rag/overview.md:264">
P1: Incorrect API usage in documentation example. `MarkdownChunker` constructor accepts an optional `Tokenizer`, not options. Options like `maxTokens` should be passed to the `chunk()` method. Also, `chunk()` is synchronous, not async.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

UmeshpJadhav and others added 2 commits December 30, 2025 18:09
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
@UmeshpJadhav
Copy link
Contributor Author

@omeraplak
Thanks!. I've updated the documentation in docs/rag/overview to include a new section on "Advanced Chunking" that highlights the @voltagent/rag package and its capabilities.

I've also resolved the recent merge conflicts and fixed the linting issues. The PR should be good to go now!

@UmeshpJadhav UmeshpJadhav reopened this Jan 13, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 13, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This pull request introduces a new @voltagent/documents package containing text chunking and embedding utilities, including RecursiveCharacterTextSplitter for document splitting and OpenAIEmbeddingModel for vector embeddings, orchestrated via DocumentProcessor. It integrates these utilities into core via a new DocumentRetriever class supporting document ingestion and vector-based retrieval workflows.

Changes

Cohort / File(s) Summary
Documents Package — Text Splitting
packages/documents/src/text-splitters/TextSplitter.ts, packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts, packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.test.ts
Introduces abstract TextSplitter base class with validation (chunkSize, chunkOverlap) and concrete RecursiveCharacterTextSplitter with multi-separator fallback recursion. Tests verify character splitting, separator-based chunking, and recursive behavior.
Documents Package — Embedding Models
packages/documents/src/embeddings/EmbeddingModel.ts, packages/documents/src/embeddings/OpenAIEmbeddingModel.ts
Defines EmbeddingModel interface (embedQuery, embedDocuments). Implements OpenAIEmbeddingModel with batch processing (default 512 max batch size), per-batch API calls, and result re-ordering to preserve document sequence.
Documents Package — Document Processing
packages/documents/src/DocumentProcessor.ts, packages/documents/src/DocumentProcessor.test.ts
Adds ProcessedDocument interface and DocumentProcessor class orchestrating text splitting and embedding. Augments chunk metadata with chunkIndex and chunkCount. Test validates splits, embeddings, and metadata attachment.
Documents Package — Configuration & Exports
packages/documents/package.json, packages/documents/tsconfig.json, packages/documents/tsup.config.ts, packages/documents/src/index.ts, packages/documents/README.md
Establishes new package with ESM/CJS exports, TypeScript/tsup build configs. README includes installation and usage examples for text splitting, embeddings, and document processing workflows.
Core Integration
packages/core/package.json, packages/core/src/retriever/document-retriever.ts, packages/core/src/retriever/index.ts
Adds @voltagent/documents dependency to core. Introduces DocumentRetriever abstract class extending BaseRetriever, providing ingest(text, metadata) to process and persist documents, and retrieve(input, options) to embed queries and search vectors (default top-k=4).
Lint & Build Configuration
biome.json, packages/core/src/agent/subagent/index.ts, packages/server-core/src/websocket/setup.ts
Updates Biome ignore list with archive and website directories. Adds biome-ignore directives to handoffTask and setupWebSocketUpgrade to suppress complexity warnings on legacy code.
Documentation
website/docs/rag/overview.md
Adds "Advanced Chunking" section documenting @voltagent/rag package with installation, available chunkers, and example usage (content appears duplicated in patch).
Changeset
.changeset/document-utils-integration.md
Documents feature: bumps @voltagent/documents and @voltagent/core to minor versions with note "feat: implement document chunking and embedding utilities and integrate into core."

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant DR as DocumentRetriever
    participant DP as DocumentProcessor
    participant TS as TextSplitter
    participant EM as EmbeddingModel
    participant VS as Vector Store

    App->>DR: ingest(text, metadata)
    DR->>DP: process(text, metadata)
    DP->>TS: splitText(text)
    TS-->>DP: chunks[]
    DP->>EM: embedDocuments(chunks)
    EM-->>DP: embeddings[][]
    DP-->>DR: ProcessedDocument[]
    DR->>VS: upsertDocuments(ProcessedDocument[])
    VS-->>DR: stored

    App->>DR: retrieve(query)
    DR->>EM: embedQuery(query)
    EM-->>DR: embedding[]
    DR->>VS: queryVectors(embedding, k=4)
    VS-->>DR: ProcessedDocument[]
    DR-->>App: joined result text
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • PR #936: Refactors subagent handoff and streaming logic in the same codebase where biome-ignore was added to suppress complexity warnings, indicating potential overlapping concerns around cognitive complexity management.

Poem

🐰 Chunks and vectors, oh what joy!
With embeddings from OpenAI,
Documents split with care so fine,
Vector searches by design.
The RAG dream comes to life,
No more retrieval strife!

🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR contains out-of-scope documentation additions: website/docs/rag/overview.md includes a duplicated "Advanced Chunking" section describing @voltagent/rag package, which is unrelated to the stated objectives of implementing document utilities. Remove the duplicated content from website/docs/rag/overview.md or clarify if @voltagent/rag package changes are part of this PR's scope.
Title check ❓ Inconclusive The title "Feat/document utils" is vague and generic, using non-descriptive terminology that doesn't clearly convey the specific changes made. Use a more descriptive title such as "Feat: add document chunking and embedding utilities package" to better communicate the primary change.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The PR successfully implements all MVP acceptance criteria from issue #6: RecursiveCharacterTextSplitter for chunking, OpenAIEmbeddingModel for embeddings, DocumentProcessor orchestration, and integration into core DocumentRetriever with proper documentation and tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description comprehensively covers all required template sections with detailed information about features, current/new behavior, and includes checklist items.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In @packages/core/src/retriever/document-retriever.ts:
- Around line 59-63: The code assumes the selected BaseMessage content is a
string and casts it before calling processor.embedder.embedQuery; instead
validate the content type from input (and the last BaseMessage) before
embedding: ensure textQuery is a string, handle undefined or non-string content
by extracting a text field if present (e.g., content.text), falling back to a
safe serialization like JSON.stringify(content) or returning/logging an error
and not calling embedQuery; replace the direct cast on textQuery and guard the
call to this.processor.embedder.embedQuery accordingly to avoid runtime crashes.
- Line 66: The code uses a cast to any to read options.k; instead extend the
RetrieveOptions type (or create a new VectorRetrieveOptions that extends
RetrieveOptions) to include an optional k?: number, update the method signature
to accept that typed options, and replace the cast line with a typed access
(e.g., destructure or read options.k with a default of 4) so type safety is
preserved for the k parameter in DocumentRetriever/document-retrieval logic.

In @packages/documents/package.json:
- Around line 29-31: The package.json currently pins the "openai" dependency to
"^4.20.0"; update that dependency to a supported 6.x release (e.g., "^6.16.0")
in the "dependencies" entry for "openai", then run your package manager
(npm/yarn/pnpm) to install and update lockfiles; after upgrading, run tests and
fix any breaking API changes in code that uses the OpenAI SDK (search for
imports/usages of "openai" and update client construction and method names per
the 6.x migration guide).

In @packages/documents/src/DocumentProcessor.ts:
- Around line 21-34: In process, guard against embedder.embedDocuments returning
fewer items than chunks: after const embeddings = await
this.embedder.embedDocuments(chunks); check that embeddings is an array and
embeddings.length === chunks.length (or at least >= chunks.length); if not,
either throw a clear error or fill missing entries with a safe fallback (e.g.,
null vector or empty embedding) and log the mismatch via the class logger;
ensure the returned ProcessedDocument objects use validated/fallback embeddings
so embeddings[index] cannot be undefined.

In @packages/documents/src/embeddings/OpenAIEmbeddingModel.ts:
- Around line 23-29: The embedQuery method lacks a safety check for an empty API
response; after calling this.client.embeddings.create in embedQuery, verify that
response && Array.isArray(response.data) && response.data.length > 0 and that
response.data[0].embedding exists before returning it; if the check fails,
either throw a descriptive error (e.g., "Empty embeddings response from OpenAI")
or return a sensible default (empty array) and log the incident so callers don't
get a runtime exception when accessing response.data[0].embedding.
- Line 19: The default embedding model in OpenAIEmbeddingModel is outdated;
update the default assignment for this.model (in the OpenAIEmbeddingModel
constructor/initialization) from "text-embedding-ada-002" to
"text-embedding-3-large" (or "text-embedding-3-small" if you prefer cost
efficiency) so new instances use the current best-practice embedding by default
while preserving params override behavior.
🧹 Nitpick comments (8)
packages/documents/tsconfig.json (1)

4-4: Consider removing DOM libs for a Node.js-only package.

This package handles document processing and embeddings, which are server-side operations. The dom and dom.iterable libs are typically unnecessary and could be removed to keep the type environment focused on Node.js.

Suggested change
-    "lib": ["dom", "dom.iterable", "esnext"],
+    "lib": ["esnext"],
packages/documents/package.json (1)

1-37: Add missing package metadata for consistency with other packages.

The package is missing license, repository, keywords, and author fields that are present in other packages like @voltagent/core. Consider adding these for consistency and proper npm publishing.

Suggested additions
 {
     "name": "@voltagent/documents",
     "version": "0.0.1",
     "description": "Document processing and embedding utilities for VoltAgent",
+    "license": "MIT",
+    "repository": {
+        "type": "git",
+        "url": "https://github.com/VoltAgent/voltagent.git",
+        "directory": "packages/documents"
+    },
     "main": "dist/index.js",
packages/documents/src/text-splitters/TextSplitter.ts (1)

6-25: Consider making properties readonly.

The chunkSize and chunkOverlap properties are set once in the constructor and validated. Making them readonly would prevent accidental mutation and better express the design intent.

♻️ Suggested change
 export abstract class TextSplitter {
-  chunkSize: number;
-  chunkOverlap: number;
+  readonly chunkSize: number;
+  readonly chunkOverlap: number;
packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (1)

19-90: Consider handling empty input text.

The recursive splitting logic is well-implemented. However, when text is an empty string, "".split(separator) returns [""], causing splitText("") to return [""] rather than an empty array. This may be unexpected for callers expecting no chunks from empty input.

♻️ Optional early return for empty input
   private _splitText(text: string, separators: string[]): string[] {
+    if (text.length === 0) {
+      return [];
+    }
+
     const finalChunks: string[] = [];
packages/documents/src/RecursiveCharacterTextSplitter.test.ts (1)

4-75: Good test coverage for core functionality; consider adding edge case tests.

The three tests effectively cover character splitting, separator-based splitting, and recursive behavior. The inline comments documenting the expected logic are helpful for understanding.

Consider adding tests for:

  • Empty string input
  • Text smaller than chunkSize
  • Non-zero chunkOverlap behavior
  • Validation errors (e.g., chunkSize <= 0)
packages/documents/src/embeddings/OpenAIEmbeddingModel.ts (1)

31-48: Consider handling empty input array.

If documents is an empty array, this returns [] silently, which is acceptable. However, consider adding an early return for clarity and to avoid unnecessary iterations.

Also, there's no validation for maxBatchSize being a positive integer—a zero or negative value would cause an infinite loop or unexpected behavior.

♻️ Proposed improvements
+  constructor(params?: OpenAIEmbeddingModelParams) {
+    this.client = new OpenAI({
+      apiKey: params?.apiKey ?? process.env.OPENAI_API_KEY,
+    });
+    this.model = params?.model ?? "text-embedding-ada-002";
+    const batchSize = params?.maxBatchSize ?? 512;
+    if (batchSize <= 0) {
+      throw new Error("maxBatchSize must be a positive integer");
+    }
+    this.maxBatchSize = batchSize;
+  }

 async embedDocuments(documents: string[]): Promise<number[][]> {
+  if (documents.length === 0) {
+    return [];
+  }
   const embeddings: number[][] = [];
   // ... rest of implementation
 }
packages/documents/src/DocumentProcessor.ts (1)

12-14: Consider marking fields as readonly.

Since splitter and embedder are set in the constructor and likely shouldn't be reassigned afterward, consider making them readonly for clarity and immutability.

♻️ Suggested change
 export class DocumentProcessor {
-  splitter: TextSplitter;
-  embedder: EmbeddingModel;
+  readonly splitter: TextSplitter;
+  readonly embedder: EmbeddingModel;
packages/documents/src/DocumentProcessor.test.ts (1)

21-43: LGTM! Consider additional edge-case tests.

The test validates the core processing flow correctly. For more comprehensive coverage, consider adding tests for:

  • Empty string input
  • Single chunk (no delimiter)
  • No metadata provided (undefined)
  • Empty array from splitter
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 969db3a and 64e2cff.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (20)
  • .changeset/document-utils-integration.md
  • biome.json
  • packages/core/package.json
  • packages/core/src/agent/subagent/index.ts
  • packages/core/src/retriever/document-retriever.ts
  • packages/core/src/retriever/index.ts
  • packages/documents/README.md
  • packages/documents/package.json
  • packages/documents/src/DocumentProcessor.test.ts
  • packages/documents/src/DocumentProcessor.ts
  • packages/documents/src/RecursiveCharacterTextSplitter.test.ts
  • packages/documents/src/embeddings/EmbeddingModel.ts
  • packages/documents/src/embeddings/OpenAIEmbeddingModel.ts
  • packages/documents/src/index.ts
  • packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts
  • packages/documents/src/text-splitters/TextSplitter.ts
  • packages/documents/tsconfig.json
  • packages/documents/tsup.config.ts
  • packages/server-core/src/websocket/setup.ts
  • website/docs/rag/overview.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

**/*.ts: Maintain type safety in TypeScript-first codebase
Never use JSON.stringify; use the safeStringify function instead, imported from @voltagent/internal

Files:

  • packages/documents/src/index.ts
  • packages/documents/src/embeddings/EmbeddingModel.ts
  • packages/core/src/agent/subagent/index.ts
  • packages/documents/src/text-splitters/TextSplitter.ts
  • packages/documents/src/RecursiveCharacterTextSplitter.test.ts
  • packages/server-core/src/websocket/setup.ts
  • packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts
  • packages/documents/tsup.config.ts
  • packages/core/src/retriever/index.ts
  • packages/documents/src/DocumentProcessor.test.ts
  • packages/documents/src/embeddings/OpenAIEmbeddingModel.ts
  • packages/core/src/retriever/document-retriever.ts
  • packages/documents/src/DocumentProcessor.ts
**/*.test.ts

📄 CodeRabbit inference engine (AGENTS.md)

Test your changes - ensure all tests pass before committing

Files:

  • packages/documents/src/RecursiveCharacterTextSplitter.test.ts
  • packages/documents/src/DocumentProcessor.test.ts
🧠 Learnings (3)
📚 Learning: 2026-01-07T05:09:23.216Z
Learnt from: CR
Repo: VoltAgent/voltagent PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-07T05:09:23.216Z
Learning: Applies to **/*.ts : Maintain type safety in TypeScript-first codebase

Applied to files:

  • packages/core/package.json
  • packages/documents/tsconfig.json
  • packages/documents/tsup.config.ts
📚 Learning: 2026-01-07T05:09:23.217Z
Learnt from: CR
Repo: VoltAgent/voltagent PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-07T05:09:23.217Z
Learning: Follow the monorepo structure - changes may impact multiple packages

Applied to files:

  • packages/documents/package.json
📚 Learning: 2026-01-07T05:09:23.217Z
Learnt from: CR
Repo: VoltAgent/voltagent PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-07T05:09:23.217Z
Learning: Applies to **/*.test.ts : Test your changes - ensure all tests pass before committing

Applied to files:

  • packages/documents/tsconfig.json
  • packages/documents/src/RecursiveCharacterTextSplitter.test.ts
🧬 Code graph analysis (4)
packages/documents/src/RecursiveCharacterTextSplitter.test.ts (1)
packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (1)
  • RecursiveCharacterTextSplitter (7-91)
packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (1)
packages/documents/src/text-splitters/TextSplitter.ts (1)
  • TextSplitterParams (1-4)
packages/documents/src/DocumentProcessor.test.ts (2)
packages/documents/src/embeddings/EmbeddingModel.ts (1)
  • EmbeddingModel (1-4)
packages/documents/src/DocumentProcessor.ts (1)
  • DocumentProcessor (12-35)
packages/documents/src/embeddings/OpenAIEmbeddingModel.ts (2)
packages/documents/src/embeddings/EmbeddingModel.ts (1)
  • EmbeddingModel (1-4)
packages/documents/src/DocumentProcessor.ts (1)
  • process (21-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Test internal
  • GitHub Check: Test logger
  • GitHub Check: Test create-voltagent-app
  • GitHub Check: Test postgres
  • GitHub Check: Test docs-mcp
  • GitHub Check: Test server-core
  • GitHub Check: Test core
  • GitHub Check: Test logger
  • GitHub Check: Test libsql
  • GitHub Check: Test cli
  • GitHub Check: Test voice
  • GitHub Check: Build (Node 22)
  • GitHub Check: Test supabase
  • GitHub Check: Build (Node 24)
  • GitHub Check: Build (Node 20)
🔇 Additional comments (19)
website/docs/rag/overview.md (1)

242-280: The documented APIs are accurate for @voltagent/rag, not @voltagent/documents.

The documentation section (lines 242-280) correctly describes the existing @voltagent/rag package with accurate class names (RecursiveChunker, MarkdownChunker, etc.), method signatures (.chunk()), and parameters (maxTokens, overlapTokens). All referenced chunkers exist in the RAG package.

However, this creates a critical mismatch: the PR adds @voltagent/documents which provides RecursiveCharacterTextSplitter (with .splitText() method and chunkSize/chunkOverlap parameters) along with OpenAIEmbeddingModel and DocumentProcessor. The documentation section does not document any of these.

Clarify the documentation intent:

  • If this section is new and meant to document the @voltagent/documents package being added, it needs to describe that package's API, not RAG's.
  • If this section pre-existed and documents RAG (separate from this PR), clarify whether @voltagent/documents features need their own documentation section.
packages/server-core/src/websocket/setup.ts (1)

232-233: Lint suppression is acceptable for existing complex handler.

The biome-ignore directive appropriately acknowledges the cognitive complexity of this WebSocket upgrade handler without changing its behavior. Consider adding a TODO or tracking issue if there are plans to refactor this handler in the future.

biome.json (1)

64-66: Verify the website directory exclusion is intentional.

Adding archive to the ignore list is reasonable. However, excluding the entire website directory is broad—ensure this is intentional and that the website has its own linting/formatting configuration, or that it doesn't contain TypeScript/JavaScript that should be linted.

packages/documents/src/embeddings/EmbeddingModel.ts (1)

1-4: Clean interface design following embedding model conventions.

The EmbeddingModel interface is well-structured with appropriate async return types. This abstraction will allow easy swapping of embedding providers beyond OpenAI.

packages/core/package.json (1)

19-19: Workspace dependency correctly links the new documents package.

Using workspace:* is the appropriate pattern for internal monorepo dependencies, ensuring core always uses the local documents package version during development and publishing.

.changeset/document-utils-integration.md (1)

1-6: Changeset correctly captures the feature addition.

The minor version bump for both packages is appropriate—@voltagent/documents is a new package with new functionality, and @voltagent/core gains new capabilities through the integration.

packages/documents/tsconfig.json (1)

10-11: No action needed. The rootDir configuration is correct for this setup.

The tsconfig.json's rootDir and declaration settings are used only for type-checking (via tsc --noEmit). Declaration generation during the build is handled independently by tsup with dts: true, which generates declarations directly from the entry point (src/index.ts) and outputs them to dist/index.d.ts and dist/index.d.mts, as correctly specified in package.json's types exports. The configuration is already correct.

Likely an incorrect or invalid review comment.

packages/core/src/agent/subagent/index.ts (1)

319-320: LGTM!

The lint suppression directive is appropriate for this existing complex method. The handoffTask method handles multiple sub-agent configuration types and orchestrates complex handoff logic, justifying the complexity allowance.

packages/documents/tsup.config.ts (1)

1-19: LGTM!

The tsup configuration is well-structured for the new documents package. Dual CJS/ESM output, TypeScript declarations, and the shared external plugin are appropriately configured.

packages/documents/src/text-splitters/TextSplitter.ts (2)

1-4: LGTM!

Clean interface definition with appropriate optional parameters.


29-36: LGTM!

The createDocuments method correctly accumulates chunks from multiple input texts. Sequential processing is appropriate here to maintain predictable ordering.

packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (2)

1-13: LGTM!

Clean extension of TextSplitter with sensible default separators that progress from paragraph to character-level splitting.


15-17: LGTM!

Clean delegation pattern for the public async interface.

packages/documents/src/RecursiveCharacterTextSplitter.test.ts (1)

1-3: LGTM!

Correct imports from vitest and the implementation module.

packages/documents/README.md (1)

1-71: LGTM!

The README provides clear documentation with helpful code examples for each component. The examples demonstrate key configuration options and typical usage patterns.

packages/documents/src/index.ts (1)

1-5: LGTM!

The barrel exports cleanly expose the public API surface: text splitters, embedding models, and the document processor with its types.

packages/core/src/retriever/index.ts (1)

6-9: LGTM!

The new exports are properly structured. Re-exporting ProcessedDocument from @voltagent/documents and exposing the new DocumentRetriever class with its options type maintains a clean public API surface for the module.

packages/core/src/retriever/document-retriever.ts (2)

20-35: LGTM!

The constructor pattern with optional processor and fallback to default is appropriate. The ingest method cleanly delegates to the processor and abstract storage method with proper debug logging.


41-49: LGTM!

The abstract method contracts are well-defined, providing clear expectations for implementations to handle vector storage and retrieval.

Comment on lines +59 to +63
const textQuery = typeof input === "string" ? input : input[input.length - 1].content;

// We assume the processor's embedder has an embedQuery method.
// Since DocumentProcessor exposes 'embedder', we can use it.
const queryVector = await this.processor.embedder.embedQuery(textQuery as string);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate content type before embedding.

The content property of BaseMessage may not always be a string (e.g., structured content in multi-modal messages). The cast on line 63 could mask runtime issues if content is an object or undefined.

Suggested defensive handling
-    const textQuery = typeof input === "string" ? input : input[input.length - 1].content;
-
-    // We assume the processor's embedder has an embedQuery method.
-    // Since DocumentProcessor exposes 'embedder', we can use it.
-    const queryVector = await this.processor.embedder.embedQuery(textQuery as string);
+    let textQuery: string;
+    if (typeof input === "string") {
+      textQuery = input;
+    } else {
+      const lastContent = input[input.length - 1].content;
+      textQuery = typeof lastContent === "string" ? lastContent : String(lastContent);
+    }
+
+    const queryVector = await this.processor.embedder.embedQuery(textQuery);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const textQuery = typeof input === "string" ? input : input[input.length - 1].content;
// We assume the processor's embedder has an embedQuery method.
// Since DocumentProcessor exposes 'embedder', we can use it.
const queryVector = await this.processor.embedder.embedQuery(textQuery as string);
let textQuery: string;
if (typeof input === "string") {
textQuery = input;
} else {
const lastContent = input[input.length - 1].content;
textQuery = typeof lastContent === "string" ? lastContent : String(lastContent);
}
const queryVector = await this.processor.embedder.embedQuery(textQuery);
🤖 Prompt for AI Agents
In @packages/core/src/retriever/document-retriever.ts around lines 59 - 63, The
code assumes the selected BaseMessage content is a string and casts it before
calling processor.embedder.embedQuery; instead validate the content type from
input (and the last BaseMessage) before embedding: ensure textQuery is a string,
handle undefined or non-string content by extracting a text field if present
(e.g., content.text), falling back to a safe serialization like
JSON.stringify(content) or returning/logging an error and not calling
embedQuery; replace the direct cast on textQuery and guard the call to
this.processor.embedder.embedQuery accordingly to avoid runtime crashes.

const queryVector = await this.processor.embedder.embedQuery(textQuery as string);

// Default top-k to 4, can be customizable via options
const k = (options as any).k ?? 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid any cast; extend RetrieveOptions with proper typing.

Using (options as any).k bypasses type safety. Consider extending RetrieveOptions or creating a dedicated options type for vector retrieval.

Proposed fix

Define proper types for the k parameter:

+export interface DocumentRetrieveOptions extends RetrieveOptions {
+  /**
+   * Number of top results to return from vector search.
+   * @default 4
+   */
+  k?: number;
+}
+
 export abstract class DocumentRetriever extends BaseRetriever {
   // ...
-  async retrieve(input: string | BaseMessage[], options: RetrieveOptions = {}): Promise<string> {
+  async retrieve(input: string | BaseMessage[], options: DocumentRetrieveOptions = {}): Promise<string> {
     // ...
-    const k = (options as any).k ?? 4;
+    const k = options.k ?? 4;
🤖 Prompt for AI Agents
In @packages/core/src/retriever/document-retriever.ts at line 66, The code uses
a cast to any to read options.k; instead extend the RetrieveOptions type (or
create a new VectorRetrieveOptions that extends RetrieveOptions) to include an
optional k?: number, update the method signature to accept that typed options,
and replace the cast line with a typed access (e.g., destructure or read
options.k with a default of 4) so type safety is preserved for the k parameter
in DocumentRetriever/document-retrieval logic.

Comment on lines +29 to +31
"dependencies": {
"openai": "^4.20.0"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check latest openai version and release info
curl -s https://registry.npmjs.org/openai | jq -r '.["dist-tags"] | to_entries[] | "\(.key): \(.value)"'

Repository: VoltAgent/voltagent

Length of output: 137


🌐 Web query:

openai npm package latest version 2025 security vulnerabilities

💡 Result:

Summary (as of Jan 13, 2026):

  • Latest official "openai" npm package (registry) — see package page on npm for the current published version. No direct vulnerabilities are listed for the package in Snyk's database. [1][2]

  • Snyk security scan: "No known security issues" for the latest openai release (Snyk package page). [1]

  • Official npm registry: package/version history and tags are on npmjs (check npm package page to confirm the exact version number you have/plan to install). [2]

Recommended actions:

  • Run npm audit and/or Snyk/Dependabot against your project before installing/upgrading.
  • Inspect your dependency tree for transitive vulnerabilities (audit covers those).
  • If you need, I can check the exact latest version number right now and scan advisories (CVE/GitHub/GitLab) for any security reports — shall I do that?

Sources:
[1] Snyk — openai package security page.
[2] npmjs — openai package page and version listings.


Update OpenAI SDK to a supported version.

The package depends on openai@^4.20.0, which is 2 major versions behind the latest release (6.16.0). While the current version has no known security vulnerabilities, the caret constraint prevents adoption of newer versions. Consider upgrading to a recent version within the 6.x line to benefit from API improvements and continued maintenance.

🤖 Prompt for AI Agents
In @packages/documents/package.json around lines 29 - 31, The package.json
currently pins the "openai" dependency to "^4.20.0"; update that dependency to a
supported 6.x release (e.g., "^6.16.0") in the "dependencies" entry for
"openai", then run your package manager (npm/yarn/pnpm) to install and update
lockfiles; after upgrading, run tests and fix any breaking API changes in code
that uses the OpenAI SDK (search for imports/usages of "openai" and update
client construction and method names per the 6.x migration guide).

Comment on lines +21 to +34
async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> {
const chunks = await this.splitter.splitText(text);
const embeddings = await this.embedder.embedDocuments(chunks);

return chunks.map((chunk, index) => ({
text: chunk,
embedding: embeddings[index],
metadata: {
...metadata,
chunkIndex: index,
chunkCount: chunks.length,
},
}));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential undefined embedding if array lengths mismatch.

If embedder.embedDocuments(chunks) returns fewer embeddings than chunks (due to an API error or implementation bug), embeddings[index] will be undefined, leading to corrupted ProcessedDocument objects.

🛡️ Proposed defensive check
 async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> {
   const chunks = await this.splitter.splitText(text);
   const embeddings = await this.embedder.embedDocuments(chunks);

+  if (embeddings.length !== chunks.length) {
+    throw new Error(
+      `Embedding count mismatch: expected ${chunks.length}, got ${embeddings.length}`
+    );
+  }
+
   return chunks.map((chunk, index) => ({
     text: chunk,
     embedding: embeddings[index],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> {
const chunks = await this.splitter.splitText(text);
const embeddings = await this.embedder.embedDocuments(chunks);
return chunks.map((chunk, index) => ({
text: chunk,
embedding: embeddings[index],
metadata: {
...metadata,
chunkIndex: index,
chunkCount: chunks.length,
},
}));
}
async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> {
const chunks = await this.splitter.splitText(text);
const embeddings = await this.embedder.embedDocuments(chunks);
if (embeddings.length !== chunks.length) {
throw new Error(
`Embedding count mismatch: expected ${chunks.length}, got ${embeddings.length}`
);
}
return chunks.map((chunk, index) => ({
text: chunk,
embedding: embeddings[index],
metadata: {
...metadata,
chunkIndex: index,
chunkCount: chunks.length,
},
}));
}
🤖 Prompt for AI Agents
In @packages/documents/src/DocumentProcessor.ts around lines 21 - 34, In
process, guard against embedder.embedDocuments returning fewer items than
chunks: after const embeddings = await this.embedder.embedDocuments(chunks);
check that embeddings is an array and embeddings.length === chunks.length (or at
least >= chunks.length); if not, either throw a clear error or fill missing
entries with a safe fallback (e.g., null vector or empty embedding) and log the
mismatch via the class logger; ensure the returned ProcessedDocument objects use
validated/fallback embeddings so embeddings[index] cannot be undefined.

this.client = new OpenAI({
apiKey: params?.apiKey ?? process.env.OPENAI_API_KEY,
});
this.model = params?.model ?? "text-embedding-ada-002";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

latest OpenAI embedding model 2025 2026 text-embedding-3

💡 Result:

As of January 13, 2026 the latest OpenAI embedding models are the text-embedding-3 family: text-embedding-3-large (most capable, up to 3072 dims) and text-embedding-3-small (cost‑efficient). They were announced January 25, 2024; text-embedding-3-large is OpenAI’s highest‑performing embedding model. [1] (openai.com)

Sources:
[1] OpenAI announcement — “New embedding models and API updates” (Jan 25, 2024). (openai.com)
[2] text-embedding-3-large docs (OpenAI). (platform.openai.com)
[3] text-embedding-3-small docs (OpenAI). (platform.openai.com)

Citations:


🏁 Script executed:

cat -n packages/documents/src/embeddings/OpenAIEmbeddingModel.ts | head -40

Repository: VoltAgent/voltagent

Length of output: 1611


Update the default embedding model to text-embedding-3-large.

The default model text-embedding-ada-002 is outdated. OpenAI released the text-embedding-3 family on January 25, 2024, with text-embedding-3-large being the highest-performing embedding model. Consider updating the default to "text-embedding-3-large" or "text-embedding-3-small" if cost efficiency is a priority. Since the model is configurable via params, applications can override this, but the default should reflect current best practices.

🤖 Prompt for AI Agents
In @packages/documents/src/embeddings/OpenAIEmbeddingModel.ts at line 19, The
default embedding model in OpenAIEmbeddingModel is outdated; update the default
assignment for this.model (in the OpenAIEmbeddingModel
constructor/initialization) from "text-embedding-ada-002" to
"text-embedding-3-large" (or "text-embedding-3-small" if you prefer cost
efficiency) so new instances use the current best-practice embedding by default
while preserving params override behavior.

Comment on lines +23 to +29
async embedQuery(text: string): Promise<number[]> {
const response = await this.client.embeddings.create({
model: this.model,
input: text.replace(/\n/g, " "),
});
return response.data[0].embedding;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing safety check for empty API response.

If the OpenAI API returns an empty data array (e.g., due to an API issue or edge case), accessing response.data[0].embedding will throw a runtime error.

🛡️ Proposed defensive check
 async embedQuery(text: string): Promise<number[]> {
   const response = await this.client.embeddings.create({
     model: this.model,
     input: text.replace(/\n/g, " "),
   });
+  if (!response.data.length) {
+    throw new Error("OpenAI returned no embeddings");
+  }
   return response.data[0].embedding;
 }
🤖 Prompt for AI Agents
In @packages/documents/src/embeddings/OpenAIEmbeddingModel.ts around lines 23 -
29, The embedQuery method lacks a safety check for an empty API response; after
calling this.client.embeddings.create in embedQuery, verify that response &&
Array.isArray(response.data) && response.data.length > 0 and that
response.data[0].embedding exists before returning it; if the check fails,
either throw a descriptive error (e.g., "Empty embeddings response from OpenAI")
or return a sensible default (empty array) and log the incident so callers don't
get a runtime exception when accessing response.data[0].embedding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document Chunking & Embedding Utilities

2 participants