Feat/document utils #893

UmeshpJadhav · 2025-12-26T12:16:51Z

feat: implement document chunking and embedding utilities and integrate into core DocumentRetriever

PR Checklist

Please check if your PR fulfills the following requirements:

The commit message follows our guidelines: https://voltagent.dev/docs/community/contributing/#commit-convention

Bugs / Features

Related issue(s) linked
Tests for the changes have been added
Docs have been added / updated
Changesets have been added https://voltagent.dev/docs/community/contributing/#creating-a-changeset

What is the current behavior?

There are no built-in utilities for document chunking, embedding, or ingestion in the core framework, making it difficult to implement RAG workflows.

What is the new behavior?

This PR introduces a new @voltagent/documents package and integrates it with @voltagent/core.

New Package: packages/documents
RecursiveCharacterTextSplitter: Smart text chunking with overlap.
OpenAIEmbeddingModel: Wrapper for OpenAI embeddings.
DocumentProcessor: Utility to split and embed text.
Core Integration:
-DocumentRetriever: New abstract class in core that adds ingest() capabilities.

fixes #6

Notes for reviewers

The core build may have pre-existing environment issues, but the new test-integration script verifies that the types and exports work correctly.
A new dependency on openai was added to packages/documents.

Summary by cubic

Adds document chunking and embeddings via a new @voltagent/documents package, and integrates ingestion into core DocumentRetriever to enable RAG workflows. Fixes #6.

New Features
- New @voltagent/documents package with RecursiveCharacterTextSplitter, OpenAIEmbeddingModel, and DocumentProcessor.
- Core retriever adds ingest(), default retrieve that embeds queries, empty-array safety, and abstract hooks: upsertDocuments() and queryVectors(). Also exports ProcessedDocument type.
- README and tests for documents utilities.
Migration
- Implement upsertDocuments() and queryVectors() in your retriever to store and search vectors.
- Set OPENAI_API_KEY for embeddings; optionally configure model and chunk sizes (chunkOverlap must be less than chunkSize).
- Ingest with processor.process(text, metadata) or retriever.ingest(text, metadata).

^{Written for commit 64e2cff. Summary will update on new commits.}

Summary by CodeRabbit

New Features
- New document processing package with automatic text chunking and segmentation
- Built-in support for document embeddings powered by AI models
- Integrated document retrieval and semantic search in core framework
Documentation
- Added comprehensive RAG (Retrieval-Augmented Generation) documentation with implementation examples

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…o core DocumentRetriever

changeset-bot · 2025-12-26T12:16:54Z

🦋 Changeset detected

Latest commit: 64e2cff

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages

Name	Type
@voltagent/documents	Minor
@voltagent/core	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

cubic-dev-ai

5 issues found across 17 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/documents/package.json">

<violation number="1" location="packages/documents/package.json:35">
P1: Version mismatch: vitest `^1.0.0` conflicts with project-level `^3.2.4`. In a monorepo with syncpack, dependency versions should be consistent across packages to avoid compatibility issues.</violation>

<violation number="2" location="packages/documents/package.json:36">
P2: Version mismatch: @types/node `^20.0.0` conflicts with project-level `^24.2.1`. Consider aligning with the monorepo&#39;s standard version for consistent type definitions.</violation>
</file>

<file name="packages/documents/src/DocumentProcessor.ts">

<violation number="1" location="packages/documents/src/DocumentProcessor.ts:23">
P2: Missing validation that `embeddings.length` matches `chunks.length`. If the embedding model returns fewer embeddings than expected, `embeddings[index]` could be `undefined`, causing silent data corruption in the returned `ProcessedDocument[]`. Consider adding a length validation check after fetching embeddings.</violation>
</file>

<file name="packages/documents/src/text-splitters/TextSplitter.ts">

<violation number="1" location="packages/documents/src/text-splitters/TextSplitter.ts:14">
P2: Missing validation for positive values. The validation checks that `chunkOverlap &lt; chunkSize`, but doesn&#39;t ensure `chunkSize &gt; 0` and `chunkOverlap &gt;= 0`. This allows invalid configurations like zero or negative values to pass.</violation>
</file>

<file name="packages/core/src/retriever/document-retriever.ts">

<violation number="1" location="packages/core/src/retriever/document-retriever.ts:56">
P1: Accessing `input[input.length - 1]` will throw a TypeError if `input` is an empty array. Add a guard to handle this edge case.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Ask questions if you need clarification on any suggestion

_{Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.}

packages/documents/package.json

packages/documents/src/DocumentProcessor.ts

packages/documents/src/text-splitters/TextSplitter.ts

packages/core/src/retriever/document-retriever.ts

- Handle empty array in DocumentRetriever.retrieve - Validate chunkSize and chunkOverlap in TextSplitter

…ent-utils

omeraplak

Hey, thanks, the PR looks great!
I actually started working on the packages/rag package a few weeks ago. Do you think we should include this package as well?

Also, would you like to add it to the docs here?
https://voltagent.dev/docs/rag/overview/

cubic-dev-ai

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="website/docs/rag/overview.md">

<violation number="1" location="website/docs/rag/overview.md:257">
P1: Incorrect API usage in documentation example. `RecursiveChunker` constructor accepts an optional `Tokenizer`, not options. Options like `maxTokens` and `overlapTokens` should be passed to the `chunk()` method. Also, `chunk()` is synchronous, not async.</violation>

<violation number="2" location="website/docs/rag/overview.md:264">
P1: Incorrect API usage in documentation example. `MarkdownChunker` constructor accepts an optional `Tokenizer`, not options. Options like `maxTokens` should be passed to the `chunk()` method. Also, `chunk()` is synchronous, not async.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

website/docs/rag/overview.md

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

UmeshpJadhav · 2025-12-30T12:40:33Z

@omeraplak
Thanks!. I've updated the documentation in docs/rag/overview to include a new section on "Advanced Chunking" that highlights the @voltagent/rag package and its capabilities.

I've also resolved the recent merge conflicts and fixed the linting issues. The PR should be good to go now!

coderabbitai · 2026-01-13T07:08:41Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This pull request introduces a new @voltagent/documents package containing text chunking and embedding utilities, including RecursiveCharacterTextSplitter for document splitting and OpenAIEmbeddingModel for vector embeddings, orchestrated via DocumentProcessor. It integrates these utilities into core via a new DocumentRetriever class supporting document ingestion and vector-based retrieval workflows.

Changes

Cohort / File(s)	Summary
Documents Package — Text Splitting `packages/documents/src/text-splitters/TextSplitter.ts`, `packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts`, `packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.test.ts`	Introduces abstract `TextSplitter` base class with validation (chunkSize, chunkOverlap) and concrete `RecursiveCharacterTextSplitter` with multi-separator fallback recursion. Tests verify character splitting, separator-based chunking, and recursive behavior.
Documents Package — Embedding Models `packages/documents/src/embeddings/EmbeddingModel.ts`, `packages/documents/src/embeddings/OpenAIEmbeddingModel.ts`	Defines `EmbeddingModel` interface (`embedQuery`, `embedDocuments`). Implements `OpenAIEmbeddingModel` with batch processing (default 512 max batch size), per-batch API calls, and result re-ordering to preserve document sequence.
Documents Package — Document Processing `packages/documents/src/DocumentProcessor.ts`, `packages/documents/src/DocumentProcessor.test.ts`	Adds `ProcessedDocument` interface and `DocumentProcessor` class orchestrating text splitting and embedding. Augments chunk metadata with `chunkIndex` and `chunkCount`. Test validates splits, embeddings, and metadata attachment.
Documents Package — Configuration & Exports `packages/documents/package.json`, `packages/documents/tsconfig.json`, `packages/documents/tsup.config.ts`, `packages/documents/src/index.ts`, `packages/documents/README.md`	Establishes new package with ESM/CJS exports, TypeScript/tsup build configs. README includes installation and usage examples for text splitting, embeddings, and document processing workflows.
Core Integration `packages/core/package.json`, `packages/core/src/retriever/document-retriever.ts`, `packages/core/src/retriever/index.ts`	Adds `@voltagent/documents` dependency to core. Introduces `DocumentRetriever` abstract class extending `BaseRetriever`, providing `ingest(text, metadata)` to process and persist documents, and `retrieve(input, options)` to embed queries and search vectors (default top-k=4).
Lint & Build Configuration `biome.json`, `packages/core/src/agent/subagent/index.ts`, `packages/server-core/src/websocket/setup.ts`	Updates Biome ignore list with archive and website directories. Adds `biome-ignore` directives to `handoffTask` and `setupWebSocketUpgrade` to suppress complexity warnings on legacy code.
Documentation `website/docs/rag/overview.md`	Adds "Advanced Chunking" section documenting `@voltagent/rag` package with installation, available chunkers, and example usage (content appears duplicated in patch).
Changeset `.changeset/document-utils-integration.md`	Documents feature: bumps `@voltagent/documents` and `@voltagent/core` to minor versions with note "feat: implement document chunking and embedding utilities and integrate into core."

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant DR as DocumentRetriever
    participant DP as DocumentProcessor
    participant TS as TextSplitter
    participant EM as EmbeddingModel
    participant VS as Vector Store

    App->>DR: ingest(text, metadata)
    DR->>DP: process(text, metadata)
    DP->>TS: splitText(text)
    TS-->>DP: chunks[]
    DP->>EM: embedDocuments(chunks)
    EM-->>DP: embeddings[][]
    DP-->>DR: ProcessedDocument[]
    DR->>VS: upsertDocuments(ProcessedDocument[])
    VS-->>DR: stored

    App->>DR: retrieve(query)
    DR->>EM: embedQuery(query)
    EM-->>DR: embedding[]
    DR->>VS: queryVectors(embedding, k=4)
    VS-->>DR: ProcessedDocument[]
    DR-->>App: joined result text

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

PR #936: Refactors subagent handoff and streaming logic in the same codebase where biome-ignore was added to suppress complexity warnings, indicating potential overlapping concerns around cognitive complexity management.

Poem

🐰 Chunks and vectors, oh what joy!
With embeddings from OpenAI,
Documents split with care so fine,
Vector searches by design.
The RAG dream comes to life,
No more retrieval strife! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR contains out-of-scope documentation additions: website/docs/rag/overview.md includes a duplicated "Advanced Chunking" section describing `@voltagent/rag` package, which is unrelated to the stated objectives of implementing document utilities.	Remove the duplicated content from website/docs/rag/overview.md or clarify if `@voltagent/rag` package changes are part of this PR's scope.
Title check	❓ Inconclusive	The title "Feat/document utils" is vague and generic, using non-descriptive terminology that doesn't clearly convey the specific changes made.	Use a more descriptive title such as "Feat: add document chunking and embedding utilities package" to better communicate the primary change.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	The PR successfully implements all MVP acceptance criteria from issue `#6`: RecursiveCharacterTextSplitter for chunking, OpenAIEmbeddingModel for embeddings, DocumentProcessor orchestration, and integration into core DocumentRetriever with proper documentation and tests.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The PR description comprehensively covers all required template sections with detailed information about features, current/new behavior, and includes checklist items.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In @packages/core/src/retriever/document-retriever.ts:
- Around line 59-63: The code assumes the selected BaseMessage content is a
string and casts it before calling processor.embedder.embedQuery; instead
validate the content type from input (and the last BaseMessage) before
embedding: ensure textQuery is a string, handle undefined or non-string content
by extracting a text field if present (e.g., content.text), falling back to a
safe serialization like JSON.stringify(content) or returning/logging an error
and not calling embedQuery; replace the direct cast on textQuery and guard the
call to this.processor.embedder.embedQuery accordingly to avoid runtime crashes.
- Line 66: The code uses a cast to any to read options.k; instead extend the
RetrieveOptions type (or create a new VectorRetrieveOptions that extends
RetrieveOptions) to include an optional k?: number, update the method signature
to accept that typed options, and replace the cast line with a typed access
(e.g., destructure or read options.k with a default of 4) so type safety is
preserved for the k parameter in DocumentRetriever/document-retrieval logic.

In @packages/documents/package.json:
- Around line 29-31: The package.json currently pins the "openai" dependency to
"^4.20.0"; update that dependency to a supported 6.x release (e.g., "^6.16.0")
in the "dependencies" entry for "openai", then run your package manager
(npm/yarn/pnpm) to install and update lockfiles; after upgrading, run tests and
fix any breaking API changes in code that uses the OpenAI SDK (search for
imports/usages of "openai" and update client construction and method names per
the 6.x migration guide).

In @packages/documents/src/DocumentProcessor.ts:
- Around line 21-34: In process, guard against embedder.embedDocuments returning
fewer items than chunks: after const embeddings = await
this.embedder.embedDocuments(chunks); check that embeddings is an array and
embeddings.length === chunks.length (or at least >= chunks.length); if not,
either throw a clear error or fill missing entries with a safe fallback (e.g.,
null vector or empty embedding) and log the mismatch via the class logger;
ensure the returned ProcessedDocument objects use validated/fallback embeddings
so embeddings[index] cannot be undefined.

In @packages/documents/src/embeddings/OpenAIEmbeddingModel.ts:
- Around line 23-29: The embedQuery method lacks a safety check for an empty API
response; after calling this.client.embeddings.create in embedQuery, verify that
response && Array.isArray(response.data) && response.data.length > 0 and that
response.data[0].embedding exists before returning it; if the check fails,
either throw a descriptive error (e.g., "Empty embeddings response from OpenAI")
or return a sensible default (empty array) and log the incident so callers don't
get a runtime exception when accessing response.data[0].embedding.
- Line 19: The default embedding model in OpenAIEmbeddingModel is outdated;
update the default assignment for this.model (in the OpenAIEmbeddingModel
constructor/initialization) from "text-embedding-ada-002" to
"text-embedding-3-large" (or "text-embedding-3-small" if you prefer cost
efficiency) so new instances use the current best-practice embedding by default
while preserving params override behavior.

🧹 Nitpick comments (8)

packages/documents/tsconfig.json (1)
4-4: Consider removing DOM libs for a Node.js-only package.

This package handles document processing and embeddings, which are server-side operations. The dom and dom.iterable libs are typically unnecessary and could be removed to keep the type environment focused on Node.js.
Suggested change
-    "lib": ["dom", "dom.iterable", "esnext"],
+    "lib": ["esnext"],
packages/documents/package.json (1)
1-37: Add missing package metadata for consistency with other packages.

The package is missing license, repository, keywords, and author fields that are present in other packages like @voltagent/core. Consider adding these for consistency and proper npm publishing.
Suggested additions
 {
     "name": "@voltagent/documents",
     "version": "0.0.1",
     "description": "Document processing and embedding utilities for VoltAgent",
+    "license": "MIT",
+    "repository": {
+        "type": "git",
+        "url": "https://github.com/VoltAgent/voltagent.git",
+        "directory": "packages/documents"
+    },
     "main": "dist/index.js",
packages/documents/src/text-splitters/TextSplitter.ts (1)
6-25: Consider making properties readonly.

The chunkSize and chunkOverlap properties are set once in the constructor and validated. Making them readonly would prevent accidental mutation and better express the design intent.
♻️ Suggested change
 export abstract class TextSplitter {
-  chunkSize: number;
-  chunkOverlap: number;
+  readonly chunkSize: number;
+  readonly chunkOverlap: number;
packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (1)
19-90: Consider handling empty input text.

The recursive splitting logic is well-implemented. However, when text is an empty string, "".split(separator) returns [""], causing splitText("") to return [""] rather than an empty array. This may be unexpected for callers expecting no chunks from empty input.
♻️ Optional early return for empty input
   private _splitText(text: string, separators: string[]): string[] {
+    if (text.length === 0) {
+      return [];
+    }
+
     const finalChunks: string[] = [];
packages/documents/src/RecursiveCharacterTextSplitter.test.ts (1)

4-75: Good test coverage for core functionality; consider adding edge case tests.

The three tests effectively cover character splitting, separator-based splitting, and recursive behavior. The inline comments documenting the expected logic are helpful for understanding.

Consider adding tests for:

Empty string input

Text smaller than chunkSize

Non-zero chunkOverlap behavior

Validation errors (e.g., chunkSize <= 0)
packages/documents/src/embeddings/OpenAIEmbeddingModel.ts (1)
31-48: Consider handling empty input array.

If documents is an empty array, this returns [] silently, which is acceptable. However, consider adding an early return for clarity and to avoid unnecessary iterations.

Also, there's no validation for maxBatchSize being a positive integer—a zero or negative value would cause an infinite loop or unexpected behavior.
♻️ Proposed improvements
+  constructor(params?: OpenAIEmbeddingModelParams) {
+    this.client = new OpenAI({
+      apiKey: params?.apiKey ?? process.env.OPENAI_API_KEY,
+    });
+    this.model = params?.model ?? "text-embedding-ada-002";
+    const batchSize = params?.maxBatchSize ?? 512;
+    if (batchSize <= 0) {
+      throw new Error("maxBatchSize must be a positive integer");
+    }
+    this.maxBatchSize = batchSize;
+  }

 async embedDocuments(documents: string[]): Promise<number[][]> {
+  if (documents.length === 0) {
+    return [];
+  }
   const embeddings: number[][] = [];
   // ... rest of implementation
 }
packages/documents/src/DocumentProcessor.ts (1)
12-14: Consider marking fields as readonly.

Since splitter and embedder are set in the constructor and likely shouldn't be reassigned afterward, consider making them readonly for clarity and immutability.
♻️ Suggested change
 export class DocumentProcessor {
-  splitter: TextSplitter;
-  embedder: EmbeddingModel;
+  readonly splitter: TextSplitter;
+  readonly embedder: EmbeddingModel;
packages/documents/src/DocumentProcessor.test.ts (1)

21-43: LGTM! Consider additional edge-case tests.

The test validates the core processing flow correctly. For more comprehensive coverage, consider adding tests for:

Empty string input

Single chunk (no delimiter)

No metadata provided (undefined)

Empty array from splitter

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 969db3a and 64e2cff.

⛔ Files ignored due to path filters (1)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (20)

.changeset/document-utils-integration.md
biome.json
packages/core/package.json
packages/core/src/agent/subagent/index.ts
packages/core/src/retriever/document-retriever.ts
packages/core/src/retriever/index.ts
packages/documents/README.md
packages/documents/package.json
packages/documents/src/DocumentProcessor.test.ts
packages/documents/src/DocumentProcessor.ts
packages/documents/src/RecursiveCharacterTextSplitter.test.ts
packages/documents/src/embeddings/EmbeddingModel.ts
packages/documents/src/embeddings/OpenAIEmbeddingModel.ts
packages/documents/src/index.ts
packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts
packages/documents/src/text-splitters/TextSplitter.ts
packages/documents/tsconfig.json
packages/documents/tsup.config.ts
packages/server-core/src/websocket/setup.ts
website/docs/rag/overview.md

🧰 Additional context used

📓 Path-based instructions (2)

**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

**/*.ts: Maintain type safety in TypeScript-first codebase
Never use JSON.stringify; use the safeStringify function instead, imported from @voltagent/internal

Files:

packages/documents/src/index.ts
packages/documents/src/embeddings/EmbeddingModel.ts
packages/core/src/agent/subagent/index.ts
packages/documents/src/text-splitters/TextSplitter.ts
packages/documents/src/RecursiveCharacterTextSplitter.test.ts
packages/server-core/src/websocket/setup.ts
packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts
packages/documents/tsup.config.ts
packages/core/src/retriever/index.ts
packages/documents/src/DocumentProcessor.test.ts
packages/documents/src/embeddings/OpenAIEmbeddingModel.ts
packages/core/src/retriever/document-retriever.ts
packages/documents/src/DocumentProcessor.ts

**/*.test.ts

📄 CodeRabbit inference engine (AGENTS.md)

Test your changes - ensure all tests pass before committing

Files:

packages/documents/src/RecursiveCharacterTextSplitter.test.ts
packages/documents/src/DocumentProcessor.test.ts

🧠 Learnings (3)

📚 Learning: 2026-01-07T05:09:23.216Z

Learnt from: CR
Repo: VoltAgent/voltagent PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-07T05:09:23.216Z
Learning: Applies to **/*.ts : Maintain type safety in TypeScript-first codebase

Applied to files:

packages/core/package.json
packages/documents/tsconfig.json
packages/documents/tsup.config.ts

📚 Learning: 2026-01-07T05:09:23.217Z

Learnt from: CR
Repo: VoltAgent/voltagent PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-07T05:09:23.217Z
Learning: Follow the monorepo structure - changes may impact multiple packages

Applied to files:

packages/documents/package.json

📚 Learning: 2026-01-07T05:09:23.217Z

Learnt from: CR
Repo: VoltAgent/voltagent PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-07T05:09:23.217Z
Learning: Applies to **/*.test.ts : Test your changes - ensure all tests pass before committing

Applied to files:

packages/documents/tsconfig.json
packages/documents/src/RecursiveCharacterTextSplitter.test.ts

🧬 Code graph analysis (4)

packages/documents/src/RecursiveCharacterTextSplitter.test.ts (1)

packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (1)

RecursiveCharacterTextSplitter (7-91)

packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (1)

packages/documents/src/text-splitters/TextSplitter.ts (1)

TextSplitterParams (1-4)

packages/documents/src/DocumentProcessor.test.ts (2)

packages/documents/src/embeddings/EmbeddingModel.ts (1)

EmbeddingModel (1-4)

packages/documents/src/DocumentProcessor.ts (1)

DocumentProcessor (12-35)

packages/documents/src/embeddings/OpenAIEmbeddingModel.ts (2)

packages/documents/src/embeddings/EmbeddingModel.ts (1)

EmbeddingModel (1-4)

packages/documents/src/DocumentProcessor.ts (1)

process (21-34)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: Test internal
GitHub Check: Test logger
GitHub Check: Test create-voltagent-app
GitHub Check: Test postgres
GitHub Check: Test docs-mcp
GitHub Check: Test server-core
GitHub Check: Test core
GitHub Check: Test logger
GitHub Check: Test libsql
GitHub Check: Test cli
GitHub Check: Test voice
GitHub Check: Build (Node 22)
GitHub Check: Test supabase
GitHub Check: Build (Node 24)
GitHub Check: Build (Node 20)

🔇 Additional comments (19)

website/docs/rag/overview.md (1)

242-280: The documented APIs are accurate for @voltagent/rag, not @voltagent/documents.

The documentation section (lines 242-280) correctly describes the existing @voltagent/rag package with accurate class names (RecursiveChunker, MarkdownChunker, etc.), method signatures (.chunk()), and parameters (maxTokens, overlapTokens). All referenced chunkers exist in the RAG package.

However, this creates a critical mismatch: the PR adds @voltagent/documents which provides RecursiveCharacterTextSplitter (with .splitText() method and chunkSize/chunkOverlap parameters) along with OpenAIEmbeddingModel and DocumentProcessor. The documentation section does not document any of these.

Clarify the documentation intent:

If this section is new and meant to document the @voltagent/documents package being added, it needs to describe that package's API, not RAG's.

If this section pre-existed and documents RAG (separate from this PR), clarify whether @voltagent/documents features need their own documentation section.

packages/server-core/src/websocket/setup.ts (1)

232-233: Lint suppression is acceptable for existing complex handler.

The biome-ignore directive appropriately acknowledges the cognitive complexity of this WebSocket upgrade handler without changing its behavior. Consider adding a TODO or tracking issue if there are plans to refactor this handler in the future.

biome.json (1)

64-66: Verify the website directory exclusion is intentional.

Adding archive to the ignore list is reasonable. However, excluding the entire website directory is broad—ensure this is intentional and that the website has its own linting/formatting configuration, or that it doesn't contain TypeScript/JavaScript that should be linted.

packages/documents/src/embeddings/EmbeddingModel.ts (1)

1-4: Clean interface design following embedding model conventions.

The EmbeddingModel interface is well-structured with appropriate async return types. This abstraction will allow easy swapping of embedding providers beyond OpenAI.

packages/core/package.json (1)

19-19: Workspace dependency correctly links the new documents package.

Using workspace:* is the appropriate pattern for internal monorepo dependencies, ensuring core always uses the local documents package version during development and publishing.

.changeset/document-utils-integration.md (1)

1-6: Changeset correctly captures the feature addition.

The minor version bump for both packages is appropriate—@voltagent/documents is a new package with new functionality, and @voltagent/core gains new capabilities through the integration.

packages/documents/tsconfig.json (1)

10-11: No action needed. The rootDir configuration is correct for this setup.

The tsconfig.json's rootDir and declaration settings are used only for type-checking (via tsc --noEmit). Declaration generation during the build is handled independently by tsup with dts: true, which generates declarations directly from the entry point (src/index.ts) and outputs them to dist/index.d.ts and dist/index.d.mts, as correctly specified in package.json's types exports. The configuration is already correct.

Likely an incorrect or invalid review comment.

packages/core/src/agent/subagent/index.ts (1)

319-320: LGTM!

The lint suppression directive is appropriate for this existing complex method. The handoffTask method handles multiple sub-agent configuration types and orchestrates complex handoff logic, justifying the complexity allowance.

packages/documents/tsup.config.ts (1)

1-19: LGTM!

The tsup configuration is well-structured for the new documents package. Dual CJS/ESM output, TypeScript declarations, and the shared external plugin are appropriately configured.

packages/documents/src/text-splitters/TextSplitter.ts (2)

1-4: LGTM!

Clean interface definition with appropriate optional parameters.

29-36: LGTM!

The createDocuments method correctly accumulates chunks from multiple input texts. Sequential processing is appropriate here to maintain predictable ordering.

packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (2)

1-13: LGTM!

Clean extension of TextSplitter with sensible default separators that progress from paragraph to character-level splitting.

15-17: LGTM!

Clean delegation pattern for the public async interface.

packages/documents/src/RecursiveCharacterTextSplitter.test.ts (1)

1-3: LGTM!

Correct imports from vitest and the implementation module.

packages/documents/README.md (1)

1-71: LGTM!

The README provides clear documentation with helpful code examples for each component. The examples demonstrate key configuration options and typical usage patterns.

packages/documents/src/index.ts (1)

1-5: LGTM!

The barrel exports cleanly expose the public API surface: text splitters, embedding models, and the document processor with its types.

packages/core/src/retriever/index.ts (1)

6-9: LGTM!

The new exports are properly structured. Re-exporting ProcessedDocument from @voltagent/documents and exposing the new DocumentRetriever class with its options type maintains a clean public API surface for the module.

packages/core/src/retriever/document-retriever.ts (2)

20-35: LGTM!

The constructor pattern with optional processor and fallback to default is appropriate. The ingest method cleanly delegates to the processor and abstract storage method with proper debug logging.

41-49: LGTM!

The abstract method contracts are well-defined, providing clear expectations for implementations to handle vector storage and retrieval.

coderabbitai · 2026-01-13T07:12:31Z

packages/core/src/retriever/document-retriever.ts

+    const textQuery = typeof input === "string" ? input : input[input.length - 1].content;
+
+    // We assume the processor's embedder has an embedQuery method.
+    // Since DocumentProcessor exposes 'embedder', we can use it.
+    const queryVector = await this.processor.embedder.embedQuery(textQuery as string);


⚠️ Potential issue | 🟡 Minor

Validate content type before embedding.

The content property of BaseMessage may not always be a string (e.g., structured content in multi-modal messages). The cast on line 63 could mask runtime issues if content is an object or undefined.

Suggested defensive handling

- const textQuery = typeof input === "string" ? input : input[input.length - 1].content; - - // We assume the processor's embedder has an embedQuery method. - // Since DocumentProcessor exposes 'embedder', we can use it. - const queryVector = await this.processor.embedder.embedQuery(textQuery as string); + let textQuery: string; + if (typeof input === "string") { + textQuery = input; + } else { + const lastContent = input[input.length - 1].content; + textQuery = typeof lastContent === "string" ? lastContent : String(lastContent); + } + + const queryVector = await this.processor.embedder.embedQuery(textQuery);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const textQuery = typeof input === "string" ? input : input[input.length - 1].content;

// We assume the processor's embedder has an embedQuery method.

// Since DocumentProcessor exposes 'embedder', we can use it.

const queryVector = await this.processor.embedder.embedQuery(textQuery as string);

let textQuery: string;

if (typeof input === "string") {

textQuery = input;

} else {

const lastContent = input[input.length - 1].content;

textQuery = typeof lastContent === "string" ? lastContent : String(lastContent);

}

const queryVector = await this.processor.embedder.embedQuery(textQuery);

🤖 Prompt for AI Agents

In @packages/core/src/retriever/document-retriever.ts around lines 59 - 63, The code assumes the selected BaseMessage content is a string and casts it before calling processor.embedder.embedQuery; instead validate the content type from input (and the last BaseMessage) before embedding: ensure textQuery is a string, handle undefined or non-string content by extracting a text field if present (e.g., content.text), falling back to a safe serialization like JSON.stringify(content) or returning/logging an error and not calling embedQuery; replace the direct cast on textQuery and guard the call to this.processor.embedder.embedQuery accordingly to avoid runtime crashes.

coderabbitai · 2026-01-13T07:12:32Z

packages/core/src/retriever/document-retriever.ts

+    const queryVector = await this.processor.embedder.embedQuery(textQuery as string);
+
+    // Default top-k to 4, can be customizable via options
+    const k = (options as any).k ?? 4;


⚠️ Potential issue | 🟠 Major

Avoid any cast; extend RetrieveOptions with proper typing.

Using (options as any).k bypasses type safety. Consider extending RetrieveOptions or creating a dedicated options type for vector retrieval.

Proposed fix

Define proper types for the k parameter:

+export interface DocumentRetrieveOptions extends RetrieveOptions { + /** + * Number of top results to return from vector search. + * @default 4 + */ + k?: number; +} + export abstract class DocumentRetriever extends BaseRetriever { // ... - async retrieve(input: string | BaseMessage[], options: RetrieveOptions = {}): Promise<string> { + async retrieve(input: string | BaseMessage[], options: DocumentRetrieveOptions = {}): Promise<string> { // ... - const k = (options as any).k ?? 4; + const k = options.k ?? 4;

🤖 Prompt for AI Agents

In @packages/core/src/retriever/document-retriever.ts at line 66, The code uses a cast to any to read options.k; instead extend the RetrieveOptions type (or create a new VectorRetrieveOptions that extends RetrieveOptions) to include an optional k?: number, update the method signature to accept that typed options, and replace the cast line with a typed access (e.g., destructure or read options.k with a default of 4) so type safety is preserved for the k parameter in DocumentRetriever/document-retrieval logic.

coderabbitai · 2026-01-13T07:12:32Z

packages/documents/package.json

+    "dependencies": {
+        "openai": "^4.20.0"
+    },


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check latest openai version and release info curl -s https://registry.npmjs.org/openai | jq -r '.["dist-tags"] | to_entries[] | "\(.key): \(.value)"'

Repository: VoltAgent/voltagent

Length of output: 137

🌐 Web query:

openai npm package latest version 2025 security vulnerabilities

💡 Result:

Summary (as of Jan 13, 2026):

Latest official "openai" npm package (registry) — see package page on npm for the current published version. No direct vulnerabilities are listed for the package in Snyk's database. [1][2]

Snyk security scan: "No known security issues" for the latest openai release (Snyk package page). [1]

Official npm registry: package/version history and tags are on npmjs (check npm package page to confirm the exact version number you have/plan to install). [2]

Recommended actions:

Run npm audit and/or Snyk/Dependabot against your project before installing/upgrading.

Inspect your dependency tree for transitive vulnerabilities (audit covers those).

If you need, I can check the exact latest version number right now and scan advisories (CVE/GitHub/GitLab) for any security reports — shall I do that?

Sources:
[1] Snyk — openai package security page.
[2] npmjs — openai package page and version listings.

Update OpenAI SDK to a supported version.

The package depends on openai@^4.20.0, which is 2 major versions behind the latest release (6.16.0). While the current version has no known security vulnerabilities, the caret constraint prevents adoption of newer versions. Consider upgrading to a recent version within the 6.x line to benefit from API improvements and continued maintenance.

🤖 Prompt for AI Agents

In @packages/documents/package.json around lines 29 - 31, The package.json currently pins the "openai" dependency to "^4.20.0"; update that dependency to a supported 6.x release (e.g., "^6.16.0") in the "dependencies" entry for "openai", then run your package manager (npm/yarn/pnpm) to install and update lockfiles; after upgrading, run tests and fix any breaking API changes in code that uses the OpenAI SDK (search for imports/usages of "openai" and update client construction and method names per the 6.x migration guide).

coderabbitai · 2026-01-13T07:12:32Z

packages/documents/src/DocumentProcessor.ts

+  async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> {
+    const chunks = await this.splitter.splitText(text);
+    const embeddings = await this.embedder.embedDocuments(chunks);
+
+    return chunks.map((chunk, index) => ({
+      text: chunk,
+      embedding: embeddings[index],
+      metadata: {
+        ...metadata,
+        chunkIndex: index,
+        chunkCount: chunks.length,
+      },
+    }));
+  }


⚠️ Potential issue | 🟡 Minor

Potential undefined embedding if array lengths mismatch.

If embedder.embedDocuments(chunks) returns fewer embeddings than chunks (due to an API error or implementation bug), embeddings[index] will be undefined, leading to corrupted ProcessedDocument objects.

🛡️ Proposed defensive check

async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> { const chunks = await this.splitter.splitText(text); const embeddings = await this.embedder.embedDocuments(chunks); + if (embeddings.length !== chunks.length) { + throw new Error( + `Embedding count mismatch: expected ${chunks.length}, got ${embeddings.length}` + ); + } + return chunks.map((chunk, index) => ({ text: chunk, embedding: embeddings[index],

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> {

const chunks = await this.splitter.splitText(text);

const embeddings = await this.embedder.embedDocuments(chunks);

return chunks.map((chunk, index) => ({

text: chunk,

embedding: embeddings[index],

metadata: {

...metadata,

chunkIndex: index,

chunkCount: chunks.length,

},

}));

}

async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> {

const chunks = await this.splitter.splitText(text);

const embeddings = await this.embedder.embedDocuments(chunks);

if (embeddings.length !== chunks.length) {

throw new Error(

`Embedding count mismatch: expected ${chunks.length}, got ${embeddings.length}`

);

}

return chunks.map((chunk, index) => ({

text: chunk,

embedding: embeddings[index],

metadata: {

...metadata,

chunkIndex: index,

chunkCount: chunks.length,

},

}));

}

🤖 Prompt for AI Agents

In @packages/documents/src/DocumentProcessor.ts around lines 21 - 34, In process, guard against embedder.embedDocuments returning fewer items than chunks: after const embeddings = await this.embedder.embedDocuments(chunks); check that embeddings is an array and embeddings.length === chunks.length (or at least >= chunks.length); if not, either throw a clear error or fill missing entries with a safe fallback (e.g., null vector or empty embedding) and log the mismatch via the class logger; ensure the returned ProcessedDocument objects use validated/fallback embeddings so embeddings[index] cannot be undefined.

coderabbitai · 2026-01-13T07:12:32Z

packages/documents/src/embeddings/OpenAIEmbeddingModel.ts

+    this.client = new OpenAI({
+      apiKey: params?.apiKey ?? process.env.OPENAI_API_KEY,
+    });
+    this.model = params?.model ?? "text-embedding-ada-002";


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

latest OpenAI embedding model 2025 2026 text-embedding-3

💡 Result:

As of January 13, 2026 the latest OpenAI embedding models are the text-embedding-3 family: text-embedding-3-large (most capable, up to 3072 dims) and text-embedding-3-small (cost‑efficient). They were announced January 25, 2024; text-embedding-3-large is OpenAI’s highest‑performing embedding model. [1] (openai.com)

Sources:
[1] OpenAI announcement — “New embedding models and API updates” (Jan 25, 2024). (openai.com)
[2] text-embedding-3-large docs (OpenAI). (platform.openai.com)
[3] text-embedding-3-small docs (OpenAI). (platform.openai.com)

Citations:

1: https://openai.com/index/new-embedding-models-and-api-updates/?utm_source=openai

2: https://openai.com/index/new-embedding-models-and-api-updates/?utm_source=openai

3: https://platform.openai.com/docs/models/text-embedding-3-large?utm_source=openai

4: https://platform.openai.com/docs/models/text-embedding-3-small?utm_source=openai

🏁 Script executed:

cat -n packages/documents/src/embeddings/OpenAIEmbeddingModel.ts | head -40

Repository: VoltAgent/voltagent

Length of output: 1611

Update the default embedding model to text-embedding-3-large.

The default model text-embedding-ada-002 is outdated. OpenAI released the text-embedding-3 family on January 25, 2024, with text-embedding-3-large being the highest-performing embedding model. Consider updating the default to "text-embedding-3-large" or "text-embedding-3-small" if cost efficiency is a priority. Since the model is configurable via params, applications can override this, but the default should reflect current best practices.

🤖 Prompt for AI Agents

In @packages/documents/src/embeddings/OpenAIEmbeddingModel.ts at line 19, The default embedding model in OpenAIEmbeddingModel is outdated; update the default assignment for this.model (in the OpenAIEmbeddingModel constructor/initialization) from "text-embedding-ada-002" to "text-embedding-3-large" (or "text-embedding-3-small" if you prefer cost efficiency) so new instances use the current best-practice embedding by default while preserving params override behavior.

coderabbitai · 2026-01-13T07:12:32Z

packages/documents/src/embeddings/OpenAIEmbeddingModel.ts

+  async embedQuery(text: string): Promise<number[]> {
+    const response = await this.client.embeddings.create({
+      model: this.model,
+      input: text.replace(/\n/g, " "),
+    });
+    return response.data[0].embedding;
+  }


⚠️ Potential issue | 🟡 Minor

Missing safety check for empty API response.

If the OpenAI API returns an empty data array (e.g., due to an API issue or edge case), accessing response.data[0].embedding will throw a runtime error.

🛡️ Proposed defensive check

async embedQuery(text: string): Promise<number[]> { const response = await this.client.embeddings.create({ model: this.model, input: text.replace(/\n/g, " "), }); + if (!response.data.length) { + throw new Error("OpenAI returned no embeddings"); + } return response.data[0].embedding; }

🤖 Prompt for AI Agents

In @packages/documents/src/embeddings/OpenAIEmbeddingModel.ts around lines 23 - 29, The embedQuery method lacks a safety check for an empty API response; after calling this.client.embeddings.create in embedQuery, verify that response && Array.isArray(response.data) && response.data.length > 0 and that response.data[0].embedding exists before returning it; if the check fails, either throw a descriptive error (e.g., "Empty embeddings response from OpenAI") or return a sensible default (empty array) and log the incident so callers don't get a runtime exception when accessing response.data[0].embedding.

UmeshpJadhav added 2 commits December 26, 2025 17:29

feat: add packages/documents with splitter/embedder and integrate int…

f5a111d

…o core DocumentRetriever

docs: add changeset for document utilities feature

dcb68f4

cubic-dev-ai bot reviewed Dec 26, 2025

View reviewed changes

UmeshpJadhav added 3 commits December 26, 2025 17:53

fix: address PR feedback (validation and edge cases)

9fc5c34

- Handle empty array in DocumentRetriever.retrieve - Validate chunkSize and chunkOverlap in TextSplitter

Merge branch 'main' of github.com:VoltAgent/voltagent into feat/docum…

7eb6e66

…ent-utils

Merge branch 'main' of github.com:VoltAgent/voltagent into feat/docum…

af1b163

…ent-utils

UmeshpJadhav mentioned this pull request Dec 27, 2025

Document Chunking & Embedding Utilities #6

Closed

fix(core): export ProcessedDocument type from retriever module

52142d4

omeraplak approved these changes Dec 30, 2025

View reviewed changes

UmeshpJadhav requested a review from omeraplak December 30, 2025 03:04

UmeshpJadhav added 2 commits December 30, 2025 09:16

fix: resolve merge conflicts and suppress lint warnings

b0ed29e

docs(website): add advanced chunking section to rag overview

5fa09b9

cubic-dev-ai bot reviewed Dec 30, 2025

View reviewed changes

website/docs/rag/overview.md Outdated Show resolved Hide resolved

website/docs/rag/overview.md Outdated Show resolved Hide resolved

UmeshpJadhav and others added 2 commits December 30, 2025 18:09

Update website/docs/rag/overview.md

8fd206e

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

Update website/docs/rag/overview.md

1b58ba3

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

fix: resolve lockfile conflict and format upstream changes

112194c

UmeshpJadhav closed this Jan 13, 2026

UmeshpJadhav force-pushed the feat/document-utils branch from 112194c to 969db3a Compare January 13, 2026 06:58

fix: resolve merge conflicts

64e2cff

UmeshpJadhav reopened this Jan 13, 2026

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

-    const textQuery = typeof input === "string" ? input : input[input.length - 1].content;
-    // We assume the processor's embedder has an embedQuery method.
-    // Since DocumentProcessor exposes 'embedder', we can use it.
-    const queryVector = await this.processor.embedder.embedQuery(textQuery as string);
+    let textQuery: string;
+    if (typeof input === "string") {
+      textQuery = input;
+    } else {
+      const lastContent = input[input.length - 1].content;
+      textQuery = typeof lastContent === "string" ? lastContent : String(lastContent);
+    }
+    const queryVector = await this.processor.embedder.embedQuery(textQuery);

Uh oh!

Feat/document utils #893

Are you sure you want to change the base?

Feat/document utils #893

Uh oh!

Conversation

UmeshpJadhav commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Bugs / Features

What is the current behavior?

What is the new behavior?

Notes for reviewers

Summary by cubic

Summary by CodeRabbit

Uh oh!

changeset-bot bot commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

omeraplak left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

UmeshpJadhav commented Dec 30, 2025

Uh oh!

coderabbitai bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UmeshpJadhav commented Dec 26, 2025 •

edited

Loading

changeset-bot bot commented Dec 26, 2025 •

edited

Loading

coderabbitai bot commented Jan 13, 2026 •

edited

Loading