-
-
Notifications
You must be signed in to change notification settings - Fork 467
Feat/document utils #893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat/document utils #893
Conversation
🦋 Changeset detectedLatest commit: 64e2cff The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 issues found across 17 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="packages/documents/package.json">
<violation number="1" location="packages/documents/package.json:35">
P1: Version mismatch: vitest `^1.0.0` conflicts with project-level `^3.2.4`. In a monorepo with syncpack, dependency versions should be consistent across packages to avoid compatibility issues.</violation>
<violation number="2" location="packages/documents/package.json:36">
P2: Version mismatch: @types/node `^20.0.0` conflicts with project-level `^24.2.1`. Consider aligning with the monorepo's standard version for consistent type definitions.</violation>
</file>
<file name="packages/documents/src/DocumentProcessor.ts">
<violation number="1" location="packages/documents/src/DocumentProcessor.ts:23">
P2: Missing validation that `embeddings.length` matches `chunks.length`. If the embedding model returns fewer embeddings than expected, `embeddings[index]` could be `undefined`, causing silent data corruption in the returned `ProcessedDocument[]`. Consider adding a length validation check after fetching embeddings.</violation>
</file>
<file name="packages/documents/src/text-splitters/TextSplitter.ts">
<violation number="1" location="packages/documents/src/text-splitters/TextSplitter.ts:14">
P2: Missing validation for positive values. The validation checks that `chunkOverlap < chunkSize`, but doesn't ensure `chunkSize > 0` and `chunkOverlap >= 0`. This allows invalid configurations like zero or negative values to pass.</violation>
</file>
<file name="packages/core/src/retriever/document-retriever.ts">
<violation number="1" location="packages/core/src/retriever/document-retriever.ts:56">
P1: Accessing `input[input.length - 1]` will throw a TypeError if `input` is an empty array. Add a guard to handle this edge case.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Ask questions if you need clarification on any suggestion
Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.
- Handle empty array in DocumentRetriever.retrieve - Validate chunkSize and chunkOverlap in TextSplitter
omeraplak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks, the PR looks great!
I actually started working on the packages/rag package a few weeks ago. Do you think we should include this package as well?
Also, would you like to add it to the docs here?
https://voltagent.dev/docs/rag/overview/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 issues found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="website/docs/rag/overview.md">
<violation number="1" location="website/docs/rag/overview.md:257">
P1: Incorrect API usage in documentation example. `RecursiveChunker` constructor accepts an optional `Tokenizer`, not options. Options like `maxTokens` and `overlapTokens` should be passed to the `chunk()` method. Also, `chunk()` is synchronous, not async.</violation>
<violation number="2" location="website/docs/rag/overview.md:264">
P1: Incorrect API usage in documentation example. `MarkdownChunker` constructor accepts an optional `Tokenizer`, not options. Options like `maxTokens` should be passed to the `chunk()` method. Also, `chunk()` is synchronous, not async.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
|
@omeraplak I've also resolved the recent merge conflicts and fixed the linting issues. The PR should be good to go now! |
112194c to
969db3a
Compare
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughThis pull request introduces a new Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant DR as DocumentRetriever
participant DP as DocumentProcessor
participant TS as TextSplitter
participant EM as EmbeddingModel
participant VS as Vector Store
App->>DR: ingest(text, metadata)
DR->>DP: process(text, metadata)
DP->>TS: splitText(text)
TS-->>DP: chunks[]
DP->>EM: embedDocuments(chunks)
EM-->>DP: embeddings[][]
DP-->>DR: ProcessedDocument[]
DR->>VS: upsertDocuments(ProcessedDocument[])
VS-->>DR: stored
App->>DR: retrieve(query)
DR->>EM: embedQuery(query)
EM-->>DR: embedding[]
DR->>VS: queryVectors(embedding, k=4)
VS-->>DR: ProcessedDocument[]
DR-->>App: joined result text
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In @packages/core/src/retriever/document-retriever.ts:
- Around line 59-63: The code assumes the selected BaseMessage content is a
string and casts it before calling processor.embedder.embedQuery; instead
validate the content type from input (and the last BaseMessage) before
embedding: ensure textQuery is a string, handle undefined or non-string content
by extracting a text field if present (e.g., content.text), falling back to a
safe serialization like JSON.stringify(content) or returning/logging an error
and not calling embedQuery; replace the direct cast on textQuery and guard the
call to this.processor.embedder.embedQuery accordingly to avoid runtime crashes.
- Line 66: The code uses a cast to any to read options.k; instead extend the
RetrieveOptions type (or create a new VectorRetrieveOptions that extends
RetrieveOptions) to include an optional k?: number, update the method signature
to accept that typed options, and replace the cast line with a typed access
(e.g., destructure or read options.k with a default of 4) so type safety is
preserved for the k parameter in DocumentRetriever/document-retrieval logic.
In @packages/documents/package.json:
- Around line 29-31: The package.json currently pins the "openai" dependency to
"^4.20.0"; update that dependency to a supported 6.x release (e.g., "^6.16.0")
in the "dependencies" entry for "openai", then run your package manager
(npm/yarn/pnpm) to install and update lockfiles; after upgrading, run tests and
fix any breaking API changes in code that uses the OpenAI SDK (search for
imports/usages of "openai" and update client construction and method names per
the 6.x migration guide).
In @packages/documents/src/DocumentProcessor.ts:
- Around line 21-34: In process, guard against embedder.embedDocuments returning
fewer items than chunks: after const embeddings = await
this.embedder.embedDocuments(chunks); check that embeddings is an array and
embeddings.length === chunks.length (or at least >= chunks.length); if not,
either throw a clear error or fill missing entries with a safe fallback (e.g.,
null vector or empty embedding) and log the mismatch via the class logger;
ensure the returned ProcessedDocument objects use validated/fallback embeddings
so embeddings[index] cannot be undefined.
In @packages/documents/src/embeddings/OpenAIEmbeddingModel.ts:
- Around line 23-29: The embedQuery method lacks a safety check for an empty API
response; after calling this.client.embeddings.create in embedQuery, verify that
response && Array.isArray(response.data) && response.data.length > 0 and that
response.data[0].embedding exists before returning it; if the check fails,
either throw a descriptive error (e.g., "Empty embeddings response from OpenAI")
or return a sensible default (empty array) and log the incident so callers don't
get a runtime exception when accessing response.data[0].embedding.
- Line 19: The default embedding model in OpenAIEmbeddingModel is outdated;
update the default assignment for this.model (in the OpenAIEmbeddingModel
constructor/initialization) from "text-embedding-ada-002" to
"text-embedding-3-large" (or "text-embedding-3-small" if you prefer cost
efficiency) so new instances use the current best-practice embedding by default
while preserving params override behavior.
🧹 Nitpick comments (8)
packages/documents/tsconfig.json (1)
4-4: Consider removing DOM libs for a Node.js-only package.This package handles document processing and embeddings, which are server-side operations. The
domanddom.iterablelibs are typically unnecessary and could be removed to keep the type environment focused on Node.js.Suggested change
- "lib": ["dom", "dom.iterable", "esnext"], + "lib": ["esnext"],packages/documents/package.json (1)
1-37: Add missing package metadata for consistency with other packages.The package is missing
license,repository,keywords, andauthorfields that are present in other packages like@voltagent/core. Consider adding these for consistency and proper npm publishing.Suggested additions
{ "name": "@voltagent/documents", "version": "0.0.1", "description": "Document processing and embedding utilities for VoltAgent", + "license": "MIT", + "repository": { + "type": "git", + "url": "https://github.com/VoltAgent/voltagent.git", + "directory": "packages/documents" + }, "main": "dist/index.js",packages/documents/src/text-splitters/TextSplitter.ts (1)
6-25: Consider making properties readonly.The
chunkSizeandchunkOverlapproperties are set once in the constructor and validated. Making themreadonlywould prevent accidental mutation and better express the design intent.♻️ Suggested change
export abstract class TextSplitter { - chunkSize: number; - chunkOverlap: number; + readonly chunkSize: number; + readonly chunkOverlap: number;packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (1)
19-90: Consider handling empty input text.The recursive splitting logic is well-implemented. However, when
textis an empty string,"".split(separator)returns[""], causingsplitText("")to return[""]rather than an empty array. This may be unexpected for callers expecting no chunks from empty input.♻️ Optional early return for empty input
private _splitText(text: string, separators: string[]): string[] { + if (text.length === 0) { + return []; + } + const finalChunks: string[] = [];packages/documents/src/RecursiveCharacterTextSplitter.test.ts (1)
4-75: Good test coverage for core functionality; consider adding edge case tests.The three tests effectively cover character splitting, separator-based splitting, and recursive behavior. The inline comments documenting the expected logic are helpful for understanding.
Consider adding tests for:
- Empty string input
- Text smaller than
chunkSize- Non-zero
chunkOverlapbehavior- Validation errors (e.g.,
chunkSize <= 0)packages/documents/src/embeddings/OpenAIEmbeddingModel.ts (1)
31-48: Consider handling empty input array.If
documentsis an empty array, this returns[]silently, which is acceptable. However, consider adding an early return for clarity and to avoid unnecessary iterations.Also, there's no validation for
maxBatchSizebeing a positive integer—a zero or negative value would cause an infinite loop or unexpected behavior.♻️ Proposed improvements
+ constructor(params?: OpenAIEmbeddingModelParams) { + this.client = new OpenAI({ + apiKey: params?.apiKey ?? process.env.OPENAI_API_KEY, + }); + this.model = params?.model ?? "text-embedding-ada-002"; + const batchSize = params?.maxBatchSize ?? 512; + if (batchSize <= 0) { + throw new Error("maxBatchSize must be a positive integer"); + } + this.maxBatchSize = batchSize; + } async embedDocuments(documents: string[]): Promise<number[][]> { + if (documents.length === 0) { + return []; + } const embeddings: number[][] = []; // ... rest of implementation }packages/documents/src/DocumentProcessor.ts (1)
12-14: Consider marking fields asreadonly.Since
splitterandembedderare set in the constructor and likely shouldn't be reassigned afterward, consider making themreadonlyfor clarity and immutability.♻️ Suggested change
export class DocumentProcessor { - splitter: TextSplitter; - embedder: EmbeddingModel; + readonly splitter: TextSplitter; + readonly embedder: EmbeddingModel;packages/documents/src/DocumentProcessor.test.ts (1)
21-43: LGTM! Consider additional edge-case tests.The test validates the core processing flow correctly. For more comprehensive coverage, consider adding tests for:
- Empty string input
- Single chunk (no delimiter)
- No metadata provided (
undefined)- Empty array from splitter
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (20)
.changeset/document-utils-integration.mdbiome.jsonpackages/core/package.jsonpackages/core/src/agent/subagent/index.tspackages/core/src/retriever/document-retriever.tspackages/core/src/retriever/index.tspackages/documents/README.mdpackages/documents/package.jsonpackages/documents/src/DocumentProcessor.test.tspackages/documents/src/DocumentProcessor.tspackages/documents/src/RecursiveCharacterTextSplitter.test.tspackages/documents/src/embeddings/EmbeddingModel.tspackages/documents/src/embeddings/OpenAIEmbeddingModel.tspackages/documents/src/index.tspackages/documents/src/text-splitters/RecursiveCharacterTextSplitter.tspackages/documents/src/text-splitters/TextSplitter.tspackages/documents/tsconfig.jsonpackages/documents/tsup.config.tspackages/server-core/src/websocket/setup.tswebsite/docs/rag/overview.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
**/*.ts: Maintain type safety in TypeScript-first codebase
Never use JSON.stringify; use thesafeStringifyfunction instead, imported from@voltagent/internal
Files:
packages/documents/src/index.tspackages/documents/src/embeddings/EmbeddingModel.tspackages/core/src/agent/subagent/index.tspackages/documents/src/text-splitters/TextSplitter.tspackages/documents/src/RecursiveCharacterTextSplitter.test.tspackages/server-core/src/websocket/setup.tspackages/documents/src/text-splitters/RecursiveCharacterTextSplitter.tspackages/documents/tsup.config.tspackages/core/src/retriever/index.tspackages/documents/src/DocumentProcessor.test.tspackages/documents/src/embeddings/OpenAIEmbeddingModel.tspackages/core/src/retriever/document-retriever.tspackages/documents/src/DocumentProcessor.ts
**/*.test.ts
📄 CodeRabbit inference engine (AGENTS.md)
Test your changes - ensure all tests pass before committing
Files:
packages/documents/src/RecursiveCharacterTextSplitter.test.tspackages/documents/src/DocumentProcessor.test.ts
🧠 Learnings (3)
📚 Learning: 2026-01-07T05:09:23.216Z
Learnt from: CR
Repo: VoltAgent/voltagent PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-07T05:09:23.216Z
Learning: Applies to **/*.ts : Maintain type safety in TypeScript-first codebase
Applied to files:
packages/core/package.jsonpackages/documents/tsconfig.jsonpackages/documents/tsup.config.ts
📚 Learning: 2026-01-07T05:09:23.217Z
Learnt from: CR
Repo: VoltAgent/voltagent PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-07T05:09:23.217Z
Learning: Follow the monorepo structure - changes may impact multiple packages
Applied to files:
packages/documents/package.json
📚 Learning: 2026-01-07T05:09:23.217Z
Learnt from: CR
Repo: VoltAgent/voltagent PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-07T05:09:23.217Z
Learning: Applies to **/*.test.ts : Test your changes - ensure all tests pass before committing
Applied to files:
packages/documents/tsconfig.jsonpackages/documents/src/RecursiveCharacterTextSplitter.test.ts
🧬 Code graph analysis (4)
packages/documents/src/RecursiveCharacterTextSplitter.test.ts (1)
packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (1)
RecursiveCharacterTextSplitter(7-91)
packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (1)
packages/documents/src/text-splitters/TextSplitter.ts (1)
TextSplitterParams(1-4)
packages/documents/src/DocumentProcessor.test.ts (2)
packages/documents/src/embeddings/EmbeddingModel.ts (1)
EmbeddingModel(1-4)packages/documents/src/DocumentProcessor.ts (1)
DocumentProcessor(12-35)
packages/documents/src/embeddings/OpenAIEmbeddingModel.ts (2)
packages/documents/src/embeddings/EmbeddingModel.ts (1)
EmbeddingModel(1-4)packages/documents/src/DocumentProcessor.ts (1)
process(21-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Test internal
- GitHub Check: Test logger
- GitHub Check: Test create-voltagent-app
- GitHub Check: Test postgres
- GitHub Check: Test docs-mcp
- GitHub Check: Test server-core
- GitHub Check: Test core
- GitHub Check: Test logger
- GitHub Check: Test libsql
- GitHub Check: Test cli
- GitHub Check: Test voice
- GitHub Check: Build (Node 22)
- GitHub Check: Test supabase
- GitHub Check: Build (Node 24)
- GitHub Check: Build (Node 20)
🔇 Additional comments (19)
website/docs/rag/overview.md (1)
242-280: The documented APIs are accurate for @voltagent/rag, not @voltagent/documents.The documentation section (lines 242-280) correctly describes the existing
@voltagent/ragpackage with accurate class names (RecursiveChunker,MarkdownChunker, etc.), method signatures (.chunk()), and parameters (maxTokens,overlapTokens). All referenced chunkers exist in the RAG package.However, this creates a critical mismatch: the PR adds
@voltagent/documentswhich providesRecursiveCharacterTextSplitter(with.splitText()method andchunkSize/chunkOverlapparameters) along withOpenAIEmbeddingModelandDocumentProcessor. The documentation section does not document any of these.Clarify the documentation intent:
- If this section is new and meant to document the
@voltagent/documentspackage being added, it needs to describe that package's API, not RAG's.- If this section pre-existed and documents RAG (separate from this PR), clarify whether
@voltagent/documentsfeatures need their own documentation section.packages/server-core/src/websocket/setup.ts (1)
232-233: Lint suppression is acceptable for existing complex handler.The
biome-ignoredirective appropriately acknowledges the cognitive complexity of this WebSocket upgrade handler without changing its behavior. Consider adding a TODO or tracking issue if there are plans to refactor this handler in the future.biome.json (1)
64-66: Verify thewebsitedirectory exclusion is intentional.Adding
archiveto the ignore list is reasonable. However, excluding the entirewebsitedirectory is broad—ensure this is intentional and that the website has its own linting/formatting configuration, or that it doesn't contain TypeScript/JavaScript that should be linted.packages/documents/src/embeddings/EmbeddingModel.ts (1)
1-4: Clean interface design following embedding model conventions.The
EmbeddingModelinterface is well-structured with appropriate async return types. This abstraction will allow easy swapping of embedding providers beyond OpenAI.packages/core/package.json (1)
19-19: Workspace dependency correctly links the new documents package.Using
workspace:*is the appropriate pattern for internal monorepo dependencies, ensuring core always uses the local documents package version during development and publishing..changeset/document-utils-integration.md (1)
1-6: Changeset correctly captures the feature addition.The minor version bump for both packages is appropriate—
@voltagent/documentsis a new package with new functionality, and@voltagent/coregains new capabilities through the integration.packages/documents/tsconfig.json (1)
10-11: No action needed. The rootDir configuration is correct for this setup.The tsconfig.json's
rootDiranddeclarationsettings are used only for type-checking (viatsc --noEmit). Declaration generation during the build is handled independently by tsup withdts: true, which generates declarations directly from the entry point (src/index.ts) and outputs them todist/index.d.tsanddist/index.d.mts, as correctly specified in package.json'stypesexports. The configuration is already correct.Likely an incorrect or invalid review comment.
packages/core/src/agent/subagent/index.ts (1)
319-320: LGTM!The lint suppression directive is appropriate for this existing complex method. The
handoffTaskmethod handles multiple sub-agent configuration types and orchestrates complex handoff logic, justifying the complexity allowance.packages/documents/tsup.config.ts (1)
1-19: LGTM!The tsup configuration is well-structured for the new documents package. Dual CJS/ESM output, TypeScript declarations, and the shared external plugin are appropriately configured.
packages/documents/src/text-splitters/TextSplitter.ts (2)
1-4: LGTM!Clean interface definition with appropriate optional parameters.
29-36: LGTM!The
createDocumentsmethod correctly accumulates chunks from multiple input texts. Sequential processing is appropriate here to maintain predictable ordering.packages/documents/src/text-splitters/RecursiveCharacterTextSplitter.ts (2)
1-13: LGTM!Clean extension of
TextSplitterwith sensible default separators that progress from paragraph to character-level splitting.
15-17: LGTM!Clean delegation pattern for the public async interface.
packages/documents/src/RecursiveCharacterTextSplitter.test.ts (1)
1-3: LGTM!Correct imports from vitest and the implementation module.
packages/documents/README.md (1)
1-71: LGTM!The README provides clear documentation with helpful code examples for each component. The examples demonstrate key configuration options and typical usage patterns.
packages/documents/src/index.ts (1)
1-5: LGTM!The barrel exports cleanly expose the public API surface: text splitters, embedding models, and the document processor with its types.
packages/core/src/retriever/index.ts (1)
6-9: LGTM!The new exports are properly structured. Re-exporting
ProcessedDocumentfrom@voltagent/documentsand exposing the newDocumentRetrieverclass with its options type maintains a clean public API surface for the module.packages/core/src/retriever/document-retriever.ts (2)
20-35: LGTM!The constructor pattern with optional processor and fallback to default is appropriate. The
ingestmethod cleanly delegates to the processor and abstract storage method with proper debug logging.
41-49: LGTM!The abstract method contracts are well-defined, providing clear expectations for implementations to handle vector storage and retrieval.
| const textQuery = typeof input === "string" ? input : input[input.length - 1].content; | ||
|
|
||
| // We assume the processor's embedder has an embedQuery method. | ||
| // Since DocumentProcessor exposes 'embedder', we can use it. | ||
| const queryVector = await this.processor.embedder.embedQuery(textQuery as string); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate content type before embedding.
The content property of BaseMessage may not always be a string (e.g., structured content in multi-modal messages). The cast on line 63 could mask runtime issues if content is an object or undefined.
Suggested defensive handling
- const textQuery = typeof input === "string" ? input : input[input.length - 1].content;
-
- // We assume the processor's embedder has an embedQuery method.
- // Since DocumentProcessor exposes 'embedder', we can use it.
- const queryVector = await this.processor.embedder.embedQuery(textQuery as string);
+ let textQuery: string;
+ if (typeof input === "string") {
+ textQuery = input;
+ } else {
+ const lastContent = input[input.length - 1].content;
+ textQuery = typeof lastContent === "string" ? lastContent : String(lastContent);
+ }
+
+ const queryVector = await this.processor.embedder.embedQuery(textQuery);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const textQuery = typeof input === "string" ? input : input[input.length - 1].content; | |
| // We assume the processor's embedder has an embedQuery method. | |
| // Since DocumentProcessor exposes 'embedder', we can use it. | |
| const queryVector = await this.processor.embedder.embedQuery(textQuery as string); | |
| let textQuery: string; | |
| if (typeof input === "string") { | |
| textQuery = input; | |
| } else { | |
| const lastContent = input[input.length - 1].content; | |
| textQuery = typeof lastContent === "string" ? lastContent : String(lastContent); | |
| } | |
| const queryVector = await this.processor.embedder.embedQuery(textQuery); |
🤖 Prompt for AI Agents
In @packages/core/src/retriever/document-retriever.ts around lines 59 - 63, The
code assumes the selected BaseMessage content is a string and casts it before
calling processor.embedder.embedQuery; instead validate the content type from
input (and the last BaseMessage) before embedding: ensure textQuery is a string,
handle undefined or non-string content by extracting a text field if present
(e.g., content.text), falling back to a safe serialization like
JSON.stringify(content) or returning/logging an error and not calling
embedQuery; replace the direct cast on textQuery and guard the call to
this.processor.embedder.embedQuery accordingly to avoid runtime crashes.
| const queryVector = await this.processor.embedder.embedQuery(textQuery as string); | ||
|
|
||
| // Default top-k to 4, can be customizable via options | ||
| const k = (options as any).k ?? 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid any cast; extend RetrieveOptions with proper typing.
Using (options as any).k bypasses type safety. Consider extending RetrieveOptions or creating a dedicated options type for vector retrieval.
Proposed fix
Define proper types for the k parameter:
+export interface DocumentRetrieveOptions extends RetrieveOptions {
+ /**
+ * Number of top results to return from vector search.
+ * @default 4
+ */
+ k?: number;
+}
+
export abstract class DocumentRetriever extends BaseRetriever {
// ...
- async retrieve(input: string | BaseMessage[], options: RetrieveOptions = {}): Promise<string> {
+ async retrieve(input: string | BaseMessage[], options: DocumentRetrieveOptions = {}): Promise<string> {
// ...
- const k = (options as any).k ?? 4;
+ const k = options.k ?? 4;🤖 Prompt for AI Agents
In @packages/core/src/retriever/document-retriever.ts at line 66, The code uses
a cast to any to read options.k; instead extend the RetrieveOptions type (or
create a new VectorRetrieveOptions that extends RetrieveOptions) to include an
optional k?: number, update the method signature to accept that typed options,
and replace the cast line with a typed access (e.g., destructure or read
options.k with a default of 4) so type safety is preserved for the k parameter
in DocumentRetriever/document-retrieval logic.
| "dependencies": { | ||
| "openai": "^4.20.0" | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Check latest openai version and release info
curl -s https://registry.npmjs.org/openai | jq -r '.["dist-tags"] | to_entries[] | "\(.key): \(.value)"'Repository: VoltAgent/voltagent
Length of output: 137
🌐 Web query:
openai npm package latest version 2025 security vulnerabilities
💡 Result:
Summary (as of Jan 13, 2026):
-
Latest official "openai" npm package (registry) — see package page on npm for the current published version. No direct vulnerabilities are listed for the package in Snyk's database. [1][2]
-
Snyk security scan: "No known security issues" for the latest openai release (Snyk package page). [1]
-
Official npm registry: package/version history and tags are on npmjs (check npm package page to confirm the exact version number you have/plan to install). [2]
Recommended actions:
- Run npm audit and/or Snyk/Dependabot against your project before installing/upgrading.
- Inspect your dependency tree for transitive vulnerabilities (audit covers those).
- If you need, I can check the exact latest version number right now and scan advisories (CVE/GitHub/GitLab) for any security reports — shall I do that?
Sources:
[1] Snyk — openai package security page.
[2] npmjs — openai package page and version listings.
Update OpenAI SDK to a supported version.
The package depends on openai@^4.20.0, which is 2 major versions behind the latest release (6.16.0). While the current version has no known security vulnerabilities, the caret constraint prevents adoption of newer versions. Consider upgrading to a recent version within the 6.x line to benefit from API improvements and continued maintenance.
🤖 Prompt for AI Agents
In @packages/documents/package.json around lines 29 - 31, The package.json
currently pins the "openai" dependency to "^4.20.0"; update that dependency to a
supported 6.x release (e.g., "^6.16.0") in the "dependencies" entry for
"openai", then run your package manager (npm/yarn/pnpm) to install and update
lockfiles; after upgrading, run tests and fix any breaking API changes in code
that uses the OpenAI SDK (search for imports/usages of "openai" and update
client construction and method names per the 6.x migration guide).
| async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> { | ||
| const chunks = await this.splitter.splitText(text); | ||
| const embeddings = await this.embedder.embedDocuments(chunks); | ||
|
|
||
| return chunks.map((chunk, index) => ({ | ||
| text: chunk, | ||
| embedding: embeddings[index], | ||
| metadata: { | ||
| ...metadata, | ||
| chunkIndex: index, | ||
| chunkCount: chunks.length, | ||
| }, | ||
| })); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential undefined embedding if array lengths mismatch.
If embedder.embedDocuments(chunks) returns fewer embeddings than chunks (due to an API error or implementation bug), embeddings[index] will be undefined, leading to corrupted ProcessedDocument objects.
🛡️ Proposed defensive check
async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> {
const chunks = await this.splitter.splitText(text);
const embeddings = await this.embedder.embedDocuments(chunks);
+ if (embeddings.length !== chunks.length) {
+ throw new Error(
+ `Embedding count mismatch: expected ${chunks.length}, got ${embeddings.length}`
+ );
+ }
+
return chunks.map((chunk, index) => ({
text: chunk,
embedding: embeddings[index],📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> { | |
| const chunks = await this.splitter.splitText(text); | |
| const embeddings = await this.embedder.embedDocuments(chunks); | |
| return chunks.map((chunk, index) => ({ | |
| text: chunk, | |
| embedding: embeddings[index], | |
| metadata: { | |
| ...metadata, | |
| chunkIndex: index, | |
| chunkCount: chunks.length, | |
| }, | |
| })); | |
| } | |
| async process(text: string, metadata?: Record<string, any>): Promise<ProcessedDocument[]> { | |
| const chunks = await this.splitter.splitText(text); | |
| const embeddings = await this.embedder.embedDocuments(chunks); | |
| if (embeddings.length !== chunks.length) { | |
| throw new Error( | |
| `Embedding count mismatch: expected ${chunks.length}, got ${embeddings.length}` | |
| ); | |
| } | |
| return chunks.map((chunk, index) => ({ | |
| text: chunk, | |
| embedding: embeddings[index], | |
| metadata: { | |
| ...metadata, | |
| chunkIndex: index, | |
| chunkCount: chunks.length, | |
| }, | |
| })); | |
| } |
🤖 Prompt for AI Agents
In @packages/documents/src/DocumentProcessor.ts around lines 21 - 34, In
process, guard against embedder.embedDocuments returning fewer items than
chunks: after const embeddings = await this.embedder.embedDocuments(chunks);
check that embeddings is an array and embeddings.length === chunks.length (or at
least >= chunks.length); if not, either throw a clear error or fill missing
entries with a safe fallback (e.g., null vector or empty embedding) and log the
mismatch via the class logger; ensure the returned ProcessedDocument objects use
validated/fallback embeddings so embeddings[index] cannot be undefined.
| this.client = new OpenAI({ | ||
| apiKey: params?.apiKey ?? process.env.OPENAI_API_KEY, | ||
| }); | ||
| this.model = params?.model ?? "text-embedding-ada-002"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
latest OpenAI embedding model 2025 2026 text-embedding-3
💡 Result:
As of January 13, 2026 the latest OpenAI embedding models are the text-embedding-3 family: text-embedding-3-large (most capable, up to 3072 dims) and text-embedding-3-small (cost‑efficient). They were announced January 25, 2024; text-embedding-3-large is OpenAI’s highest‑performing embedding model. [1] (openai.com)
Sources:
[1] OpenAI announcement — “New embedding models and API updates” (Jan 25, 2024). (openai.com)
[2] text-embedding-3-large docs (OpenAI). (platform.openai.com)
[3] text-embedding-3-small docs (OpenAI). (platform.openai.com)
Citations:
- 1: https://openai.com/index/new-embedding-models-and-api-updates/?utm_source=openai
- 2: https://openai.com/index/new-embedding-models-and-api-updates/?utm_source=openai
- 3: https://platform.openai.com/docs/models/text-embedding-3-large?utm_source=openai
- 4: https://platform.openai.com/docs/models/text-embedding-3-small?utm_source=openai
🏁 Script executed:
cat -n packages/documents/src/embeddings/OpenAIEmbeddingModel.ts | head -40Repository: VoltAgent/voltagent
Length of output: 1611
Update the default embedding model to text-embedding-3-large.
The default model text-embedding-ada-002 is outdated. OpenAI released the text-embedding-3 family on January 25, 2024, with text-embedding-3-large being the highest-performing embedding model. Consider updating the default to "text-embedding-3-large" or "text-embedding-3-small" if cost efficiency is a priority. Since the model is configurable via params, applications can override this, but the default should reflect current best practices.
🤖 Prompt for AI Agents
In @packages/documents/src/embeddings/OpenAIEmbeddingModel.ts at line 19, The
default embedding model in OpenAIEmbeddingModel is outdated; update the default
assignment for this.model (in the OpenAIEmbeddingModel
constructor/initialization) from "text-embedding-ada-002" to
"text-embedding-3-large" (or "text-embedding-3-small" if you prefer cost
efficiency) so new instances use the current best-practice embedding by default
while preserving params override behavior.
| async embedQuery(text: string): Promise<number[]> { | ||
| const response = await this.client.embeddings.create({ | ||
| model: this.model, | ||
| input: text.replace(/\n/g, " "), | ||
| }); | ||
| return response.data[0].embedding; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing safety check for empty API response.
If the OpenAI API returns an empty data array (e.g., due to an API issue or edge case), accessing response.data[0].embedding will throw a runtime error.
🛡️ Proposed defensive check
async embedQuery(text: string): Promise<number[]> {
const response = await this.client.embeddings.create({
model: this.model,
input: text.replace(/\n/g, " "),
});
+ if (!response.data.length) {
+ throw new Error("OpenAI returned no embeddings");
+ }
return response.data[0].embedding;
}🤖 Prompt for AI Agents
In @packages/documents/src/embeddings/OpenAIEmbeddingModel.ts around lines 23 -
29, The embedQuery method lacks a safety check for an empty API response; after
calling this.client.embeddings.create in embedQuery, verify that response &&
Array.isArray(response.data) && response.data.length > 0 and that
response.data[0].embedding exists before returning it; if the check fails,
either throw a descriptive error (e.g., "Empty embeddings response from OpenAI")
or return a sensible default (empty array) and log the incident so callers don't
get a runtime exception when accessing response.data[0].embedding.
feat: implement document chunking and embedding utilities and integrate into core DocumentRetriever
PR Checklist
Please check if your PR fulfills the following requirements:
Bugs / Features
What is the current behavior?
There are no built-in utilities for document chunking, embedding, or ingestion in the core framework, making it difficult to implement RAG workflows.
What is the new behavior?
This PR introduces a new
@voltagent/documentspackage and integrates it with@voltagent/core.-DocumentRetriever: New abstract class in core that adds
ingest()capabilities.fixes #6
Notes for reviewers
test-integrationscript verifies that the types and exports work correctly.openaiwas added topackages/documents.Summary by cubic
Adds document chunking and embeddings via a new @voltagent/documents package, and integrates ingestion into core DocumentRetriever to enable RAG workflows. Fixes #6.
New Features
Migration
Written for commit 64e2cff. Summary will update on new commits.
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.