High-performance document intelligence for Node.js and TypeScript, powered by Rust.
Extract text, tables, images, and metadata from 56 file formats including PDF, DOCX, PPTX, XLSX, images, and more.
Recommended for Node.js and Bun - Native NAPI-RS bindings provide the best performance (2-3x faster than WASM).
For browser, Deno, or Cloudflare Workers, use @kreuzberg/wasm instead.
Version 4.0.0 Release Candidate This is a pre-release version. We invite you to test the library and report any issues you encounter.
- 56 File Formats: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
- OCR Support: Built-in Tesseract, EasyOCR, and PaddleOCR backends for scanned documents
- Table Extraction: Advanced table detection and structured data extraction
- Native Performance: 2-3x faster than WASM; 10-50x faster than pure JavaScript
- Zero-Copy Operations: Direct system calls and minimal data copying
- Type-Safe: Full TypeScript definitions for all methods, configurations, and return types
- Async/Sync APIs: Both asynchronous and synchronous extraction methods
- Batch Processing: Process multiple documents in parallel with optimized concurrency
- Language Detection: Automatic language detection for extracted text
- Text Chunking: Split long documents into manageable chunks for LLM processing
- Caching: Built-in result caching for faster repeated extractions
- Zero Configuration: Works out of the box with sensible defaults
Choose @kreuzberg/node if you're building with:
- Node.js 18+ - Native bindings provide direct access to system resources
- Bun - Full compatibility with Bun's Node.js API
- Performance-critical applications - Processing large document batches or real-time extraction
- Server-side extraction - APIs, microservices, document processing pipelines
| Aspect | @kreuzberg/node |
@kreuzberg/wasm |
|---|---|---|
| Performance | 2-3x faster (native) | Standard baseline |
| Environment | Node.js, Bun | Browser, Deno, Workers, Node.js |
| Bundle Size | 10-15 MB (prebuilt binary) | 2-4 MB (WASM module) |
| System Access | Direct system calls | Sandboxed via WASM |
| Best For | Server-side, batch processing | Client-side, edge computing |
Use @kreuzberg/wasm for browser applications, Cloudflare Workers, Deno, or when you need a smaller bundle size.
- Node.js 18 or higher
- Native bindings are prebuilt for:
- macOS (x64, arm64)
- Linux (x64, arm64, armv7)
- Windows (x64, arm64)
-
Tesseract: For OCR functionality
- macOS:
brew install tesseract - Ubuntu:
sudo apt-get install tesseract-ocr - Windows: Download from GitHub
- macOS:
-
LibreOffice: For legacy MS Office formats (.doc, .ppt)
- macOS:
brew install libreoffice - Ubuntu:
sudo apt-get install libreoffice
- macOS:
-
Pandoc: For advanced document conversion
- macOS:
brew install pandoc - Ubuntu:
sudo apt-get install pandoc
- macOS:
npm install @kreuzberg/nodeOr with pnpm:
pnpm add @kreuzberg/nodeOr with yarn:
yarn add @kreuzberg/nodeThe package includes prebuilt native binaries for major platforms. No additional build steps required.
import { extractFileSync } from '@kreuzberg/node';
// Synchronous extraction
const result = extractFileSync('document.pdf');
console.log(result.content);
console.log(result.metadata);import { extractFile } from '@kreuzberg/node';
// Asynchronous extraction
const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.tables);import {
extractFile,
type ExtractionConfig,
type ExtractionResult
} from '@kreuzberg/node';
const config: ExtractionConfig = {
useCache: true,
enableQualityProcessing: true
};
const result: ExtractionResult = await extractFile('invoice.pdf', config);
// Type-safe access to all properties
console.log(result.content);
console.log(result.mimeType);
console.log(result.metadata);
if (result.tables) {
for (const table of result.tables) {
console.log(table.markdown);
}
}import { extractFile, type ExtractionConfig, type OcrConfig } from '@kreuzberg/node';
const config: ExtractionConfig = {
ocr: {
backend: 'tesseract',
language: 'eng',
tesseractConfig: {
enableTableDetection: true,
psm: 6,
minConfidence: 50.0
}
} as OcrConfig
};
const result = await extractFile('scanned.pdf', config);
console.log(result.content);import { extractFile, type PdfConfig } from '@kreuzberg/node';
const config = {
pdfOptions: {
passwords: ['password1', 'password2'],
extractImages: true,
extractMetadata: true
} as PdfConfig
};
const result = await extractFile('protected.pdf', config);import { extractFile } from '@kreuzberg/node';
const result = await extractFile('financial-report.pdf');
if (result.tables) {
for (const table of result.tables) {
console.log('Table as Markdown:');
console.log(table.markdown);
console.log('Table cells:');
console.log(JSON.stringify(table.cells, null, 2));
}
}import { extractFile, type ChunkingConfig } from '@kreuzberg/node';
const config = {
chunking: {
maxChars: 1000,
maxOverlap: 200
} as ChunkingConfig
};
const result = await extractFile('long-document.pdf', config);
if (result.chunks) {
for (const chunk of result.chunks) {
console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 100)}...`);
}
}import { extractFile, type LanguageDetectionConfig } from '@kreuzberg/node';
const config = {
languageDetection: {
enabled: true,
minConfidence: 0.8,
detectMultiple: false
} as LanguageDetectionConfig
};
const result = await extractFile('multilingual.pdf', config);
if (result.language) {
console.log(`Detected language: ${result.language.code}`);
console.log(`Confidence: ${result.language.confidence}`);
}import { extractFile, type ImageExtractionConfig } from '@kreuzberg/node';
import { writeFile } from 'fs/promises';
const config = {
images: {
extractImages: true,
targetDpi: 300,
maxImageDimension: 4096,
autoAdjustDpi: true
} as ImageExtractionConfig
};
const result = await extractFile('document-with-images.pdf', config);
if (result.images) {
for (let i = 0; i < result.images.length; i++) {
const image = result.images[i];
await writeFile(`image-${i}.${image.format}`, Buffer.from(image.data));
}
}import {
extractFile,
type ExtractionConfig,
type OcrConfig,
type ChunkingConfig,
type ImageExtractionConfig,
type PdfConfig,
type TokenReductionConfig,
type LanguageDetectionConfig
} from '@kreuzberg/node';
const config: ExtractionConfig = {
useCache: true,
enableQualityProcessing: true,
forceOcr: false,
maxConcurrentExtractions: 8,
ocr: {
backend: 'tesseract',
language: 'eng',
preprocessing: true,
tesseractConfig: {
enableTableDetection: true,
psm: 6,
oem: 3,
minConfidence: 50.0
}
} as OcrConfig,
chunking: {
maxChars: 1000,
maxOverlap: 200
} as ChunkingConfig,
images: {
extractImages: true,
targetDpi: 300,
maxImageDimension: 4096,
autoAdjustDpi: true
} as ImageExtractionConfig,
pdfOptions: {
extractImages: true,
passwords: [],
extractMetadata: true
} as PdfConfig,
tokenReduction: {
mode: 'moderate',
preserveImportantWords: true
} as TokenReductionConfig,
languageDetection: {
enabled: true,
minConfidence: 0.8,
detectMultiple: false
} as LanguageDetectionConfig
};
const result = await extractFile('document.pdf', config);import { extractBytes } from '@kreuzberg/node';
import { readFile } from 'fs/promises';
const buffer = await readFile('document.pdf');
const result = await extractBytes(buffer, 'application/pdf');
console.log(result.content);import { batchExtractFiles } from '@kreuzberg/node';
const files = [
'document1.pdf',
'document2.docx',
'document3.xlsx'
];
const results = await batchExtractFiles(files);
for (const result of results) {
console.log(`${result.mimeType}: ${result.content.length} characters`);
}import { batchExtractFiles } from '@kreuzberg/node';
const config = {
maxConcurrentExtractions: 4 // Process 4 files at a time
};
const files = Array.from({ length: 20 }, (_, i) => `file-${i}.pdf`);
const results = await batchExtractFiles(files, config);
console.log(`Processed ${results.length} files`);import { extractFile } from '@kreuzberg/node';
const result = await extractFile('document.pdf');
if (result.metadata) {
console.log('Title:', result.metadata.title);
console.log('Author:', result.metadata.author);
console.log('Creation Date:', result.metadata.creationDate);
console.log('Page Count:', result.metadata.pageCount);
console.log('Word Count:', result.metadata.wordCount);
}import { extractFile, type TokenReductionConfig } from '@kreuzberg/node';
const config = {
tokenReduction: {
mode: 'aggressive', // Options: 'light', 'moderate', 'aggressive'
preserveImportantWords: true
} as TokenReductionConfig
};
const result = await extractFile('long-document.pdf', config);
// Reduced token count while preserving meaning
console.log(`Original length: ${result.content.length}`);
console.log(`Processed for LLM context window`);import {
extractFile,
KreuzbergError,
ValidationError,
ParsingError,
OCRError,
MissingDependencyError
} from '@kreuzberg/node';
try {
const result = await extractFile('document.pdf');
console.log(result.content);
} catch (error) {
if (error instanceof ValidationError) {
console.error('Invalid configuration or input:', error.message);
} else if (error instanceof ParsingError) {
console.error('Failed to parse document:', error.message);
} else if (error instanceof OCRError) {
console.error('OCR processing failed:', error.message);
} else if (error instanceof MissingDependencyError) {
console.error(`Missing dependency: ${error.dependency}`);
console.error('Installation instructions:', error.message);
} else if (error instanceof KreuzbergError) {
console.error('Kreuzberg error:', error.message);
} else {
throw error;
}
}Asynchronously extract content from a file.
Synchronously extract content from a file.
Asynchronously extract content from a buffer.
Synchronously extract content from a buffer.
Asynchronously extract content from multiple files in parallel.
Synchronously extract content from multiple files.
Main result object containing:
content: string- Extracted text contentmimeType: string- MIME type of the documentmetadata?: Metadata- Document metadatatables?: Table[]- Extracted tablesimages?: ImageData[]- Extracted imageschunks?: Chunk[]- Text chunks (if chunking enabled)language?: LanguageInfo- Detected language (if enabled)
Configuration object for extraction:
useCache?: boolean- Enable result cachingenableQualityProcessing?: boolean- Enable text quality improvementsforceOcr?: boolean- Force OCR even for text-based PDFsmaxConcurrentExtractions?: number- Max parallel extractionsocr?: OcrConfig- OCR settingschunking?: ChunkingConfig- Text chunking settingsimages?: ImageExtractionConfig- Image extraction settingspdfOptions?: PdfConfig- PDF-specific optionstokenReduction?: TokenReductionConfig- Token reduction settingslanguageDetection?: LanguageDetectionConfig- Language detection settings
OCR configuration:
backend: string- OCR backend ('tesseract', 'easyocr', 'paddleocr')language: string- Language code (e.g., 'eng', 'fra', 'deu')preprocessing?: boolean- Enable image preprocessingtesseractConfig?: TesseractConfig- Tesseract-specific options
Extracted table structure:
markdown: string- Table in Markdown formatcells: TableCell[][]- 2D array of table cellsrowCount: number- Number of rowscolumnCount: number- Number of columns
All Kreuzberg exceptions extend the base KreuzbergError class:
KreuzbergError- Base error class for all Kreuzberg errorsValidationError- Invalid configuration, missing required fields, or invalid inputParsingError- Document parsing failure or corrupted fileOCRError- OCR processing failureMissingDependencyError- Missing optional system dependency (includes installation instructions)
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
| Images | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
| Web | HTML, XHTML, XML |
| Text | TXT, MD, CSV, TSV, JSON, YAML, TOML |
| EML, MSG | |
| Archives | ZIP, TAR, 7Z |
| Other | And 30+ more formats |
Kreuzberg is built with a native Rust core, providing significant performance improvements over pure JavaScript solutions:
- 10-50x faster text extraction compared to pure Node.js libraries
- Native multithreading for batch processing
- Optimized memory usage with streaming for large files
- Zero-copy operations where possible
- Efficient caching to avoid redundant processing
Processing 100 mixed documents (PDF, DOCX, XLSX):
| Library | Time | Memory |
|---|---|---|
| Kreuzberg | 2.3s | 145 MB |
| pdf-parse + mammoth | 23.1s | 890 MB |
| textract | 45.2s | 1.2 GB |
If you encounter errors about missing native modules:
npm rebuild @kreuzberg/nodeEnsure Tesseract is installed and available in PATH:
tesseract --versionIf Tesseract is not found:
- macOS:
brew install tesseract - Ubuntu:
sudo apt-get install tesseract-ocr - Windows: Download from tesseract-ocr/tesseract
For very large PDFs, use chunking to reduce memory usage:
const config = {
chunking: { maxChars: 1000 }
};
const result = await extractFile('large.pdf', config);Make sure you're using:
- Node.js 18 or higher
- TypeScript 5.0 or higher
The package includes built-in type definitions.
For maximum performance when processing many files:
// Use batch processing instead of sequential
const results = await batchExtractFiles(files, {
maxConcurrentExtractions: 8 // Tune based on CPU cores
});import { extractFile } from '@kreuzberg/node';
const result = await extractFile('invoice.pdf');
// Access tables for line items
if (result.tables && result.tables.length > 0) {
const lineItems = result.tables[0];
console.log(lineItems.markdown);
}
// Access metadata for invoice details
if (result.metadata) {
console.log('Invoice Date:', result.metadata.creationDate);
}import { extractFile } from '@kreuzberg/node';
const config = {
forceOcr: true,
ocr: {
backend: 'tesseract',
language: 'eng',
preprocessing: true
}
};
const result = await extractFile('scanned-contract.pdf', config);
console.log(result.content);import { batchExtractFiles } from '@kreuzberg/node';
import { glob } from 'glob';
// Find all documents
const files = await glob('documents/**/*.{pdf,docx,xlsx}');
// Extract in batches
const results = await batchExtractFiles(files, {
maxConcurrentExtractions: 8,
enableQualityProcessing: true
});
// Build search index
const searchIndex = results.map((result, i) => ({
path: files[i],
content: result.content,
metadata: result.metadata
}));
console.log(`Indexed ${searchIndex.length} documents`);For comprehensive documentation, visit https://kreuzberg.dev
We welcome contributions! Please see our Contributing Guide for details.
MIT