Skip to content

ilyashusterman/doc-to-readable

Repository files navigation

CI npm version npm downloads License: MIT TypeScript Node.js

doc-to-readable

Universal document-to-markdown and section splitter for HTML, URLs, and PDFs.

Features

  • Cross-platform: Works in both Node.js and browser environments
  • Convert HTML, URLs, or PDFs to Markdown
  • Split Markdown into logical sections by headers
  • Works in Node.js and browser (PDF support is best in Node.js)
  • High Performance: Sub-second processing for most documents
  • Memory Efficient: Optimized for large files up to 2MB

Installation

npm install doc-to-readable

Usage

Convert to Markdown

import { docToMarkdown } from 'doc-to-readable';

// From HTML string
const md = await docToMarkdown('<h1>Hello</h1><p>World</p>', { type: 'html' });

// From URL
const mdFromUrl = await docToMarkdown('https://example.com', { type: 'url' });

// From Markdown (returns as-is)
const mdFromMarkdown = await docToMarkdown('# Title\nContent', { type: 'markdown' });

Split into Sections

import { splitReadableDocs } from 'doc-to-readable';

// From Markdown
const sections = await splitReadableDocs('# Title\n\nContent here\n\n## Subtitle\n\nMore content');
// sections: [{ title: 'Title', content: 'Content here' }, { title: 'Subtitle', content: 'More content' }]

// From HTML
const html = '<h1>Title</h1><p>Content</p><h2>Subtitle</h2><p>More</p>';
const htmlSections = await splitReadableDocs(html, { type: 'html' });

// From URL
const urlSections = await splitReadableDocs('https://example.com', { type: 'url' });

PDF Support

  • For PDF files, convert to HTML first using the included helpers, then use docToMarkdown or splitReadableDocs with { type: 'html' }.

API

  • docToMarkdown(input: string, options: { type: 'url' | 'html' | 'markdown' }): Promise<string>
    • If type is 'markdown', returns input as-is.
    • If unsupported type, throws a Not Implemented error.
  • splitReadableDocs(input: string, options?: { type?: 'markdown' | 'url' | 'html' }): Promise<Array<{ title: string | null, content: string }>>
    • If type is omitted or 'markdown', splits input as markdown.
    • If type is 'html' or 'url', converts to markdown first, then splits.
  • pdfToHtmlFromBuffer(buffer: ArrayBuffer): Promise<string> - Convert PDF buffer to HTML

PDF Buffer to HTML

import { pdfToHtmlFromBuffer } from 'doc-to-readable';

// Convert PDF buffer to HTML
const pdfBuffer = await fetch('document.pdf').then(res => res.arrayBuffer());
const html = await pdfToHtmlFromBuffer(pdfBuffer);

// Then convert to markdown
const md = await docToMarkdown(html, { type: 'html' });

Performance

The library is optimized for high performance across different file sizes. Here are benchmark results from our test suite:

Processing Speed

File Size docToMarkdown splitReadableDocs Memory Usage
1KB 265ms 0ms 33MB RSS
10KB 43ms 0ms 2MB RSS
100KB 237ms 1ms 23MB RSS
1000KB 2.7s 4ms 259MB RSS
2MB 6.3s N/A 934MB RSS

Key Performance Features

  • Ultra-fast splitting: splitReadableDocs processes documents in sub-millisecond time
  • Linear scaling: Processing time scales linearly with file size
  • Memory efficient: Optimized memory usage for large documents
  • Size limits: Built-in 2MB limit prevents memory issues
  • Real-time ready: Sub-second processing for documents up to 100KB

Performance Benchmarks

The library includes comprehensive benchmark tests that validate performance across:

  • Small documents (1-10KB): Sub-second processing
  • Medium documents (100KB): ~250ms processing
  • Large documents (1MB): ~3 seconds processing
  • Very large documents (2MB): ~6 seconds processing
  • Edge cases: Many sections, long paragraphs, oversized files

Run benchmarks with:

npm run test:benchmark

Main Dependencies

▶️ Open Live on StackBlitz

License

MIT

Patch update: API and types for splitReadableDocs and docToMarkdown improved for clarity and flexibility.