Universal document-to-markdown and section splitter for HTML, URLs, and PDFs.
- Cross-platform: Works in both Node.js and browser environments
- Convert HTML, URLs, or PDFs to Markdown
- Split Markdown into logical sections by headers
- Works in Node.js and browser (PDF support is best in Node.js)
- High Performance: Sub-second processing for most documents
- Memory Efficient: Optimized for large files up to 2MB
npm install doc-to-readable
import { docToMarkdown } from 'doc-to-readable';
// From HTML string
const md = await docToMarkdown('<h1>Hello</h1><p>World</p>', { type: 'html' });
// From URL
const mdFromUrl = await docToMarkdown('https://example.com', { type: 'url' });
// From Markdown (returns as-is)
const mdFromMarkdown = await docToMarkdown('# Title\nContent', { type: 'markdown' });
import { splitReadableDocs } from 'doc-to-readable';
// From Markdown
const sections = await splitReadableDocs('# Title\n\nContent here\n\n## Subtitle\n\nMore content');
// sections: [{ title: 'Title', content: 'Content here' }, { title: 'Subtitle', content: 'More content' }]
// From HTML
const html = '<h1>Title</h1><p>Content</p><h2>Subtitle</h2><p>More</p>';
const htmlSections = await splitReadableDocs(html, { type: 'html' });
// From URL
const urlSections = await splitReadableDocs('https://example.com', { type: 'url' });
- For PDF files, convert to HTML first using the included helpers, then use
docToMarkdown
orsplitReadableDocs
with{ type: 'html' }
.
docToMarkdown(input: string, options: { type: 'url' | 'html' | 'markdown' }): Promise<string>
- If
type
is'markdown'
, returns input as-is. - If unsupported type, throws a Not Implemented error.
- If
splitReadableDocs(input: string, options?: { type?: 'markdown' | 'url' | 'html' }): Promise<Array<{ title: string | null, content: string }>>
- If
type
is omitted or'markdown'
, splits input as markdown. - If
type
is'html'
or'url'
, converts to markdown first, then splits.
- If
pdfToHtmlFromBuffer(buffer: ArrayBuffer): Promise<string>
- Convert PDF buffer to HTML
import { pdfToHtmlFromBuffer } from 'doc-to-readable';
// Convert PDF buffer to HTML
const pdfBuffer = await fetch('document.pdf').then(res => res.arrayBuffer());
const html = await pdfToHtmlFromBuffer(pdfBuffer);
// Then convert to markdown
const md = await docToMarkdown(html, { type: 'html' });
The library is optimized for high performance across different file sizes. Here are benchmark results from our test suite:
File Size | docToMarkdown | splitReadableDocs | Memory Usage |
---|---|---|---|
1KB | 265ms | 0ms | 33MB RSS |
10KB | 43ms | 0ms | 2MB RSS |
100KB | 237ms | 1ms | 23MB RSS |
1000KB | 2.7s | 4ms | 259MB RSS |
2MB | 6.3s | N/A | 934MB RSS |
- Ultra-fast splitting:
splitReadableDocs
processes documents in sub-millisecond time - Linear scaling: Processing time scales linearly with file size
- Memory efficient: Optimized memory usage for large documents
- Size limits: Built-in 2MB limit prevents memory issues
- Real-time ready: Sub-second processing for documents up to 100KB
The library includes comprehensive benchmark tests that validate performance across:
- Small documents (1-10KB): Sub-second processing
- Medium documents (100KB): ~250ms processing
- Large documents (1MB): ~3 seconds processing
- Very large documents (2MB): ~6 seconds processing
- Edge cases: Many sections, long paragraphs, oversized files
Run benchmarks with:
npm run test:benchmark
- @mozilla/readability: Extracts main article content from HTML.
- turndown: Converts HTML to Markdown.
- turndown-plugin-gfm: GitHub Flavored Markdown support for Turndown.
- remark: Markdown processing (used for splitting and parsing).
- dompurify: Sanitizes HTML input.
- jsdom: Emulates browser DOM in Node.js for HTML parsing.
- pdf.js: PDF to HTML conversion.
MIT
Patch update: API and types for splitReadableDocs and docToMarkdown improved for clarity and flexibility.