doc-to-readable

Universal document-to-markdown and section splitter for HTML, URLs, and PDFs.

Features

Cross-platform: Works in both Node.js and browser environments
Convert HTML, URLs, or PDFs to Markdown
Split Markdown into logical sections by headers
Works in Node.js and browser (PDF support is best in Node.js)
High Performance: Sub-second processing for most documents
Memory Efficient: Optimized for large files up to 2MB

Installation

npm install doc-to-readable

Usage

Convert to Markdown

import { docToMarkdown } from 'doc-to-readable';

// From HTML string
const md = await docToMarkdown('<h1>Hello</h1><p>World</p>', { type: 'html' });

// From URL
const mdFromUrl = await docToMarkdown('https://example.com', { type: 'url' });

// From Markdown (returns as-is)
const mdFromMarkdown = await docToMarkdown('# Title\nContent', { type: 'markdown' });

Split into Sections

import { splitReadableDocs } from 'doc-to-readable';

// From Markdown
const sections = await splitReadableDocs('# Title\n\nContent here\n\n## Subtitle\n\nMore content');
// sections: [{ title: 'Title', content: 'Content here' }, { title: 'Subtitle', content: 'More content' }]

// From HTML
const html = '<h1>Title</h1><p>Content</p><h2>Subtitle</h2><p>More</p>';
const htmlSections = await splitReadableDocs(html, { type: 'html' });

// From URL
const urlSections = await splitReadableDocs('https://example.com', { type: 'url' });

PDF Support

For PDF files, convert to HTML first using the included helpers, then use docToMarkdown or splitReadableDocs with { type: 'html' }.

API

docToMarkdown(input: string, options: { type: 'url' | 'html' | 'markdown' }): Promise<string>
- If type is 'markdown', returns input as-is.
- If unsupported type, throws a Not Implemented error.
splitReadableDocs(input: string, options?: { type?: 'markdown' | 'url' | 'html' }): Promise<Array<{ title: string | null, content: string }>>
- If type is omitted or 'markdown', splits input as markdown.
- If type is 'html' or 'url', converts to markdown first, then splits.
pdfToHtmlFromBuffer(buffer: ArrayBuffer): Promise<string> - Convert PDF buffer to HTML

PDF Buffer to HTML

import { pdfToHtmlFromBuffer } from 'doc-to-readable';

// Convert PDF buffer to HTML
const pdfBuffer = await fetch('document.pdf').then(res => res.arrayBuffer());
const html = await pdfToHtmlFromBuffer(pdfBuffer);

// Then convert to markdown
const md = await docToMarkdown(html, { type: 'html' });

Performance

The library is optimized for high performance across different file sizes. Here are benchmark results from our test suite:

Processing Speed

File Size	docToMarkdown	splitReadableDocs	Memory Usage
1KB	265ms	0ms	33MB RSS
10KB	43ms	0ms	2MB RSS
100KB	237ms	1ms	23MB RSS
1000KB	2.7s	4ms	259MB RSS
2MB	6.3s	N/A	934MB RSS

Key Performance Features

Ultra-fast splitting: splitReadableDocs processes documents in sub-millisecond time
Linear scaling: Processing time scales linearly with file size
Memory efficient: Optimized memory usage for large documents
Size limits: Built-in 2MB limit prevents memory issues
Real-time ready: Sub-second processing for documents up to 100KB

Performance Benchmarks

The library includes comprehensive benchmark tests that validate performance across:

Small documents (1-10KB): Sub-second processing
Medium documents (100KB): ~250ms processing
Large documents (1MB): ~3 seconds processing
Very large documents (2MB): ~6 seconds processing
Edge cases: Many sections, long paragraphs, oversized files

Run benchmarks with:

npm run test:benchmark

Main Dependencies

@mozilla/readability: Extracts main article content from HTML.
turndown: Converts HTML to Markdown.
turndown-plugin-gfm: GitHub Flavored Markdown support for Turndown.
remark: Markdown processing (used for splitting and parsing).
dompurify: Sanitizes HTML input.
jsdom: Emulates browser DOM in Node.js for HTML parsing.
pdf.js: PDF to HTML conversion.

▶️ Open Live on StackBlitz

License

MIT

Patch update: API and types for splitReadableDocs and docToMarkdown improved for clarity and flexibility.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github		.github
demo		demo
docs		docs
src		src
.gitignore		.gitignore
BENCHMARK_RESULTS.md		BENCHMARK_RESULTS.md
CONTRIBUTING.md		CONTRIBUTING.md
Makefile		Makefile
README.md		README.md
babel.config.cjs		babel.config.cjs
deploy-demo.sh		deploy-demo.sh
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

doc-to-readable

Features

Installation

Usage

Convert to Markdown

Split into Sections

PDF Support

API

PDF Buffer to HTML

Performance

Processing Speed

Key Performance Features

Performance Benchmarks

Main Dependencies

License

About

Uh oh!

Releases

Packages

Languages

ilyashusterman/doc-to-readable

Folders and files

Latest commit

History

Repository files navigation

doc-to-readable

Features

Installation

Usage

Convert to Markdown

Split into Sections

PDF Support

API

PDF Buffer to HTML

Performance

Processing Speed

Key Performance Features

Performance Benchmarks

Main Dependencies

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages