Note: This tool was AI-generated. It applies best-effort heuristics that won't work perfectly on every PDF. Review the intermediate Markdown before converting. Use at your own risk.
Convert PDFs to Kindle-optimized EPUBs.
brew install poppler pandoc uvPython 3.11+ is also required (ships with macOS). If pdftotext or pandoc
are missing, the script will offer to install them via Homebrew.
For PDFs with custom font encodings (garbled text output), Tesseract OCR is used as an automatic fallback:
brew install tesseractInstall the Python dependencies (includes epubcheck for EPUB validation):
uv syncchmod +x pdf2kindle.sh
./pdf2kindle.sh [options] input.pdf [output.epub]Options:
| Flag | Description |
|---|---|
--title TEXT |
Book title (default: filename) |
--author TEXT |
Author name |
--no-pause |
Skip the manual review step |
--keep-md |
Keep the intermediate Markdown file |
--layout |
Use spatial layout mode (for single-column PDFs) |
--ocr |
Force OCR extraction via Tesseract |
Examples:
# Basic conversion with manual review step
./pdf2kindle.sh report.pdf
# Set metadata, keep the markdown, custom output name
./pdf2kindle.sh --title "My Report" --author "Jane Doe" --keep-md report.pdf out.epub
# Fully automated (skip review)
./pdf2kindle.sh --no-pause --title "Quick Read" paper.pdf
# Force OCR for a scanned or garbled PDF
./pdf2kindle.sh --ocr --title "Scanned Doc" scan.pdfpdftotextextracts text in reading order (handles multi-column layouts)extract.pycleans up the output: strips repeated headers/footers, removes page numbers and TOC lines, collapses blank lines, detects likely headings, rejoins split paragraphs, dehyphenates broken words- You review and edit the Markdown (the part machines can't reliably automate)
pandocconverts to EPUB with a Kindle-optimized stylesheet and table of contentsqa_epub.pyruns deterministic validation checks against the final EPUB and reports any issues before you transfer to Kindle
If the extracted text looks garbled (common with PDFs that use custom font
encodings), the script automatically falls back to Tesseract OCR. You can also
force OCR with --ocr.
Output files are written to the current working directory.
Main conversion script. Orchestrates extraction → review → EPUB build → validation. See the Usage section for options.
Standalone text extraction and cleanup script. Called by pdf2kindle.sh but
can be run directly:
python3 extract.py [--layout] [--ocr] [-t TITLE] [-a AUTHOR] [-o output.md] input.pdfHeuristics applied: soft-hyphen removal, dehyphenation, repeated line (header/footer) detection, page-number stripping, TOC dot-leader removal, heading detection (numbered headings + ALL CAPS), paragraph rejoining.
Advanced builder for visual-heavy PDFs (annual reports, data-rich documents). Produces a hybrid Markdown file that combines reflowable text with embedded rendered page images for pages that are primarily charts, maps, or tables.
python3 build_hybrid_markdown.py \
--title "My Report" --author "Jane Doe" \
--image-dir images --image-prefix page \
[--section PAGE:TITLE ...] \
[--skip-pages RANGE ...] \
[--toc-pages RANGE ...] \
-o output.md input.pdf| Option | Description |
|---|---|
--image-dir DIR |
Relative path (used in Markdown) to pre-rendered page images |
--image-prefix PREFIX |
Filename prefix for page images (default: page) |
--section PAGE:TITLE |
Insert a # TITLE heading before the given page number |
--skip-pages RANGE |
Pages or ranges to omit entirely (e.g. 1-3,7) |
--toc-pages RANGE |
Pages where TOC dot-leader lines should be dropped |
Pages are classified automatically as visual-heavy based on alpha/digit
ratios, line-length distribution, and chart/map markers. Visual-heavy pages are
replaced by embedded <img> tags pointing to pre-rendered JPEGs; text pages
are reflowed normally.
Typical workflow for visual PDFs:
# 1. Pre-render all pages to JPEG (requires pdftoppm from poppler)
mkdir -p images
pdftoppm -jpeg -r 150 report.pdf images/page
# 2. Build hybrid Markdown
python3 build_hybrid_markdown.py \
--title "Annual Report 2024" --author "ASER" \
--image-dir images --image-prefix page \
--section 5:"Introduction" --section 12:"Results" \
--toc-pages 2-4 --skip-pages 1 \
-o report.md report.pdf
# 3. Convert to EPUB
pandoc report.md -o report.epub \
--css=kindle.css --split-level=1 --toc --toc-depth=3 \
--metadata title="Annual Report 2024" \
--metadata creator="ASER" --metadata lang="en"Deterministic EPUB quality-assurance checker. Run automatically by
pdf2kindle.sh after every build, or invoke directly:
uv run python qa_epub.py output.epub [--source-md output.md]Checks performed:
| Check | Description |
|---|---|
| EPUBCheck validation | W3C schema conformance via the epubcheck Python package |
| Archive integrity | ZIP validity, mimetype, META-INF/container.xml present |
| Package/manifest | Spine itemrefs resolve to manifest items |
| Navigation | Nav document exists, is parseable, contains links |
| Internal links | All href targets and fragment anchors resolve |
| Images | All <img src> targets exist in the archive |
| Stylesheets | CSS files are linked and present in the archive |
| Placeholder text | Fallback marker strings are not left in final output |
| Split URLs | URLs broken across lines are flagged |
Output follows the CONVERSION_QA_CHECKLIST.md format: failed items only,
with evidence, impact, and suggested fix. Exit code 0 = all clear;
1 = issues found.
Stylesheet embedded in every generated EPUB. Optimised for e-ink readability:
font-size, line-height, margin tuning, and .visual-page rules that
constrain preserved page images to fit Kindle screen widths.
PDFs vary widely. After the script pauses, open the .md file and check:
- Headings — promote/demote
##/###as needed - Paragraphs — fix incorrectly joined or split lines
- Callout boxes — wrap in
>blockquote syntax - Lists — numbered/bulleted lists may need reformatting
- Artifacts — remove garbled characters or stray symbols
CONVERSION_QA_CHECKLIST.md is the mandatory go/no-go gate used by the agent
for every conversion. It covers preflight, extraction sanity, structural
quality, artifact cleanup, navigation, metadata, technical validity, and
reading-quality spot-checks. qa_epub.py automates the deterministic subset;
the structural and reading-quality sections require human review.