Skip to content

Docling silently produces 0-byte markdown for a single-page PDF #3419

@billdenney

Description

@billdenney

Affects: Docling 2.92.0 (Docling Core 2.74.1, IBM Models 3.13.2,
Parse 5.10.1) on Python 3.10, Linux, NVIDIA CUDA.

TL;DR

Running Docling on a small (1.2 KB) single-page PDF that contains an
FDA electronic-signature manifestation page produces a 0-byte
markdown file
while Docling logs "Processed 1 docs, of which 0 failed" and exits with returncode 0. The success log is misleading:
no markdown content is actually written.

pdftotext and pikepdf both render the page's text fine, so the
input is well-formed.

Source

The bug was triggered by page 46 of an FDA Drug Approval review
package fetched from:

https://www.accessdata.fda.gov/drugsatfda_docs/nda/2007/016608s098,020712s029,021710_ClinRev.pdf

(FDA NDA 016608 / 020712 / 021710 Clinical Review, 46 pages,
1,014,458 bytes)

Page 46 is the standard FDA "electronic-signature manifestation"
page that closes many review documents.

We extracted page 46 alone via pikepdf (preserving original bytes,
no re-encoding) so the reproducer is minimal:

import pikepdf
with pikepdf.open("016608s098,020712s029,021710_ClinRev.pdf") as src, \
     pikepdf.new() as out:
    out.pages.append(src.pages[45])  # page 46, 0-indexed
    out.save("input.pdf")

The resulting input.pdf (attached, 1,228 bytes) is the reproducer.

Reproducer

Easiest: pip-installed Docling

pip install docling==2.92.0
mkdir -p in out
cp input.pdf in/

docling in/ --to md --output out --device cuda --no-abort-on-error
# OR for the same flags our production worker uses:
docling in/ --to md --output out --device cuda --num-threads 4 \
    --no-abort-on-error --image-export-mode placeholder

ls -la out/
# Expected: a non-empty .md file
# Actual:   a 0-byte .md file

The bug reproduces with or without --no-abort-on-error and with or
without --image-export-mode placeholder. The minimal flag set that
triggers it is --device cuda; CPU-only runs may behave differently
(we have not tested CPU-only since our production deployment is GPU).

Docker (minimal stock image, no custom build)

# One-liner: build an ad-hoc image with docling 2.92.0 from pip and run it.
docker build -t docling-bug-test - <<'DOCKERFILE'
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
RUN apt-get update -qq && apt-get install -y -qq python3-pip \
    && rm -rf /var/lib/apt/lists/*
RUN pip3 install --no-cache-dir docling==2.92.0
ENTRYPOINT ["docling"]
DOCKERFILE

mkdir -p in out
cp input.pdf in/

docker run --rm --gpus all -v "$(pwd):/data" docling-bug-test \
    /data/in --to md --output /data/out \
    --device cuda --num-threads 4 --no-abort-on-error \
    --image-export-mode placeholder

ls -la out/

This recipe deliberately uses a stock CUDA image and pip install docling==2.92.0 — no project-specific Dockerfile or pre-cached models,
so any reader can reproduce on a GPU host with Docker + nvidia-toolkit.
First run takes ~2 minutes to build; subsequent runs are instant.

(The bug was originally observed inside our own pre-built image
docling-gpu:latest — a CUDA 12.4 base with Docling 2.92.0 and all
its models pre-cached for cold-start speed — but the bug is in
Docling itself, not in our packaging. The pip-based reproducer above
runs the same Docling 2.92.0 from upstream.)

Observed behavior

Verbose output (-vv):

2026-05-08 12:01:21 INFO  docling.pipeline.base_pipeline: Processing document <sha>.pdf
2026-05-08 12:01:21 DEBUG docling.pipeline.standard_pdf_pipeline: PIPELINE_PROFILING Stage preprocess: ... duration=0.049s
2026-05-08 12:01:21 DEBUG docling.pipeline.standard_pdf_pipeline: PIPELINE_PROFILING Stage ocr:        ... duration=0.124s
2026-05-08 12:01:23 DEBUG docling.pipeline.standard_pdf_pipeline: PIPELINE_PROFILING Stage layout:     ... duration=1.175s
2026-05-08 12:01:23 DEBUG docling.pipeline.standard_pdf_pipeline: PIPELINE_PROFILING Stage table:      ... duration=0.000s
2026-05-08 12:01:23 DEBUG docling.pipeline.standard_pdf_pipeline: PIPELINE_PROFILING Stage assemble:   ... duration=0.000s
2026-05-08 12:01:23 INFO  docling.document_converter: Finished converting document <sha>.pdf in 3.88 sec.
2026-05-08 12:01:23 INFO  docling.cli.main: writing Markdown output to /data/out/<sha>.md
2026-05-08 12:01:23 INFO  docling.cli.main: Processed 1 docs, of which 0 failed
2026-05-08 12:01:23 INFO  docling.cli.main: All documents were converted in 3.89 seconds.

$ ls -l /data/out/<sha>.md
-rw-r--r-- 1 bill bill 0 May  8 12:21 <sha>.md

The full pipeline ran (preprocess → ocr → layout → table → assemble),
no exception was raised, the output file was created — but it's
0 bytes.

Expected behavior

Either:

  1. Docling produces non-empty markdown for this page (the input
    has extractable content — pdftotext returns a clean record
    below), OR
  2. Docling surfaces an error if it cannot render the page — the
    silent 0-byte success is the worst-of-both-worlds failure mode (no
    error to act on, but no usable output either, and a downstream
    pipeline trusting the "0 failed" signal will silently ingest empty
    content).

pdftotext output for the same PDF (proves the input is renderable):

--------------------------------------------------------------------------------------------------------------------This is a representation of an electronic record that was signed electronically and
this page is the manifestation of the electronic signature.
--------------------------------------------------------------------------------------------------------------------/s/
--------------------Ronald Farkas
10/11/2007 05:28:47 PM
MEDICAL OFFICER

John Feeney
10/26/2007 10:34:59 PM
MEDICAL OFFICER
Concur

What's in doc17189_p46/

File Description
input.pdf The minimal 1,228-byte 1-page PDF reproducer. Also re-derivable from the FDA URL above by extracting page 46.
command.sh Self-contained reproducer script — works on any GPU host with Docker + NVIDIA toolkit; builds a stock docling image and runs the failing case. Exits non-zero with a clear message if the bug reproduces (.md is 0 bytes).
info.json Machine-readable provenance: source FDA URL, source doc, observed behavior, expected behavior
stderr.log Full stderr of a verbose-mode (-vv) Docling run on the input
stdout.log Stdout (empty — Docling logs to stderr)
empty_output.md The 0-byte .md Docling produced

Environment where originally observed

  • Image: docling-gpu:latest (custom local build — CUDA 12.4 base
    with Docling 2.92.0 and all its models pre-cached, equivalent to
    the pip-based reproducer above)
  • Host: 31 GiB system RAM, 8 GiB GPU (NVIDIA), AMD Ryzen Threadripper PRO 3955WX
  • Docker memory limit: --memory 12g --memory-swap 12g (well under
    what Docling needs for full-document conversions; not a memory
    issue — bug also reproduces with no memory cap)
  • Docling CLI flags: --device cuda --num-threads 4 --no-abort-on-error --image-export-mode placeholder

docling-reprex-doc17189_p46.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions