Docling silently produces 0-byte markdown for a single-page PDF

**Affects**: Docling 2.92.0 (Docling Core 2.74.1, IBM Models 3.13.2,
Parse 5.10.1) on Python 3.10, Linux, NVIDIA CUDA.

## TL;DR

Running Docling on a small (1.2 KB) single-page PDF that contains an
FDA electronic-signature manifestation page produces a **0-byte
markdown file** while Docling logs `"Processed 1 docs, of which 0
failed"` and exits with returncode `0`. The success log is misleading:
no markdown content is actually written.

`pdftotext` and `pikepdf` both render the page's text fine, so the
input is well-formed.

## Source

The bug was triggered by **page 46** of an FDA Drug Approval review
package fetched from:

> <https://www.accessdata.fda.gov/drugsatfda_docs/nda/2007/016608s098,020712s029,021710_ClinRev.pdf>
>
> (FDA NDA 016608 / 020712 / 021710 Clinical Review, 46 pages,
> 1,014,458 bytes)

Page 46 is the standard FDA "electronic-signature manifestation"
page that closes many review documents.

We extracted page 46 alone via `pikepdf` (preserving original bytes,
no re-encoding) so the reproducer is minimal:

```python
import pikepdf
with pikepdf.open("016608s098,020712s029,021710_ClinRev.pdf") as src, \
     pikepdf.new() as out:
    out.pages.append(src.pages[45])  # page 46, 0-indexed
    out.save("input.pdf")
```

The resulting `input.pdf` (attached, 1,228 bytes) is the reproducer.

## Reproducer

### Easiest: pip-installed Docling

```bash
pip install docling==2.92.0
mkdir -p in out
cp input.pdf in/

docling in/ --to md --output out --device cuda --no-abort-on-error
# OR for the same flags our production worker uses:
docling in/ --to md --output out --device cuda --num-threads 4 \
    --no-abort-on-error --image-export-mode placeholder

ls -la out/
# Expected: a non-empty .md file
# Actual:   a 0-byte .md file
```

The bug reproduces with or without `--no-abort-on-error` and with or
without `--image-export-mode placeholder`. The minimal flag set that
triggers it is `--device cuda`; CPU-only runs may behave differently
(we have not tested CPU-only since our production deployment is GPU).

### Docker (minimal stock image, no custom build)

```bash
# One-liner: build an ad-hoc image with docling 2.92.0 from pip and run it.
docker build -t docling-bug-test - <<'DOCKERFILE'
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
RUN apt-get update -qq && apt-get install -y -qq python3-pip \
    && rm -rf /var/lib/apt/lists/*
RUN pip3 install --no-cache-dir docling==2.92.0
ENTRYPOINT ["docling"]
DOCKERFILE

mkdir -p in out
cp input.pdf in/

docker run --rm --gpus all -v "$(pwd):/data" docling-bug-test \
    /data/in --to md --output /data/out \
    --device cuda --num-threads 4 --no-abort-on-error \
    --image-export-mode placeholder

ls -la out/
```

This recipe deliberately uses a stock CUDA image and `pip install
docling==2.92.0` — no project-specific Dockerfile or pre-cached models,
so any reader can reproduce on a GPU host with Docker + nvidia-toolkit.
First run takes ~2 minutes to build; subsequent runs are instant.

(The bug was originally observed inside our own pre-built image
`docling-gpu:latest` — a CUDA 12.4 base with Docling 2.92.0 and all
its models pre-cached for cold-start speed — but the bug is in
Docling itself, not in our packaging. The pip-based reproducer above
runs the same Docling 2.92.0 from upstream.)

## Observed behavior

Verbose output (`-vv`):

```
2026-05-08 12:01:21 INFO  docling.pipeline.base_pipeline: Processing document <sha>.pdf
2026-05-08 12:01:21 DEBUG docling.pipeline.standard_pdf_pipeline: PIPELINE_PROFILING Stage preprocess: ... duration=0.049s
2026-05-08 12:01:21 DEBUG docling.pipeline.standard_pdf_pipeline: PIPELINE_PROFILING Stage ocr:        ... duration=0.124s
2026-05-08 12:01:23 DEBUG docling.pipeline.standard_pdf_pipeline: PIPELINE_PROFILING Stage layout:     ... duration=1.175s
2026-05-08 12:01:23 DEBUG docling.pipeline.standard_pdf_pipeline: PIPELINE_PROFILING Stage table:      ... duration=0.000s
2026-05-08 12:01:23 DEBUG docling.pipeline.standard_pdf_pipeline: PIPELINE_PROFILING Stage assemble:   ... duration=0.000s
2026-05-08 12:01:23 INFO  docling.document_converter: Finished converting document <sha>.pdf in 3.88 sec.
2026-05-08 12:01:23 INFO  docling.cli.main: writing Markdown output to /data/out/<sha>.md
2026-05-08 12:01:23 INFO  docling.cli.main: Processed 1 docs, of which 0 failed
2026-05-08 12:01:23 INFO  docling.cli.main: All documents were converted in 3.89 seconds.

$ ls -l /data/out/<sha>.md
-rw-r--r-- 1 bill bill 0 May  8 12:21 <sha>.md
```

The full pipeline ran (preprocess → ocr → layout → table → assemble),
no exception was raised, the output file was created — but it's
**0 bytes**.

## Expected behavior

Either:
1. **Docling produces non-empty markdown** for this page (the input
   has extractable content — `pdftotext` returns a clean record
   below), OR
2. **Docling surfaces an error** if it cannot render the page — the
   silent 0-byte success is the worst-of-both-worlds failure mode (no
   error to act on, but no usable output either, and a downstream
   pipeline trusting the "0 failed" signal will silently ingest empty
   content).

`pdftotext` output for the same PDF (proves the input is renderable):

```
--------------------------------------------------------------------------------------------------------------------This is a representation of an electronic record that was signed electronically and
this page is the manifestation of the electronic signature.
--------------------------------------------------------------------------------------------------------------------/s/
--------------------Ronald Farkas
10/11/2007 05:28:47 PM
MEDICAL OFFICER

John Feeney
10/26/2007 10:34:59 PM
MEDICAL OFFICER
Concur
```

## What's in `doc17189_p46/`

| File | Description |
|---|---|
| `input.pdf` | The minimal 1,228-byte 1-page PDF reproducer. *Also re-derivable from the FDA URL above by extracting page 46.* |
| `command.sh` | Self-contained reproducer script — works on any GPU host with Docker + NVIDIA toolkit; builds a stock docling image and runs the failing case. Exits non-zero with a clear message if the bug reproduces (.md is 0 bytes). |
| `info.json` | Machine-readable provenance: source FDA URL, source doc, observed behavior, expected behavior |
| `stderr.log` | Full stderr of a verbose-mode (`-vv`) Docling run on the input |
| `stdout.log` | Stdout (empty — Docling logs to stderr) |
| `empty_output.md` | The 0-byte `.md` Docling produced |

## Environment where originally observed

- Image: `docling-gpu:latest` (custom local build — CUDA 12.4 base
  with Docling 2.92.0 and all its models pre-cached, equivalent to
  the pip-based reproducer above)
- Host: 31 GiB system RAM, 8 GiB GPU (NVIDIA), AMD Ryzen Threadripper PRO 3955WX
- Docker memory limit: `--memory 12g --memory-swap 12g` (well under
  what Docling needs for full-document conversions; not a memory
  issue — bug also reproduces with no memory cap)
- Docling CLI flags: `--device cuda --num-threads 4 --no-abort-on-error
  --image-export-mode placeholder`

[docling-reprex-doc17189_p46.tar.gz](https://github.com/user-attachments/files/27558829/docling-reprex-doc17189_p46.tar.gz)

File	Description
`input.pdf`	The minimal 1,228-byte 1-page PDF reproducer. Also re-derivable from the FDA URL above by extracting page 46.
`command.sh`	Self-contained reproducer script — works on any GPU host with Docker + NVIDIA toolkit; builds a stock docling image and runs the failing case. Exits non-zero with a clear message if the bug reproduces (.md is 0 bytes).
`info.json`	Machine-readable provenance: source FDA URL, source doc, observed behavior, expected behavior
`stderr.log`	Full stderr of a verbose-mode (`-vv`) Docling run on the input
`stdout.log`	Stdout (empty — Docling logs to stderr)
`empty_output.md`	The 0-byte `.md` Docling produced

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docling silently produces 0-byte markdown for a single-page PDF #3419

TL;DR

Source

Reproducer

Easiest: pip-installed Docling

Docker (minimal stock image, no custom build)

Observed behavior

Expected behavior

What's in `doc17189_p46/`

Environment where originally observed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Docling silently produces 0-byte markdown for a single-page PDF #3419

Description

TL;DR

Source

Reproducer

Easiest: pip-installed Docling

Docker (minimal stock image, no custom build)

Observed behavior

Expected behavior

What's in doc17189_p46/

Environment where originally observed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What's in `doc17189_p46/`