Skip to content

Commit c3d8b1d

Browse files
committed
Default to tesseract for OCR (faster than ocrmypdf)
1 parent 0146964 commit c3d8b1d

File tree

10 files changed

+244
-123
lines changed

10 files changed

+244
-123
lines changed

README.md

+13-12
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,16 @@ The above results are with marker and nougat setup so they each take ~3GB of VRA
4040

4141
See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
4242

43+
# Limitations
44+
45+
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
46+
47+
- Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation.
48+
- Whitespace and indentations are not always respected.
49+
- Not all lines/spans will be joined properly.
50+
- Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
51+
- This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.
52+
4353
# Installation
4454

4555
This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and [poetry](https://python-poetry.org/docs/#installing-with-the-official-installer).
@@ -82,8 +92,9 @@ First, some configuration:
8292
- Set your torch device in the `local.env` file. For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`. `cpu` is the default.
8393
- If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
8494
- Depending on your document types, marker's average memory usage per task can vary slightly. You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
85-
- By default, the final editor model is off. Turn it on with `ENABLE_EDITOR_MODEL`.
86-
- Inspect the settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables.
95+
- Inspect the other settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables.
96+
- By default, the final editor model is off. Turn it on with `ENABLE_EDITOR_MODEL`.
97+
- By default, marker will use ocrmypdf for OCR, which is slower than base tesseract, but higher quality. You can change this with the `OCR_ENGINE` setting.
8798

8899
## Convert a single file
89100

@@ -178,16 +189,6 @@ This will benchmark marker against other text extraction methods. It sets up ba
178189

179190
Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.
180191

181-
# Limitations
182-
183-
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
184-
185-
- Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation.
186-
- Whitespace and indentations are not always respected.
187-
- Not all lines/spans will be joined properly.
188-
- Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
189-
- This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.
190-
191192
# Commercial usage
192193

193194
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.

convert.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import argparse
22
import os
3-
from typing import Dict
3+
from typing import Dict, Optional
44

55
import ray
66
from tqdm import tqdm
@@ -17,7 +17,7 @@
1717

1818

1919
@ray.remote(num_cpus=settings.RAY_CORES_PER_WORKER, num_gpus=.05 if settings.CUDA else 0)
20-
def process_single_pdf(fname: str, out_folder: str, model_refs, metadata: Dict | None=None, min_length: int | None = None):
20+
def process_single_pdf(fname: str, out_folder: str, model_refs, metadata: Optional[Dict] = None, min_length: Optional[int] = None):
2121
out_filename = fname.rsplit(".", 1)[0] + ".md"
2222
out_filename = os.path.join(out_folder, os.path.basename(out_filename))
2323
out_meta_filename = out_filename.rsplit(".", 1)[0] + "_meta.json"

0 commit comments

Comments
 (0)