You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+13-12
Original file line number
Diff line number
Diff line change
@@ -40,6 +40,16 @@ The above results are with marker and nougat setup so they each take ~3GB of VRA
40
40
41
41
See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
42
42
43
+
# Limitations
44
+
45
+
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
46
+
47
+
- Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation.
48
+
- Whitespace and indentations are not always respected.
49
+
- Not all lines/spans will be joined properly.
50
+
- Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
51
+
- This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.
52
+
43
53
# Installation
44
54
45
55
This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and [poetry](https://python-poetry.org/docs/#installing-with-the-official-installer).
@@ -82,8 +92,9 @@ First, some configuration:
82
92
- Set your torch device in the `local.env` file. For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`. `cpu` is the default.
83
93
- If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
84
94
- Depending on your document types, marker's average memory usage per task can vary slightly. You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
85
-
- By default, the final editor model is off. Turn it on with `ENABLE_EDITOR_MODEL`.
86
-
- Inspect the settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables.
95
+
- Inspect the other settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables.
96
+
- By default, the final editor model is off. Turn it on with `ENABLE_EDITOR_MODEL`.
97
+
- By default, marker will use ocrmypdf for OCR, which is slower than base tesseract, but higher quality. You can change this with the `OCR_ENGINE` setting.
87
98
88
99
## Convert a single file
89
100
@@ -178,16 +189,6 @@ This will benchmark marker against other text extraction methods. It sets up ba
178
189
179
190
Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.
180
191
181
-
# Limitations
182
-
183
-
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
184
-
185
-
- Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation.
186
-
- Whitespace and indentations are not always respected.
187
-
- Not all lines/spans will be joined properly.
188
-
- Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
189
-
- This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.
190
-
191
192
# Commercial usage
192
193
193
194
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
0 commit comments