Skip to content

Commit 13fe745

Browse files
committedNov 29, 2023
Update benchmarks
1 parent 88f20eb commit 13fe745

File tree

6 files changed

+55
-19
lines changed

6 files changed

+55
-19
lines changed
 

‎.gitignore

-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ local.env
44
experiments
55
test_data
66
training
7-
benchmark_data
87
wandb
98

109
# Byte-compiled / optimized / DLL files

‎README.md

+26-14
Original file line numberDiff line numberDiff line change
@@ -51,14 +51,14 @@ First, clone the repo:
5151
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `install/ghostscript_install.sh`.
5252
- Install other requirements with `cat install/apt-requirements.txt | xargs sudo apt-get install -y`
5353
- Set the tesseract data folder path
54-
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the right tesseract version if you have multiple!
54+
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple!
5555
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
5656
- Install python requirements
5757
- `poetry install`
5858
- `poetry shell` to activate your poetry venv
5959
- Update pytorch as needed since poetry doesn't play nicely with it
6060
- GPU only: run `pip install torch` to install other torch dependencies.
61-
- CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/) instructions.
61+
- CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
6262

6363
## Mac
6464

@@ -126,7 +126,7 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
126126

127127
# Benchmarks
128128

129-
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I converted the latex to text, and compared the reference to the output of text extraction methods.
129+
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I converted the latex to text, and compare the reference to the output of text extraction methods.
130130

131131
Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
132132

@@ -142,21 +142,21 @@ nougat 0.614548 810.756
142142

143143
**Accuracy**
144144

145-
First 4 are non-arXiv books, last 3 are arXiv papers.
145+
First 3 are non-arXiv books, last 3 are arXiv papers.
146146

147-
Method thinkos.pdf thinkdsp.pdf thinkpython.pdf paip.pdf switch_trans.pdf crowd.pdf multicolcnn.pdf
148-
-------- ------------- -------------- ----------------- ---------- ------------------ ----------- -----------------
149-
naive 0.366817 0.412014 0.468147 0.735464 0.244739 0.14489 0.0890217
150-
marker 0.753291 0.787938 0.779262 0.679189 0.478387 0.446068 0.533737
151-
nougat 0.638434 0.632723 0.637626 0.462495 0.690028 0.540994 0.699539
147+
Method thinkos.pdf thinkdsp.pdf thinkpython.pdf switch_trans.pdf crowd.pdf multicolcnn.pdf
148+
-------- ------------- -------------- ----------------- ------------------ ----------- -----------------
149+
naive 0.366817 0.412014 0.468147 0.244739 0.14489 0.0890217
150+
marker 0.753291 0.787938 0.779262 0.478387 0.446068 0.533737
151+
nougat 0.638434 0.632723 0.637626 0.690028 0.540994 0.699539
152152

153-
Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `2.7GB` for marker. Benchmarks were run on an A6000.
153+
Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker. Benchmarks were run on an A6000.
154154

155155
## Running your own benchmarks
156156

157-
You can benchmark the performance of marker on your machine.
157+
You can benchmark the performance of marker on your machine. First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.
158158

159-
Run `benchmark.py` like this:
159+
Then run `benchmark.py` like this:
160160

161161
```
162162
python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
@@ -168,7 +168,17 @@ Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running
168168

169169
# Commercial usage
170170

171-
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
171+
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
172+
173+
I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh.
174+
175+
Here are the non-commercial/restrictive dependencies:
176+
177+
- LayoutLMv3: CC BY-NC-SA 4.0 . [Source](https://huggingface.co/microsoft/layoutlmv3-base)
178+
- Nougat: CC-BY-NC . [Source](https://github.com/facebookresearch/nougat)
179+
- PyMuPDF - GPL . [Source](https://pymupdf.readthedocs.io/en/latest/about.html#license-and-copyright)
180+
181+
Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript).
172182

173183
# Thanks
174184

@@ -177,4 +187,6 @@ This work would not have been possible without amazing open source models and da
177187
- Nougat from Meta
178188
- Layoutlmv3 from Microsoft
179189
- DocLayNet from IBM
180-
- ByT5 from Google
190+
- ByT5 from Google
191+
192+
Thank you to the authors of these models and datasets for making them available to the community.

‎benchmark.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ def nougat_prediction(pdf_filename, batch_size=1):
4040
parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
4141
parser.add_argument("out_file", help="Output filename")
4242
parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
43-
parser.add_argument("--nougat_batch_size", type=int, default=2, help="Batch size to use for nougat when making predictions.")
43+
# Nougat batch size 1 uses about as much VRAM as default marker settings
44+
parser.add_argument("--nougat_batch_size", type=int, default=1, help="Batch size to use for nougat when making predictions.")
4445
parser.add_argument("--marker_parallel_factor", type=int, default=1, help="How much to multiply default parallel OCR workers and model batch sizes by.")
4546
parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
4647
args = parser.parse_args()

‎benchmark_data/.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
latex
2+
pdfs
3+
references

‎benchmark_data/latex_to_md.sh

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
#!/bin/bash
2+
3+
# List all .tex files in the latex folder
4+
FILES=$(find latex -name "*.tex")
5+
6+
for f in $FILES
7+
do
8+
echo "Processing $f file..."
9+
base_name=$(basename "$f" .tex)
10+
out_file="references/${base_name}.md"
11+
12+
pandoc --wrap=none --no-highlight --strip-comments=true -s "$f" -t plain -o "$out_file"
13+
# Replace non-breaking spaces
14+
sed -i .bak 's/ / /g' "$out_file"
15+
sed -i .bak 's/ / /g' "$out_file"
16+
sed -i .bak 's/ / /g' "$out_file"
17+
sed -i .bak 's/ / /g' "$out_file"
18+
# Remove .bak file
19+
rm "$out_file.bak"
20+
done
21+

‎marker/settings.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ class Settings(BaseSettings):
1111
# General
1212
TORCH_DEVICE: str = "cpu"
1313
INFERENCE_RAM: int = 40 # How much VRAM each GPU has (in GB).
14-
VRAM_PER_TASK: float = 2.5 # How much VRAM to allocate per task (in GB)
14+
VRAM_PER_TASK: float = 2.5 # How much VRAM to allocate per task (in GB). Peak marker VRAM usage is around 3GB, but avg across workers is lower.
1515
DEBUG: bool = False # Enable debug logging
1616
DEFAULT_LANG: str = "English" # Default language we assume files to be in, should be one of the keys in TESSERACT_LANGUAGES
1717

@@ -57,7 +57,7 @@ class Settings(BaseSettings):
5757
"\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]", "### "]
5858
NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
5959
NOUGAT_MODEL_NAME: str = "0.1.0-small" # Name of the model to use
60-
NOUGAT_BATCH_SIZE: int = 4 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
60+
NOUGAT_BATCH_SIZE: int = 6 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
6161

6262
# Layout model
6363
BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
@@ -73,7 +73,7 @@ class Settings(BaseSettings):
7373

7474
# Final editing model
7575
EDITOR_BATCH_SIZE: int = 4
76-
EDITOR_MAX_LENGTH: int = 1024
76+
EDITOR_MAX_LENGTH: int = 2048
7777
EDITOR_MODEL_NAME: str = "vikp/pdf_postprocessor"
7878
ENABLE_EDITOR_MODEL: bool = False # The editor model can create false positives
7979

0 commit comments

Comments
 (0)
Please sign in to comment.