Update benchmarks

VikParuchuri · VikParuchuri · commit 13fe745d4bc3 · 2023-11-29T12:51:10.000-08:00
diff --git a/.gitignore b/.gitignore
@@ -4,7 +4,6 @@ local.env
 experiments
 test_data
 training
-benchmark_data
 wandb
 
 # Byte-compiled / optimized / DLL files
diff --git a/README.md b/README.md
@@ -51,14 +51,14 @@ First, clone the repo:
   - Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `install/ghostscript_install.sh`.
   - Install other requirements with `cat install/apt-requirements.txt | xargs sudo apt-get install -y`
 - Set the tesseract data folder path
-  - Find the tesseract data folder `tessdata` with `find / -name tessdata`.  Make sure to use the one corresponding to the right tesseract version if you have multiple!
+  - Find the tesseract data folder `tessdata` with `find / -name tessdata`.  Make sure to use the one corresponding to the latest tesseract version if you have multiple!
   - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
 - Install python requirements
   - `poetry install`
   - `poetry shell` to activate your poetry venv
 - Update pytorch as needed since poetry doesn't play nicely with it
   - GPU only: run `pip install torch` to install other torch dependencies.
-  - CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/) instructions.
+  - CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
 
 ## Mac
 
@@ -126,7 +126,7 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
 
 # Benchmarks
 
-Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I converted the latex to text, and compared the reference to the output of text extraction methods.
+Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I converted the latex to text, and compare the reference to the output of text extraction methods.
 
 Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
 
@@ -142,21 +142,21 @@ nougat           0.614548      810.756
 
 **Accuracy**
 
-First 4 are non-arXiv books, last 3 are arXiv papers.
+First 3 are non-arXiv books, last 3 are arXiv papers.
 
-Method      thinkos.pdf    thinkdsp.pdf    thinkpython.pdf    paip.pdf    switch_trans.pdf    crowd.pdf    multicolcnn.pdf
---------  -------------  --------------  -----------------  ----------  ------------------  -----------  -----------------
-naive          0.366817        0.412014           0.468147    0.735464             0.244739     0.14489           0.0890217
-marker         0.753291        0.787938           0.779262    0.679189             0.478387     0.446068          0.533737
-nougat         0.638434        0.632723           0.637626    0.462495             0.690028     0.540994          0.699539
+Method      thinkos.pdf    thinkdsp.pdf    thinkpython.pdf   switch_trans.pdf    crowd.pdf    multicolcnn.pdf
+--------  -------------  --------------  -----------------  ------------------  -----------  -----------------
+naive          0.366817        0.412014           0.468147            0.244739     0.14489          0.0890217
+marker         0.753291        0.787938           0.779262            0.478387     0.446068          0.533737
+nougat         0.638434        0.632723           0.637626            0.690028     0.540994          0.699539
 
-Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `2.7GB` for marker.  Benchmarks were run on an A6000.
+Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker.  Benchmarks were run on an A6000.
 
 ## Running your own benchmarks
 
-You can benchmark the performance of marker on your machine.
+You can benchmark the performance of marker on your machine.  First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.
 
-Run `benchmark.py` like this:
+Then run `benchmark.py` like this:
 
 ```
 python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
@@ -168,7 +168,17 @@ Omit `--nougat` to exclude nougat from the benchmark.  I don't recommend running
 
 # Commercial usage
 
-Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.  I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
+Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.  
+
+I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh.
+
+Here are the non-commercial/restrictive dependencies:
+
+- LayoutLMv3: CC BY-NC-SA 4.0 .  [Source](https://huggingface.co/microsoft/layoutlmv3-base)
+- Nougat: CC-BY-NC . [Source](https://github.com/facebookresearch/nougat)
+- PyMuPDF - GPL . [Source](https://pymupdf.readthedocs.io/en/latest/about.html#license-and-copyright)
+
+Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript).
 
 # Thanks
 
@@ -177,4 +187,6 @@ This work would not have been possible without amazing open source models and da
 - Nougat from Meta
 - Layoutlmv3 from Microsoft
 - DocLayNet from IBM
-- ByT5 from Google
+- ByT5 from Google
+
+Thank you to the authors of these models and datasets for making them available to the community.
diff --git a/benchmark.py b/benchmark.py
@@ -40,7 +40,8 @@ def nougat_prediction(pdf_filename, batch_size=1):
     parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
     parser.add_argument("out_file", help="Output filename")
     parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
-    parser.add_argument("--nougat_batch_size", type=int, default=2, help="Batch size to use for nougat when making predictions.")
+    # Nougat batch size 1 uses about as much VRAM as default marker settings
+    parser.add_argument("--nougat_batch_size", type=int, default=1, help="Batch size to use for nougat when making predictions.")
     parser.add_argument("--marker_parallel_factor", type=int, default=1, help="How much to multiply default parallel OCR workers and model batch sizes by.")
     parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
     args = parser.parse_args()
diff --git a/benchmark_data/.gitignore b/benchmark_data/.gitignore
@@ -0,0 +1,3 @@
+latex
+pdfs
+references
diff --git a/benchmark_data/latex_to_md.sh b/benchmark_data/latex_to_md.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+
+# List all .tex files in the latex folder
+FILES=$(find latex -name "*.tex")
+
+for f in $FILES
+do
+  echo "Processing $f file..."
+  base_name=$(basename "$f" .tex)
+  out_file="references/${base_name}.md"
+
+  pandoc --wrap=none --no-highlight --strip-comments=true -s "$f" -t plain -o "$out_file"
+  # Replace non-breaking spaces
+  sed -i .bak 's/ / /g' "$out_file"
+  sed -i .bak 's/ / /g' "$out_file"
+  sed -i .bak 's/ / /g' "$out_file"
+  sed -i .bak 's/ / /g' "$out_file"
+  # Remove .bak file
+  rm "$out_file.bak"
+done
+
diff --git a/marker/settings.py b/marker/settings.py
@@ -11,7 +11,7 @@ class Settings(BaseSettings):
     # General
     TORCH_DEVICE: str = "cpu"
     INFERENCE_RAM: int = 40 # How much VRAM each GPU has (in GB).
-    VRAM_PER_TASK: float = 2.5 # How much VRAM to allocate per task (in GB)
+    VRAM_PER_TASK: float = 2.5 # How much VRAM to allocate per task (in GB).  Peak marker VRAM usage is around 3GB, but avg across workers is lower.
     DEBUG: bool = False # Enable debug logging
     DEFAULT_LANG: str = "English" # Default language we assume files to be in, should be one of the keys in TESSERACT_LANGUAGES
 
@@ -57,7 +57,7 @@ class Settings(BaseSettings):
                                   "\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]", "### "]
     NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
     NOUGAT_MODEL_NAME: str = "0.1.0-small" # Name of the model to use
-    NOUGAT_BATCH_SIZE: int = 4 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
+    NOUGAT_BATCH_SIZE: int = 6 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
 
     # Layout model
     BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
@@ -73,7 +73,7 @@ class Settings(BaseSettings):
 
     # Final editing model
     EDITOR_BATCH_SIZE: int = 4
-    EDITOR_MAX_LENGTH: int = 1024
+    EDITOR_MAX_LENGTH: int = 2048
     EDITOR_MODEL_NAME: str = "vikp/pdf_postprocessor"
     ENABLE_EDITOR_MODEL: bool = False # The editor model can create false positives