You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+26-14
Original file line number
Diff line number
Diff line change
@@ -51,14 +51,14 @@ First, clone the repo:
51
51
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `install/ghostscript_install.sh`.
52
52
- Install other requirements with `cat install/apt-requirements.txt | xargs sudo apt-get install -y`
53
53
- Set the tesseract data folder path
54
-
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the right tesseract version if you have multiple!
54
+
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple!
55
55
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
56
56
- Install python requirements
57
57
-`poetry install`
58
58
-`poetry shell` to activate your poetry venv
59
59
- Update pytorch as needed since poetry doesn't play nicely with it
60
60
- GPU only: run `pip install torch` to install other torch dependencies.
61
-
- CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/) instructions.
61
+
- CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I converted the latex to text, and compared the reference to the output of text extraction methods.
129
+
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I converted the latex to text, and compare the reference to the output of text extraction methods.
130
130
131
131
Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
132
132
@@ -142,21 +142,21 @@ nougat 0.614548 810.756
142
142
143
143
**Accuracy**
144
144
145
-
First 4 are non-arXiv books, last 3 are arXiv papers.
145
+
First 3 are non-arXiv books, last 3 are arXiv papers.
Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `2.7GB` for marker. Benchmarks were run on an A6000.
153
+
Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker. Benchmarks were run on an A6000.
154
154
155
155
## Running your own benchmarks
156
156
157
-
You can benchmark the performance of marker on your machine.
157
+
You can benchmark the performance of marker on your machine. First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.
@@ -168,7 +168,17 @@ Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running
168
168
169
169
# Commercial usage
170
170
171
-
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
171
+
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
172
+
173
+
I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh.
174
+
175
+
Here are the non-commercial/restrictive dependencies:
176
+
177
+
- LayoutLMv3: CC BY-NC-SA 4.0 . [Source](https://huggingface.co/microsoft/layoutlmv3-base)
parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
43
-
parser.add_argument("--nougat_batch_size", type=int, default=2, help="Batch size to use for nougat when making predictions.")
43
+
# Nougat batch size 1 uses about as much VRAM as default marker settings
44
+
parser.add_argument("--nougat_batch_size", type=int, default=1, help="Batch size to use for nougat when making predictions.")
44
45
parser.add_argument("--marker_parallel_factor", type=int, default=1, help="How much to multiply default parallel OCR workers and model batch sizes by.")
45
46
parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
0 commit comments