You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+40-34
Original file line number
Diff line number
Diff line change
@@ -1,43 +1,42 @@
1
1
# Marker
2
2
3
-
Marker converts PDF, EPUB, and MOBI to Markdown. It is 10x faster than nougat, works across many types of documents, and minimizes the risk of hallucinations significantly.
4
-
5
-
Features:
3
+
Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has near-zero hallucination risk.
6
4
7
5
- Support for a range of PDF documents (optimized for books and scientific papers)
8
-
- Support for 1 and 2 column layouts
9
-
- Removal of headers/footers/other artifacts
10
-
- Latex conversion for most equations
11
-
- Proper code block and table formatting
6
+
- Removes headers/footers/other artifacts
7
+
- Converts most equations to latex
8
+
- Formats code blocks and tables
12
9
- Support for multiple languages (although most testing is done in English). See `settings.py` for a list of supported languages.
13
10
- Works on GPU, CPU, or MPS
14
11
15
12
## How it works
16
13
17
-
Marker is a pipeline of steps and deep learning models:
14
+
Marker is a pipeline of deep learning models:
18
15
19
-
- Loop through each document page, and:
20
-
- OCR the page if text cannot be detected
21
-
- Detect page layout
22
-
- Format blocks properly based on layout
23
-
- Combine text from all pages
24
-
- Postprocess extracted text
16
+
- Extract text, OCR if necessary (heuristics, tesseract)
- Clean and format each block (heuristics, [nougat](https://huggingface.co/facebook/nougat-base))
19
+
- Combine blocks and postprocess complete text (heuristics, [pdf_postprocessor](https://huggingface.co/vikp/pdf_postprocessor))
25
20
26
-
Marker minimizes the use of autoregressive models, which reduces the risk of hallucinations to close to zero, and improves speed. The only parts of a document that are passed through an LLM forward pass are equation blocks.
21
+
Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper `We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents.` In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages. Nougat is an amazing model that is part of marker, it's just not a general-purpose converter.
27
22
28
-
## Limitations
23
+
Marker is 10x faster and more accurate by only passing equation blocks through an LLM forward pass.
29
24
30
-
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
25
+
## Examples
31
26
32
-
- Marker will convert fewer equations to latex that nougat. This is because it has to first detect equations, then convert them without hallucation.
33
-
- Marker is much faster than autoregressive methods like nougat or kosmos, but much slower than just extracting text directly from the pdf with no processing.
34
-
- Whitespace and indentations are not always respected.
35
-
- Images and most charts will be removed, since text can't be extracted effectively.
36
-
- Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
|[Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf)| arXiv paper |[View file](https://github.com/VikParuchuri/marker/blob/master/examples/marker/switch_transformers.md)|[View](https://github.com/VikParuchuri/marker/blob/master/examples/nougat/switch_transformers.md)|
32
+
|[Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf)| arXiv paper |[View file](https://github.com/VikParuchuri/marker/blob/master/examples/marker/multicolcnn.md)|[View file](https://github.com/VikParuchuri/marker/blob/master/examples/nougat/multicolcnn.md)|
33
+
34
+
35
+
See [below](#benchmarks) for speed and accuracy benchmarks.
37
36
38
37
# Installation
39
38
40
-
This has been tested on Mac and Linux (Ubuntu and Debian). You will need python 3.9+ and [poetry](https://python-poetry.org/docs/#installing-with-the-official-installer).
39
+
This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and [poetry](https://python-poetry.org/docs/#installing-with-the-official-installer).
41
40
42
41
First, clone the repo:
43
42
@@ -47,22 +46,22 @@ First, clone the repo:
47
46
## Linux
48
47
49
48
- Install system requirements
50
-
- Optional: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/) or running `install/tesseract_5_install.sh`.
51
-
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `install/ghostscript_install.sh`.
52
-
- Install other requirements with `cat install/apt-requirements.txt | xargs sudo apt-get install -y`
49
+
- Optional: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/) or running `scripts/install/tesseract_5_install.sh`.
50
+
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`.
51
+
- Install other requirements with `cat scripts/install/apt-requirements.txt | xargs sudo apt-get install -y`
53
52
- Set the tesseract data folder path
54
-
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple!
53
+
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
55
54
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
56
55
- Install python requirements
57
56
-`poetry install`
58
57
-`poetry shell` to activate your poetry venv
59
-
- Update pytorch as needed since poetry doesn't play nicely with it
58
+
- Update pytorch since poetry doesn't play nicely with it
60
59
- GPU only: run `pip install torch` to install other torch dependencies.
61
60
- CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
62
61
63
62
## Mac
64
63
65
-
- Install system requirements from `install/brew-requirements.txt`
64
+
- Install system requirements from `scripts/install/brew-requirements.txt`
66
65
- Set the tesseract data folder path
67
66
- Find the tesseract data folder `tessdata` with `brew list tesseract`
68
67
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I converted the latex to text, and compare the reference to the output of text extraction methods.
128
+
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods.
130
129
131
130
Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
132
131
133
132
**Speed**
134
133
135
-
The books are several hundred pages long (paip is almost 1000 pages).
136
-
137
134
Method Average Score Time per doc
138
135
-------- --------------- --------------
139
136
naive 0.351585 0.328931
@@ -159,13 +156,22 @@ You can benchmark the performance of marker on your machine. First, download th
This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.
166
163
167
164
Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.
168
165
166
+
# Limitations
167
+
168
+
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
169
+
170
+
- Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation.
171
+
- Whitespace and indentations are not always respected.
172
+
- Not all lines/spans will be joined properly.
173
+
- Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
174
+
169
175
# Commercial usage
170
176
171
177
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
@@ -189,4 +195,4 @@ This work would not have been possible without amazing open source models and da
189
195
- DocLayNet from IBM
190
196
- ByT5 from Google
191
197
192
-
Thank you to the authors of these models and datasets for making them available to the community.
198
+
Thank you to the authors of these models and datasets for making them available to the community!
0 commit comments