Skip to content

Commit 0817d87

Browse files
committed
Add examples, update readme
1 parent 13fe745 commit 0817d87

19 files changed

+17533
-42
lines changed

README.md

+40-34
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,42 @@
11
# Marker
22

3-
Marker converts PDF, EPUB, and MOBI to Markdown. It is 10x faster than nougat, works across many types of documents, and minimizes the risk of hallucinations significantly.
4-
5-
Features:
3+
Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has near-zero hallucination risk.
64

75
- Support for a range of PDF documents (optimized for books and scientific papers)
8-
- Support for 1 and 2 column layouts
9-
- Removal of headers/footers/other artifacts
10-
- Latex conversion for most equations
11-
- Proper code block and table formatting
6+
- Removes headers/footers/other artifacts
7+
- Converts most equations to latex
8+
- Formats code blocks and tables
129
- Support for multiple languages (although most testing is done in English). See `settings.py` for a list of supported languages.
1310
- Works on GPU, CPU, or MPS
1411

1512
## How it works
1613

17-
Marker is a pipeline of steps and deep learning models:
14+
Marker is a pipeline of deep learning models:
1815

19-
- Loop through each document page, and:
20-
- OCR the page if text cannot be detected
21-
- Detect page layout
22-
- Format blocks properly based on layout
23-
- Combine text from all pages
24-
- Postprocess extracted text
16+
- Extract text, OCR if necessary (heuristics, tesseract)
17+
- Detect page layout ([layout segmenter](https://huggingface.co/vikp/layout_segmenter), [column detector](https://huggingface.co/vikp/column_detector))
18+
- Clean and format each block (heuristics, [nougat](https://huggingface.co/facebook/nougat-base))
19+
- Combine blocks and postprocess complete text (heuristics, [pdf_postprocessor](https://huggingface.co/vikp/pdf_postprocessor))
2520

26-
Marker minimizes the use of autoregressive models, which reduces the risk of hallucinations to close to zero, and improves speed. The only parts of a document that are passed through an LLM forward pass are equation blocks.
21+
Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper `We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents.` In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages. Nougat is an amazing model that is part of marker, it's just not a general-purpose converter.
2722

28-
## Limitations
23+
Marker is 10x faster and more accurate by only passing equation blocks through an LLM forward pass.
2924

30-
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
25+
## Examples
3126

32-
- Marker will convert fewer equations to latex that nougat. This is because it has to first detect equations, then convert them without hallucation.
33-
- Marker is much faster than autoregressive methods like nougat or kosmos, but much slower than just extracting text directly from the pdf with no processing.
34-
- Whitespace and indentations are not always respected.
35-
- Images and most charts will be removed, since text can't be extracted effectively.
36-
- Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
27+
| PDF | Type | Marker | Nougat |
28+
|-----------------------------------------------------------------------|-------------|--------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
29+
| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/marker/thinkpython.md) | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/nougat/thinkpython.md) |
30+
| [Think OS](https://greenteapress.com/thinkos/thinkos.pdf) | Textbook | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/marker/thinkos.md) | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/nougat/thinkos.md) |
31+
| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/marker/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/examples/nougat/switch_transformers.md) |
32+
| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/marker/multicolcnn.md) | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/nougat/multicolcnn.md) |
33+
34+
35+
See [below](#benchmarks) for speed and accuracy benchmarks.
3736

3837
# Installation
3938

40-
This has been tested on Mac and Linux (Ubuntu and Debian). You will need python 3.9+ and [poetry](https://python-poetry.org/docs/#installing-with-the-official-installer).
39+
This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and [poetry](https://python-poetry.org/docs/#installing-with-the-official-installer).
4140

4241
First, clone the repo:
4342

@@ -47,22 +46,22 @@ First, clone the repo:
4746
## Linux
4847

4948
- Install system requirements
50-
- Optional: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/) or running `install/tesseract_5_install.sh`.
51-
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `install/ghostscript_install.sh`.
52-
- Install other requirements with `cat install/apt-requirements.txt | xargs sudo apt-get install -y`
49+
- Optional: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/) or running `scripts/install/tesseract_5_install.sh`.
50+
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`.
51+
- Install other requirements with `cat scripts/install/apt-requirements.txt | xargs sudo apt-get install -y`
5352
- Set the tesseract data folder path
54-
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple!
53+
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
5554
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
5655
- Install python requirements
5756
- `poetry install`
5857
- `poetry shell` to activate your poetry venv
59-
- Update pytorch as needed since poetry doesn't play nicely with it
58+
- Update pytorch since poetry doesn't play nicely with it
6059
- GPU only: run `pip install torch` to install other torch dependencies.
6160
- CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
6261

6362
## Mac
6463

65-
- Install system requirements from `install/brew-requirements.txt`
64+
- Install system requirements from `scripts/install/brew-requirements.txt`
6665
- Set the tesseract data folder path
6766
- Find the tesseract data folder `tessdata` with `brew list tesseract`
6867
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
@@ -126,14 +125,12 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
126125

127126
# Benchmarks
128127

129-
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I converted the latex to text, and compare the reference to the output of text extraction methods.
128+
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods.
130129

131130
Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
132131

133132
**Speed**
134133

135-
The books are several hundred pages long (paip is almost 1000 pages).
136-
137134
Method Average Score Time per doc
138135
-------- --------------- --------------
139136
naive 0.351585 0.328931
@@ -159,13 +156,22 @@ You can benchmark the performance of marker on your machine. First, download th
159156
Then run `benchmark.py` like this:
160157

161158
```
162-
python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
159+
python benchmark.py data/pdfs data/references report.json --nougat
163160
```
164161

165162
This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.
166163

167164
Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.
168165

166+
# Limitations
167+
168+
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
169+
170+
- Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation.
171+
- Whitespace and indentations are not always respected.
172+
- Not all lines/spans will be joined properly.
173+
- Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
174+
169175
# Commercial usage
170176

171177
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
@@ -189,4 +195,4 @@ This work would not have been possible without amazing open source models and da
189195
- DocLayNet from IBM
190196
- ByT5 from Google
191197

192-
Thank you to the authors of these models and datasets for making them available to the community.
198+
Thank you to the authors of these models and datasets for making them available to the community!

benchmark.py

+20-7
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,8 @@ def nougat_prediction(pdf_filename, batch_size=1):
5555
scores = defaultdict(dict)
5656
benchmark_files = os.listdir(args.in_folder)
5757
benchmark_files = [b for b in benchmark_files if b.endswith(".pdf")]
58-
times = defaultdict(int)
59-
total_pages = 0
58+
times = defaultdict(dict)
59+
pages = defaultdict(int)
6060

6161
for fname in tqdm(benchmark_files):
6262
md_filename = fname.rsplit(".", 1)[0] + ".md"
@@ -67,7 +67,7 @@ def nougat_prediction(pdf_filename, batch_size=1):
6767

6868
pdf_filename = os.path.join(args.in_folder, fname)
6969
doc = pymupdf.open(pdf_filename)
70-
total_pages += len(doc)
70+
pages[fname] = len(doc)
7171

7272
for method in methods:
7373
start = time.time()
@@ -80,7 +80,7 @@ def nougat_prediction(pdf_filename, batch_size=1):
8080
else:
8181
raise ValueError(f"Unknown method {method}")
8282

83-
times[method] += time.time() - start
83+
times[method][fname] = time.time() - start
8484

8585
score = score_text(full_text, reference)
8686
scores[method][fname] = score
@@ -90,14 +90,26 @@ def nougat_prediction(pdf_filename, batch_size=1):
9090
with open(os.path.join(args.md_out_path, md_out_filename), "w+") as f:
9191
f.write(full_text)
9292

93+
total_pages = sum(pages.values())
9394
with open(args.out_file, "w+") as f:
9495
write_data = defaultdict(dict)
9596
for method in methods:
97+
total_time = sum(times[method].values())
98+
file_stats = {
99+
fname:
100+
{
101+
"time": times[method][fname],
102+
"score": scores[method][fname],
103+
"pages": pages[fname]
104+
}
105+
106+
for fname in benchmark_files
107+
}
96108
write_data[method] = {
109+
"files": file_stats,
97110
"avg_score": sum(scores[method].values()) / len(scores[method]),
98-
"scores": scores[method],
99-
"time_per_page": times[method] / total_pages,
100-
"time_per_doc": times[method] / len(scores[method])
111+
"time_per_page": total_time / total_pages,
112+
"time_per_doc": total_time / len(scores[method])
101113
}
102114

103115
json.dump(write_data, f, indent=4)
@@ -113,3 +125,4 @@ def nougat_prediction(pdf_filename, batch_size=1):
113125
print("")
114126
print("Scores by file")
115127
print(tabulate(score_table, headers=["Method", *score_headers]))
128+
File renamed without changes.

0 commit comments

Comments
 (0)