You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[162](https://github.com/pymupdf/RAG/issues/162) - to_markdown isn't outputting all the pages but get_text is.
13
+
*[173](https://github.com/pymupdf/RAG/issues/173) - First column of table is repeated before the actual table.
14
+
*[187](https://github.com/pymupdf/RAG/issues/187) - Unsolicited Text Particles
15
+
*[188](https://github.com/pymupdf/RAG/issues/188) - Takes lot of time to convert into markdown.
16
+
*[191](https://github.com/pymupdf/RAG/issues/191) - Extraction of text stops in the middle while working fine with PyMuPDF.
17
+
*[212](https://github.com/pymupdf/RAG/issues/212) - In pymupdf4llm, if a page has multiple images, only 1 image per-page is extracted.
18
+
*[213](https://github.com/pymupdf/RAG/issues/213) - Many ���� after converting when using pymupdf4llm
19
+
*[215](https://github.com/pymupdf/RAG/issues/215) - Spending too much time on identifying text bboxes
20
+
*[218](https://github.com/pymupdf/RAG/issues/218) - IndexError in get_raw_lines when processing PDFs with formulas
21
+
*[225](https://github.com/pymupdf/RAG/issues/225) - Text with background missing from output.
22
+
*[229](https://github.com/pymupdf/RAG/issues/229) - Duplicated Table Content on pymuPDF4LLM.
23
+
24
+
25
+
### Other Changes:
26
+
27
+
* Added **_new parameter_**`filename`: (str), optional. Overwrites or sets the filename for saved images. Useful when the document is opened from memory.
28
+
29
+
* Added **_new parameter_**`use_glyphs`: (bool), optional. Request to use the glyph number (if possible) of a character if the font has no back-translation to the original Unicode value. The default is `False` which causes � symbols to be rendered in these cases.
30
+
31
+
* Added **_strike-out support_**: We now detect and render ~~striked-out text.~~
32
+
33
+
* Improved **_background color_** detection: We have introduced a simple background color detection mechanism: If a page shows an identical color in all four corners, we assume this to be the background color. Text and vector graphics with this color will be ignored as invisible.
34
+
35
+
* Improved **_invisible text detection_**: Text with an alpha value of 0 is now ignored.
36
+
37
+
* Improved **_fake-bold_** detection: Text mimicking bold appearance is now treated like standard bold text in most cases.
38
+
39
+
* Header handling changes:
40
+
- Detection now happens based on the **_largest font size_** of the line.
41
+
- Uniformly rendered: All spans of a header line will now be rendered with the same appearance.
42
+
43
+
* Changed handling of parameter `graphics_limit`: We previously ignored a page completely if the vector graphics count exceeded the limit. We now only ignore vector graphics if their count **_outside table boundary boxes_** is too large. This should only suppress vector graphics on the page, while keeping images, text and table content extractable.
44
+
45
+
* Changed the `margins` default to 0. The previous default `(0, 50, 0, 50)` ignored 50 points at the top and bottom of pages. This has turned out to cause confusion in too many cases.
0 commit comments