Skip to content

Commit 47b68a3

Browse files
authored
Merge pull request #236 from pymupdf/Version-0.0.19
Version 0.0.19
2 parents d4d68b0 + 2ec62ae commit 47b68a3

File tree

6 files changed

+214
-104
lines changed

6 files changed

+214
-104
lines changed

CHANGES.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,50 @@
11
# Change Log
22

33

4+
## Changes in version 0.0.19
5+
6+
### Fixes:
7+
The following list includes fixes made in version 0.0.18 already.
8+
9+
* [158](https://github.com/pymupdf/RAG/issues/158) - Very long titles when converting to markdown.
10+
* [155](https://github.com/pymupdf/RAG/issues/155) - Inconsistent image extraction from image-only PDFs
11+
* [161](https://github.com/pymupdf/RAG/issues/161) - force_text param ignored.
12+
* [162](https://github.com/pymupdf/RAG/issues/162) - to_markdown isn't outputting all the pages but get_text is.
13+
* [173](https://github.com/pymupdf/RAG/issues/173) - First column of table is repeated before the actual table.
14+
* [187](https://github.com/pymupdf/RAG/issues/187) - Unsolicited Text Particles
15+
* [188](https://github.com/pymupdf/RAG/issues/188) - Takes lot of time to convert into markdown.
16+
* [191](https://github.com/pymupdf/RAG/issues/191) - Extraction of text stops in the middle while working fine with PyMuPDF.
17+
* [212](https://github.com/pymupdf/RAG/issues/212) - In pymupdf4llm, if a page has multiple images, only 1 image per-page is extracted.
18+
* [213](https://github.com/pymupdf/RAG/issues/213) - Many ���� after converting when using pymupdf4llm
19+
* [215](https://github.com/pymupdf/RAG/issues/215) - Spending too much time on identifying text bboxes
20+
* [218](https://github.com/pymupdf/RAG/issues/218) - IndexError in get_raw_lines when processing PDFs with formulas
21+
* [225](https://github.com/pymupdf/RAG/issues/225) - Text with background missing from output.
22+
* [229](https://github.com/pymupdf/RAG/issues/229) - Duplicated Table Content on pymuPDF4LLM.
23+
24+
25+
### Other Changes:
26+
27+
* Added **_new parameter_** `filename`: (str), optional. Overwrites or sets the filename for saved images. Useful when the document is opened from memory.
28+
29+
* Added **_new parameter_** `use_glyphs`: (bool), optional. Request to use the glyph number (if possible) of a character if the font has no back-translation to the original Unicode value. The default is `False` which causes � symbols to be rendered in these cases.
30+
31+
* Added **_strike-out support_**: We now detect and render ~~striked-out text.~~
32+
33+
* Improved **_background color_** detection: We have introduced a simple background color detection mechanism: If a page shows an identical color in all four corners, we assume this to be the background color. Text and vector graphics with this color will be ignored as invisible.
34+
35+
* Improved **_invisible text detection_**: Text with an alpha value of 0 is now ignored.
36+
37+
* Improved **_fake-bold_** detection: Text mimicking bold appearance is now treated like standard bold text in most cases.
38+
39+
* Header handling changes:
40+
- Detection now happens based on the **_largest font size_** of the line.
41+
- Uniformly rendered: All spans of a header line will now be rendered with the same appearance.
42+
43+
* Changed handling of parameter `graphics_limit`: We previously ignored a page completely if the vector graphics count exceeded the limit. We now only ignore vector graphics if their count **_outside table boundary boxes_** is too large. This should only suppress vector graphics on the page, while keeping images, text and table content extractable.
44+
45+
* Changed the `margins` default to 0. The previous default `(0, 50, 0, 50)` ignored 50 points at the top and bottom of pages. This has turned out to cause confusion in too many cases.
46+
47+
448
## Changes in version 0.0.17
549

650
### Fixes:

pdf4llm/setup.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,11 @@
1313
"Programming Language :: Python :: 3",
1414
"Topic :: Utilities",
1515
]
16-
requires = ["pymupdf4llm>=0.0.18"]
16+
requires = ["pymupdf4llm>=0.0.19"]
1717

1818
setuptools.setup(
1919
name="pdf4llm",
20-
version="0.0.18",
20+
version="0.0.19",
2121
author="Artifex",
2222
author_email="[email protected]",
2323
description="PyMuPDF Utilities for LLM/RAG",
@@ -32,4 +32,10 @@
3232
package_data={
3333
"pdf4llm": ["LICENSE"],
3434
},
35+
project_urls={
36+
"Documentation": "https://pymupdf.readthedocs.io/",
37+
"Source": "https://github.com/pymupdf/RAG/tree/main/pdf4llm/pdf4llm",
38+
"Tracker": "https://github.com/pymupdf/RAG/issues",
39+
"Changelog": "https://github.com/pymupdf/RAG/blob/main/CHANGES.md",
40+
},
3541
)

pymupdf4llm/pymupdf4llm/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from .helpers.pymupdf_rag import IdentifyHeaders, to_markdown
22

3-
__version__ = "0.0.18"
3+
__version__ = "0.0.19"
44
version = __version__
55
version_tuple = tuple(map(int, version.split(".")))
66

pymupdf4llm/pymupdf4llm/helpers/get_text_lines.py

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -74,11 +74,15 @@ def sanitize_spans(line):
7474
s0 = line[i - 1]
7575
s1 = line[i]
7676
# "delta" depends on the font size. Spans will be joined if
77-
# no more than 10% of the font size separates them.
77+
# no more than 10% of the font size separates them and important
78+
# attributes are the same.
7879
delta = s1["size"] * 0.1
79-
if s0["bbox"].x1 + delta < s1["bbox"].x0:
80-
continue # all good: no joining neded
81-
80+
if s0["bbox"].x1 + delta < s1["bbox"].x0 or (
81+
s0["flags"],
82+
s0["char_flags"],
83+
s0["size"],
84+
) != (s1["flags"], s1["char_flags"], s1["size"]):
85+
continue # no joining
8286
# We need to join bbox and text of two consecutive spans
8387
# On occasion, spans may also be duplicated.
8488
if s0["text"] != s1["text"] or s0["bbox"] != s1["bbox"]:
@@ -108,11 +112,14 @@ def sanitize_spans(line):
108112
continue
109113
if is_white(s["text"]): # ignore white text
110114
continue
115+
if s["alpha"] == 0: # ignore invisible text
116+
continue
111117
if s["flags"] & 1 == 1: # if a superscript, modify bbox
112118
# with that of the preceding or following span
113119
i = 1 if sno == 0 else sno - 1
114-
neighbor = line["spans"][i]
115-
sbbox.y1 = neighbor["bbox"][3]
120+
if len(line["spans"]) > i:
121+
neighbor = line["spans"][i]
122+
sbbox.y1 = neighbor["bbox"][3]
116123
s["text"] = f"[{s['text']}]"
117124
s["bbox"] = sbbox # update with the Rect version
118125
# include line/block numbers to facilitate separator insertion
@@ -132,10 +139,7 @@ def sanitize_spans(line):
132139
sbbox = s["bbox"] # this bbox
133140
sbbox0 = line[-1]["bbox"] # previous bbox
134141
# if any of top or bottom coordinates are close enough, join...
135-
if (
136-
abs(sbbox.y1 - sbbox0.y1) <= y_delta
137-
or abs(sbbox.y0 - sbbox0.y0) <= y_delta
138-
):
142+
if abs(sbbox.y1 - sbbox0.y1) <= y_delta or abs(sbbox.y0 - sbbox0.y0) <= y_delta:
139143
line.append(s) # append to this line
140144
lrect |= sbbox # extend line rectangle
141145
continue
@@ -156,9 +160,7 @@ def sanitize_spans(line):
156160
return nlines
157161

158162

159-
def get_text_lines(
160-
page, *, textpage=None, clip=None, sep="\t", tolerance=3, ocr=False
161-
):
163+
def get_text_lines(page, *, textpage=None, clip=None, sep="\t", tolerance=3, ocr=False):
162164
"""Extract text by line keeping natural reading sequence.
163165
164166
Notes:

0 commit comments

Comments
 (0)