-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocessing of header blocks #45
Comments
Hello @jojennin, thanks for your detailed report and kind words. If this worked in v2.2 and not now I had to break something :( I will quickly go through git bisect if I manage to find the commit. Thank you for the HTML to test this. |
I just created a test for version in main branch and it seems it is working. Maybe I don't understand the issue here? Can you please create a test where you can demonstrate that the current version fails? from unittest.mock import ANY
def test_heading_should_be_processed():
html = """<html>
<head>
<title>Eddie Brunner - Wikipedia</title>
</head>
<body>
<h1>Eddie Brunner</h1>
<p>
<b>Eduard "Eddie" Brunner</b> (July 19, 1912 - July 18, 1960, Zurich).
</p>
<p>
Eddie Brunner was a Swiss jazz reedist and bandleader.
Brunner learned to play clarinet, piano, and tenor and alto saxophone,
and began playing professionally in the early 1930s with Rene Dumont,
Jack and Louis de Vries and Marek Weber. In 1936 he moved to Paris and
recorded under his own name as well as with the Goldene Sieben and Louis Bacon;
he moved back to Switzerland once the
<a href="/wiki/World_War_II" title="World War II">war</a>
had begun. He joined Teddy Stauffer's band, and in 1941 took over leadership of
the group until 1947, when it dissolved. He led a new six-piece ensemble in 1948,
and recorded for radio and television broadcasts in the 1950s.
</p>
</body>
</html>"""
paragraphs = justext.justext(html, justext.get_stoplist("English"))
assert [p.text for p in paragraphs] == [
"Eddie Brunner",
'Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich).',
ANY,
] |
Thank you @miso-belica for your response. The issue is not in the retrieval of paragraphs, rather in the classification (specifically in the context-aware classification). Because the import justext
import justext.core as jt
html = """<html>
<head>
<title>Eddie Brunner - Wikipedia</title>
</head>
<body>
<h1>Eddie Brunner</h1>
<p>
<b>Eduard "Eddie" Brunner</b> (July 19, 1912 - July 18, 1960, Zurich).
</p>
<p>
Eddie Brunner was a Swiss jazz reedist and bandleader.
Brunner learned to play clarinet, piano, and tenor and alto saxophone,
and began playing professionally in the early 1930s with Rene Dumont,
Jack and Louis de Vries and Marek Weber. In 1936 he moved to Paris and
recorded under his own name as well as with the Goldene Sieben and Louis Bacon;
he moved back to Switzerland once the
<a href="/wiki/World_War_II" title="World War II">war</a>
had begun. He joined Teddy Stauffer's band, and in 1941 took over leadership of
the group until 1947, when it dissolved. He led a new six-piece ensemble in 1948,
and recorded for radio and television broadcasts in the 1950s.
</p>
</body>
</html>
"""
def justext_v3(html, stoplist):
paragraphs = justext.justext(html, stoplist)
return paragraphs
def justext_v220_copyclass(html, stoplist):
dom = jt.html_to_dom(html)
dom = jt.preprocessor(dom)
paragraphs = jt.ParagraphMaker.make_paragraphs(dom)
jt.classify_paragraphs(paragraphs, stoplist)
# Copy the class_type attribute
# as is done in v2.2.0 (Line 296 of core.py in v2.2.0)
# Commenting out this loop will result in the same classification
# as the current version
for paragraph in paragraphs:
paragraph.class_type = paragraph.cf_class
jt.revise_paragraph_classification(paragraphs)
return paragraphs
def main():
stoplist = justext.get_stoplist("English")
paragraphs = justext_v3(html, stoplist)
paragraphs_v220_copyclass = justext_v220_copyclass(html, stoplist)
for p3, p2 in zip(paragraphs, paragraphs_v220_copyclass):
if p3.class_type != p2.class_type:
print(p3.text, p3.class_type)
print(p2.text, p2.class_type)
if __name__ == "__main__":
main() Running this script results in the following:
If you comment out the loop that copies the The difference is due to the classification of short header blocks. If the |
Thanks sharing and maintaining the repository. I find the code very readable and great for extracting text from HTML.
It seems that the current version (3.0.0) does not process header blocks as is described in the documentation:
The code for this appears to be implemented on line 314 of core.py and works by searching for good blocks that come after the short header block by looking at the
class_type
attribute. However, at this point in the code, theclass_type
of each paragraph has only been initialized and not copied fromcf_class
, so when searching for a good block withinmax_heading_distance
characters, a good block will never be found, asclass_type
has not been copied for subsequent paragraphs.Here is a simple dummy HTML file that I have created to provide an example of how a short block that comes between a short header and a good block will be classified as "bad". However, had the header block been labeled as "neargood" from the preprocessing step, the short block would have been labeled as "good".
Is this preprocessing of the header blocks as described above intended? If so, then I think the documentation should be updated to reflect this change. Otherwise, it seems a simple loop over the paragraphs copying
cf_class
toclass_type
(as was done in version 2.2.0) would suffice to fix the issue.The text was updated successfully, but these errors were encountered: