Skip to content

Conversation

@mdbraber
Copy link

@mdbraber mdbraber commented Jan 30, 2022

  • Improve fixing spaces when seeing similar consecutive characters
  • Add argument to force fixing spaces
  • Strip possible newlines from end result

result += first_page[p]
t += 1

# if the current character is the same as the previous
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there a need for this ?

Copy link
Author

@mdbraber mdbraber Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose I'm looking for the title "Een scheiding" and in the page I find the string "Armoede\nEen scheiding" it will iterate over the page characters and find "e E" as a temporary result (replacing the newline with a presumed space), but when it hits the following 'e' it will decide it's not the title we're looking for ("eee" != "een"). This is false, because we can still be on track to find the title, but we should shift the window one character to the right and decide again which is what we're doing here (by not doing t+=1). Maybe I'm overlooking someting, but this solved this use case for me (refer to the trouw.nl PDF I sent separately as a test)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants