-
Notifications
You must be signed in to change notification settings - Fork 10
Description
i have two versions of the same book
- a EPUB version
- a HOCR version created by tesseract from scanned images (TIFF files), which i want to convert to a searchable PDF file (page images with a transparent text layer)
problem: tesseract makes many mistakes when recognizing text
bad solution: manually proofread the HOCR files
wanted solution: automatically fix the almost-correct text in the HOCR files using the correct text in the EPUB file. aka: automatic proofreading of HOCR files with a known expected text
this would also require alignment of similar texts (sequence alignment), a problem which i already have encountered (and somewhat solved) in my translate-richtext project, where i use a character-diff to align two similar texts:
git diff --word-diff=color --word-diff-regex=. --no-index \
$(readlink -f translation.joined.txt) \
$(readlink -f translation.splitted.txt) |
sed -E $'s/\e\[32m.*?\e\[m//g; s/\e\\[[0-9;:]*[a-zA-Z]//g' |
tail -n +6 >translation.aligned.txtother possible solutions: passim and text-pair
the alignment of similar texts can produce new mistakes, so it should be easy to manually inspect and fix the alignments (semi-automatic solution)
the solution should be implemented in a python script, to make it easy to customize
crossposted to reddit and stackexchange