autofix tesseract OCR output of a scanned book with the expected text from an EPUB file of the same book

i have two versions of the same book

1. a EPUB version
2. a HOCR version created by tesseract from scanned images (TIFF files), which i want to convert to a searchable PDF file (page images with a transparent text layer)

problem: tesseract makes many mistakes when recognizing text

bad solution: manually [proofread the HOCR files](https://github.com/zacharywhitley/awesome-ocr/pull/7)

wanted solution: automatically fix the almost-correct text in the HOCR files using the correct text in the EPUB file. aka: automatic proofreading of HOCR files with a known expected text

this would also require alignment of similar texts ([sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment)), a problem which i already have encountered (and somewhat solved) in my [translate-richtext](https://github.com/milahu/translate-richtext) project, where i use a character-diff to align two similar texts:

```sh
git diff --word-diff=color --word-diff-regex=. --no-index \
  $(readlink -f translation.joined.txt) \
  $(readlink -f translation.splitted.txt) |
sed -E $'s/\e\[32m.*?\e\[m//g; s/\e\\[[0-9;:]*[a-zA-Z]//g' |
tail -n +6 >translation.aligned.txt
```

other possible solutions: [passim](https://github.com/dasmiq/passim) and [text-pair](https://github.com/ARTFL-Project/text-pair)

the alignment of similar texts can produce new mistakes, so it should be easy to manually inspect and fix the alignments (semi-automatic solution)

the solution should be implemented in a python script, to make it easy to customize

crossposted to [reddit](https://www.reddit.com/r/Annas_Archive/comments/1od2oqv/autofix_tesseract_ocr_output_of_a_scanned_book/) and [stackexchange](https://ebooks.stackexchange.com/questions/9432/autofix-tesseract-ocr-output-of-a-scanned-book-with-the-expected-text-from-an-ep)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

autofix tesseract OCR output of a scanned book with the expected text from an EPUB file of the same book #24

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

autofix tesseract OCR output of a scanned book with the expected text from an EPUB file of the same book #24

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions