GitHub - mikemccabe/analyze_ocr: Parse OCR result files for pagenos, tables of contents, etc.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
fonts		fonts
.gitignore		.gitignore
README		README
analyze_ocr.php		analyze_ocr.php
analyze_ocr.py		analyze_ocr.py
color.py		color.py
diff_match_patch.py		diff_match_patch.py
extract_sorted.py		extract_sorted.py
find_header_footer.py		find_header_footer.py
find_pagenos.py		find_pagenos.py
font.py		font.py
iabook.py		iabook.py
interval.py		interval.py
make_toc.py		make_toc.py
rnums.py		rnums.py
tuples.py		tuples.py
visualize.py		visualize.py
windowed_iterator.py		windowed_iterator.py

Repository files navigation

Some code for analyzing OCR'ed documents.  It's currently pretty
specific to Internet Archive OCR'd books, but it may be generalizable.

Entry point: analyze_ocr.py - run this against an archive scanned book.

Functionality: find headers/footers, page numbers, tables of contents.