Skip to content

Tesseract inserting additional alternative characters #1465

@jghare

Description

@jghare

Environment

  • Tesseract Version: <3.x stable and 4.0 alpha/beta> for English language text (using Fast and Best trained data) Command line

  • Platform: <Windows, version 64-bit and linux (Ubuntu/centos)-->

Current Behavior:

All versions of tesseract mentioned above tend to insert additional alternative characters (probably) whenever its not very confident. For example - if theres a "#" in the image file it often spits out "#H" or "A#" or even "AH"... Thats 2 characters for 1. Another example: If theres a "$" in the image then it gives "S$" or "$s" etc.. happens very often for other characters like 0,O,!,%,^ etc etc...
My application is very sensitive to length of the string hence an extra character throws many things off.
I am currently a command-line user and may later use it in Java whenever a wrapper for 4.0 becomes available.

Expected Behavior:

Expect tesseract to give out only one character for each character in the image. I should be able to control this behaviour using command line parameters (assuming there isn't one yet..). I have looked into the parameters but there are hundreds and mostly non-self-explanatory. Hence raising this as an issue. Also is it possible to get a "Character-level" HOCR output - current one is at word level granularity.

Suggested Fix:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions