Skip to content

Duplicate Characters in Output Stream #2738

@woodjohndavid

Description

@woodjohndavid

Please refer to the following link:

#2635

This concerns changes made to lstm_choices_mode.

Unless I misunderstand what these options are supposed to do, it appears like there is a bug or oversight. Please refer to this user area thread:

https://groups.google.com/forum/#!topic/tesseract-ocr/5tC6appoUgE

There seems to be no way to prevent lstm from including duplicates in the generated text and/or HOCR output. The example in the thread above is a clear example of this.

Surely there must be some way to force Tesseract to include only the highest confidence level choice of character when there are multiple possibilities.

Also, apologies if this is posted in the wrong place, and apologies for possible duplicate postings. I am a Tesseract newbie so trying to learn the ropes.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions