Skip to content

Conversation

@bertsky
Copy link
Contributor

@bertsky bertsky commented May 7, 2025

To be used in conjunction with tesseract-ocr/tesseract#4420 – which hopefully will get merged in some form, ultimately.

There may be more functions that need exception conversion.

Example result:

15:36:15.156 ERROR ocrd.processor.base - Failure on page PHYS_0067: ELIST_ITERATOR::forward:Error:List would have returned a nullptr data pointer
Traceback (most recent call last):
  File "/data/ocr-d/ocrd_all/core/src/ocrd/processor/base.py", line 710, in process_workspace_handle_page_task
    task.result()
  File "/data/ocr-d/ocrd_all/core/src/ocrd/processor/base.py", line 124, in result
    return self.fn(*self.args, **self.kwargs)
  File "/data/ocr-d/ocrd_all/core/src/ocrd/processor/base.py", line 1157, in _page_worker
    _page_worker_processor.process_page_file(*input_files)
  File "/data/ocr-d/ocrd_all/core/src/ocrd/processor/base.py", line 809, in process_page_file
    result = self.process_page_pcgts(*input_pcgts, page_id=page_id)
  File "/data/ocr-d/ocrd_all/ocrd_tesserocr/ocrd_tesserocr/recognize.py", line 512, in process_page_pcgts
    self._process_existing_regions(regions, page_image, page_coords, pcgts.mapping)
  File "/data/ocr-d/ocrd_all/ocrd_tesserocr/ocrd_tesserocr/recognize.py", line 995, in _process_existing_regions
    self._process_existing_lines(textlines, region_image, region_coords, mapping)
  File "/data/ocr-d/ocrd_all/ocrd_tesserocr/ocrd_tesserocr/recognize.py", line 1059, in _process_existing_lines
    self._process_existing_words(words, line_image, line_coords, mapping)
  File "/data/ocr-d/ocrd_all/ocrd_tesserocr/ocrd_tesserocr/recognize.py", line 1095, in _process_existing_words
    word_conf = self.tessapi.AllWordConfidences()
  File "tesserocr.pyx", line 2386, in tesserocr.PyTessBaseAPI.AllWordConfidences
RuntimeError: ELIST_ITERATOR::forward:Error:List would have returned a nullptr data pointer

That is, there was a failed assertion in libtesseract, which now throws a C++ exception instead of hard abort(), so in Cython we could convert it to Python exception, which we can then catch and act on.

@sirfz
Copy link
Owner

sirfz commented May 8, 2025

This won't break older versions right?

@bertsky
Copy link
Contributor Author

bertsky commented May 8, 2025

This won't break older versions right?

I cannot imagine how. Adding except + will just convert C++ exceptions (instead of having them be handled by the C++ runtime). If libtesseract does not throw them, nothing should change. (And the difference of sacrificing the const specifier should also be negligable.)

Sry for not creating a minimal PR in the first place, btw. Do you want me to rebase to current master?

@sirfz
Copy link
Owner

sirfz commented May 8, 2025

Yes please rebase

@bertsky bertsky force-pushed the convert-more-exceptions branch from 8de15c0 to 29cc0f3 Compare May 8, 2025 08:20
@sirfz
Copy link
Owner

sirfz commented Oct 8, 2025

Hello @bertsky, is this good to move forward? You implied in your original post that this change is depending on a tesseract issue (or is it not a hard dependency?)

@bertsky
Copy link
Contributor Author

bertsky commented Oct 8, 2025

Hi @sirfz, yes, I can see no reason not to include this in tesserocr. (Like I said, it's a soft dependency – it only acts if libtesseract throws C++ exceptions, which so far it never did.) However, it will not be much use until the PR in Tesseract gets merged. Unfortunately, the discussion there seems to be stalled.

@sirfz sirfz merged commit 609dfc3 into sirfz:master Oct 8, 2025
6 checks passed
bmwiedemann pushed a commit to bmwiedemann/openSUSE that referenced this pull request Oct 10, 2025
https://build.opensuse.org/request/show/1310399
by user mia + dimstar_suse
- Drop unpin-cython.patch
  but make version range explicit in BuildRequires
- Update to 2.9.0
  * fix: update test, make it pass
    gh#sirfz/tesserocr#364
  * Improve stub file
    gh#sirfz/tesserocr#371
  * Convert more exceptions
    gh#sirfz/tesserocr#365
  * Drop python2 support
    gh#sirfz/tesserocr#377
  * All words error handling
    gh#sirfz/tesserocr#378
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants