Skip to content

Significant performance drop when running Tesseract OCR within a container #11617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lpetrus-tempest opened this issue Apr 8, 2025 · 3 comments
Labels
type: bug Something isn't working

Comments

@lpetrus-tempest
Copy link

lpetrus-tempest commented Apr 8, 2025

Description

We are experiencing significant performance drops when running Tesseract OCR within a container with GVisor runtime "runsc".

When tesseract is run with standard docker "runc" runtime the run time is as follows:

real    0m 3.83s
user    0m 10.35s
sys     0m 0.20s

However the same tesseract run with the GVisor "runsc" runtime is 5x slower:

real    0m 19.60s
user    1m 4.21s
sys     0m 0.46s

We have found out that the run time with the "runsc" runtime can be influenced by setting the OpenMP (a library used by Tesseract) parallelism level using the OMP_THREAD_LIMIT environment variable:

OMP_THREAD_LIMIT=1:

real    0m 7.44s
user    0m 6.38s
sys     0m 0.71s

OMP_THREAD_LIMIT=2:

real    0m 6.83s
user    0m 10.56s
sys     0m 0.43s

OMP_THREAD_LIMIT=3:

real    0m 6.86s
user    0m 14.91s
sys     0m 0.38s

OMP_THREAD_LIMIT=4:

real    0m 19.60s
user    1m 3.79s
sys     0m 0.50s

It seems that tesseract runs fastest when OpenMP utilizes 2 threads. However the resource utilization is not optimal - the decrease in run time significantly impacts overall CPU utilization (total run time of 6.38s of 1 thread vs 10.56s of 2 threads). The CPU on this machine has 4 CPU threads.

Expected behaviour in this case would be for GVisor to minimally impact the run time and resource utilization of Tesseract OCR.

Steps to reproduce

  1. Configure docker to utilize "runsc" runtime.
  2. Provision a container with Tesseract OCR installed. Dockerfile:
FROM alpine:latest

RUN apk add --no-cache qpdf tesseract-ocr tesseract-ocr-data-slk tesseract-ocr-data-eng

ARG user=app
ARG group=app
ARG uid=1000
ARG gid=1000
RUN addgroup -g ${gid} ${group}
RUN adduser -G ${group} -u ${uid} -s /bin/sh -h /app ${user} -D

WORKDIR /app

# Switch to user
USER ${uid}:${gid}

ENTRYPOINT ["/bin/sh"]
  1. Execute Tesseract OCR within the container using the following command line:
OMP_THREAD_LIMIT=4 time tesseract test-0.png - -l eng

Note: test-0.png can be any screenshot containing english text.

runsc version

runsc version release-20250319.0
spec: 1.1.0-rc.1

docker version (if using docker)

Client: Docker Engine - Community
 Version:           27.5.1
 API version:       1.47
 Go version:        go1.22.11
 Git commit:        9f9e405
 Built:             Wed Jan 22 13:41:48 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.5.1
  API version:      1.47 (minimum version 1.24)
  Go version:       go1.22.11
  Git commit:       4c9b3b0
  Built:            Wed Jan 22 13:41:48 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.25
  GitCommit:        bcc810d6b9066471b0b6fa75f557a15a1cbf31bb
 runc:
  Version:          1.2.4
  GitCommit:        v1.2.4-0-g6c52b3f
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

uname

Linux docker 6.8.0-55-generic #57-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 12 23:42:21 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

repo state (if built from source)

No response

runsc debug logs (if available)

@lpetrus-tempest lpetrus-tempest added the type: bug Something isn't working label Apr 8, 2025
@konstantin-s-bogom
Copy link
Member

This looks like the same issue as #11431, assuming you have left the --platform flag as default or set it to systrap.

To confirm, see if you can reproduce this using --platform=kvm.

@lpetrus-tempest
Copy link
Author

We did not change the platform flag. The issue also manifests with kvm and systrap platforms.

no gvisor runtime:

OMP_THREAD_LIMIT not set:

real    0m 3.79s
user    0m 10.33s
sys     0m 0.17s

OMP_THREAD_LIMIT=4

real    0m 3.75s
user    0m 10.22s
sys     0m 0.17s

OMP_THREAD_LIMIT=3

real    0m 4.01s
user    0m 8.92s
sys     0m 0.14s

OMP_THREAD_LIMIT=2

real    0m 3.97s
user    0m 6.32s
sys     0m 0.16s

OMP_THREAD_LIMIT=1

real    0m 4.56s
user    0m 4.41s
sys     0m 0.13s

gvisor with kvm platform:

OMP_THREAD_LIMIT not set:

real    0m 5.31s
user    0m 13.04s
sys     0m 0.15s

OMP_THREAD_LIMIT=4

real    0m 5.62s
user    0m 13.75s
sys     0m 0.12s

OMP_THREAD_LIMIT=3

real    0m 5.51s
user    0m 10.79s
sys     0m 0.19s

OMP_THREAD_LIMIT=2

real    0m 5.65s
user    0m 8.38s
sys     0m 0.16s

OMP_THREAD_LIMIT=1

real    0m 6.41s
user    0m 6.04s
sys     0m 0.19s

gvisor with systrap platform:

OMP_THREAD_LIMIT not set:

real    0m 19.74s
user    1m 4.82s
sys     0m 0.47s

OMP_THREAD_LIMIT=4

real    0m 19.41s
user    1m 3.69s
sys     0m 0.46s

OMP_THREAD_LIMIT=3

real    0m 5.94s
user    0m 12.01s
sys     0m 0.56s

OMP_THREAD_LIMIT=2

real    0m 6.63s
user    0m 10.15s
sys     0m 0.53s

OMP_THREAD_LIMIT=1

real    0m 7.18s
user    0m 6.17s
sys     0m 0.67s

@konstantin-s-bogom
Copy link
Member

It looks like the KVM platform is behaving as expected, so this is indeed the same issue as #11431. The fact that for KVM OMP_THREAD_LIMIT at 4 is slightly slower than 3 is not necessarily a bug -- the gVisor sentry may be doing things in the background while user threads are executing, which means more than 4 threads will be trying to share 4 CPU cores, which loses efficiency.

Please chime in and help out in #11431, I've gone into more detail there about what I believe the issue to be related to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants