Skip to content

Conversation

@elfkuzco
Copy link
Collaborator

@elfkuzco elfkuzco commented Nov 1, 2025

Rationale

This PR adds support for measuring resources used by the scraper. For the CPU stats, it uses the Exponentially Weighted Moving Average to measure the percentage of CPU usage, along with the maximum CPU percentage used.

Also, max disk usage is computed by adding the filesizes of the files in the scraper mount directory.
Screenshot_20251114_133311
Screenshot_20251114_133301

Changes

  • add functions to compute CPU and Disk stats
  • show stats in UI

This closes #1423

@elfkuzco
Copy link
Collaborator Author

elfkuzco commented Nov 1, 2025

Implementation of disk usage is yet to be added as it would rely on the approval of docker/docker-py#3370

@elfkuzco elfkuzco self-assigned this Nov 2, 2025
@benoit74
Copy link
Collaborator

benoit74 commented Nov 3, 2025

As discussed on Slack, I propose we wait few days for https://github.com/docker/docker-py/ maintainers to give us an answer.

If they don't reply soon enough, I propose two possible plans:

  • consider only mounts size in disk usage stat for the time being, since this is in general the "core" of disk usage, at least for big ZIMs which are primary concern
  • reimplement our own very-limited Docker SDK for only the operations we need ; could make since because Python Docker SDK seems to receive little attention from Docker, plus we only use few methods and could plug directly to the REST API just like the SDK does for these few methods we are using

@benoit74
Copy link
Collaborator

@elfkuzco looks like I was right been concerned about not getting any feedback on your upstream PR.

Please advise which plan B (among the two I've proposed, or one you can propose) makes more sense to you so that we can move on and have CPU measure and at least a first estimate of Disk used.

@elfkuzco
Copy link
Collaborator Author

consider only mounts size in disk usage stat for the time being, since this is in general the "core" of disk usage, at least for big ZIMs which are primary concern

this would be simpler to implement plus, given almost everything is written to the mount point, i don't know if there's going to be any real metric obtained from writable layer. possibly .pyc files or pycache files but those really shouldn't be big enough, right?

@benoit74
Copy link
Collaborator

Let's go for this alternative: consider only mount point for the time-being + open an issue about the fact that we might want to better track disk usage. Goal would be not only to capture the writable layer (which in general is supposed to be small, but this is not the case on all scrapers, not even speaking about bugs) but also the image size itself (which is then an slight overestimation of disk usage since image is shared across tasks).

@elfkuzco
Copy link
Collaborator Author

Updated PR description.

@codecov
Copy link

codecov bot commented Nov 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.39%. Comparing base (921dc27) to head (b3a9a34).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1491      +/-   ##
==========================================
+ Coverage   83.38%   83.39%   +0.01%     
==========================================
  Files          91       91              
  Lines        4399     4403       +4     
  Branches      470      470              
==========================================
+ Hits         3668     3672       +4     
  Misses        606      606              
  Partials      125      125              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small remarks + I need more data to confirm the average we are computing is really close to the average CPU consumption. Since we have tasks which might run for hours, I doubt an EWMA with 0.25 and update every minute will really represent something close to the average. I might be wrong, at least I need to be convinced 😄

@elfkuzco elfkuzco requested a review from benoit74 November 17, 2025 18:51
Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elfkuzco elfkuzco merged commit 191e80c into main Nov 18, 2025
10 checks passed
@elfkuzco elfkuzco deleted the measure-resources branch November 18, 2025 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Measure and report all tasks resources usage

3 participants