Add CodSpeed CI with end-to-end primer benchmarks (flask + black) and a cold start test using initial imports#3079
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3079 +/- ##
=======================================
Coverage 93.60% 93.60%
=======================================
Files 92 92
Lines 11364 11364
=======================================
Hits 10637 10637
Misses 727 727
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Replaces the single Flask benchmark with three primer projects of progressively increasing size, all pinned to the SHAs used by pylint's nightly primer run: - flask (small, ~50 .py files in src/flask) — kept, with both parse-only (isolates rebuilder cost) and full pylint-shaped traversal - black (medium, ~250 files across src/black + src/blackd + src/blib2to3) — classic AST handling, deep class hierarchies - pandas/core (large) — heavy brain_dataclasses / brain_numpy / brain_namedtuple_enum usage; stresses the brain plugins more than any other target The microbenchmarks landed in #3079 show >40% StdDev on CodSpeed CI because they complete in microseconds. The end-to-end benches each take seconds to minutes, so per-run noise becomes a small fraction of the measurement. The fixture uses 'git clone --no-checkout --filter=blob:none --no-tags --single-branch' plus 'sparse-checkout' so the pandas clone doesn't fetch the whole monorepo — just the pandas/core subtree. Refactor: project metadata moved into a _Project NamedTuple keyed by short name, so adding more primer targets is a one-block addition. Local smoke run (one iteration each, no --codspeed): 4 passed in 2:05. Refs #2014. Stacked on #3079.
Adds ~5-10 µs of busywork (sum(range(1000))) at the very top of every NodeNG.infer() call. Since infer() is called O(1M) times during a pylint-shaped traversal, this adds several seconds to each end-to-end benchmark. The microbenchmarks should also show a clear regression. Purpose: verify that CodSpeed CI reports the perf delta vs the baseline (the merged-ahead state of #3079) and renders it correctly on the PR. This is an experiment, not a real change. The branch will be discarded after the comparison is observed. Refs #2014 (the profile-first feasibility study).
9986de3 to
28184b1
Compare
Adds ~5-10 µs of busywork (sum(range(1000))) at the very top of every NodeNG.infer() call. Since infer() is called O(1M) times during a pylint-shaped traversal, this adds several seconds to each end-to-end benchmark. The microbenchmarks should also show a clear regression. Purpose: verify that CodSpeed CI reports the perf delta vs the baseline (the merged-ahead state of #3079) and renders it correctly on the PR. This is an experiment, not a real change. The branch will be discarded after the comparison is observed. Refs #2014 (the profile-first feasibility study).
Congrats! CodSpeed is installed 🎉
You will start to see performance impacts in the reports once the benchmarks are run from your default branch.
|
28184b1 to
f8520dd
Compare
Adds ~5-10 µs of busywork (sum(range(1000))) at the very top of every NodeNG.infer() call. Since infer() is called O(1M) times during a pylint-shaped traversal, this adds several seconds to each end-to-end benchmark. The microbenchmarks should also show a clear regression. Purpose: verify that CodSpeed CI reports the perf delta vs the baseline (the merged-ahead state of #3079) and renders it correctly on the PR. This is an experiment, not a real change. The branch will be discarded after the comparison is observed. Refs #2014 (the profile-first feasibility study).
2d391a6 to
2ef6d50
Compare
Adds ~5-10 µs of busywork (sum(range(1000))) at the very top of every NodeNG.infer() call. Since infer() is called O(1M) times during a pylint-shaped traversal, this adds several seconds to each end-to-end benchmark. The microbenchmarks should also show a clear regression. Purpose: verify that CodSpeed CI reports the perf delta vs the baseline (the merged-ahead state of #3079) and renders it correctly on the PR. This is an experiment, not a real change. The branch will be discarded after the comparison is observed. Refs #2014 (the profile-first feasibility study).
… MERGE Eagerly imports nine heavyweight stdlib modules (pydoc, multiprocessing, xmlrpc.server, xml.dom.minidom, unittest.mock, email.mime.multipart, wsgiref.simple_server, http.server, concurrent.futures) at the top of astroid/__init__.py. Local measurement: `import astroid` goes from ~76 ms to ~198 ms (+120 ms per cold start). The cold-lint bench (test_bench_endtoend_cold_lint) shells out `python -m pylint` per iteration and re-pays import cost every run, so it should be the loudest reporter; the in-process benches import once at module load and will only see this on the first iteration. Purpose: verify that CodSpeed's walltime workflow (cold-lint) reports the startup delta vs the baseline (#3079) and that it shows up distinctly from the in-process `infer()` regression added in 19f5c5b. This is an experiment, not a real change. The branch will be discarded after the comparison is observed. Refs #2014 (the profile-first feasibility study).
cdce8p
left a comment
There was a problem hiding this comment.
Left some comments. Similar to the pylint primer, this isn't really my area of expertise, so might be good if someone else could take a look at it as well.
| permissions: | ||
| contents: read | ||
| id-token: write |
There was a problem hiding this comment.
Is id-token necessary for CodSpeedHQ/action? If that's the case, could you move it to the job itself?
| - name: Set up Python 3.13 | ||
| id: python | ||
| uses: actions/setup-python@v6.2.0 | ||
| with: | ||
| python-version: "3.13" | ||
| check-latest: true |
There was a problem hiding this comment.
Any particular reason to use 3.13 over 3.14? Saw that we recently switch the default in pylint, guess we'll do the same for astroid as well soon.
| # astroid with the version under test. pylint is invoked as a | ||
| # subprocess by the cold-start benchmark. | ||
| pip install pylint pytest-codspeed | ||
| pip install -e . |
There was a problem hiding this comment.
| pip install -e . | |
| pip install . |
I believe editable installs might even be unnecessary here. Yes, we use them for the other workflows as well, however the astroid files aren't modified in any way after the install.
|
|
||
| permissions: | ||
| contents: read | ||
| id-token: write |
There was a problem hiding this comment.
Similar to earlier, move this to the job itself.
| - name: Set up Python 3.13 | ||
| id: python | ||
| uses: actions/setup-python@v6.2.0 | ||
| with: | ||
| python-version: "3.13" | ||
| check-latest: true |
| run: | | ||
| python -m pip install -U pip | ||
| pip install pytest-codspeed | ||
| pip install -e . |
There was a problem hiding this comment.
| pip install -e . | |
| pip install . |
| is set in ``pyproject.toml``, even a warning would abort collection). | ||
| """ | ||
|
|
||
| from __future__ import annotations |
| import importlib.util | ||
|
|
||
| if importlib.util.find_spec("pytest_codspeed") is None: | ||
| collect_ignore_glob = ["test_bench_*.py"] |
There was a problem hiding this comment.
Guess that will work, though in other test files we usually use a try ... except ImportError guard for it.
try:
import pytest_codespeed
except ImportError:
collect_ignore_glob = ["test_bench_*.py"]| - ``test_bench_endtoend_cold_lint`` — shells out | ||
| ``python -m pylint <minimal module>`` per iteration. ~98 % of wall | ||
| time is startup (Python import, pylint init, ``import astroid``, | ||
| brain-plugin registration); only ~2 % is actual linting work on the | ||
| one-line target. This captures cold-start cost that the in-process | ||
| benches below miss: within a single pytest session ``astroid`` is | ||
| imported once at module load and stays in ``sys.modules``, so | ||
| optimizations like deferring brain-plugin registration are invisible | ||
| to those benches. |
There was a problem hiding this comment.
I worry that we might end up optimizing the wrong thing. Yes, startup time is important as well, however with lazy imports begin available in 3.15 there seems to be a focus across the ecosystem to use them everywhere. IMO we should be cautious with that when the time comes, especially for modules which get imported anyway.
That's a bit different for brains, etc. which aren't actually needed for some / most projects. Though this might only be a small fraction of the total import time.
There was a problem hiding this comment.
We did some lazy import in https://github.com/pylint-dev/astroid/pull/3062/changes, I think those one are useful, and would have been detected because the logging is very seldom used (only if there's C level warning during parsing).
|
Thank you for the review Marc ! Appreciate it. |
…t, bench scope) Review feedback from cdce8p on #3079: - Move 'id-token: write' from workflow-level permissions down to the jobs.benchmarks.permissions block in both CodSpeed workflows so the token is only minted for the job that actually uploads to CodSpeed. - Bump setup-python to 3.14 (matches pylint's new default; astroid will follow). - Drop the editable install in both workflows: nothing modifies the source tree after install, so 'pip install .' is enough. Updated the inline comment in the walltime workflow accordingly. - conftest.py: switch from importlib.util.find_spec to the try/except ImportError idiom used elsewhere in the test tree, and drop the unused 'from __future__ import annotations' (no annotations in this file). - test_bench_endtoend.py: add a scope note on the cold-start bench. The point is to *protect* targeted lazy imports for rare-path modules (brain plugins not used by every project; debug-only stdlib like pprint / logging deferred in #3062). Those function- local / TYPE_CHECKING imports work on every supported Python and do not depend on PEP 810 lazy imports landing in 3.15. The bench is explicitly *not* an argument for lazifying modules astroid imports unconditionally.
DanielNoord
left a comment
There was a problem hiding this comment.
I don't have much experience with CodSpeed, but most of it looks fine to me I guess?
| # For details: https://github.com/pylint-dev/astroid/blob/main/LICENSE | ||
| # Copyright (c) https://github.com/pylint-dev/astroid/blob/main/CONTRIBUTORS.txt | ||
|
|
||
| """End-to-end benchmarks: cold import + parse + walk + infer real projects. |
There was a problem hiding this comment.
A lot of this docstring will get outdated as soon as we start changing stuff. Is it really necessary to document this in this much detail?
| benchmark(_pylint_one_file, minimal_module) | ||
|
|
||
|
|
||
| # -- Flask: small, parse + walk_infer (parse isolates rebuilder cost). -- |
There was a problem hiding this comment.
In line comments like this also tend to get out of date
| # -- Flask: small, parse + walk_infer (parse isolates rebuilder cost). -- | ||
|
|
||
|
|
||
| def test_bench_endtoend_parse_flask(benchmark, flask_files: list[Path]) -> None: |
There was a problem hiding this comment.
No types for benchmark?
| timeout-minutes: 15 | ||
| permissions: | ||
| contents: read | ||
| id-token: write |
There was a problem hiding this comment.
Why is this necessary?
| runs-on: ubuntu-latest | ||
| permissions: | ||
| contents: read | ||
| id-token: write |
There was a problem hiding this comment.
Why is this necessary?
There was a problem hiding this comment.
Might not be in a public repo, I removed, let's see how it goes
There was a problem hiding this comment.
Look like it worked. I fixed your other remarks in be68e14
Three benchmarks based on pylint's primer corpus, with SHAs pinned to match what pylint sees nightly: - test_bench_endtoend_parse_flask (rebuilder hot path, ~50 .py) - test_bench_endtoend_walk_infer_flask (pylint-shaped traversal) - test_bench_endtoend_walk_infer_black (medium scale, ~250 .py) The benchmarks call astroid's API directly (ast_from_file + nodes_of_class + safe_infer on Call/Attribute/Name) so they mirror what a pylint checker does without depending on pylint. The fixture uses 'git clone --no-checkout --filter=blob:none --no-tags --single-branch' plus 'sparse-checkout' so only the declared source subdirs are fetched. Workflow uses 'mode: simulation' so the CodSpeed dashboard provides per-function attribution; 'CodSpeedHQ/action' is allowlisted in repo Actions settings. conftest.py uses 'collect_ignore_glob' to skip the directory cleanly when pytest-codspeed is not installed locally. Drops the 19 microbenchmarks the CodSpeed wizard generated: at microsecond scale they showed >40 % StdDev on CI, too noisy for regression detection, and their coverage is subsumed by the end-to-end benches. Local smoke (one iter each, no --codspeed): 3 passed in 11.6s. Supersedes #3022. Refs #2014.
…t, bench scope) Review feedback from cdce8p on #3079: - Move 'id-token: write' from workflow-level permissions down to the jobs.benchmarks.permissions block in both CodSpeed workflows so the token is only minted for the job that actually uploads to CodSpeed. - Bump setup-python to 3.14 (matches pylint's new default; astroid will follow). - Drop the editable install in both workflows: nothing modifies the source tree after install, so 'pip install .' is enough. Updated the inline comment in the walltime workflow accordingly. - conftest.py: switch from importlib.util.find_spec to the try/except ImportError idiom used elsewhere in the test tree, and drop the unused 'from __future__ import annotations' (no annotations in this file). - test_bench_endtoend.py: add a scope note on the cold-start bench. The point is to *protect* targeted lazy imports for rare-path modules (brain plugins not used by every project; debug-only stdlib like pprint / logging deferred in #3062). Those function- local / TYPE_CHECKING imports work on every supported Python and do not depend on PEP 810 lazy imports landing in 3.15. The bench is explicitly *not* an argument for lazifying modules astroid imports unconditionally.
Per Daniel's review on #3079: - Trim the module docstring to what stays true (what the file benches + primer-corpus link); the removed prose (startup percentages, PEP 810, StdDev figures) would rot as the benches change. - Drop the `# -- section --` divider comments; the test names already say what each group does. - Annotate the `benchmark` fixture as `BenchmarkFixture`, imported under TYPE_CHECKING since pytest-codspeed is an optional dependency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
astroid is a public repo, so CodSpeed can upload results without authentication; the id-token: write permission (for OIDC auth) is not required. Removing it also lets the redundant job-level permissions block go, since the top-level `contents: read` already covers the job. Try without it per Daniel's "why is this necessary?" review on #3079; if uploads fail, re-add `id-token: write` at the job level. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
49bb76b to
2fb0a52
Compare
Per Daniel's review on #3079: - Trim the module docstring to what stays true (what the file benches + primer-corpus link); the removed prose (startup percentages, PEP 810, StdDev figures) would rot as the benches change. - Drop the `# -- section --` divider comments; the test names already say what each group does. - Annotate the `benchmark` fixture as `BenchmarkFixture`, imported under TYPE_CHECKING since pytest-codspeed is an optional dependency.
astroid is a public repo, so CodSpeed can upload results without authentication; the id-token: write permission (for OIDC auth) is not required. Removing it also lets the redundant job-level permissions block go, since the top-level `contents: read` already covers the job. Try without it per Daniel's "why is this necessary?" review on #3079; if uploads fail, re-add `id-token: write` at the job level.
2fb0a52 to
7e482f9
Compare
Adds CodSpeed CI to astroid with three end-to-end benchmarks for black and flask, alongside a cold start test to exercice the import time which is important because we often lint single file in parallel inside pre-commit.
Supersedes #3022.
Result can be seen in #3080 (but also on currently opened performance branches : #3069, #3046, #3048). For the lazy loading of brains, the result is not visible because the threshold is at 2% change (already close to noise). We won't see performance change in brains as we don't parse lib specific code in this benchmark.