Skip to content

Commit f346082

Browse files
authored
[CI] diff driven test selection (#8077)
TLDR: Analyze PR's diff and filter out tests that aren't exercising what has changed, potentially cutting down runtime and expense by 95-99% most of the time. Detailed: Deepspeed's CI takes forever - most of the time burning $$ and wastes dev time for no reason as most changes require just a few tests to run. HF Transformers has a system to select which tests to run based on the diff of the PR - Sylvain Gugger wrote it many years ago since that repo has now probably thousands of tests. Deepspeed's CI isn't too bad but can easily take hours. So I asked Claude Opus 4.8 to replicate the system for Deepspeed. Please have a look. It looks super complicated, so I'm not sure how easy it'd be to maintain/operate unless we always use AI to continue maintaining it. I asked Claude to leave a detailed state file so that it or another model could pick it up where it left. I started with just the slowest costliest workload `.github/workflows/modal-torch-latest.yml` to see if it works well. If you're happy we can replicate it to the rest of the workloads. CC: @loadams, @tjruwase - please tag others if you think they would be helpful to discuss this. --------- Signed-off-by: Stas Bekman <stas@stason.org>
1 parent a163245 commit f346082

8 files changed

Lines changed: 1663 additions & 24 deletions

File tree

Lines changed: 306 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,306 @@
1+
# Diff-based CI test selection
2+
3+
This document explains the diff-driven test-selection system that lets some GPU CI
4+
workflows run only the tests a PR could affect, instead of the whole suite, while
5+
still satisfying Required status checks.
6+
7+
It currently drives **`modal-torch-latest`** (which runs `tests/unit/v1/` on
8+
[modal.com](https://modal.com) GPUs), but is built to drive more workflows from
9+
one config — see [Adding a workflow](#adding-a-new-workflow).
10+
11+
- [TL;DR](#tldr)
12+
- [Why](#why)
13+
- [Moving parts](#moving-parts)
14+
- [How a decision is made](#how-a-decision-is-made)
15+
- [Driving it (as a contributor)](#driving-it-as-a-contributor)
16+
- [Changing it (as a maintainer)](#changing-it-as-a-maintainer)
17+
- [Security model](#security-model)
18+
- [Failure modes & guarantees](#failure-modes--guarantees)
19+
- [FAQ / troubleshooting](#faq--troubleshooting)
20+
21+
22+
## TL;DR
23+
24+
- On a PR, `ci/tests_fetcher.py` diffs your branch against the base branch, traces
25+
the import graph from your changed files to the impacted tests, and writes the
26+
list to `ci/.test_selection/test_list.txt`.
27+
- The workflow runs pytest on just that list. `push` to `master` and manual runs
28+
always run everything.
29+
- It is **fail-safe**: anything it can't reason about safely → run the *full* suite.
30+
It never silently runs *fewer* tests than reality.
31+
- Preview locally:
32+
```bash
33+
python ci/tests_fetcher.py --base origin/master # what would CI run?
34+
python ci/tests_fetcher.py --base origin/master --explain # ...and why?
35+
```
36+
- Force a full run: put `[test all]` (or `[no filter]`) in a commit message.
37+
38+
39+
## Why
40+
41+
`modal-torch-latest` is a **Required** check, so it must report a status on every
42+
PR — which means we can't use GitHub's path filters (`on.<event>.paths`) to skip
43+
it, because a skipped Required job blocks merges. Instead, the job always runs but
44+
*selects* what to test: a docs-only PR resolves to "no impacted tests" and exits
45+
fast (still green), while a PR touching a leaf module runs only the tests that
46+
import it.
47+
48+
The design is a small, self-contained take on HuggingFace `transformers`'
49+
`utils/tests_fetcher.py`.
50+
51+
52+
## Moving parts
53+
54+
| File | Role |
55+
| --- | --- |
56+
| `.github/workflows/modal-torch-latest.yml` | The workflow: a `collect-tests` job (runs the fetcher) gating a `deploy` job (runs pytest on modal). |
57+
| `ci/tests_fetcher.py` | The selector. AST-parses the repo, builds an import graph, decides `all` / `subset` / `none`, writes the test-list file, emits a job summary. |
58+
| `ci/test_tests_fetcher.py` | Self-tests for the selector (pure stdlib; run in `collect-tests`). |
59+
| `ci/torch_latest.py` | The modal runner. Reads the test-list file and feeds it to pytest. |
60+
| `ci/.test_selection/test_list.txt` | The hand-off artifact (one pytest target per line). Git-ignored. |
61+
62+
### Job flow
63+
64+
```
65+
pull_request_target / push / workflow_dispatch
66+
67+
┌─────────────┴─────────────┐
68+
│ collect-tests │ (no secrets; AST-only)
69+
│ checkout PR head │
70+
│ restore ci/ from base │
71+
│ self-test the fetcher │
72+
│ run tests_fetcher.py │
73+
│ → mode (all|subset|none) │
74+
│ → upload test_list.txt │
75+
└─────────────┬──────────────┘
76+
│ needs:
77+
78+
┌────────────────────────────┐
79+
│ deploy │ (has modal/HF secrets)
80+
│ if mode != none (or manual │
81+
│ / collector failed) │
82+
│ checkout PR head │
83+
│ restore ci/ from base │
84+
│ download test_list.txt │
85+
│ modal run ci.torch_latest │
86+
└────────────────────────────┘
87+
```
88+
89+
`mode` controls `deploy`:
90+
91+
- **`none`**`deploy` is skipped. The Required status is still satisfied (a
92+
skipped dependent job counts as success here).
93+
- **`subset`**`deploy` runs pytest on exactly the impacted files.
94+
- **`all`**`deploy` runs the whole scope (`tests/unit/v1`).
95+
96+
97+
## How a decision is made
98+
99+
`TestSelector.select()` in `ci/tests_fetcher.py` runs these checks in order; the
100+
first that matches wins:
101+
102+
1. **No base ref** (push / manual) → `all`.
103+
2. **Base ref unresolvable**`all`.
104+
3. **No merge-base** with the base (e.g. shallow clone, unrelated history) → `all`.
105+
A diff here would be wrong, so we never narrow on it.
106+
4. **Commit message tag** `[test all]` / `[no filter]` anywhere on the branch → `all`.
107+
5. **A changed file matches a run-all glob** (`COMMON_RUN_ALL_GLOBS` +
108+
the workflow's `extra_run_all_globs`) → `all`. These are files too central or
109+
too dynamic to narrow safely: CI scripts, build system, `csrc/`, `op_builder/`,
110+
`accelerator/`, shared fixtures (`tests/unit/common.py`, `tests/conftest.py`,
111+
`pytest.ini`), and core runtime hubs (`deepspeed/__init__.py`,
112+
`deepspeed/runtime/engine.py`, `deepspeed/comm/**`, `deepspeed/accelerator/**`, …).
113+
6. **A deleted module is still imported** by a surviving file (a dangling import
114+
the graph can't follow) → `all`. A *clean* deletion (importers removed/updated
115+
in the same PR) does **not** trigger this.
116+
7. Otherwise, **narrow via the import graph** (below). If nothing is impacted →
117+
`none`; if everything is → `all`; else → `subset`.
118+
119+
### The import graph
120+
121+
- Nodes are Python files under the package roots: `deepspeed/**` and the `unit`
122+
test-helper package at `tests/unit/**`.
123+
- An edge `A → B` means "B imports A". The selector walks **backwards** from each
124+
changed file to every test that (transitively) imports it.
125+
- **Opaque hub modules** (`OPAQUE_MODULES`: `deepspeed`, `deepspeed.comm`,
126+
`deepspeed.accelerator`): almost every test imports `unit.common`, which imports
127+
these. Their `__init__.py` files eagerly pull in huge subtrees, so if treated as
128+
normal nodes *any* `deepspeed/**` change would fan out to the whole suite. We
129+
therefore don't expand their `__init__` imports; instead, changes to the hubs
130+
themselves are caught by the run-all globs in step 5.
131+
- **`conftest.py`** changes select every test under that conftest's directory.
132+
- **New test files** are selected directly (they have no importers yet).
133+
134+
### Dynamic edges (`DYNAMIC_EDGES`)
135+
136+
Some dependencies are wired at runtime (monkey-patching, plugin/registry lookup,
137+
JIT-loaded ops, `deepspeed.initialize()`-time `replace_module` injection), so a
138+
test can depend on code it never `import`s. `DYNAMIC_EDGES` is a curated map of
139+
`changed-file glob → extra test-path globs` that patches these blind spots. It is
140+
additive on top of the static graph (and is only consulted if step 5 didn't
141+
already short-circuit to `all`).
142+
143+
144+
## Driving it (as a contributor)
145+
146+
### Preview what CI will run
147+
148+
The fetcher is pure stdlib — no DeepSpeed/torch install needed, just `git`:
149+
150+
```bash
151+
# Selection for your branch vs. the base branch
152+
python ci/tests_fetcher.py --base origin/master
153+
cat ci/.test_selection/test_list.txt
154+
155+
# Explain *why* each test was selected (prints import chains)
156+
python ci/tests_fetcher.py --base origin/master --explain
157+
```
158+
159+
`--explain` output looks like:
160+
161+
```
162+
deepspeed/shared.py impacts:
163+
tests/unit/v1/test_shared.py <- deepspeed/shared.py
164+
tests/unit/v1/moe/test_moe.py <- tests/unit/v1/moe/test_moe.py <- deepspeed/shared.py
165+
```
166+
167+
### Escape hatches
168+
169+
- **Force the full suite for a push:** include `[test all]` (or `[no filter]`)
170+
anywhere in a commit message on the branch.
171+
- **Touch an infra file:** any change to a run-all glob runs everything.
172+
- **Found a missed test?** It's likely a runtime/dynamic dependency the static
173+
graph can't see — add a `DYNAMIC_EDGES` entry (see below) and/or report it.
174+
175+
176+
## Changing it (as a maintainer)
177+
178+
All knobs live in `ci/tests_fetcher.py`. After any change, run the self-tests:
179+
180+
```bash
181+
python ci/test_tests_fetcher.py # standalone
182+
# or: pytest ci/test_tests_fetcher.py
183+
```
184+
185+
### Add a run-all trigger
186+
187+
Add a glob to `TestSelector.COMMON_RUN_ALL_GLOBS` (shared across workflows) or to
188+
a specific workflow's `extra_run_all_globs`. Globs match repo-root-relative POSIX
189+
paths; `dir/**` matches everything under `dir/`.
190+
191+
### Add a dynamic edge
192+
193+
Add to `TestSelector.DYNAMIC_EDGES`:
194+
195+
```python
196+
DYNAMIC_EDGES = {
197+
# changed-file glob : test-path globs to pull in when it changes
198+
"deepspeed/module_inject/**": ("tests/unit/v1/moe/**",),
199+
}
200+
```
201+
202+
Keep entries conservative: too broad just wastes GPU time; too narrow misses
203+
coverage. Prefer fixing it here over widening a run-all glob when only a slice of
204+
tests is truly affected.
205+
206+
### Mark a module opaque
207+
208+
If a new universal hub starts fanning every change out to the whole suite, add it
209+
to `OPAQUE_MODULES` and (usually) add the hub file itself to the run-all globs so
210+
real changes to it still run everything.
211+
212+
### Add a new workflow
213+
214+
The engine is config-driven (`WORKFLOWS` registry of `WorkflowConfig`). To drive
215+
another workflow:
216+
217+
1. Add an entry:
218+
```python
219+
WORKFLOWS["my-workflow"] = WorkflowConfig(
220+
name="my-workflow",
221+
test_scopes=("tests/unit/v2",), # dirs this workflow runs
222+
extra_run_all_globs=(".github/workflows/my-workflow.yml",),
223+
)
224+
```
225+
2. In that workflow's YAML, mirror `modal-torch-latest.yml`'s `collect-tests` job
226+
and call the fetcher with `--workflow my-workflow`.
227+
3. Point the runner at `ci/.test_selection/test_list.txt`.
228+
4. Add coverage to `ci/test_tests_fetcher.py` if the scope/behavior differs.
229+
230+
### Add a self-test
231+
232+
`ci/test_tests_fetcher.py` builds throwaway git repos (`TmpRepo`) from a synthetic
233+
`BASELINE` tree and asserts the resulting `mode`/`tests`. Add a `test_*` function
234+
for any new behavior; it runs both standalone and under pytest, and executes in
235+
the `collect-tests` CI job, so a broken selector is caught before it mis-picks
236+
tests.
237+
238+
239+
## Security model
240+
241+
The workflow triggers on **`pull_request_target`**, so it runs in the base repo's
242+
context and the `deploy` job has the modal/HF **secrets** in scope. To keep PRs
243+
from abusing that:
244+
245+
- `collect-tests` holds **no secrets** and only **AST-parses** (never executes)
246+
the PR tree.
247+
- **Both jobs restore `ci/` from the base branch** before using it:
248+
- `collect-tests` runs the **base** selector + self-tests. This job decides
249+
whether the Required `deploy` runs, so it must not trust the PR's own
250+
`ci/tests_fetcher.py` — otherwise a PR could rewrite it to emit `mode=none`
251+
and skip CI while still going green. The diff is computed from git history, so
252+
a PR's `ci/` changes still appear in the diff and (via the base selector's
253+
`ci/**` run-all glob) force a full run.
254+
- `deploy` runs the **base** orchestration (which drives `modal run` with the
255+
secrets), so a PR can't repoint it at the secrets to exfiltrate them.
256+
In both jobs the PR's `deepspeed/` + `tests/` are what gets parsed/tested.
257+
- The `pull_request_target` trigger types are `review_requested`,
258+
`ready_for_review`, and `synchronize`. Because `synchronize` re-runs on every
259+
push to an open PR (not just on a maintainer action), the maintainer review is a
260+
mitigation, **not** an absolute barrier — the base-`ci/` restore above is the
261+
primary protection for the secrets.
262+
263+
> **Consequence:** changes to `ci/*` (including `tests_fetcher.py` itself) take
264+
> effect under `pull_request_target` only after they're **merged**. Validate
265+
> `ci/*` changes via a `pull_request`-triggered run or the `modal` CLI locally.
266+
>
267+
> **Bootstrap:** when the base branch has no selector yet (the PR that introduces
268+
> it), the restored base `ci/` won't contain `tests_fetcher.py`; `collect-tests`
269+
> detects this and falls back to running the full suite.
270+
271+
272+
## Failure modes & guarantees
273+
274+
The selector is built to **fail safe — to `all`, never to `none`**:
275+
276+
- Missing/unresolvable base, no merge-base, a parse error on a file, or **any
277+
unexpected exception** in the selector → it falls back to the full suite (the
278+
top-level handler in `main()` logs the traceback and sets `mode=all`).
279+
- If `collect-tests` fails entirely, `deploy` still runs (and the workflow has an
280+
explicit "fail if selection failed" guard) so a broken collector can't let a PR
281+
pass without testing.
282+
- The only way to run *fewer* tests is a clean, well-understood narrow decision;
283+
every uncertain case widens to everything.
284+
285+
Every run writes a summary to the GitHub job summary (mode, reason, and the
286+
selected files) so the decision is auditable from the Actions UI.
287+
288+
289+
## FAQ / troubleshooting
290+
291+
**My PR shows `mode=none` but I changed code.**
292+
Either the change is non-Python / out of the workflow's scope, or your edits
293+
aren't committed (the fetcher diffs *committed* history, `merge-base..HEAD`).
294+
295+
**A relevant test wasn't selected.**
296+
Likely a runtime/dynamic dependency. Confirm with `--explain`, then add a
297+
`DYNAMIC_EDGES` entry. As a stop-gap, push with `[test all]`.
298+
299+
**Everything runs even for a tiny change.**
300+
You touched a run-all glob (CI/build/shared fixture/core runtime), or a hub module
301+
fanned out. Check the job summary's `reason`. If a hub over-fans, consider
302+
`OPAQUE_MODULES`.
303+
304+
**It ran the full suite and the summary says "shallow clone?" / "no merge-base".**
305+
The runner didn't have enough history to compute the diff. `collect-tests` uses
306+
`fetch-depth: 0`; if you changed that, restore full history for the base.

0 commit comments

Comments
 (0)