Improve CI reliability and add S3 cache for speed #3647

teor2345 · 2025-07-22T05:09:44Z

Caching Changes

This PR activates the runs-on magic cache, which caches for around 10 days on AWS S3.

As part of that change, it also adds extra caches for:

S3 sccache on Linux: Rust, C, C++, and CUDA build products
- this is a shared cache across jobs
- saves 4-12+ minutes per Linux job (average 7 minutes)
target on macOS: most Rust build products, maybe some C/C++/CUDA ones?
- this is a per-job and per-platform cache, but S3 caching is free so we should be fine here
- saves 7+ minutes per macOS clippy job, but is slower than a rebuild for tests
- Windows caching is too slow, and Linux already has sccache
tools like Protoc:
- Protoc is sometimes unreliable, failing downloads
- but caching is slower than downloads on Windows

And fixes the caches for:

source deps: add required registry tracking files

Spot Instance Fixes

Spot instances are disabled in the merge queue and release-related workflows. This will only have an impact if we decide to re-enable them in runs-on.yml.

This is based on runs-on maintainer advice:
runs-on/runs-on#338 (comment)
runs-on/runs-on#337 (comment)

Instance changes

This PR expands the runs-on CI instances to include "r7" instances (memory optimised). Instances will be chosen based on lowest cost from the entire range provided. This is based on advice from the runs-on maintainer:
runs-on/runs-on#338 (comment)

This is low risk, because our costs are already low. Because of our low costs, there's not much need to use spot instances, so I left them off in this PR.

It also removes some redundant CPU and RAM keys, because the provided numbers are already treated as a range:
https://runs-on.com/configuration/job-labels/#cpu

Workflow Cleanups

cargo fmt does not require much CPU or RAM, so its job is changed to a free GitHub runner. It also doesn't need any dependencies, so those are removed.

Code contributor checklist:

I have read, understood and followed contributing guide

nazar-pc

The caches are not matched exactly, but rather can use fallback keys (specified in restore-keys). This configuration means when second and third toolchain is used, they will all end up being stored, slowing CI over time more and more. This is even worse with target, whose build artifacts are unlikely to be useful upon restart as crates and especially dependencies change. Caching target is generally not recommended, not to mention it is absolutely HUGE.

Not only that, cache is spectacularly slow on Windows, to the point that using it may end up being slower than not using.

Overall the intention of this PR is good, but I believe it'll cause more issues that actually help. Majority of important and useful things were cached already.

teor2345 · 2025-07-23T21:14:38Z

The caches are not matched exactly, but rather can use fallback keys (specified in restore-keys). This configuration means when second and third toolchain is used, they will all end up being stored, slowing CI over time more and more.

Thanks for the reminder. I've reviewed all the cache settings, and scoped them to the compiler version, any dependency version, or both. That should limit cache growth, along with some other changes you might not be aware of:

https://runs-on.com/caching/magic-cache/ - items will be evicted based on the cache eviction policy defined in your RunsOn stack (default: 10 days)
Stabilize automatic garbage collection. rust-lang/cargo#14287 - Stabilize automatic garbage collection

This is even worse with target, whose build artifacts are unlikely to be useful upon restart as crates and especially dependencies change. Caching target is generally not recommended, not to mention it is absolutely HUGE.

I'm testing which of the remaining caches are actually useful. S3 cache is free and (potentially) fast, so it's worth trying now we're not using GitHub's caches.

Not only that, cache is spectacularly slow on Windows, to the point that using it may end up being slower than not using.

I've disabled almost all the caches on Windows, including the existing source deps one, because you're right, they're slower than the actual downloads.

.github/runs-on.yml

.github/workflows/chain-spec-snapshot-build.yml

nazar-pc · 2025-07-24T02:45:02Z

Thanks for the reminder. I've reviewed all the cache settings, and scoped them to the compiler version, any dependency version, or both. That should limit cache growth, along with some other changes you might not be aware of:

Removing restore keys means any small dependency or feature change and you're completely without cache. Using restore keys (which is recommended) means it first downloads older cache, then if something it needs is missing (like a newer toolchain), it'll download that and then store a newer version with both toolchains, compounding the size over time.

Neither is ideal, which is why storing things like build artifacts is generally a bad idea. Tools like nextest were already cached, so I don't think there is a need to cache ~/.cargo/bin for example.

Also note that CARGO_INCREMENTAL: 0 means faster builds from scratch, but also complete rebuilds on changes, making caching of things like target less effective, though I see you're not caching it anymore.

BTW the cache for ROCm for example was right before its installation to see both at the same time and more likely to remove both if/when they are not needed (for example, if plotter starts using Vulkan, ROCm would no longer be needed in CI). Moving it around means you need to scroll back and forth more during maintenance and I'm not sure having logs of different caches together is more useful than grouping things that are closely related.

teor2345 · 2025-07-28T02:52:16Z

Thanks for the reminder. I've reviewed all the cache settings, and scoped them to the compiler version, any dependency version, or both. That should limit cache growth, along with some other changes you might not be aware of:

Removing restore keys means any small dependency or feature change and you're completely without cache. Using restore keys (which is recommended) means it first downloads older cache, then if something it needs is missing (like a newer toolchain), it'll download that and then store a newer version with both toolchains, compounding the size over time.

Neither is ideal, which is why storing things like build artifacts is generally a bad idea. Tools like nextest were already cached, so I don't think there is a need to cache ~/.cargo/bin for example.

I've removed the compiler binaries cache, and set the other caches so they'll be reset when the workspace Cargo.toml changes (and no restore keys). That Cargo.toml only contains direct dependency versions, so it changes every 1-3 weeks. This seems like a good compromise, because:

Cargo.lock patch versions or internal feature/dependency changes will use the old cache, causing limited cache growth
the cache will reset when we explicitly upgrade a dependency

The reset will only impact the first run of the first PR with each Cargo.toml change, so it's unlikely to slow things down much.

The S3 sccache is a clear win for Linux (5+ minutes quicker overall), and the other remaining caches are very fast (a few seconds).

I'm seeing if a macOS compiled deps cache is useful, once I know I'll bring this PR out of draft.

BTW the cache for ROCm for example was right before its installation to see both at the same time and more likely to remove both if/when they are not needed (for example, if plotter starts using Vulkan, ROCm would no longer be needed in CI). Moving it around means you need to scroll back and forth more during maintenance and I'm not sure having logs of different caches together is more useful than grouping things that are closely related.

Done!

teor2345 · 2025-07-29T01:30:33Z

This is ready for review now, I've made sure each added cache is actually improving performance, and they are reset when they become less useful (when direct dependencies change, around every 5-20 days).

Edit: This PR changes CI job names, so we'll need to update our branch protection rules for it to merge.

nazar-pc

Makes sense to me overall

nazar-pc · 2025-07-30T13:08:33Z

.github/workflows/chain-spec-snapshot-build.yml

@@ -14,7 +14,7 @@ on:
 jobs:
  chains-spec:
    runs-on: ${{ fromJson(github.repository_owner == 'autonomys' &&
-      '"runs-on=${{ github.run_id }}/runner=self-hosted-ubuntu-22.04-x86-64"' || '"ubuntu-22.04"') }}
+      '"runs-on=${{ github.run_id }}-${{ github.run_attempt }}/runner=self-hosted-ubuntu-22.04-x86-64/spot=false"' || '"ubuntu-22.04"') }}


Why is github.run_attempt used in here?

If we just use the run-id, then different attempts can be scheduled on the same runner. This disables the spot instance interruption protection on job re-runs:
runs-on/runs-on#337 (comment)

Probably isn't strictly needed in this file with spot=false, but I did a search and replace for consistency (and to avoid other similar issues).

Such an obscure behavior. I really consider this to be a bug.

nazar-pc · 2025-07-30T13:25:58Z

.github/workflows/rust.yml

      - name: cargo fmt
        run: cargo fmt --all -- --check

  cargo-clippy:
+    name: cargo-clippy (${{ strategy.job-index == 0 && 'Linux' || (strategy.job-index == 1 && 'Windows' || 'macOS') }})


Convenient, but I'd argue it is helpful to have exact OS name and version. There are no tests for aarch64 Ubuntu/Windows, but it'd be nice to have at least Ubuntu aarc64 here, in which case "Linux" will be duplicated.

The solution to shifting job names for branch protection rules could be to have a "summary" job that depends on all others. Here is an example:
https://github.com/nazar-pc/abundance/blob/b0eb4ede65e73c8e8d5d2b03414e45dd43526d78/.github/workflows/rust.yml#L381-L400

Note that it uses job names as in yaml file and waits for all of them in case of job matrix. The you'll not need to customize the names here at all and it'll be nicer for long-term maintenance.

I'd like to go with OS name and maybe architecture, we can add OS version if we ever run multiple OS versions in CI.

nazar-pc · 2025-07-30T13:27:09Z

.github/workflows/rust.yml

+            ~/.cargo/.crates2.json
+            ~/.cargo/.global-cache


Why are these two files important?

Thanks for checking this!
Actually, I think those files (and parts of the current registry and git) aren't needed.

In the docs, the paths to cache for crate sources are:

registry/index

registry/cache

git/db

https://doc.rust-lang.org/cargo/guide/cargo-home.html#caching-the-cargo-home-in-ci

Caching registry/src and git/checkouts massively increases the number of files, and more than doubles the size. I'll re-check Windows after this change, because it might turn out to be faster with many fewer files.

.global-cache is a tiny marker file that stops backup software descending into the directory, but I'm pretty sure it gets re-created by cargo anyway.

Those git checkouts are exactly the files that are used to build crates, that is why we cache them. Yes, they could be relatively large, especially as we upgrade dependencies over time and caches grow in size over time.

Yes, but we only need to cache the compressed crate archives and bare git repository clones. Then once the cache is restored, cargo will uncompress/checkout the sources from those archives.

We definitely don't need to cache both the archives and the sources, that's more than double the size, and many more files.

nazar-pc · 2025-07-30T13:27:58Z

.github/workflows/rust.yml

+        uses: actions/cache@1bd1e32a3bdc45362d1e726936510720a7c30a57 # v4.2.0
+        id: tool-cache
+        with:
+          path: '~/**/_tool'


What about specifying a full path rather than **/_tool glob?

The paths are different on macOS and Linux, so this glob is the easiest way to write them both.

I see, annoying 😕

nazar-pc · 2025-07-30T13:30:03Z

.github/workflows/rust.yml

@@ -89,6 +88,17 @@ jobs:
        run: brew install libtool
        if: runner.os == 'macOS'

+      # We cache protoc because it sometimes fails to download, but cache is too slow on Windows.


Consider creating an upstream issue to add built-in cache to arduino/setup-protoc action

teor2345

Thanks for the review!

I agree that the crate cache needs to be cut down, and I found even more ways we can reduce its size based on the docs.

I also think a final "required jobs" job would be much more maintainable. Then we'd only need to switch the branch protection rules once.

teor2345 · 2025-07-30T21:39:07Z

.github/workflows/chain-spec-snapshot-build.yml

@@ -14,7 +14,7 @@ on:
 jobs:
  chains-spec:
    runs-on: ${{ fromJson(github.repository_owner == 'autonomys' &&
-      '"runs-on=${{ github.run_id }}/runner=self-hosted-ubuntu-22.04-x86-64"' || '"ubuntu-22.04"') }}
+      '"runs-on=${{ github.run_id }}-${{ github.run_attempt }}/runner=self-hosted-ubuntu-22.04-x86-64/spot=false"' || '"ubuntu-22.04"') }}


If we just use the run-id, then different attempts can be scheduled on the same runner. This disables the spot instance interruption protection on job re-runs:
runs-on/runs-on#337 (comment)

Probably isn't strictly needed in this file with spot=false, but I did a search and replace for consistency (and to avoid other similar issues).

teor2345 · 2025-07-30T21:40:02Z

.github/workflows/rust.yml

+        uses: actions/cache@1bd1e32a3bdc45362d1e726936510720a7c30a57 # v4.2.0
+        id: tool-cache
+        with:
+          path: '~/**/_tool'


The paths are different on macOS and Linux, so this glob is the easiest way to write them both.

teor2345 · 2025-07-30T21:48:47Z

.github/workflows/rust.yml

+            ~/.cargo/.crates2.json
+            ~/.cargo/.global-cache


Thanks for checking this!
Actually, I think those files (and parts of the current registry and git) aren't needed.

In the docs, the paths to cache for crate sources are:

registry/index

registry/cache

git/db

https://doc.rust-lang.org/cargo/guide/cargo-home.html#caching-the-cargo-home-in-ci

Caching registry/src and git/checkouts massively increases the number of files, and more than doubles the size. I'll re-check Windows after this change, because it might turn out to be faster with many fewer files.

.global-cache is a tiny marker file that stops backup software descending into the directory, but I'm pretty sure it gets re-created by cargo anyway.

teor2345 · 2025-07-30T21:50:59Z

.github/workflows/rust.yml

      - name: cargo fmt
        run: cargo fmt --all -- --check

  cargo-clippy:
+    name: cargo-clippy (${{ strategy.job-index == 0 && 'Linux' || (strategy.job-index == 1 && 'Windows' || 'macOS') }})


I'd like to go with OS name and maybe architecture, we can add OS version if we ever run multiple OS versions in CI.

teor2345 self-assigned this Jul 22, 2025

teor2345 added improvement it is already working, but can be better devops labels Jul 22, 2025

This comment was marked as off-topic.

Sign in to view

teor2345 requested a review from nazar-pc as a code owner July 23, 2025 01:45

teor2345 changed the title ~~Expand instance range in runs-on.yml~~ Improve CI reliability and add S3 cache for speed Jul 23, 2025

teor2345 requested review from vedhavyas and jfrank-summit July 23, 2025 01:46

teor2345 added the bug Something isn't working label Jul 23, 2025

teor2345 force-pushed the ci-instance-range branch 3 times, most recently from f53b052 to 7422965 Compare July 23, 2025 03:26

teor2345 marked this pull request as draft July 23, 2025 04:30

This comment was marked as resolved.

Sign in to view

nazar-pc reviewed Jul 23, 2025

View reviewed changes

teor2345 force-pushed the ci-instance-range branch 2 times, most recently from fd7586a to 10c078e Compare July 23, 2025 20:51

teor2345 force-pushed the ci-instance-range branch from 10c078e to f137aac Compare July 23, 2025 21:24

DaMandal0rian reviewed Jul 23, 2025

View reviewed changes

.github/runs-on.yml Show resolved Hide resolved

.github/runs-on.yml Outdated Show resolved Hide resolved

.github/workflows/chain-spec-snapshot-build.yml Show resolved Hide resolved

teor2345 force-pushed the ci-instance-range branch from f137aac to b33fd8d Compare July 28, 2025 02:41

teor2345 added 6 commits July 28, 2025 12:45

Expand instance range in runs-on.yml

af73c1e

Disable spot instances for infrequent workflows

91144af

Use free runners for low resource jobs

7f1b1bd

Remove unnecessary rustfmt steps

2211a49

Disable spot instances in the merge queue

be550bf

Activate S3 cache in CI

03da366

teor2345 force-pushed the ci-instance-range branch from b33fd8d to 44102f4 Compare July 28, 2025 02:57

teor2345 force-pushed the ci-instance-range branch from 44102f4 to 0887d22 Compare July 28, 2025 05:38

teor2345 requested a review from clostao July 28, 2025 10:14

teor2345 added 2 commits July 29, 2025 11:28

Add extra CI caches

7e17bf1

Make job names independent of runner settings

cd7399e

teor2345 force-pushed the ci-instance-range branch from 70b2934 to cd7399e Compare July 29, 2025 01:28

teor2345 marked this pull request as ready for review July 29, 2025 01:29

teor2345 enabled auto-merge July 29, 2025 01:31

nazar-pc reviewed Jul 30, 2025

View reviewed changes

teor2345 commented Jul 30, 2025

View reviewed changes

nazar-pc mentioned this pull request Jul 31, 2025

Add rust-all CI job to aggregate results #3658

Merged

1 task

Improve CI reliability and add S3 cache for speed #3647

Are you sure you want to change the base?

Improve CI reliability and add S3 cache for speed #3647

Conversation

teor2345 commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Caching Changes

Spot Instance Fixes

Instance changes

Workflow Cleanups

Code contributor checklist:

Uh oh!

This comment was marked as off-topic.

This comment was marked as resolved.

nazar-pc left a comment

Choose a reason for hiding this comment

Uh oh!

teor2345 commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nazar-pc commented Jul 24, 2025

Uh oh!

teor2345 commented Jul 28, 2025

Uh oh!

teor2345 commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nazar-pc left a comment

Choose a reason for hiding this comment

Uh oh!

nazar-pc Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

teor2345 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

teor2345 commented Jul 22, 2025 •

edited

Loading

teor2345 commented Jul 29, 2025 •

edited

Loading

nazar-pc Jul 30, 2025 •

edited

Loading