Skip to content

Improve CI reliability and add S3 cache for speed #3647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

teor2345
Copy link
Member

@teor2345 teor2345 commented Jul 22, 2025

Caching Changes

This PR activates the runs-on magic cache, which caches for around 10 days on AWS S3.

As part of that change, it also adds extra caches for:

  • S3 sccache on Linux: Rust, C, C++, and CUDA build products
    • this is a shared cache across jobs
    • saves 4-12+ minutes per Linux job (average 7 minutes)
  • target on macOS: most Rust build products, maybe some C/C++/CUDA ones?
    • this is a per-job and per-platform cache, but S3 caching is free so we should be fine here
    • saves 7+ minutes per macOS clippy job, but is slower than a rebuild for tests
    • Windows caching is too slow, and Linux already has sccache
  • tools like Protoc:
    • Protoc is sometimes unreliable, failing downloads
    • but caching is slower than downloads on Windows

And fixes the caches for:

  • source deps: add required registry tracking files

Spot Instance Fixes

Spot instances are disabled in the merge queue and release-related workflows. This will only have an impact if we decide to re-enable them in runs-on.yml.

This is based on runs-on maintainer advice:
runs-on/runs-on#338 (comment)
runs-on/runs-on#337 (comment)

Instance changes

This PR expands the runs-on CI instances to include "r7" instances (memory optimised). Instances will be chosen based on lowest cost from the entire range provided. This is based on advice from the runs-on maintainer:
runs-on/runs-on#338 (comment)

This is low risk, because our costs are already low. Because of our low costs, there's not much need to use spot instances, so I left them off in this PR.

It also removes some redundant CPU and RAM keys, because the provided numbers are already treated as a range:
https://runs-on.com/configuration/job-labels/#cpu

Workflow Cleanups

cargo fmt does not require much CPU or RAM, so its job is changed to a free GitHub runner. It also doesn't need any dependencies, so those are removed.

Code contributor checklist:

@teor2345 teor2345 self-assigned this Jul 22, 2025
@teor2345 teor2345 added improvement it is already working, but can be better devops labels Jul 22, 2025

This comment was marked as off-topic.

@teor2345 teor2345 requested a review from nazar-pc as a code owner July 23, 2025 01:45
@teor2345 teor2345 changed the title Expand instance range in runs-on.yml Improve CI reliability and add S3 cache for speed Jul 23, 2025
@teor2345 teor2345 added the bug Something isn't working label Jul 23, 2025
@teor2345 teor2345 force-pushed the ci-instance-range branch 3 times, most recently from f53b052 to 7422965 Compare July 23, 2025 03:26
@teor2345 teor2345 marked this pull request as draft July 23, 2025 04:30
@teor2345

This comment was marked as resolved.

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caches are not matched exactly, but rather can use fallback keys (specified in restore-keys). This configuration means when second and third toolchain is used, they will all end up being stored, slowing CI over time more and more. This is even worse with target, whose build artifacts are unlikely to be useful upon restart as crates and especially dependencies change. Caching target is generally not recommended, not to mention it is absolutely HUGE.

Not only that, cache is spectacularly slow on Windows, to the point that using it may end up being slower than not using.

Overall the intention of this PR is good, but I believe it'll cause more issues that actually help. Majority of important and useful things were cached already.

@teor2345 teor2345 force-pushed the ci-instance-range branch 2 times, most recently from fd7586a to 10c078e Compare July 23, 2025 20:51
@teor2345
Copy link
Member Author

The caches are not matched exactly, but rather can use fallback keys (specified in restore-keys). This configuration means when second and third toolchain is used, they will all end up being stored, slowing CI over time more and more.

Thanks for the reminder. I've reviewed all the cache settings, and scoped them to the compiler version, any dependency version, or both. That should limit cache growth, along with some other changes you might not be aware of:

This is even worse with target, whose build artifacts are unlikely to be useful upon restart as crates and especially dependencies change. Caching target is generally not recommended, not to mention it is absolutely HUGE.

I'm testing which of the remaining caches are actually useful. S3 cache is free and (potentially) fast, so it's worth trying now we're not using GitHub's caches.

Not only that, cache is spectacularly slow on Windows, to the point that using it may end up being slower than not using.

I've disabled almost all the caches on Windows, including the existing source deps one, because you're right, they're slower than the actual downloads.

@teor2345 teor2345 force-pushed the ci-instance-range branch from 10c078e to f137aac Compare July 23, 2025 21:24
@nazar-pc
Copy link
Member

Thanks for the reminder. I've reviewed all the cache settings, and scoped them to the compiler version, any dependency version, or both. That should limit cache growth, along with some other changes you might not be aware of:

Removing restore keys means any small dependency or feature change and you're completely without cache. Using restore keys (which is recommended) means it first downloads older cache, then if something it needs is missing (like a newer toolchain), it'll download that and then store a newer version with both toolchains, compounding the size over time.

Neither is ideal, which is why storing things like build artifacts is generally a bad idea. Tools like nextest were already cached, so I don't think there is a need to cache ~/.cargo/bin for example.

Also note that CARGO_INCREMENTAL: 0 means faster builds from scratch, but also complete rebuilds on changes, making caching of things like target less effective, though I see you're not caching it anymore.

BTW the cache for ROCm for example was right before its installation to see both at the same time and more likely to remove both if/when they are not needed (for example, if plotter starts using Vulkan, ROCm would no longer be needed in CI). Moving it around means you need to scroll back and forth more during maintenance and I'm not sure having logs of different caches together is more useful than grouping things that are closely related.

@teor2345 teor2345 force-pushed the ci-instance-range branch from f137aac to b33fd8d Compare July 28, 2025 02:41
@teor2345
Copy link
Member Author

Thanks for the reminder. I've reviewed all the cache settings, and scoped them to the compiler version, any dependency version, or both. That should limit cache growth, along with some other changes you might not be aware of:

Removing restore keys means any small dependency or feature change and you're completely without cache. Using restore keys (which is recommended) means it first downloads older cache, then if something it needs is missing (like a newer toolchain), it'll download that and then store a newer version with both toolchains, compounding the size over time.

Neither is ideal, which is why storing things like build artifacts is generally a bad idea. Tools like nextest were already cached, so I don't think there is a need to cache ~/.cargo/bin for example.

I've removed the compiler binaries cache, and set the other caches so they'll be reset when the workspace Cargo.toml changes (and no restore keys). That Cargo.toml only contains direct dependency versions, so it changes every 1-3 weeks. This seems like a good compromise, because:

  • Cargo.lock patch versions or internal feature/dependency changes will use the old cache, causing limited cache growth
  • the cache will reset when we explicitly upgrade a dependency

The reset will only impact the first run of the first PR with each Cargo.toml change, so it's unlikely to slow things down much.

The S3 sccache is a clear win for Linux (5+ minutes quicker overall), and the other remaining caches are very fast (a few seconds).

I'm seeing if a macOS compiled deps cache is useful, once I know I'll bring this PR out of draft.

BTW the cache for ROCm for example was right before its installation to see both at the same time and more likely to remove both if/when they are not needed (for example, if plotter starts using Vulkan, ROCm would no longer be needed in CI). Moving it around means you need to scroll back and forth more during maintenance and I'm not sure having logs of different caches together is more useful than grouping things that are closely related.

Done!

@teor2345 teor2345 force-pushed the ci-instance-range branch from b33fd8d to 44102f4 Compare July 28, 2025 02:57
@teor2345 teor2345 force-pushed the ci-instance-range branch from 44102f4 to 0887d22 Compare July 28, 2025 05:38
@teor2345 teor2345 requested a review from clostao July 28, 2025 10:14
@teor2345 teor2345 force-pushed the ci-instance-range branch from 70b2934 to cd7399e Compare July 29, 2025 01:28
@teor2345 teor2345 marked this pull request as ready for review July 29, 2025 01:29
@teor2345
Copy link
Member Author

teor2345 commented Jul 29, 2025

This is ready for review now, I've made sure each added cache is actually improving performance, and they are reset when they become less useful (when direct dependencies change, around every 5-20 days).

Edit: This PR changes CI job names, so we'll need to update our branch protection rules for it to merge.

@teor2345 teor2345 enabled auto-merge July 29, 2025 01:31
Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me overall

@@ -14,7 +14,7 @@ on:
jobs:
chains-spec:
runs-on: ${{ fromJson(github.repository_owner == 'autonomys' &&
'"runs-on=${{ github.run_id }}/runner=self-hosted-ubuntu-22.04-x86-64"' || '"ubuntu-22.04"') }}
'"runs-on=${{ github.run_id }}-${{ github.run_attempt }}/runner=self-hosted-ubuntu-22.04-x86-64/spot=false"' || '"ubuntu-22.04"') }}
Copy link
Member

@nazar-pc nazar-pc Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is github.run_attempt used in here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just use the run-id, then different attempts can be scheduled on the same runner. This disables the spot instance interruption protection on job re-runs:
runs-on/runs-on#337 (comment)

Probably isn't strictly needed in this file with spot=false, but I did a search and replace for consistency (and to avoid other similar issues).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such an obscure behavior. I really consider this to be a bug.

- name: cargo fmt
run: cargo fmt --all -- --check

cargo-clippy:
name: cargo-clippy (${{ strategy.job-index == 0 && 'Linux' || (strategy.job-index == 1 && 'Windows' || 'macOS') }})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convenient, but I'd argue it is helpful to have exact OS name and version. There are no tests for aarch64 Ubuntu/Windows, but it'd be nice to have at least Ubuntu aarc64 here, in which case "Linux" will be duplicated.

The solution to shifting job names for branch protection rules could be to have a "summary" job that depends on all others. Here is an example:
https://github.com/nazar-pc/abundance/blob/b0eb4ede65e73c8e8d5d2b03414e45dd43526d78/.github/workflows/rust.yml#L381-L400

Note that it uses job names as in yaml file and waits for all of them in case of job matrix. The you'll not need to customize the names here at all and it'll be nicer for long-term maintenance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to go with OS name and maybe architecture, we can add OS version if we ever run multiple OS versions in CI.

Comment on lines +156 to +157
~/.cargo/.crates2.json
~/.cargo/.global-cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these two files important?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking this!
Actually, I think those files (and parts of the current registry and git) aren't needed.

In the docs, the paths to cache for crate sources are:

  • registry/index
  • registry/cache
  • git/db

https://doc.rust-lang.org/cargo/guide/cargo-home.html#caching-the-cargo-home-in-ci

Caching registry/src and git/checkouts massively increases the number of files, and more than doubles the size. I'll re-check Windows after this change, because it might turn out to be faster with many fewer files.

.global-cache is a tiny marker file that stops backup software descending into the directory, but I'm pretty sure it gets re-created by cargo anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those git checkouts are exactly the files that are used to build crates, that is why we cache them. Yes, they could be relatively large, especially as we upgrade dependencies over time and caches grow in size over time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we only need to cache the compressed crate archives and bare git repository clones. Then once the cache is restored, cargo will uncompress/checkout the sources from those archives.

We definitely don't need to cache both the archives and the sources, that's more than double the size, and many more files.

uses: actions/cache@1bd1e32a3bdc45362d1e726936510720a7c30a57 # v4.2.0
id: tool-cache
with:
path: '~/**/_tool'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about specifying a full path rather than **/_tool glob?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paths are different on macOS and Linux, so this glob is the easiest way to write them both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, annoying 😕

@@ -89,6 +88,17 @@ jobs:
run: brew install libtool
if: runner.os == 'macOS'

# We cache protoc because it sometimes fails to download, but cache is too slow on Windows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider creating an upstream issue to add built-in cache to arduino/setup-protoc action

Copy link
Member Author

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

I agree that the crate cache needs to be cut down, and I found even more ways we can reduce its size based on the docs.

I also think a final "required jobs" job would be much more maintainable. Then we'd only need to switch the branch protection rules once.

@@ -14,7 +14,7 @@ on:
jobs:
chains-spec:
runs-on: ${{ fromJson(github.repository_owner == 'autonomys' &&
'"runs-on=${{ github.run_id }}/runner=self-hosted-ubuntu-22.04-x86-64"' || '"ubuntu-22.04"') }}
'"runs-on=${{ github.run_id }}-${{ github.run_attempt }}/runner=self-hosted-ubuntu-22.04-x86-64/spot=false"' || '"ubuntu-22.04"') }}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just use the run-id, then different attempts can be scheduled on the same runner. This disables the spot instance interruption protection on job re-runs:
runs-on/runs-on#337 (comment)

Probably isn't strictly needed in this file with spot=false, but I did a search and replace for consistency (and to avoid other similar issues).

uses: actions/cache@1bd1e32a3bdc45362d1e726936510720a7c30a57 # v4.2.0
id: tool-cache
with:
path: '~/**/_tool'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paths are different on macOS and Linux, so this glob is the easiest way to write them both.

Comment on lines +156 to +157
~/.cargo/.crates2.json
~/.cargo/.global-cache
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking this!
Actually, I think those files (and parts of the current registry and git) aren't needed.

In the docs, the paths to cache for crate sources are:

  • registry/index
  • registry/cache
  • git/db

https://doc.rust-lang.org/cargo/guide/cargo-home.html#caching-the-cargo-home-in-ci

Caching registry/src and git/checkouts massively increases the number of files, and more than doubles the size. I'll re-check Windows after this change, because it might turn out to be faster with many fewer files.

.global-cache is a tiny marker file that stops backup software descending into the directory, but I'm pretty sure it gets re-created by cargo anyway.

- name: cargo fmt
run: cargo fmt --all -- --check

cargo-clippy:
name: cargo-clippy (${{ strategy.job-index == 0 && 'Linux' || (strategy.job-index == 1 && 'Windows' || 'macOS') }})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to go with OS name and maybe architecture, we can add OS version if we ever run multiple OS versions in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working devops improvement it is already working, but can be better
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants