-
Notifications
You must be signed in to change notification settings - Fork 256
Improve CI reliability and add S3 cache for speed #3647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
f53b052
to
7422965
Compare
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caches are not matched exactly, but rather can use fallback keys (specified in restore-keys
). This configuration means when second and third toolchain is used, they will all end up being stored, slowing CI over time more and more. This is even worse with target
, whose build artifacts are unlikely to be useful upon restart as crates and especially dependencies change. Caching target
is generally not recommended, not to mention it is absolutely HUGE.
Not only that, cache is spectacularly slow on Windows, to the point that using it may end up being slower than not using.
Overall the intention of this PR is good, but I believe it'll cause more issues that actually help. Majority of important and useful things were cached already.
fd7586a
to
10c078e
Compare
Thanks for the reminder. I've reviewed all the cache settings, and scoped them to the compiler version, any dependency version, or both. That should limit cache growth, along with some other changes you might not be aware of:
I'm testing which of the remaining caches are actually useful. S3 cache is free and (potentially) fast, so it's worth trying now we're not using GitHub's caches.
I've disabled almost all the caches on Windows, including the existing source deps one, because you're right, they're slower than the actual downloads. |
10c078e
to
f137aac
Compare
Removing restore keys means any small dependency or feature change and you're completely without cache. Using restore keys (which is recommended) means it first downloads older cache, then if something it needs is missing (like a newer toolchain), it'll download that and then store a newer version with both toolchains, compounding the size over time. Neither is ideal, which is why storing things like build artifacts is generally a bad idea. Tools like Also note that BTW the cache for ROCm for example was right before its installation to see both at the same time and more likely to remove both if/when they are not needed (for example, if plotter starts using Vulkan, ROCm would no longer be needed in CI). Moving it around means you need to scroll back and forth more during maintenance and I'm not sure having logs of different caches together is more useful than grouping things that are closely related. |
f137aac
to
b33fd8d
Compare
I've removed the compiler binaries cache, and set the other caches so they'll be reset when the workspace
The reset will only impact the first run of the first PR with each The S3 I'm seeing if a macOS compiled deps cache is useful, once I know I'll bring this PR out of draft.
Done! |
b33fd8d
to
44102f4
Compare
44102f4
to
0887d22
Compare
70b2934
to
cd7399e
Compare
This is ready for review now, I've made sure each added cache is actually improving performance, and they are reset when they become less useful (when direct dependencies change, around every 5-20 days). Edit: This PR changes CI job names, so we'll need to update our branch protection rules for it to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me overall
@@ -14,7 +14,7 @@ on: | |||
jobs: | |||
chains-spec: | |||
runs-on: ${{ fromJson(github.repository_owner == 'autonomys' && | |||
'"runs-on=${{ github.run_id }}/runner=self-hosted-ubuntu-22.04-x86-64"' || '"ubuntu-22.04"') }} | |||
'"runs-on=${{ github.run_id }}-${{ github.run_attempt }}/runner=self-hosted-ubuntu-22.04-x86-64/spot=false"' || '"ubuntu-22.04"') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is github.run_attempt
used in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just use the run-id, then different attempts can be scheduled on the same runner. This disables the spot instance interruption protection on job re-runs:
runs-on/runs-on#337 (comment)
Probably isn't strictly needed in this file with spot=false
, but I did a search and replace for consistency (and to avoid other similar issues).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such an obscure behavior. I really consider this to be a bug.
- name: cargo fmt | ||
run: cargo fmt --all -- --check | ||
|
||
cargo-clippy: | ||
name: cargo-clippy (${{ strategy.job-index == 0 && 'Linux' || (strategy.job-index == 1 && 'Windows' || 'macOS') }}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convenient, but I'd argue it is helpful to have exact OS name and version. There are no tests for aarch64 Ubuntu/Windows, but it'd be nice to have at least Ubuntu aarc64 here, in which case "Linux" will be duplicated.
The solution to shifting job names for branch protection rules could be to have a "summary" job that depends on all others. Here is an example:
https://github.com/nazar-pc/abundance/blob/b0eb4ede65e73c8e8d5d2b03414e45dd43526d78/.github/workflows/rust.yml#L381-L400
Note that it uses job names as in yaml file and waits for all of them in case of job matrix. The you'll not need to customize the names here at all and it'll be nicer for long-term maintenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to go with OS name and maybe architecture, we can add OS version if we ever run multiple OS versions in CI.
~/.cargo/.crates2.json | ||
~/.cargo/.global-cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these two files important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking this!
Actually, I think those files (and parts of the current registry
and git
) aren't needed.
In the docs, the paths to cache for crate sources are:
- registry/index
- registry/cache
- git/db
https://doc.rust-lang.org/cargo/guide/cargo-home.html#caching-the-cargo-home-in-ci
Caching registry/src and git/checkouts massively increases the number of files, and more than doubles the size. I'll re-check Windows after this change, because it might turn out to be faster with many fewer files.
.global-cache
is a tiny marker file that stops backup software descending into the directory, but I'm pretty sure it gets re-created by cargo anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those git checkouts are exactly the files that are used to build crates, that is why we cache them. Yes, they could be relatively large, especially as we upgrade dependencies over time and caches grow in size over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we only need to cache the compressed crate archives and bare git repository clones. Then once the cache is restored, cargo will uncompress/checkout the sources from those archives.
We definitely don't need to cache both the archives and the sources, that's more than double the size, and many more files.
uses: actions/cache@1bd1e32a3bdc45362d1e726936510720a7c30a57 # v4.2.0 | ||
id: tool-cache | ||
with: | ||
path: '~/**/_tool' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about specifying a full path rather than **/_tool
glob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paths are different on macOS and Linux, so this glob is the easiest way to write them both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, annoying 😕
@@ -89,6 +88,17 @@ jobs: | |||
run: brew install libtool | |||
if: runner.os == 'macOS' | |||
|
|||
# We cache protoc because it sometimes fails to download, but cache is too slow on Windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider creating an upstream issue to add built-in cache to arduino/setup-protoc
action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
I agree that the crate cache needs to be cut down, and I found even more ways we can reduce its size based on the docs.
I also think a final "required jobs" job would be much more maintainable. Then we'd only need to switch the branch protection rules once.
@@ -14,7 +14,7 @@ on: | |||
jobs: | |||
chains-spec: | |||
runs-on: ${{ fromJson(github.repository_owner == 'autonomys' && | |||
'"runs-on=${{ github.run_id }}/runner=self-hosted-ubuntu-22.04-x86-64"' || '"ubuntu-22.04"') }} | |||
'"runs-on=${{ github.run_id }}-${{ github.run_attempt }}/runner=self-hosted-ubuntu-22.04-x86-64/spot=false"' || '"ubuntu-22.04"') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just use the run-id, then different attempts can be scheduled on the same runner. This disables the spot instance interruption protection on job re-runs:
runs-on/runs-on#337 (comment)
Probably isn't strictly needed in this file with spot=false
, but I did a search and replace for consistency (and to avoid other similar issues).
uses: actions/cache@1bd1e32a3bdc45362d1e726936510720a7c30a57 # v4.2.0 | ||
id: tool-cache | ||
with: | ||
path: '~/**/_tool' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paths are different on macOS and Linux, so this glob is the easiest way to write them both.
~/.cargo/.crates2.json | ||
~/.cargo/.global-cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking this!
Actually, I think those files (and parts of the current registry
and git
) aren't needed.
In the docs, the paths to cache for crate sources are:
- registry/index
- registry/cache
- git/db
https://doc.rust-lang.org/cargo/guide/cargo-home.html#caching-the-cargo-home-in-ci
Caching registry/src and git/checkouts massively increases the number of files, and more than doubles the size. I'll re-check Windows after this change, because it might turn out to be faster with many fewer files.
.global-cache
is a tiny marker file that stops backup software descending into the directory, but I'm pretty sure it gets re-created by cargo anyway.
- name: cargo fmt | ||
run: cargo fmt --all -- --check | ||
|
||
cargo-clippy: | ||
name: cargo-clippy (${{ strategy.job-index == 0 && 'Linux' || (strategy.job-index == 1 && 'Windows' || 'macOS') }}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to go with OS name and maybe architecture, we can add OS version if we ever run multiple OS versions in CI.
Caching Changes
This PR activates the runs-on magic cache, which caches for around 10 days on AWS S3.
As part of that change, it also adds extra caches for:
And fixes the caches for:
Spot Instance Fixes
Spot instances are disabled in the merge queue and release-related workflows. This will only have an impact if we decide to re-enable them in
runs-on.yml
.This is based on runs-on maintainer advice:
runs-on/runs-on#338 (comment)
runs-on/runs-on#337 (comment)
Instance changes
This PR expands the runs-on CI instances to include "r7" instances (memory optimised). Instances will be chosen based on lowest cost from the entire range provided. This is based on advice from the runs-on maintainer:
runs-on/runs-on#338 (comment)
This is low risk, because our costs are already low. Because of our low costs, there's not much need to use spot instances, so I left them off in this PR.
It also removes some redundant CPU and RAM keys, because the provided numbers are already treated as a range:
https://runs-on.com/configuration/job-labels/#cpu
Workflow Cleanups
cargo fmt
does not require much CPU or RAM, so its job is changed to a free GitHub runner. It also doesn't need any dependencies, so those are removed.Code contributor checklist: