-
Notifications
You must be signed in to change notification settings - Fork 5
Optimizing a performance testing workflow by reusing minified R package versions between CI runs
Autocomment-atime-results is a GitHub Action (GHA) that tests the time/memory-based performance of the incoming changes that are introduced via Pull Requests (PRs) to a GitHub repository for an R package. (For more information on the GHA, feel free to check what I wrote for The-Raft)
Currently, it is being used extensively in data.table
, a well-known and highly-performant package for data manipulation and analysis.
This action utilizes the atime
package to execute various tests, each of which are run across different specified versions of data.table
in order to help identify potential performance regressions and improvements in comparison against each other.
Currently, the GHA rebuilds and reinstalls these numerous versions of data.table
for each job run, which consumes a considerable amount of time and CI resources. By optimizing this workflow along the lines of the main idea posted in #6528 (which is to cache the versions that are the same for every PR, or to avoid the reinstallation of the historical data.table
versions), substantial time efficiency or speedups in job runs can be achieved. Time savings would be proportional to the number of reusable versions, and the more lightweight these historical versions are in size, the more we can fit in within the limits of usage for the free tier.
Additionally, it would be favourable to have the GHA support PRs originating from forks.
-
Caching CI Artifacts: Several GitHub projects leverage artifact caching to avoid redundant computations. For example:
- Actions Upload/Download Artifacts stores and retrieves build artifacts, reducing setup time.
- GitHub Actions Cache helps reuse dependencies like R packages and compiled binaries across CI runs.
-
Package Optimizations: The R ecosystem has ways to minimize CI overhead, such as:
- Using
--no-manual
and--no-build-vignettes
flags duringR CMD INSTALL
to skip non-essential build steps. - Excluding documentation and tests during package building for deployment-focused workflows.
- Using
The main goal is to incorporate these two fundamental strategies to reduce CI time by reusing simplified historical versions of data.table
.
-
Supporting all forks: Handling workflows for forked repositories often involves using
pull_request_target
events or selectively disabling steps that rely on sensitive data. Following best practices outlined in similar workflows is an option. For example:- bencher.dev provides discussion on adapting actions to support forks.
The optimization primarily consists of two phases:
-
Artifact Caching & Retrieval:
- Utilize upload-artifact to save precompiled versions of historical
data.table
releases after a CI run. - Leverage download-artifact to retrieve them at the beginning of subsequent CI runs, avoiding recompilation of versions that are constant across PRs.
- Utilize upload-artifact to save precompiled versions of historical
-
Minimized Package Installation:
- Modify the
R CMD INSTALL
process to exclude unnecessary components such as documentation, vignettes, tests, and other large directories in order to significantly reduce package size and installation time. The "minified" build process will:- Remove unnecessary files (
NEWS.md
for e.g.) and subdirectories (such asinst
,man
,tests
,vignettes
, etc.) - Use installation flags (like
--no-manual
and--no-build-vignettes
)
- Remove unnecessary files (
- Modify the
Integration with the GHA
Make modifications to this workflow to check for the availability of the precompiled artifacts - If they are available, download and install them directly; if not, rebuild minified versions and upload them as artifacts for future use. Simply put, one needs to:
- Update the step which includes installation of the different R package versions via
atime
to include the artifact download logic with appropriate branching (which involves changes inatime
code as well). - Modify the 'Upload results' step to save newly compiled historical versions.
Finally, ensure that the current set of tests with historical references use cached versions of data.table
and estimate the total version count this optimization can support (mileage would vary based on how 'minified' each version is in terms of size) keeping in mind the usage limit for the free tier of GitHub Actions (500 MB and 2k minutes/month at present).
PRs from forks
Adapt the workflow to securely handle PRs from forks. For example, making use of pull_request_target
events to allow access to repository secrets for testing while avoiding exposure to untrusted code, disabling or modifying artifact upload steps to avoid conflicts or unauthorized access, and clearly documenting how contributors can trigger/debug workflows on forked repositories.
It should yield immediately visible benefits in terms of conserving the CI runtime for data.table
on GitHub-hosted runners. Subsequently, this enables developers and contributors to iterate more quickly given the faster feedback loops on PRs. And by enabling support for PRs from forks, it will allow collaboration across a broader community of contributors (not restricted to project members).
Contributors, please contact the mentors below after completing at least one of the tests below.
- EVALUATING MENTOR: Anirban Chetia [email protected]
- Other mentors: Toby Hocking [email protected]
Contributors, please do one or more of the following tests before contacting the mentors above.
- Easy - Given a tarball (e.g.
data.table_1.16.99.tar.gz
) of thedata.table
package, write a script that:
- Extracts it, then removes subdirectories and files that are not required for performance testing.
- Recreates a minimal version containing only the essential files for installation.
- Installs that package using
R CMD INSTALL
.
- Medium - Write a YAML snippet that:
- Checks if a precompiled artifact for a specific historical version of
data.table
exists. - If it is found, it downloads and installs the artifact, else it builds the version from source, minifies it, and then uploads it as an artifact for future runs. (Name the artifacts based on version numbers for keeping them distinct)
- Implement a fallback mechanism in the workflow to build the missing version from source in case artifact retrieval fails.
- Hard - Modify my GitHub Action to:
- Incorporate the steps you took to create the workflow for the 'Medium' test.
- Log the time taken to use the precompiled artifacts (would be nice to numerically compare the difference in time with this feature versus without or when building from source) and the disk space utilized by the downloaded artifacts. Additionally, if you can think of any other useful metrics, feel free to include them with your rationale.
- Simulate some CI runs to observe the caching in effect and document gains.
- Make the workflow work with PRs from any contributor's fork. (Optional)
- Bonus points - Beyond minimizing packages and caching artifacts, identify one additional optimization that could further improve the performance of this workflow and share a working proof of concept. Make sure to provide links to the configured workflow and PRs demonstrating results.
Contributors, please post a link to your test results here.
- EXAMPLE CONTRIBUTOR 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.
- Sagnik Mandal , https://github.com/criticic , EASY , MEDIUM
- Priyanshu Yadav ,https://github.com/tech0priyanshu , https://github.com/tech0priyanshu/r-package-ci-optimization