Skip to content

Optimizing a performance testing workflow by reusing minified R package versions between CI runs

PRIYANSHU edited this page Jan 21, 2025 · 10 revisions

Background

Autocomment-atime-results is a GitHub Action (GHA) that tests the time/memory-based performance of the incoming changes that are introduced via Pull Requests (PRs) to a GitHub repository for an R package. (For more information on the GHA, feel free to check what I wrote for The-Raft)

Currently, it is being used extensively in data.table, a well-known and highly-performant package for data manipulation and analysis.

This action utilizes the atime package to execute various tests, each of which are run across different specified versions of data.table in order to help identify potential performance regressions and improvements in comparison against each other.

Currently, the GHA rebuilds and reinstalls these numerous versions of data.table for each job run, which consumes a considerable amount of time and CI resources. By optimizing this workflow along the lines of the main idea posted in #6528 (which is to cache the versions that are the same for every PR, or to avoid the reinstallation of the historical data.table versions), substantial time efficiency or speedups in job runs can be achieved. Time savings would be proportional to the number of reusable versions, and the more lightweight these historical versions are in size, the more we can fit in within the limits of usage for the free tier.

Additionally, it would be favourable to have the GHA support PRs originating from forks.

Related work

  • Caching CI Artifacts: Several GitHub projects leverage artifact caching to avoid redundant computations. For example:

  • Package Optimizations: The R ecosystem has ways to minimize CI overhead, such as:

    • Using --no-manual and --no-build-vignettes flags during R CMD INSTALL to skip non-essential build steps.
    • Excluding documentation and tests during package building for deployment-focused workflows.

The main goal is to incorporate these two fundamental strategies to reduce CI time by reusing simplified historical versions of data.table.

  • Supporting all forks: Handling workflows for forked repositories often involves using pull_request_target events or selectively disabling steps that rely on sensitive data. Following best practices outlined in similar workflows is an option. For example:
    • bencher.dev provides discussion on adapting actions to support forks.

Details of your coding project

The optimization primarily consists of two phases:

  1. Artifact Caching & Retrieval:

    • Utilize upload-artifact to save precompiled versions of historical data.table releases after a CI run.
    • Leverage download-artifact to retrieve them at the beginning of subsequent CI runs, avoiding recompilation of versions that are constant across PRs.
  2. Minimized Package Installation:

    • Modify the R CMD INSTALL process to exclude unnecessary components such as documentation, vignettes, tests, and other large directories in order to significantly reduce package size and installation time. The "minified" build process will:
      • Remove unnecessary files (NEWS.md for e.g.) and subdirectories (such as inst, man, tests, vignettes, etc.)
      • Use installation flags (like --no-manual and --no-build-vignettes)

Integration with the GHA

Make modifications to this workflow to check for the availability of the precompiled artifacts - If they are available, download and install them directly; if not, rebuild minified versions and upload them as artifacts for future use. Simply put, one needs to:

  • Update the step which includes installation of the different R package versions via atime to include the artifact download logic with appropriate branching (which involves changes in atime code as well).
  • Modify the 'Upload results' step to save newly compiled historical versions.

Finally, ensure that the current set of tests with historical references use cached versions of data.table and estimate the total version count this optimization can support (mileage would vary based on how 'minified' each version is in terms of size) keeping in mind the usage limit for the free tier of GitHub Actions (500 MB and 2k minutes/month at present).

PRs from forks

Adapt the workflow to securely handle PRs from forks. For example, making use of pull_request_target events to allow access to repository secrets for testing while avoiding exposure to untrusted code, disabling or modifying artifact upload steps to avoid conflicts or unauthorized access, and clearly documenting how contributors can trigger/debug workflows on forked repositories.

Expected impact

It should yield immediately visible benefits in terms of conserving the CI runtime for data.table on GitHub-hosted runners. Subsequently, this enables developers and contributors to iterate more quickly given the faster feedback loops on PRs. And by enabling support for PRs from forks, it will allow collaboration across a broader community of contributors (not restricted to project members).

Mentors

Contributors, please contact the mentors below after completing at least one of the tests below.

Tests

Contributors, please do one or more of the following tests before contacting the mentors above.

  • Easy - Given a tarball (e.g. data.table_1.16.99.tar.gz) of the data.table package, write a script that:
  1. Extracts it, then removes subdirectories and files that are not required for performance testing.
  2. Recreates a minimal version containing only the essential files for installation.
  3. Installs that package using R CMD INSTALL.
  • Medium - Write a YAML snippet that:
  1. Checks if a precompiled artifact for a specific historical version of data.table exists.
  2. If it is found, it downloads and installs the artifact, else it builds the version from source, minifies it, and then uploads it as an artifact for future runs. (Name the artifacts based on version numbers for keeping them distinct)
  3. Implement a fallback mechanism in the workflow to build the missing version from source in case artifact retrieval fails.
  • Hard - Modify my GitHub Action to:
  1. Incorporate the steps you took to create the workflow for the 'Medium' test.
  2. Log the time taken to use the precompiled artifacts (would be nice to numerically compare the difference in time with this feature versus without or when building from source) and the disk space utilized by the downloaded artifacts. Additionally, if you can think of any other useful metrics, feel free to include them with your rationale.
  3. Simulate some CI runs to observe the caching in effect and document gains.
  4. Make the workflow work with PRs from any contributor's fork. (Optional)
  • Bonus points - Beyond minimizing packages and caching artifacts, identify one additional optimization that could further improve the performance of this workflow and share a working proof of concept. Make sure to provide links to the configured workflow and PRs demonstrating results.

Solutions of tests

Contributors, please post a link to your test results here.

Clone this wiki locally