-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support nvcc
in sccache-dist
#2247
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2247 +/- ##
===========================================
+ Coverage 30.91% 55.14% +24.23%
===========================================
Files 53 57 +4
Lines 20112 22104 +1992
Branches 9755 10339 +584
===========================================
+ Hits 6217 12190 +5973
- Misses 7922 8080 +158
+ Partials 5973 1834 -4139 ☔ View full report in Codecov by Sentry. |
@trxcllnt I am off for two weeks |
@@ -1078,7 +1078,7 @@ mod client { | |||
use super::urls; | |||
use crate::errors::*; | |||
|
|||
const REQUEST_TIMEOUT_SECS: u64 = 600; | |||
const REQUEST_TIMEOUT_SECS: u64 = 1200; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 minutes is too short to compile some complex kernels, so I bumped to 10 minutes here. It would be good to make this configurable.
Cargo.toml
Outdated
@@ -118,6 +119,7 @@ object = "0.32" | |||
rouille = { version = "3.6", optional = true, default-features = false, features = [ | |||
"ssl", | |||
] } | |||
shlex = "=1.3.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please document why "=1.3.0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just what cargo selected when I did cargo add shlex
. How do you suggest I update it?
could you please document NVCC_PREPEND_FLAGS and NVCC_APPEND_FLAGS ? Could you please split the work into smaller PR? |
@@ -390,11 +402,38 @@ where | |||
arg | |||
); | |||
} | |||
|
|||
let use_preprocessor_cache_mode = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, this change seems to be independent from nvcc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may seem that way, but unfortunately it is not.
The generated subcommands are run with SCCACHE_DIRECT
set to false, because the preprocessed output is either impossible (or unnecessary) to cache, so attempting to cache it is just wasted cycles and messier logs.
It didn't seem like there was a good way to disable the preprocessor caching besides the envvar, but then the issue is that this envvar is only read on startup and not per file, so this change ensures it's read per file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the latency but maybe add a comment in the code too :)
src/compiler/c.rs
Outdated
) | ||
)?; | ||
|
||
let force_no_cache = env_vars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this is also related to this feature, and I left an explanation in this comment.
This behavior ensures we don't cache the same object file twice under two different hashes.
In the case where a call like nvcc -c x.cu -o x.cu.o
is cached, x.cu.o
is stored at a hash computed from the inputs.
In the case that it's not cached, we generate a list of commands to produce x.cu.o
, where the final command is the host compiler call that combines the fatbin + host code, i.e.:
<all other cicc, ptxas, fatbinary etc. calls>
gcc "$tmp/x.cudafe1.cpp" -o "x.cu.o"
We do want to run this host compiler call through sccache, so we can take advantage of distributed compilation, but doing that would normally cache x.cu.o
with a hash computed from the gcc compilation flow.
Since this is the object we're going to cache under the original hash computed for the outer nvcc call, and these objects can be large since they include the device code for all archs, we should avoid caching it twice.
Sure, these envvars are documented in the CUDA compiler docs. Since they can potentially affect the hash (and the nvcc-generated subcommands), sccache needs to intercept and prepend/append them manually.
This was my initial goal, but sticking to it was difficult. If I had to break it down to relatively isolated chunks, they could be:
The challenge I had was that if broken into separate PRs, none of these changes make much sense on their own, and (more importantly) may not even be the way they should be implemented. For example I anticipate reviewer questions like, "why make Similarly for the So in conclusion, my main goal was to get the feature fully implemented and PR'd, then discuss details like these with the context of why they're necessary. I am 100% OK with splitting this out into separate PRs if that makes things easier to merge, I just want to make sure there's agreement on the implementation details first. |
Hi, thank you so much for the thorough work you’ve put into this PR! 🙌 The level of detail and the description you provided in your comments are super helpful to understand the scope and rationale behind the changes. Really appreciate the thought you’ve given to this! My first thought here would be that it would help with the review process if the changes were broken down into separate commits based on the different features and fixes you’ve described. This doesn’t necessarily mean separate PRs, but organizing each distinct change into its own commit and force-pushing onto your branch would make it easier to go through and review each part more methodically. As it stands, the 47 commits make it a bit tricky to trace logical units of changes, and reviewing the full diff in one go is a tad overwhelming. Splitting these would definitely make the review process smoother. That being said, coming from a place where I know nothing of nvcc nor why support for it was added, my understanding from what you wrote (and thanks again for the great description), |
Sure, I don't mind squashing the intermediate commits into logical units.
Just like other compiler drivers,
Just like with I hope those answers help your understanding, but from your line of questioning I sense there's some deeper context I can provide as justification for this PR. I may not have made this clear from my issue and PR description, but the fundamental reason
A very rough analogy is like if So if
The past few years we (the NVIDIA RAPIDS team) have seen huge improvements from shared build caches just using |
The fundamental reason sccache-dist preprocesses input is that it's the most efficient way to ensure everything required for compilation, aside from the toolchain, is available on the host where the job is distributed. If we could assume that the toolchain and all system headers were available there, at the right location, preprocessing wouldn't be necessary in the first place. That nvcc doesn't preprocess isn't the key issue. What's more relevant, though, is that this same constraint likely applies to some if not all of the commands it runs under the hood. Preprocessing is actually more crucial to the caching part, and the way nvcc seems to operate makes me wonder if the current caching sccache does for nvcc is fully sound. Without looking into the exact implementation, I'm concerned there could be flaws, for example, when modifying headers that nvcc-invoked gcc calls rely on. Or when the gcc version changes. Such hidden issues might not surface during typical development cycles, but could lead to unexpected results at the most unexpected moments. This is the first thing that worries me, and entrenching the wrapping of nvcc, and the consequences that follow (need for re-entrancy for your use case) are not really enticing. The second thing that worries me is that relying on the output from nvcc --dry-run to infer what commands it would run sounds brittle. Does it handle escaping reliably? Will sccache be able to parse those commands correctly, or could edge cases slip through due to inadequate or missing escaping? Which brings me back to earlier questions: does nvcc also do anything unique itself, or is everything handled through the subcommands it executes? Does it call all its subcommands via $PATH or via absolute paths it derives from its own location? |
Yes, I am aware how sccache-dist works and why.
It is impractical, undesired, and out of scope of this PR to remove the client-preprocessing step from sccache-dist. For all intents and purposes, this is the reason nvcc compilations cannot presently be distributed.
Please review the steps in the PR description above the mermaid diagram, specifically steps 2, 3, and 4. These represent the internal host compiler and CUDA front-end (
Current sccache (master branch) nvcc caching is almost fully sound, with one possible edge case that to my knowledge we've never hit in practice. That said, the changes in this branch could actually resolve that too, which I will describe below.
gcc headers are included in the
This is the edge case. Technically sccache should consider both It currently doesn't, because that requires predicting which host compiler nvcc would choose when the Practically speaking, changing host compiler version either involves changing a flag ( However, this PR can ensure that never happens. By decomposing the nvcc call into its constituent commands, the final host compilation can run through sccache like any other compilation, at which point sccache considers its version in the computed hash.
Here we agree. This approach was neither pleasant to consider nor implement, but in my analysis represented the smallest ratio of sccache changes to new features. As a quick reminder, these are the new user-facing features enabled by this PR:
The first two are available to sccache clients running in local-compilation mode, and were consequences of the changes necessary to distribute nvcc compilations. However, they are huge quality-of-life improvements for relatively common scenarios encountered by users, even without distributed compilation. Integrating software that wasn't designed with my use-case in mind often involves compromises, and selecting the "least-bad" option from a set of bad options is sometimes necessary. I am absolutely open to suggestions on alternative implementations if you think there's something I've missed.
Yes, it is brittle. However, other tools rely on I am working with the NVIDIA compiler team to add a feature to future nvcc versions to produce its compilation DAG in a well-known format (e.g. graphviz dot) rather than shell format. When that lands, we can revisit the implementation here and use the nvcc DAG directly, rather than parsing one from the
Yes.
That depends on the parser. I chose
I believe invoking I briefed the CUDA compiler team on our plans to implement this feature, and they didn't raise any issues with the current approach. They are also aware of this PR, and we are working on plans to ensure these features are both stable and easier to achieve (e.g. enabling nvcc to output a DAG in a well-known format).
How nvcc locates its binaries depends on many factors, however the source of truth is in the |
282ea92
to
1caf0a5
Compare
@sylvestre @glandium I've reordered and squashed the commits in this branch into logical units. Let me know if there's more I can do to make this easier to review. I have also removed the When we decompose nvcc into its subcommands, the final host compiler call is the nvcc -x cu -c x.cu -o x.cu.o # <-- cache x.cu.o (1)
- cicc -o x.ptx x.cu.ii # <-- cache x.ptx
- ptxas -o x.cubin x.ptx # <-- cache x.cubin
- gcc -c x.cudafe1.cpp -o x.cu.o # <-- cache x.cu.o (2) The But as @glandium points out, this is potentially unsound in the case where the host compiler changes in an unobservable way, for example by switching the compiler symlink via If this happens, the easiest and most reliable implementation is to always run each nvcc -x cu -c x.cu -o x.cu.o # <-- _do not_ cache x.cu.o here
- cicc -o x.ptx x.cu.ii # <-- cache x.ptx
- ptxas -o x.cubin x.ptx # <-- cache x.cubin
- gcc -c x.cudafe1.cpp -o x.cu.o # <-- cache x.cu.o here The above logic is now included in this PR (by always returning However, this method has potentially undesirable consequences. Impact on
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trxcllnt and I did an offline code review.
nitpick: would like to see Ensure use_preprocessor_cache_mode considers the current value of the
SCCACHE_DIRECT envvar.
to be a separate PR.
nitpick: extra dist files comment / variable should be updated to better reflect that these files won't be hashed ( in our case the name of the file will change, but the contents wont' ).
nitpick: unhashflag look at if it should capture -v
major: Need to pass -rdc=true
when doing preprocessing when given -dc so that we properly evaluate __CUDACC_RDC__
1c36942
to
4a1e674
Compare
a417a4f
to
0a3195e
Compare
0a3195e
to
a58656b
Compare
* Spawn the tokio compilation task in `check_compiler()` instead of in `start_compile_task()` * Map the tokio compilation task into a stream, and return that stream as the response body These changes reduce the amount of memory leaked when building the default linux target with glibc. Refactoring `start_compile_task()` to be an async function is a necessary change to support re-entrancy for sccache-dist nvcc support.
…he `SCCACHE_DIRECT` envvar.
* Refactor `CompileCommand` into a trait with an execute function that accepts an `SccacheService<T>` ref and a `CommandCreatorSync` ref * Rename `CompileCommand` struct to `SingleCompileCommand` * Update all `generate_compile_command()` implementations to return a boxed `CompileCommand` trait This allows compilers to return custom implementations of the `CompileCommand` trait that execute more complex logic than simply launching a subprocess.
…ompilers can add additional files to be included for each distributed compilation
… the computed hash
* Adds top-level `Cicc` and `Ptxas` compiler types * Updates `Nvcc` compiler implementation to decompose nvcc calls into its constituent subcompiler invocations via the `nvcc --dryrun` flag * Bumps the `sccache-dist` request timeout from 5 to 10 minutes, because nvcc compilations can take a while * Updates the CUDA tests, separates into tests for nvcc and clang since now their behavior is different * Fixes lint
… caching outer nvcc invocations
* Test nvcc and clang-cuda in workflows/ci.yml * Fix clang-cuda tests * Ensure /tmp/sccache_*.txt files are included in failed job artifacts on Windows
…ng envvars are reset before unwrapping/asserting
…ations in the stats
89dd8a8
to
99ad742
Compare
let's see ! |
Awesome, thanks for all the hard work. |
Shoot, I didn't know this was close to merging. I have a few minor fixes to account for differences in the CUDA Toolkit v11.1, as well as update the stats for the new SCCACHE_NO_CACHE, so I'll open another PR. |
This PR implements the features described in #2238. The reasons and benefits for this PR are described there, so I'll focus on the details of the implementation here.
First, I'll cover how to use and modify
nvcc --dryrun
output to reliably yield cache hits and misses. Then I'll cover thesccache
changes necessary to implement this feature. Last, I'll show the results of a fully-uncached NVIDIA RAPIDS build with these changes applied.Table of contents:
nvcc
callcicc
andptxas
as top-level compilersCompileCommand
CUDA compilation
To distribute and cache CUDA compilations, it's necessary to understand the anatomy of an
nvcc
call, and how the GPU architecture options impact what is compiled into the final object.Anatomy of an
nvcc
callAs noted in #2238,
nvcc
is a wrapper around the host compiler and internal CUDA device compilers. Notably forsccache
,nvcc
can compile a source file into an object that runs on multiple GPU architectures.Two kinds of device code can be embedded into the final object: PTX and cubins. A
.ptx
file is assembly, and a.cubin
file is the assembled GPU code. A.cubin
is valid for any GPU architecture in the same family, e.g.sm_70.cubin
runs on Volta and Turing (but not Ampere), andsm_80.cubin
runs on Ampere and Ada (but not Hopper).Applications that wish to run on heterogeneous GPU architectures embed cubins for their supported architecture families, as well as PTX for the latest architecture. If the application is run on a newer GPU than what's been embedded, the CUDA driver will JIT the embedded PTX into GPU code at runtime.
This is achieved by the
-gencode=
flags:The
nvcc --dryrun
flag shows hownvcc
achieves this:This output can be grouped into sections of commands that must run sequentially (whether local or distributed).
Each group may depend on previous groups, but some groups can be executed in parallel.
1. This group is a list of environment variables needed by the later sections:
2. This group preprocesses the source file into a form that embeds the GPU code into the final object:
3. This group compiles the source file to arch 70 PTX, then assembles it into an arch 70 cubin:
4. This group does the same as the the third group, except for arch 80:
5. This group assembles the PTX and cubins into a fatbin, then compiles step 2's preprocessor output to an object with the fatbin embedded:
The above commands as a DAG:
The the nodes labeled "Can be distributed" are the most expensive to run, but are also cacheable.
For example, rebuilding with a subset of a prior build's architectures should be fast:
Impediments to caching
In theory,
sccache
should be able to parse thenvcc --dryrun
output and execute each command. In practice, directly executingnvcc --dryrun
commands yields cache misses when it should yield hits, necessitating careful modifications.Random strings in file names
By default nvcc's generated files have randomly-generated strings as part of their file names. These strings end up in the preprocessor output, making it impossible to cache when the post-processed file is included in the cache key.
This behavior is disabled when the
--keep
flag is present, so it is essential to usenvcc --dryrun --keep
to generate commands.Architecture-dependent file names
The filenames generated for the intermediate
.ii
,.stub.c
, and.cudafe1.{c,cpp}
files are sensitive to the set of-gencode=
flags. Because we're dealing with preprocessor output, these names leak into the post-processed output, again leading to cache misses.The file names aren't relevant to the final result, and we can rename them (i.e. auto-incrementing by type) to yield cache hits.
Choice of build directory
Since we'd like to run the underlying nvcc commands directly, we need a scratch directory in which to work.
nvcc --dryrun --keep --keep-dir $(mktemp -d)
would generate commands with paths to files in the temp dir, however this leads to the same issue as before: the path to the temp dir ends up as part of the post-processed files leading to unnecessary cache misses.Additionally, since the original
nvcc
invocation can include relative paths, it's essential to either run the preprocessor from the original cwd, or canonicalize all argument paths. To align with existing sccache behavior, I chose the former approach.My solution is to alternate which directory each command is executed from. For example, in pseudo-code:
This approach ensures only paths relative to the tempdir are ever in any post-processed output handled by nvcc. Along with renaming the files deterministically by extension, this ensures no false-negative cache misses from random paths leaking into preprocessor output.
sccache implementation
The high-level design of this feature is as follows:
cicc
andptxas
as "top-level" compilers in sccacheNvcc
compiler to generate a compile command that:nvcc --keep --dryrun
for additional envvars and constructs the command group DAGnvcc --threads N
valuesccache
to produce each build productAdding
cicc
andptxas
as top-level compilersThe
cicc
andptxas
compiler implementations are straightforward, if a bit odd compared to otherCCompilerImpl
s.cicc
andptxas
arguments are not documented, and the NVIDIA compiler team may change them at any time. So the approach I've taken is to only process the subset of the arguments that impact caching (some of which are excluded from the computed hash), and pass through all other arguments as-is (and including these in the computed hash).In addition to the
-o x.ptx
argument,cicc
has 3 options that cause it to create additional outputs:--gen_c_file_name
,--gen_device_file_name
, and--stub_file_name
. The file names can be different based on the-gencode
flags, so they are excluded from the hash computed for the output.ptx
file.cicc
also requires the.module_id
file generated bycudafe++
as an input. This is available when performing a local compile, but requires adding anextra_dist_files
list to theParsedArguments
andCInputsPackager
structs.Making sccache reentrant
In order for the
Nvcc
compiler to generate aCompileCommand
that can load-or-compile-and-cache the underlying ptx and cubins, a mechanism for recursively calling sccache with additional compile commands needs to exist.Theoretically we could re-invoke the
sccache
client binary as a subprocess, passing the sub-compiler as arguments .e.gsccache cicc ...
(or make a request to thesccache
server that communicates the same information), but this is a non-starter due to jobserver limitations. If eachnvcc
invocation spawns sub-invocations that are processed by thesccache
server'sCommand::Compile
matcher, a new jobserver slot is reserved. The outernvcc
slot is not released, and when all the jobserver slots are reserved, the server deadlocks.nvcc
jobs are taking up slots waiting for sub-compiler invocations that never start due to being blocked by other nvcc jobs.The other way to re-enter sccache with additional compile commands seems to be via
SccacheService<T>
. By providing theSccacheService
instance toCompileCommand::execute()
, Nvcc'sCompileCommand
implementation should be able to call it as necessary for each sub-command.I refactored
start_compile_task()
into an async function that returns aFuture<Result<CompileFinished>>
, and refactoredcheck_compiler()
to spawn + join a tokio task for thestart_compile_task()
future. Then if we make bothcompiler_info()
andstart_compile_task()
public, the NvccCompileCommand
implementation can mimic the top-levelhandle_compile()
logic without spawning additional tasks.Customizing
CompileCommand
After making the above changes, the last step is to refactor
CompileCommand
into a trait that supports different implementations of its logic. This was straightforward, and I modeled the code after the relationship betweenCompiler
/CCompiler
/CCompilerImpl
.CompileCommand<T>
is a new trait with a constraint onCommandCreatorSync
(because traits with generic functions are not "object safe")CompileCommandImpl
is a new trait with a genericexecute<T>(service: &SccacheService<T>, creator: &T) -> Result<process::Output>
CCompileCommand<I>
is aCompileCommand<T>
implementation that owns a concreteCompileCommandImpl
There are two
CompileCommandImpl
implementations:SingleCompileCommand
which is exactly the same as the originalCompileCommand
struct, andNvccCompileCommand
which implements the additional logic for parsingnvcc --dryrun
and transforming/calling the subcommands.Final thoughts
I'm not a huge fan the changes to
SccacheService<T>
andCompileCommand
.I don't feel great about passing around references to a global just so
CompileCommand
can re-enter thesccache
compilation flow. I understandSccacheService<T>::compiler_info()
requires mutable state inSccacheService
, but needing to pass theSccacheService
reference toCompileCommand
(+ mock it inCompileCommand
tests) doesn't pass my personal smell-test.Comparative build times
Here's two fully uncached NVIDIA RAPIDS builds using sccache
v0.7.7
and the version in this PR, with total build time dropping from4h 21m
to2h 18m
🎉:sccache v0.7.7 local
sccache v0.8.1 distributed
Closes #2238