-
Notifications
You must be signed in to change notification settings - Fork 607
CPU Optimizations for FP8 #2559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
93ee022
06338bc
50de9cd
5fee841
4c79ac7
62b88e1
99494d7
b157f85
2a7b627
b61a6a8
938651e
88dfdbd
1526eea
5809dcc
b3bd748
30fecf2
1b0d497
138b7bf
eec1e86
8169d9c
6fefaf2
a5feaf9
285dbff
afb2f23
3919cb8
fd36424
4668133
5a00652
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -961,7 +961,7 @@ void nvte_destroy_quantization_config(NVTEQuantizationConfig config) { | |||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| int nvte_is_non_tn_fp8_gemm_supported() { | ||||||||||||||||||
| int num_devices = transformer_engine::cuda::num_devices(); | ||||||||||||||||||
| static int num_devices = transformer_engine::cuda::num_devices(); | ||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Critical: Potential out-of-bounds access with static num_devices Making This will cause out-of-bounds access on lines 968 and 975: std::call_once(flags[device_id], ...); // OOB if device_id >= num_devices
return cache[device_id]; // OOB if device_id >= num_devicesImpact: Undefined behavior, potential crashes, memory corruption. Fix: Query device count each time, or add bounds checking: int num_devices = transformer_engine::cuda::num_devices();
static std::vector<int> cache;
static std::vector<std::once_flag> flags;
static std::mutex resize_mutex;
int device_id = transformer_engine::cuda::current_device();
{
std::lock_guard<std::mutex> lock(resize_mutex);
if (device_id >= cache.size()) {
cache.resize(device_id + 1, -1);
flags.resize(device_id + 1);
}
}
std::call_once(flags[device_id], [&]() { ... });
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Potential race condition in static variable initialization Making Scenario:
However, this is likely safe in practice because:
But for correctness, consider wrapping in the existing static int num_devices = []() {
return transformer_engine::cuda::num_devices();
}();Or initialize it within the |
||||||||||||||||||
| static std::vector<int> cache(num_devices, -1); | ||||||||||||||||||
| static std::vector<std::once_flag> flags(num_devices); | ||||||||||||||||||
|
Comment on lines
+964
to
966
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Potential edge case: The While dynamic device changes are rare, consider adding a bounds check:
Suggested change
Comment on lines
+964
to
966
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Critical bug: Potential out-of-bounds access when Making
Fix: Either:
|
||||||||||||||||||
| int device_id = transformer_engine::cuda::current_device(); | ||||||||||||||||||
|
|
||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making
num_devicesstatic has a subtle initialization order issue. The static initialization happens once when the function is first called, but the subsequent static vectorscacheandflagsdepend onnum_devicesfor their size.If
transformer_engine::cuda::num_devices()returns different values across multiple calls (which shouldn't happen in practice but isn't guaranteed by the API), the first call to this function will initializenum_devices, and subsequent calls will use that cached value. However, if the CUDA context changes or devices are added/removed (in rare scenarios), this could cause a mismatch.Consider:
This is initialized once, but
cacheandflagsvectors might need a different size if the device count somehow changes. While unlikely, this could cause out-of-bounds access.A safer approach might be:
Or simply document that the device count must not change during the application's lifetime.