chore(gpu): add benchmarking for 8xsxm5 with two processes#3465
chore(gpu): add benchmarking for 8xsxm5 with two processes#3465andrei-stoian-zama wants to merge 3 commits into
Conversation
8f3377c to
2389601
Compare
| @@ -0,0 +1,67 @@ | |||
| # CLAUDE.md | |||
There was a problem hiding this comment.
This file shouldn't be included
| { | ||
| let mut rng = thread_rng(); | ||
| let num_gpus = get_number_of_gpus() as u64; | ||
| let total_num_gpus = get_number_of_gpus() as u64; |
There was a problem hiding this comment.
I understand this is the logic that makes each process select only a sub group of gpus, but in practice all the gpus are visible for each process, i think it would be easier if when we launch the benchmark we filter the gpus with CUDA_VISIBLE_DEVICES, then all this extra logic is not necessary, and it simulates in a better way what would happen in reality
soonum
left a comment
There was a problem hiding this comment.
Divide and conquer ERC7984 transfer ⚔️ !
@soonum reviewed 10 files and all commit messages, and made 15 comments.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on andrei-stoian-zama, IceTDrinker, soonum, and SouchonTheo).
Makefile line 1958 at r1 (raw file):
--features=integer,gpu,internal-keycache,pbs-stats -p tfhe-benchmark --profile release_lto_off -- .PHONY: bench_hlapi_erc7984_multi_group_gpu # Runs ERC7984 bench in two processes (half of gpus for each) and aggregates results
This recipe doesn't aggregate the results.
Suggestion:
.PHONY: bench_hlapi_erc7984_multi_group_gpu # Runs ERC7984 throughput benchmarks in two processes (half of gpus for each).github/workflows/benchmark_gpu.yml line 79 at r1 (raw file):
- multi_bit_documentation - classical_documentation + multi_bit_documentation
This newline is not needed.
.github/workflows/benchmark_gpu_common.yml line 263 at r1 (raw file):
if: ${{ inputs.command == 'hlapi_erc7984_multi_group' }} run: | pwd
I guess this is a debug command.
.github/workflows/benchmark_summary.yml line 117 at r1 (raw file):
SLAB_BASE_URL: ${{ secrets.SLAB_BASE_URL }} run-benchmarks-gpu-erc7984-multi-group:
In which category this benchmark falls ? A new line in ERC7984 dashboard ?
backends/tfhe-cuda-backend/CLAUDE.md line 1 at r1 (raw file):
# CLAUDE.md
I'd rather add this file in another PR.
This is not related to the benchmark changes.
ci/merge_multi_group_results.py line 2 at r1 (raw file):
#!/usr/bin/env python3 # TODO: ADD COMMENT
TODO to be done 😏 (same everywhere else)
ci/merge_multi_group_results.py line 8 at r1 (raw file):
# TODO: ADD COMMENT def merge_multi_group_results(file1, file2, output_file):
Can't we make it generic for N files ? This should be doable as long as we use *files and keep output_file as the last argument in the command-line.
ci/merge_multi_group_results.py line 37 at r1 (raw file):
with open(output_file, "w") as f: json.dump(result, f, indent=2)
I think Slab doesn't support pretty-printed JSON. To be tested.
tfhe-benchmark/benches/high_level_api/erc7984.rs line 552 at r1 (raw file):
let params_name = params.name(); // 300 * num_gpus seems to be enough for maximum throughput on 8xH100 SXM5
Comment to update.
tfhe-benchmark/src/utilities.rs line 483 at r1 (raw file):
} pub fn get_bench_instances() -> Option<usize> {
Maybe renaming it toget_bench_gpu_instances() to avoid confusion.
tfhe-benchmark/src/utilities.rs line 491 at r1 (raw file):
} /// TODO: ADD COMMENT
I'm adding a comment 😂 .
More seriously, it's more than needed since it brings libc objects handling in native Rust file.
tfhe-benchmark/src/utilities.rs line 493 at r1 (raw file):
/// TODO: ADD COMMENT #[cfg(target_os = "linux")] pub fn bench_sync_barrier(num_instances: usize) {
Maybe we can have a cleaning function that's is called before running the benchmark in the caller? That would remove the need to manually remove the files in the Makefile.
By using a struct that embed all three sem_*, pass it to this cleaning function and simply does
unsafe {
libc::sem_close(mutex);
libc::sem_close(arrive);
libc::sem_close(gate);
libc::sem_unlink(sem_mutex.as_ptr());
libc::sem_unlink(sem_arrive.as_ptr());
libc::sem_unlink(sem_gate.as_ptr());
}I'd prefer all filesystem ops to be handled at crate-level rather a mix of crate + Makfeile.
tfhe-benchmark/src/utilities.rs line 500 at r1 (raw file):
const MUTEX_NAME_PREFIX: &str = "tfhe_bench"; let sem_mutex = CString::new(format!("/{MUTEX_NAME_PREFIX}_mutex")).unwrap();
What does sem_ stands for here? semaphore_?
tfhe-benchmark/src/utilities.rs line 504 at r1 (raw file):
let sem_gate = CString::new(format!("/{MUTEX_NAME_PREFIX}_gate")).unwrap(); let now = SystemTime::now().duration_since(UNIX_EPOCH).unwrap();
Do we want to panic if this call fails, despite being very unlikely, since it would abort the whole benchmark?
Current issue with cloud-init handling at Scaleway forbids us to use custom image to spawn instance. Then we use a default image and install all the required dependencies. Note that Nividia drivers are already installed system-wide on this default image for this reason we don't need to specify environment variables related to Cuda.
054f90a to
1503608
Compare
|
✅ Backward-compat snapshot: everything looks good! No backward-compatibility issues detected. |
Runs TPS benchmarks in parallel in two processes, starting at the same time with synchronization. Aggregates the bnech results and sends the aggregated TPS count to slab
This change is