chore(gpu): add benchmarking for 8xsxm5 with two processes#3465
chore(gpu): add benchmarking for 8xsxm5 with two processes#3465andrei-stoian-zama wants to merge 3 commits intomainfrom
Conversation
4ce499b to
dc01cab
Compare
8f3377c to
2389601
Compare
| @@ -0,0 +1,67 @@ | |||
| # CLAUDE.md | |||
There was a problem hiding this comment.
This file shouldn't be included
| { | ||
| let mut rng = thread_rng(); | ||
| let num_gpus = get_number_of_gpus() as u64; | ||
| let total_num_gpus = get_number_of_gpus() as u64; |
There was a problem hiding this comment.
I understand this is the logic that makes each process select only a sub group of gpus, but in practice all the gpus are visible for each process, i think it would be easier if when we launch the benchmark we filter the gpus with CUDA_VISIBLE_DEVICES, then all this extra logic is not necessary, and it simulates in a better way what would happen in reality
soonum
left a comment
There was a problem hiding this comment.
Divide and conquer ERC7984 transfer ⚔️ !
@soonum reviewed 10 files and all commit messages, and made 15 comments.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on andrei-stoian-zama, IceTDrinker, soonum, and SouchonTheo).
Makefile line 1958 at r1 (raw file):
--features=integer,gpu,internal-keycache,pbs-stats -p tfhe-benchmark --profile release_lto_off -- .PHONY: bench_hlapi_erc7984_multi_group_gpu # Runs ERC7984 bench in two processes (half of gpus for each) and aggregates results
This recipe doesn't aggregate the results.
Suggestion:
.PHONY: bench_hlapi_erc7984_multi_group_gpu # Runs ERC7984 throughput benchmarks in two processes (half of gpus for each).github/workflows/benchmark_gpu.yml line 79 at r1 (raw file):
- multi_bit_documentation - classical_documentation + multi_bit_documentation
This newline is not needed.
.github/workflows/benchmark_gpu_common.yml line 263 at r1 (raw file):
if: ${{ inputs.command == 'hlapi_erc7984_multi_group' }} run: | pwd
I guess this is a debug command.
.github/workflows/benchmark_summary.yml line 117 at r1 (raw file):
SLAB_BASE_URL: ${{ secrets.SLAB_BASE_URL }} run-benchmarks-gpu-erc7984-multi-group:
In which category this benchmark falls ? A new line in ERC7984 dashboard ?
backends/tfhe-cuda-backend/CLAUDE.md line 1 at r1 (raw file):
# CLAUDE.md
I'd rather add this file in another PR.
This is not related to the benchmark changes.
ci/merge_multi_group_results.py line 2 at r1 (raw file):
#!/usr/bin/env python3 # TODO: ADD COMMENT
TODO to be done 😏 (same everywhere else)
ci/merge_multi_group_results.py line 8 at r1 (raw file):
# TODO: ADD COMMENT def merge_multi_group_results(file1, file2, output_file):
Can't we make it generic for N files ? This should be doable as long as we use *files and keep output_file as the last argument in the command-line.
ci/merge_multi_group_results.py line 37 at r1 (raw file):
with open(output_file, "w") as f: json.dump(result, f, indent=2)
I think Slab doesn't support pretty-printed JSON. To be tested.
tfhe-benchmark/benches/high_level_api/erc7984.rs line 552 at r1 (raw file):
let params_name = params.name(); // 300 * num_gpus seems to be enough for maximum throughput on 8xH100 SXM5
Comment to update.
tfhe-benchmark/src/utilities.rs line 483 at r1 (raw file):
} pub fn get_bench_instances() -> Option<usize> {
Maybe renaming it toget_bench_gpu_instances() to avoid confusion.
tfhe-benchmark/src/utilities.rs line 491 at r1 (raw file):
} /// TODO: ADD COMMENT
I'm adding a comment 😂 .
More seriously, it's more than needed since it brings libc objects handling in native Rust file.
tfhe-benchmark/src/utilities.rs line 493 at r1 (raw file):
/// TODO: ADD COMMENT #[cfg(target_os = "linux")] pub fn bench_sync_barrier(num_instances: usize) {
Maybe we can have a cleaning function that's is called before running the benchmark in the caller? That would remove the need to manually remove the files in the Makefile.
By using a struct that embed all three sem_*, pass it to this cleaning function and simply does
unsafe {
libc::sem_close(mutex);
libc::sem_close(arrive);
libc::sem_close(gate);
libc::sem_unlink(sem_mutex.as_ptr());
libc::sem_unlink(sem_arrive.as_ptr());
libc::sem_unlink(sem_gate.as_ptr());
}I'd prefer all filesystem ops to be handled at crate-level rather a mix of crate + Makfeile.
tfhe-benchmark/src/utilities.rs line 500 at r1 (raw file):
const MUTEX_NAME_PREFIX: &str = "tfhe_bench"; let sem_mutex = CString::new(format!("/{MUTEX_NAME_PREFIX}_mutex")).unwrap();
What does sem_ stands for here? semaphore_?
tfhe-benchmark/src/utilities.rs line 504 at r1 (raw file):
let sem_gate = CString::new(format!("/{MUTEX_NAME_PREFIX}_gate")).unwrap(); let now = SystemTime::now().duration_since(UNIX_EPOCH).unwrap();
Do we want to panic if this call fails, despite being very unlikely, since it would abort the whole benchmark?
Runs TPS benchmarks in parallel in two processes, starting at the same time with synchronization. Aggregates the bnech results and sends the aggregated TPS count to slab
This change is