chore(gpu): add benchmarking for 8xsxm5 with two processes by andrei-stoian-zama · Pull Request #3465 · zama-ai/tfhe-rs

andrei-stoian-zama · 2026-04-13T09:01:04Z

Runs TPS benchmarks in parallel in two processes, starting at the same time with synchronization. Aggregates the bnech results and sends the aggregated TPS count to slab

This change is

guillermo-oyarzun · 2026-04-16T07:45:36Z

@@ -0,0 +1,67 @@
+# CLAUDE.md


This file shouldn't be included

guillermo-oyarzun · 2026-04-16T07:52:15Z

 {
    let mut rng = thread_rng();
-    let num_gpus = get_number_of_gpus() as u64;
+    let total_num_gpus = get_number_of_gpus() as u64;


I understand this is the logic that makes each process select only a sub group of gpus, but in practice all the gpus are visible for each process, i think it would be easier if when we launch the benchmark we filter the gpus with CUDA_VISIBLE_DEVICES, then all this extra logic is not necessary, and it simulates in a better way what would happen in reality

soonum

Divide and conquer ERC7984 transfer ⚔️ !

@soonum reviewed 10 files and all commit messages, and made 15 comments.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on andrei-stoian-zama, IceTDrinker, soonum, and SouchonTheo).

Makefile line 1958 at r1 (raw file):

	--features=integer,gpu,internal-keycache,pbs-stats -p tfhe-benchmark --profile release_lto_off --

.PHONY: bench_hlapi_erc7984_multi_group_gpu # Runs ERC7984 bench in two processes (half of gpus for each) and aggregates results

This recipe doesn't aggregate the results.

Suggestion:

.PHONY: bench_hlapi_erc7984_multi_group_gpu # Runs ERC7984 throughput benchmarks in two processes (half of gpus for each)

.github/workflows/benchmark_gpu.yml line 79 at r1 (raw file):

          - multi_bit_documentation
          - classical_documentation + multi_bit_documentation

This newline is not needed.

.github/workflows/benchmark_gpu_common.yml line 263 at r1 (raw file):

        if: ${{ inputs.command == 'hlapi_erc7984_multi_group' }}
        run: |
          pwd

I guess this is a debug command.

.github/workflows/benchmark_summary.yml line 117 at r1 (raw file):

      SLAB_BASE_URL: ${{ secrets.SLAB_BASE_URL }}

  run-benchmarks-gpu-erc7984-multi-group:

In which category this benchmark falls ? A new line in ERC7984 dashboard ?

backends/tfhe-cuda-backend/CLAUDE.md line 1 at r1 (raw file):

# CLAUDE.md

I'd rather add this file in another PR.
This is not related to the benchmark changes.

ci/merge_multi_group_results.py line 2 at r1 (raw file):

#!/usr/bin/env python3
# TODO: ADD COMMENT

TODO to be done 😏 (same everywhere else)

ci/merge_multi_group_results.py line 8 at r1 (raw file):

# TODO: ADD COMMENT
def merge_multi_group_results(file1, file2, output_file):

Can't we make it generic for N files ? This should be doable as long as we use *files and keep output_file as the last argument in the command-line.

ci/merge_multi_group_results.py line 37 at r1 (raw file):

    with open(output_file, "w") as f:
        json.dump(result, f, indent=2)

I think Slab doesn't support pretty-printed JSON. To be tested.

tfhe-benchmark/benches/high_level_api/erc7984.rs line 552 at r1 (raw file):

    let params_name = params.name();

    // 300 * num_gpus seems to be enough for maximum throughput on 8xH100 SXM5

Comment to update.

tfhe-benchmark/src/utilities.rs line 483 at r1 (raw file):

}

pub fn get_bench_instances() -> Option<usize> {

Maybe renaming it toget_bench_gpu_instances() to avoid confusion.

tfhe-benchmark/src/utilities.rs line 491 at r1 (raw file):

}

/// TODO: ADD COMMENT

I'm adding a comment 😂 .
More seriously, it's more than needed since it brings libc objects handling in native Rust file.

tfhe-benchmark/src/utilities.rs line 493 at r1 (raw file):

/// TODO: ADD COMMENT
#[cfg(target_os = "linux")]
pub fn bench_sync_barrier(num_instances: usize) {

Maybe we can have a cleaning function that's is called before running the benchmark in the caller? That would remove the need to manually remove the files in the Makefile.
By using a struct that embed all three sem_*, pass it to this cleaning function and simply does

    unsafe {
        libc::sem_close(mutex);
        libc::sem_close(arrive);
        libc::sem_close(gate);
        libc::sem_unlink(sem_mutex.as_ptr());
        libc::sem_unlink(sem_arrive.as_ptr());
        libc::sem_unlink(sem_gate.as_ptr());
    }

I'd prefer all filesystem ops to be handled at crate-level rather a mix of crate + Makfeile.

tfhe-benchmark/src/utilities.rs line 500 at r1 (raw file):

    const MUTEX_NAME_PREFIX: &str = "tfhe_bench";

    let sem_mutex = CString::new(format!("/{MUTEX_NAME_PREFIX}_mutex")).unwrap();

What does sem_ stands for here? semaphore_?

tfhe-benchmark/src/utilities.rs line 504 at r1 (raw file):

    let sem_gate = CString::new(format!("/{MUTEX_NAME_PREFIX}_gate")).unwrap();

    let now = SystemTime::now().duration_since(UNIX_EPOCH).unwrap();

Do we want to panic if this call fails, despite being very unlikely, since it would abort the whole benchmark?

Current issue with cloud-init handling at Scaleway forbids us to use custom image to spawn instance. Then we use a default image and install all the required dependencies. Note that Nividia drivers are already installed system-wide on this default image for this reason we don't need to specify environment variables related to Cuda.

github-actions · 2026-05-21T07:28:39Z

✅ Backward-compat snapshot: everything looks good! No backward-compatibility issues detected.

cla-bot Bot added the cla-signed label Apr 13, 2026

andrei-stoian-zama force-pushed the as/benchmark_throughput_gpu_grouping branch 2 times, most recently from 8f3377c to 2389601 Compare April 15, 2026 14:54

andrei-stoian-zama marked this pull request as ready for review April 16, 2026 07:05

andrei-stoian-zama requested review from a team, IceTDrinker, SouchonTheo and soonum as code owners April 16, 2026 07:05

guillermo-oyarzun reviewed Apr 16, 2026

View reviewed changes

soonum reviewed Apr 16, 2026

View reviewed changes

soonum and others added 3 commits May 20, 2026 22:07

fix(gpu): pcc nvidia-smi on aws cpu

28fc388

chore(gpu): add benchmarking for 8xsxm5 with two processes

1503608

andrei-stoian-zama force-pushed the as/benchmark_throughput_gpu_grouping branch from 054f90a to 1503608 Compare May 21, 2026 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(gpu): add benchmarking for 8xsxm5 with two processes#3465

chore(gpu): add benchmarking for 8xsxm5 with two processes#3465
andrei-stoian-zama wants to merge 3 commits into
mainfrom
as/benchmark_throughput_gpu_grouping

andrei-stoian-zama commented Apr 13, 2026 •

edited by soonum

Loading

Uh oh!

guillermo-oyarzun Apr 16, 2026

Uh oh!

guillermo-oyarzun Apr 16, 2026

Uh oh!

soonum left a comment

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andrei-stoian-zama commented Apr 13, 2026 • edited by soonum Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guillermo-oyarzun Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

guillermo-oyarzun Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

soonum left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrei-stoian-zama commented Apr 13, 2026 •

edited by soonum

Loading