Skip to content

chore(gpu): add benchmarking for 8xsxm5 with two processes#3465

Open
andrei-stoian-zama wants to merge 3 commits intomainfrom
as/benchmark_throughput_gpu_grouping
Open

chore(gpu): add benchmarking for 8xsxm5 with two processes#3465
andrei-stoian-zama wants to merge 3 commits intomainfrom
as/benchmark_throughput_gpu_grouping

Conversation

@andrei-stoian-zama
Copy link
Copy Markdown
Contributor

@andrei-stoian-zama andrei-stoian-zama commented Apr 13, 2026

Runs TPS benchmarks in parallel in two processes, starting at the same time with synchronization. Aggregates the bnech results and sends the aggregated TPS count to slab


This change is Reviewable

@cla-bot cla-bot bot added the cla-signed label Apr 13, 2026
@andrei-stoian-zama andrei-stoian-zama force-pushed the as/benchmark_throughput_gpu_grouping branch from 4ce499b to dc01cab Compare April 13, 2026 11:55
@andrei-stoian-zama andrei-stoian-zama force-pushed the as/benchmark_throughput_gpu_grouping branch from 8f3377c to 2389601 Compare April 15, 2026 14:54
@andrei-stoian-zama andrei-stoian-zama marked this pull request as ready for review April 16, 2026 07:05
Comment thread backends/tfhe-cuda-backend/CLAUDE.md Outdated
@@ -0,0 +1,67 @@
# CLAUDE.md
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file shouldn't be included

{
let mut rng = thread_rng();
let num_gpus = get_number_of_gpus() as u64;
let total_num_gpus = get_number_of_gpus() as u64;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is the logic that makes each process select only a sub group of gpus, but in practice all the gpus are visible for each process, i think it would be easier if when we launch the benchmark we filter the gpus with CUDA_VISIBLE_DEVICES, then all this extra logic is not necessary, and it simulates in a better way what would happen in reality

Copy link
Copy Markdown
Contributor

@soonum soonum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Divide and conquer ERC7984 transfer ⚔️ !

@soonum reviewed 10 files and all commit messages, and made 15 comments.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on andrei-stoian-zama, IceTDrinker, soonum, and SouchonTheo).


Makefile line 1958 at r1 (raw file):

	--features=integer,gpu,internal-keycache,pbs-stats -p tfhe-benchmark --profile release_lto_off --

.PHONY: bench_hlapi_erc7984_multi_group_gpu # Runs ERC7984 bench in two processes (half of gpus for each) and aggregates results

This recipe doesn't aggregate the results.

Suggestion:

.PHONY: bench_hlapi_erc7984_multi_group_gpu # Runs ERC7984 throughput benchmarks in two processes (half of gpus for each)

.github/workflows/benchmark_gpu.yml line 79 at r1 (raw file):

          - multi_bit_documentation
          - classical_documentation + multi_bit_documentation

This newline is not needed.


.github/workflows/benchmark_gpu_common.yml line 263 at r1 (raw file):

        if: ${{ inputs.command == 'hlapi_erc7984_multi_group' }}
        run: |
          pwd

I guess this is a debug command.


.github/workflows/benchmark_summary.yml line 117 at r1 (raw file):

      SLAB_BASE_URL: ${{ secrets.SLAB_BASE_URL }}

  run-benchmarks-gpu-erc7984-multi-group:

In which category this benchmark falls ? A new line in ERC7984 dashboard ?


backends/tfhe-cuda-backend/CLAUDE.md line 1 at r1 (raw file):

# CLAUDE.md

I'd rather add this file in another PR.
This is not related to the benchmark changes.


ci/merge_multi_group_results.py line 2 at r1 (raw file):

#!/usr/bin/env python3
# TODO: ADD COMMENT

TODO to be done 😏 (same everywhere else)


ci/merge_multi_group_results.py line 8 at r1 (raw file):

# TODO: ADD COMMENT
def merge_multi_group_results(file1, file2, output_file):

Can't we make it generic for N files ? This should be doable as long as we use *files and keep output_file as the last argument in the command-line.


ci/merge_multi_group_results.py line 37 at r1 (raw file):

    with open(output_file, "w") as f:
        json.dump(result, f, indent=2)

I think Slab doesn't support pretty-printed JSON. To be tested.


tfhe-benchmark/benches/high_level_api/erc7984.rs line 552 at r1 (raw file):

    let params_name = params.name();

    // 300 * num_gpus seems to be enough for maximum throughput on 8xH100 SXM5

Comment to update.


tfhe-benchmark/src/utilities.rs line 483 at r1 (raw file):

}

pub fn get_bench_instances() -> Option<usize> {

Maybe renaming it toget_bench_gpu_instances() to avoid confusion.


tfhe-benchmark/src/utilities.rs line 491 at r1 (raw file):

}

/// TODO: ADD COMMENT

I'm adding a comment 😂 .
More seriously, it's more than needed since it brings libc objects handling in native Rust file.


tfhe-benchmark/src/utilities.rs line 493 at r1 (raw file):

/// TODO: ADD COMMENT
#[cfg(target_os = "linux")]
pub fn bench_sync_barrier(num_instances: usize) {

Maybe we can have a cleaning function that's is called before running the benchmark in the caller? That would remove the need to manually remove the files in the Makefile.
By using a struct that embed all three sem_*, pass it to this cleaning function and simply does

    unsafe {
        libc::sem_close(mutex);
        libc::sem_close(arrive);
        libc::sem_close(gate);
        libc::sem_unlink(sem_mutex.as_ptr());
        libc::sem_unlink(sem_arrive.as_ptr());
        libc::sem_unlink(sem_gate.as_ptr());
    }

I'd prefer all filesystem ops to be handled at crate-level rather a mix of crate + Makfeile.


tfhe-benchmark/src/utilities.rs line 500 at r1 (raw file):

    const MUTEX_NAME_PREFIX: &str = "tfhe_bench";

    let sem_mutex = CString::new(format!("/{MUTEX_NAME_PREFIX}_mutex")).unwrap();

What does sem_ stands for here? semaphore_?


tfhe-benchmark/src/utilities.rs line 504 at r1 (raw file):

    let sem_gate = CString::new(format!("/{MUTEX_NAME_PREFIX}_gate")).unwrap();

    let now = SystemTime::now().duration_since(UNIX_EPOCH).unwrap();

Do we want to panic if this call fails, despite being very unlikely, since it would abort the whole benchmark?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants