Adds optional batched VAD feature #1408

Purfview · 2025-12-30T03:01:32Z

Optional batched VAD feature.

Prompted by #1388
[User wanted GPU option because it's ~5x faster in slow CPU/fast GPU env, this PR should give similar speed increase on CPU]

Enabled when VAD attribute vad_batch_size > 1 .
Optimal vad_batch_size value is probably somewhere between 1.5x-2x CPU threads.

RAM usage increase [check on 2h audio]:

Not batched RAM peak delta:   621.367 MB
Batch 8     RAM peak delta:  1770.285 MB
Batch 16    RAM peak delta:  2029.824 MB

Speed tests:
Up to 4.3 times faster with vad_batch_size=16 on CPU with 8 logical processors.
Up to 8.9 times faster with vad_batch_size=24 on CPU with 16 logical processors.

Probs difference [inspected 2h audio with vad_batch_size=8`]:
VAD timestamps are a bit different, not worse - not better.
5% timestamps were different from 1233 total, 92% of those diffs were insignificant.

Total probs               : 224965
Total diff probs          : 144714
Above > 0.01 diffs        : 4304
Above > 0.1 diffs         : 785
Mean abs of '> 0.01 diffs': 0.06389281

Just some stats for nerds

  
0s overlap
Above > 0.01 diffs      : 4600
Above > 0.1 diffs       : 845

1s overlap
Above > 0.01 diffs      : 4558
Above > 0.1 diffs       : 799

3s overlap
Above > 0.01 diffs      : 4304
Above > 0.1 diffs       : 785

30s overlap
Above > 0.01 diffs      : 3353
Above > 0.1 diffs       : 499

60s overlap
Above > 0.01 diffs      : 2535
Above > 0.1 diffs       : 387

120s overlap
Above > 0.01 diffs      : 1320
Above > 0.1 diffs       : 29

180s overlap
Above > 0.01 diffs      : 481
Above > 0.1 diffs       : 128

240s overlap [diffs stops having effect on timestamps]
Above > 0.01 diffs      : 88
Above > 0.1 diffs       : 0

300s overlap
Above > 0.01 diffs      : 0
Above > 0.1 diffs       : 0

Depends on #1406 [but its influence on the probs is not significant]

Optional batched VAD feature.

sssshhhhhh · 2026-01-02T09:31:52Z

It's possible to get a similar speedup but with mathematical equivalence if you thread pool the encoder and decode as they finish. Also the model still wastes flops on the stft which is a free 5%

Purfview · 2026-01-07T11:01:24Z

It's possible to get a similar speedup but with mathematical equivalence if you thread pool the encoder and decode as they finish.

I don't know how to do that because model is stateful. Can you share/make PR such func producing mathematical equivalence?

sssshhhhhh · 2026-01-08T00:06:14Z

class SileroVADModelFast:
    def __init__(self, encoder_path, decoder_path):
        try:
            import onnxruntime
        except ImportError as e:
            raise RuntimeError(
                "Applying the VAD filter requires the onnxruntime package"
            ) from e

        opts = onnxruntime.SessionOptions()
        opts.inter_op_num_threads = 1
        opts.intra_op_num_threads = 1
        opts.enable_cpu_mem_arena = False
        opts.log_severity_level = 4

        self.encoder_session = onnxruntime.InferenceSession(
            encoder_path,
            providers=["CPUExecutionProvider"],
            sess_options=opts,
        )
        self.decoder_session = onnxruntime.InferenceSession(
            decoder_path,
            providers=["CPUExecutionProvider"],
            sess_options=opts,
        )

    def __call__(
        self,
        audio: np.ndarray,
        threads: int = 1,
        window_size_samples: int = 512,
        context_size_samples: int = 64,
    ) -> np.ndarray:
        assert (
            audio.ndim == 2
        ), "Input should be a 2D array with size (batch_size, num_samples)"

        batch_size, num_samples = audio.shape
        rhs_padding = window_size_samples - num_samples % window_size_samples
        audio = np.pad(audio, ((0, 0), (context_size_samples, rhs_padding)))
        num_samples = audio.shape[1]

        encoder_batch_size = 256
        batch_samples = encoder_batch_size // batch_size * window_size_samples
        input_size = window_size_samples + context_size_samples
        h = np.zeros((1, batch_size, 128), dtype=np.float32)
        c = np.zeros((1, batch_size, 128), dtype=np.float32)

        def encode(i):
            batch = audio[:, i : i + batch_samples + context_size_samples]
            shape = (batch_size, batch.shape[1] // window_size_samples, input_size)
            strides = (
                batch.strides[0],
                batch.strides[1] * window_size_samples,
                batch.strides[1],
            )
            batch = np.lib.stride_tricks.as_strided(batch, shape, strides)
            return self.encoder_session.run(None, {"input": batch})[0]

        outputs = []
        with ThreadPoolExecutor(threads) as executor:
            futures = [executor.submit(encode, i) for i in range(0, num_samples, batch_samples)]
            for future in futures:
                batch = future.result()
                output, h, c = self.decoder_session.run(None, {"input": batch, "h": h, "c": c})
                outputs.append(output)

        out = np.concatenate(outputs, axis=0).T
        return out

model = SileroVADModelFast('enc.onnx', 'dec.onnx')
model(audio[None], 8)

6.6hrs
2 15523.9138        12746.2859
3 10647.9426        8864.6105
4 8231.9762         6742.6077
5 6804.4266         5604.9741
6 5940.452          4912.6163
7 5332.413          4346.9303
8 4776.8938         4432.3617
9 4498.9325         4454.5463
10 4311.0307        4468.7211
11 4176.293         4430.4164
12 4139.2589        plateaus
13 3995.533
14 3730.9739
15 3854.1674
16 3723.9984

Needed onnx files attached, faster when <10 threads but doesn't scale. Probs was identical with main which took like 30s.

onnx.zip

Purfview · 2026-01-08T06:55:10Z

@sssshhhhhh Thanks for the func.
It's a bit disappointing that it doesn't scale beyond 8 threads. Anyway, maybe it would be better, MahmoudAshraf97?

As I understand, you benchmarked on a CPU with 8 logical processors, right?

sssshhhhhh · 2026-01-08T07:26:17Z

16, it's limited by decoder speed. Can always chunk like you do if more speed is needed. But at 1s/hr already idk if it'll make a big impact on latency.

Purfview · 2026-01-08T07:30:31Z

16

Could you benchmark vad_batch_size=32?

sssshhhhhh · 2026-01-08T07:43:15Z

14 3663.7819
15 3638.2823
16 3623.5396
17 3731.0817
18 3511.4404
19 3430.9416
20 3478.4598
21 3361.2536
22 3437.7053
23 3477.2363
24 3460.3391
25 3558.7283
26 3768.5504
27 3724.8539
28 3726.5815
29 3786.6403
30 3845.2653
31 3864.2095
32 3812.9137

pretty sure it's bandwidth bound, power draw isn't that high for all core

Purfview · 2026-01-09T22:26:53Z

Tested your func, on my CPU it's not that slower with 16 threads like in your benchmark, just 5% slower.

Purfview added 4 commits December 30, 2025 02:47

Adds optional batched VAD feature

b08cfb0

Optional batched VAD feature.

Reformat imports

dc74581

Reformat imports 2

5a6829c

Reformat imports 3

26c8a3d

Purfview mentioned this pull request Dec 30, 2025

Why is VAD run on CPU? (it's faster on GPU, meaningfully so on long audio) #1388

Open

Overlap tweak

9cf9eec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds optional batched VAD feature #1408

Adds optional batched VAD feature #1408

Purfview commented Dec 30, 2025 •

edited

Loading

Uh oh!

sssshhhhhh commented Jan 2, 2026

Uh oh!

Purfview commented Jan 7, 2026

Uh oh!

sssshhhhhh commented Jan 8, 2026

Uh oh!

Purfview commented Jan 8, 2026 •

edited

Loading

Uh oh!

sssshhhhhh commented Jan 8, 2026

Uh oh!

Purfview commented Jan 8, 2026

Uh oh!

sssshhhhhh commented Jan 8, 2026 •

edited

Loading

Uh oh!

Purfview commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adds optional batched VAD feature #1408

Are you sure you want to change the base?

Adds optional batched VAD feature #1408

Conversation

Purfview commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sssshhhhhh commented Jan 2, 2026

Uh oh!

Purfview commented Jan 7, 2026

Uh oh!

sssshhhhhh commented Jan 8, 2026

Uh oh!

Purfview commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sssshhhhhh commented Jan 8, 2026

Uh oh!

Purfview commented Jan 8, 2026

Uh oh!

sssshhhhhh commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Purfview commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Purfview commented Dec 30, 2025 •

edited

Loading

Purfview commented Jan 8, 2026 •

edited

Loading

sssshhhhhh commented Jan 8, 2026 •

edited

Loading