Skip to content

Conversation

mlodyjesienin
Copy link
Contributor

Description

This PR introduces voice activity detection (vad) feature into the react native executorch library.

This PR is not ready for merge yet,
however it is ready for the review of C++ code.

Things that are missing:

  • Documentation
  • exported model on official swm huggingface
  • benchmarks
  • maybe example usage in one app? idk

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

Screenshots

Related issues

Closes #547

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

Comment on lines 8 to 18
inline constexpr uint32_t nextPowerOfTwo(uint32_t n) noexcept {
if (n <= 1)
return 1;
n--;
n |= n >> 1;
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
return n + 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an util

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end I got rid of this function; (thanks to the @msluszniak comment about std::bit_ceil ), but to clarify, I agree that it should be inside Utils.h, and that was my initial approach, but due to circular dependencies I could not make it work. Obviously, one could solve this by introducing new file etc, but actually moving it to the Constants.h due to its simplicity seemed like the best solution to me. Writing this comment to ask, are there any better solutions I am not aware of?

Comment on lines 9 to 17
const std::array<float, kWindowSize> generateHammingWindow() noexcept {
constexpr size_t size = static_cast<size_t>(constants::kWindowSize);
std::array<float, size> window;
for (size_t i = 0; i < size; ++i) {
window[i] =
0.54f -
0.46f * std::cos((2.0f * std::numbers::pi_v<float> * i) / (size - 1));
}
return window;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check if this will work with the hannWindow defined in dsp.cpp. If so, then you don't need to define it. Otherwise, move this function to dsp.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked it, it works fine. I deleted this function entirely, and I use hannWindow instead.
The differences are very subtle (e.g. on the 60s audio clip with 16 speech segments, 14 of them are exactly the same, and two of them differ by circa 100ms, making it negligible.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils.cpp often ends up as a bit of a bag for random functions. Might be worth thinking about whether some of these belong in more focused files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My approach is simple: In the model (VoiceActivityDetection.h/.cpp in this case ) files, I want only three functions, that is: preprocess, generate, postprocess. Basically everything else goes to Utils.cpp.
Often utils files are not very long so I think this works ok. Introducing some other files might be
less helpful, and more confusing but it is only my opinion. Also sometimes it is just hard to find the "common denominator" of more than one function.


std::vector<std::array<float, kPaddedWindowSize>>
VoiceActivityDetection::preprocess(std::span<float> waveform) const {
auto kHammingWindowArray = utils::generateHammingWindow();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's not actually depending on the size of the input, nor the input - calling this function on each preprocess call is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in comment above, in the end I deleted this function, but for clarification I will say this:
Yes, this is redundant, but there is no easy way to bypass it. The problem is, std::cos is constexpr since C++23 and we use C++20. At first I tried to simply write it as constexpr and calculate it in compile time, but as it turns out it is impossible. Obviously, what could be done is one of these:

  1. we could write cosine function independently, and we could make it constexpr
  2. hardcode the output in some different way.

But I reckon it is just unnecessary complications (in the end it is 400 element array so it should be negligible)

Since I have decided to use the hannWindow instead, it does not matter (but obviously, the hannWindow has the same disadvantage of being calculated with each model run, and what is even worse - it is std::vector and not std::array.... )

@mlodyjesienin
Copy link
Contributor Author

I think it is ready to merge, but there is one thing that might be done before (or possibly after) merge:
I measured inference time for benchmarks for android only on OnePlus because Samsung Galaxy S24 was unavailable.
Since it is my last day in work, I won't be able to do it, so someone else should complete the benchmarking.

Obviously, review needs to be done, but I think we are in the clear.

Copy link
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ code looks good but I haven't tested this feature

@msluszniak msluszniak added the feature PRs that implement a new feature label Oct 6, 2025
@msluszniak msluszniak changed the title feat:voice activity detection model Voice activity detection model Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature PRs that implement a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement voice activity detection
3 participants