Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .cspell-wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -80,3 +80,6 @@ setpriority
errno
ifdef
elif
FSMN
fsmn
subarray
191 changes: 191 additions & 0 deletions docs/docs/02-hooks/01-natural-language-processing/useVAD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
title: useVAD
---

Voice Activity Detection (VAD) is the task of analyzing an audio signal to identify time segments containing human speech, separating them from non-speech sections like silence and background noise.

:::caution
It is recommended to use models provided by us, which are available at our [Hugging Face repository](https://huggingface.co/software-mansion/react-native-executorch-fsmn-vad). You can also use [constants](https://github.com/software-mansion/react-native-executorch/blob/main/packages/react-native-executorch/src/constants/modelUrls.ts) shipped with our library.
:::

## Reference

You can obtain waveform from audio in any way most suitable to you, however in the snippet below we utilize `react-native-audio-api` library to process a `.mp3` file.

```typescript
import { useVAD, FSMN_VAD } from 'react-native-executorch';
import { AudioContext } from 'react-native-audio-api';
import * as FileSystem from 'expo-file-system';

const model = useVAD({
model: FSMN_VAD,
});

const { uri } = await FileSystem.downloadAsync(
'https://some-audio-url.com/file.mp3',
FileSystem.cacheDirectory + 'audio_file'
);

const audioContext = new AudioContext({ sampleRate: 16000 });
const decodedAudioData = await audioContext.decodeAudioDataSource(uri);
const audioBuffer = decodedAudioData.getChannelData(0);

try {
const speechSegments = await model.forward(audioBuffer);
console.log(speechSegments);
} catch (error) {
console.error('Error during running VAD model', error);
}
```

### Arguments

**`model`** - Object containing the model source.

- **`modelSource`** - A string that specifies the location of the model binary.

**`preventLoad?`** - Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.

For more information on loading resources, take a look at [loading models](../../01-fundamentals/02-loading-models.md) page.

### Returns

| Field | Type | Description |
| ------------------ | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `forward` | `(waveform: Float32Array) => Promise<{Segment[]}>` | Executes the model's forward pass, where input array should be a waveform at 16kHz. Returns a promise containing an array of `Segment` objects. |
| `error` | <code>string &#124; null</code> | Contains the error message if the model failed to load. |
| `isGenerating` | `boolean` | Indicates whether the model is currently processing an inference. |
| `isReady` | `boolean` | Indicates whether the model has successfully loaded and is ready for inference. |
| `downloadProgress` | `number` | Represents the download progress as a value between 0 and 1. |

<details>
<summary>Type definitions</summary>

```typescript
interface Segment {
start: number;
end: number;
}
```

</details>
## Running the model

Before running the model's `forward` method, make sure to extract the audio waveform you want to process. You'll need to handle this step yourself, ensuring the audio is sampled at 16 kHz. Once you have the waveform, pass it as an argument to the forward method. The method returns a promise that resolves to the array of detected speech segments.

:::info
Timestamps in returned speech segments, correspond to indices of input array (waveform).
:::

## Example

```tsx
import React from 'react';
import { Button, Text, SafeAreaView } from 'react-native';
import { useVAD, FSMN_VAD } from 'react-native-executorch';
import { AudioContext } from 'react-native-audio-api';
import * as FileSystem from 'expo-file-system';

export default function App() {
const model = useVAD({
model: FSMN_VAD,
});

const audioURL = 'https://some-audio-url.com/file.mp3';

const handleAudio = async () => {
if (!model) {
console.error('VAD model is not loaded yet.');
return;
}

console.log('Processing URL:', audioURL);

try {
const { uri } = await FileSystem.downloadAsync(
audioURL,
FileSystem.cacheDirectory + 'vad_example.tmp'
);

const audioContext = new AudioContext({ sampleRate: 16000 });
const originalDecodedBuffer =
await audioContext.decodeAudioDataSource(uri);
const originalChannelData = originalDecodedBuffer.getChannelData(0);

const segments = await model.forward(originalChannelData);
if (segments.length === 0) {
console.log('No speech segments were found.');
return;
}
console.log(`Found ${segments.length} speech segments.`);

const totalLength = segments.reduce(
(sum, seg) => sum + (seg.end - seg.start),
0
);
const newAudioBuffer = audioContext.createBuffer(
1, // Mono
totalLength,
originalDecodedBuffer.sampleRate
);
const newChannelData = newAudioBuffer.getChannelData(0);

let offset = 0;
for (const segment of segments) {
const slice = originalChannelData.subarray(segment.start, segment.end);
newChannelData.set(slice, offset);
offset += slice.length;
}

// Play the processed audio
const source = audioContext.createBufferSource();
source.buffer = newAudioBuffer;
source.connect(audioContext.destination);
source.start();
} catch (error) {
console.error('Error processing audio data:', error);
}
};

return (
<SafeAreaView>
<Text>
Press the button to process and play speech from a sample file.
</Text>
<Button onPress={handleAudio} title="Run VAD Example" />
</SafeAreaView>
);
}
```

## Supported models

- [fsmn-vad](https://huggingface.co/funasr/fsmn-vad)

## Benchmarks

### Model size

| Model | XNNPACK [MB] |
| -------- | :----------: |
| FSMN_VAD | 1.83 |

### Memory usage

| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
| -------- | :--------------------: | :----------------: |
| FSMN_VAD | 97 | 45,9 |

### Inference time

<!-- TODO: MEASURE INFERENCE TIME FOR SAMSUNG GALAXY S24 WHEN POSSIBLE -->

:::warning warning
Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization.
:::

Inference time were measured on a 60s audio, that can be found [here](https://models.silero.ai/vad_models/en.wav).

| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
| -------- | :--------------------------: | :------------------------------: | :------------------------: | :-----------------------: |
| FSMN_VAD | 151 | 171 | 180 | 109 |
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
title: VADModule
---

TypeScript API implementation of the [useVAD](../../02-hooks/01-natural-language-processing/useVAD.md) hook.

## Reference

```typescript
import { VADModule, FSMN_VAD } from 'react-native-executorch';

const model = new VADModule();
await model.load(FSMN_VAD, (progress) => {
console.log(progress);
});

await model.forward(waveform);
```

### Methods

| Method | Type | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `load` | `(model: { modelSource: ResourceSource }, onDownloadProgressCallback?: (progress: number) => void): Promise<void>` | Loads the model, where `modelSource` is a string that specifies the location of the model binary. To track the download progress, supply a callback function `onDownloadProgressCallback`. |
| `forward` | `(waveform: Float32Array): Promise<{ [category: string]: number }>` | Executes the model's forward pass, where `imageSource` can be a fetchable resource or a Base64-encoded string. |
| `delete` | `(): void` | Release the memory held by the module. Calling `forward` afterwards is invalid. |

<details>
<summary>Type definitions</summary>

```typescript
type ResourceSource = string | number | object;
```

```typescript
interface Segment {
start: number;
end: number;
}
```

</details>

## Loading the model

To load the model, create a new instance of the module and use the `load` method on it. It accepts an object:

**`model`** - Object containing the model source.

- **`modelSource`** - A string that specifies the location of the model binary.

**`onDownloadProgressCallback`** - (Optional) Function called on download progress.

This method returns a promise, which can resolve to an error or void.

For more information on loading resources, take a look at [loading models](../../01-fundamentals/02-loading-models.md) page.

## Running the model

To run the model, you can use the `forward` method on the module object. Before running the model's `forward` method, make sure to extract the audio waveform you want to process. You'll need to handle this step yourself, ensuring the audio is sampled at 16 kHz. Once you have the waveform, pass it as an argument to the forward method. The method returns a promise that resolves to the array of detected speech segments.

## Managing memory

The module is a regular JavaScript object, and as such its lifespan will be managed by the garbage collector. In most cases this should be enough, and you should not worry about freeing the memory of the module yourself, but in some cases you may want to release the memory occupied by the module before the garbage collector steps in. In this case use the method `delete()` on the module object you will no longer use, and want to remove from the memory. Note that you cannot use `forward` after `delete` unless you load the module again.
12 changes: 11 additions & 1 deletion docs/docs/04-benchmarks/inference-time.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ Times presented in the tables are measured as consecutive runs of the model. Ini

❌ - Insufficient RAM.

### Streaming mode
## Streaming mode

Notice than for `Whisper` model which has to take as an input 30 seconds audio chunks (for shorter audio it is automatically padded with silence to 30 seconds) `fast` mode has the lowest latency (time from starting transcription to first token returned, caused by streaming algorithm), but the slowest speed. If you believe that this might be a problem for you, prefer `balanced` mode instead.

Expand Down Expand Up @@ -119,3 +119,13 @@ Average time for generating one image of size 256×256 in 10 inference steps.
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
| --------------------- | :--------------------------: | :------------------------------: | :-------------------: | :-------------------------------: | :-----------------------: |
| BK_SDM_TINY_VPRED_256 | 19100 | 25000 | ❌ | ❌ | 23100 |

## Voice Activity Detection (VAD)

Average time for processing 60s audio.

<!-- TODO: MEASURE INFERENCE TIME FOR SAMSUNG GALAXY S24 WHEN POSSIBLE -->

| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
| -------- | :--------------------------: | :------------------------------: | :------------------------: | :-----------------------: |
| FSMN_VAD | 151 | 171 | 180 | 109 |
6 changes: 6 additions & 0 deletions docs/docs/04-benchmarks/memory-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,9 @@ title: Memory Usage
| --------------------- | ---------------------- | ------------------ |
| BK_SDM_TINY_VPRED_256 | 2900 | 2800 |
| BK_SDM_TINY_VPRED | 6700 | 6560 |

## Voice Activity Detection (VAD)

| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
| -------- | :--------------------: | :----------------: |
| FSMN_VAD | 97 | 45,9 |
6 changes: 6 additions & 0 deletions docs/docs/04-benchmarks/model-size.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,9 @@ title: Model Size
| Model | Text encoder (XNNPACK) [MB] | UNet (XNNPACK) [MB] | VAE decoder (XNNPACK) [MB] |
| ----------------- | --------------------------- | ------------------- | -------------------------- |
| BK_SDM_TINY_VPRED | 492 | 1290 | 198 |

## Voice Activity Detection (VAD)

| Model | XNNPACK [MB] |
| -------- | :----------: |
| FSMN_VAD | 1.83 |
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,14 @@
#include <rnexecutorch/models/embeddings/image/ImageEmbeddings.h>
#include <rnexecutorch/models/embeddings/text/TextEmbeddings.h>
#include <rnexecutorch/models/image_segmentation/ImageSegmentation.h>
#include <rnexecutorch/models/text_to_image/TextToImage.h>
#include <rnexecutorch/models/llm/LLM.h>
#include <rnexecutorch/models/object_detection/ObjectDetection.h>
#include <rnexecutorch/models/ocr/OCR.h>
#include <rnexecutorch/models/speech_to_text/SpeechToText.h>
#include <rnexecutorch/models/style_transfer/StyleTransfer.h>
#include <rnexecutorch/models/text_to_image/TextToImage.h>
#include <rnexecutorch/models/vertical_ocr/VerticalOCR.h>
#include <rnexecutorch/models/voice_activity_detection/VoiceActivityDetection.h>
#include <rnexecutorch/threads/GlobalThreadPool.h>
#include <rnexecutorch/threads/utils/ThreadUtils.h>

Expand Down Expand Up @@ -51,8 +52,9 @@ void RnExecutorchInstaller::injectJSIBindings(

jsiRuntime->global().setProperty(
*jsiRuntime, "loadObjectDetection",
RnExecutorchInstaller::loadModel<models::object_detection::ObjectDetection>(
jsiRuntime, jsCallInvoker, "loadObjectDetection"));
RnExecutorchInstaller::loadModel<
models::object_detection::ObjectDetection>(jsiRuntime, jsCallInvoker,
"loadObjectDetection"));

jsiRuntime->global().setProperty(
*jsiRuntime, "loadExecutorchModule",
Expand Down Expand Up @@ -93,6 +95,12 @@ void RnExecutorchInstaller::injectJSIBindings(
RnExecutorchInstaller::loadModel<models::speech_to_text::SpeechToText>(
jsiRuntime, jsCallInvoker, "loadSpeechToText"));

jsiRuntime->global().setProperty(
*jsiRuntime, "loadVAD",
RnExecutorchInstaller::loadModel<
models::voice_activity_detection::VoiceActivityDetection>(
jsiRuntime, jsCallInvoker, "loadVAD"));

threads::utils::unsafeSetupThreadPool();
threads::GlobalThreadPool::initialize();
}
Expand Down
Loading