Have I written custom code (as opposed to using a stock example script provided in MediaPipe)
No — using stock LlmInference.createFromOptions() with the pre-built model from litert-community/gemma-4-E2B-it-litert-lm.
OS Platform and Distribution
macOS 15 (Sequoia), Darwin 25.3.0, Apple M4 (10 cores), 16 GB unified memory
Mobile device if the issue happens on mobile device
No response
Browser and version if the issue happens on browser
Google Chrome 146.0.0.0
Programming Language and version
JavaScript (ES Modules, browser)
MediaPipe version
@mediapipe/tasks-genai@0.10.26 (also reproduced on 0.10.27)
Bazel version
No response
Solution
LLM Inference (GenAI)
Android Studio, NDK, SDK versions (if issue is related to building in Android environment)
No response
Xcode & Tulsi version (if issue is related to building for iOS)
No response
Describe the actual behavior
LlmInference.createFromOptions() crashes with RuntimeError: memory access out of bounds in the WASM module approximately 43-63 seconds after the model graph starts. The model file (gemma-4-E2B-it-web.task, 1.93 GB, from litert-community/gemma-4-E2B-it-litert-lm) downloads successfully, the MediaPipe graph starts, LlmConfig is parsed correctly (model_type: 52, stack_size: 35, vocabulary_size: 262144), and Subgroups Enabled! is logged. The crash then occurs deterministically at wasm-function[489]:0x43b55 during what appears to be GPU program build.
The WASM crash is an uncaught exception — the createFromOptions promise never resolves or rejects. A try/catch around the call does not catch it.
This is NOT an out-of-memory condition: at crash time, performance.memory.usedJSHeapSize is 20 MB (of 4 GB limit), and the device has 16 GB total RAM with WebGPU adapter available.
Describe the expected behaviour
LlmInference.createFromOptions() should resolve with a usable LlmInference instance. The HF model card for litert-community/gemma-4-E2B-it-litert-lm benchmarks this model running on MacBook Pro M4 Max via Chrome WebGPU.
Standalone code/steps you may have used to try to get what you need
Minimal reproduction (2 files, verified to reproduce):
server.py — serves with COOP/COEP headers required for SharedArrayBuffer/WASM threading:
#!/usr/bin/env python3
"""Serve with COOP/COEP headers for WebGPU + SharedArrayBuffer."""
import http.server
class COEPHandler(http.server.SimpleHTTPRequestHandler):
def end_headers(self):
self.send_header("Cross-Origin-Opener-Policy", "same-origin")
self.send_header("Cross-Origin-Embedder-Policy", "credentialless")
super().end_headers()
http.server.HTTPServer(("", 8080), COEPHandler).serve_forever()
index.html:
<!DOCTYPE html>
<html>
<head><title>Gemma 4 E2B MediaPipe Test</title></head>
<body>
<h1>Gemma 4 E2B MediaPipe Test</h1>
<pre id="log">Loading MediaPipe...</pre>
<script type="module">
const log = document.getElementById('log');
function print(msg) { log.textContent += '\n' + msg; console.log(msg); }
try {
// CDN import works under COEP "credentialless" (verified)
const { FilesetResolver, LlmInference } = await import(
'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@0.10.26/genai_bundle.mjs'
);
print('MediaPipe imported from CDN.');
print('Resolving WASM fileset...');
const fileset = await FilesetResolver.forGenAiTasks(
'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@0.10.26/wasm'
);
print('Fileset ready.');
print('deviceMemory: ' + navigator.deviceMemory + ' GB');
print('WebGPU adapter: ' + !!(await navigator.gpu?.requestAdapter()));
print('Creating LlmInference (downloads ~1.93 GB, then builds GPU program)...');
const llm = await LlmInference.createFromOptions(fileset, {
baseOptions: {
modelAssetPath: 'https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it-web.task'
},
maxTokens: 512,
topK: 40,
temperature: 0.1,
randomSeed: 42,
});
print('SUCCESS: Model loaded!');
const response = await llm.generateResponse('What is GDPR?');
print('Response: ' + response);
} catch (e) {
// NOTE: The WASM crash does NOT reach this catch block.
// The promise never resolves or rejects — the WASM exception is uncaught.
print('FAILED: ' + e.message);
print('Stack: ' + e.stack);
}
</script>
</body>
</html>
Steps to reproduce:
# 1. Save both files to a directory
# 2. Start the server
python3 server.py
# 3. Open Chrome 146 on macOS with Apple M4
# 4. Open DevTools Console (Cmd+Option+I -> Console tab)
# 5. Navigate to http://localhost:8080
# 6. Wait ~3-5 min total (download + GPU program build)
Where to look for the result:
- On the page (
<pre id="log">): shows SUCCESS: Model loaded! on success, or stays stuck at Creating LlmInference... on crash (the catch block never fires because the WASM exception is uncaught)
- In DevTools Console: the crash appears as a red
Uncaught (in promise) RuntimeError: memory access out of bounds from genai_wasm_internal.wasm
- Console timeline:
Graph successfully started running. appears when download completes and GPU program build begins. The WASM exception follows ~43s later.
Tested mitigations (none resolved the crash):
- Chrome flag
#enable-unsafe-webgpu: Enabled
- Chrome flag
#enable-webgpu-developer-features: Enabled
- GPU hardware acceleration: Enabled in chrome://settings/system
- Reduced
maxTokens to 64 (ignored; cache_size: 512 is embedded in .task bundle)
- Upgraded from
@mediapipe/tasks-genai@0.10.26 to @0.10.27
- Used
LlmInference.createFromModelBuffer() with ReadableStreamDefaultReader instead of modelAssetPath — same crash at same address
Other info / Complete Logs
Console output from standalone repro (verified to reproduce):
Graph successfully started running.
I0409 13:48:37.737999 2106368 llm_gpu_calculator.cc:389] LlmConfig: model_type: 52 text decoder: (batch_size: 1 sequence_size: 0 model_dimension: 1536 hidden_dimension: 6144 head_dimension: 256 number_of_heads: 8 number_of_kv_heads: 1 vocabulary_size: 262144 stack_size: 35 attention_mask_type: 1 lora_rank: max_top_k: 40 cache_size: 512 post_norm_ff: true post_norm_attention: true attention_scale_type: 5 skip_absolute_positional_embeddings: true use_optional_transformer_layers: false selectable_final_ln: std::nullopt num_local_layers_per_global: 4 sliding_window_size: 512 global_rope_wavelength: 1000000.000000 global_rope_scaling: std::nullopt qk_norm: true ngram_vocabulary_size: std::nullopt load_ngrammer_embedding_dynamically: false per_layer_embedding_dimension: 256 load_per_layer_embedding_dynamically: true kv_cache_shared_start_layer: 15 activation_sparsity_stddev_multiplier: std::nullopt residual_adapter_interleave: std::nullopt residual_adapter_bottleneck_dimension: std::nullopt vision_tokens_num: 281 max_num_images: 0 audio_vocab_offset: 518272 audio_vocab_extra_dim: 128 audio_input_embedding_dim: 1536 audio_dual_norm_soft_and_hard_tokens: true bfloat16 fix: false int8_kv_cache: true apply_srq: true layer_skip_start: 0 num_layers_skipped: 0 matform_factor: 0 enable_external_embeddings: false)
I0409 13:48:37.737999 2106368 environment.cc:330] Subgroups Enabled!
Uncaught (in promise) RuntimeError: memory access out of bounds
at wasm-function[489]:0x43b55
at wasm-function[7691]:0xa3d6d7
at wasm-function[1511]:0x13bac4
at wasm-function[7684]:0xa3755d
at wasm-function[3667]:0x43c9e0
at wasm-function[15976]:0x122c1e5
at wasm-function[16953]:0x13ab534
at wasm-function[13767]:0x1079803
at wasm-function[5041]:0x63a57c
at wasm-function[1422]:0x122233
at wasm-function[5168]:0x6a390c
at wasm-function[10382]:0xeb5407
at wrapper (genai_wasm_internal.js:1:78622)
at Object.doRewind (genai_wasm_internal.js:1:80477)
at genai_wasm_internal.js:1:81035
Diagnostics at crash time:
performance.memory.jsHeapSizeLimit: 4.00 GB
performance.memory.totalJSHeapSize: 0.029 GB
performance.memory.usedJSHeapSize: 0.027 GB
navigator.deviceMemory: 16
navigator.hardwareConcurrency: 10
navigator.gpu.requestAdapter(): returns valid adapter
self.crossOriginIsolated: true
Note: performance.memory measures JS heap only. WASM linear memory is separate and not measured by this API. The crash could potentially be WASM linear memory exhaustion, but the deterministic crash address ($func489:0x43b55) across all attempts suggests a code bug rather than a resource limit.
chrome://gpu summary (full output attached as about-gpu.txt):
- GPU: Apple M4
- Backend: Metal
- WebGPU: Enabled (after turning on GPU acceleration in chrome://settings/system)
- GL implementation: disabled (Metal only)
- WGSL subgroups: Enabled
What we tested:
| Test |
Method |
Result |
modelAssetPath (URL) |
createFromOptions({baseOptions: {modelAssetPath: url}}) |
Crash at $func489:0x43b55 |
modelAssetBuffer (stream) |
createFromModelBuffer(fileset, reader) |
Same crash at same address |
| Standalone HTML repro |
server.py + index.html + CDN imports |
Same crash (verified) |
@mediapipe/tasks-genai@0.10.26 |
npm version |
Crash |
@mediapipe/tasks-genai@0.10.27 |
npm version |
Same crash |
maxTokens: 64 |
Reduced from 512 |
Ignored (cache_size baked into .task), same crash |
#enable-unsafe-webgpu flag |
Chrome flag |
No effect |
| Gemma 3 270m control |
gemma3-270m-it-q8-web.task (263 MB) same device, same code path |
SUCCESS — loads, generates, closes cleanly |
The Gemma 3 270m control test confirms the MediaPipe WASM runtime, WebGPU adapter, and Metal shader compilation all work correctly on this M4 device. The crash is specific to the Gemma 4 E2B model.
Related issues searched (via gh search issues --repo google-ai-edge/mediapipe):
Searched terms: "memory access out of bounds", "LlmInference", "gemma-4 web", "gemma E2B web.task", "llm_gpu_calculator", "shader compilation", "Metal Apple", "web.task crash". No existing issue matches this combination.
Question:
Is gemma-4-E2B-it-web.task expected to work on Apple M4 (non-Max, 10 GPU cores) via Chrome WebGPU? The model card only benchmarks M4 Max (which has 40 GPU cores and more unified memory bandwidth). If M4 non-Max is not a supported target for this model size, could this be documented on the model card?
Have I written custom code (as opposed to using a stock example script provided in MediaPipe)
No — using stock
LlmInference.createFromOptions()with the pre-built model fromlitert-community/gemma-4-E2B-it-litert-lm.OS Platform and Distribution
macOS 15 (Sequoia), Darwin 25.3.0, Apple M4 (10 cores), 16 GB unified memory
Mobile device if the issue happens on mobile device
No response
Browser and version if the issue happens on browser
Google Chrome 146.0.0.0
Programming Language and version
JavaScript (ES Modules, browser)
MediaPipe version
@mediapipe/tasks-genai@0.10.26 (also reproduced on 0.10.27)
Bazel version
No response
Solution
LLM Inference (GenAI)
Android Studio, NDK, SDK versions (if issue is related to building in Android environment)
No response
Xcode & Tulsi version (if issue is related to building for iOS)
No response
Describe the actual behavior
LlmInference.createFromOptions()crashes withRuntimeError: memory access out of boundsin the WASM module approximately 43-63 seconds after the model graph starts. The model file (gemma-4-E2B-it-web.task, 1.93 GB, fromlitert-community/gemma-4-E2B-it-litert-lm) downloads successfully, the MediaPipe graph starts,LlmConfigis parsed correctly (model_type: 52, stack_size: 35, vocabulary_size: 262144), andSubgroups Enabled!is logged. The crash then occurs deterministically atwasm-function[489]:0x43b55during what appears to be GPU program build.The WASM crash is an uncaught exception — the
createFromOptionspromise never resolves or rejects. Atry/catcharound the call does not catch it.This is NOT an out-of-memory condition: at crash time,
performance.memory.usedJSHeapSizeis 20 MB (of 4 GB limit), and the device has 16 GB total RAM with WebGPU adapter available.Describe the expected behaviour
LlmInference.createFromOptions()should resolve with a usable LlmInference instance. The HF model card forlitert-community/gemma-4-E2B-it-litert-lmbenchmarks this model running on MacBook Pro M4 Max via Chrome WebGPU.Standalone code/steps you may have used to try to get what you need
Minimal reproduction (2 files, verified to reproduce):
server.py— serves with COOP/COEP headers required for SharedArrayBuffer/WASM threading:index.html:Steps to reproduce:
Where to look for the result:
<pre id="log">): showsSUCCESS: Model loaded!on success, or stays stuck atCreating LlmInference...on crash (thecatchblock never fires because the WASM exception is uncaught)Uncaught (in promise) RuntimeError: memory access out of boundsfromgenai_wasm_internal.wasmGraph successfully started running.appears when download completes and GPU program build begins. The WASM exception follows ~43s later.Tested mitigations (none resolved the crash):
#enable-unsafe-webgpu: Enabled#enable-webgpu-developer-features: EnabledmaxTokensto 64 (ignored;cache_size: 512is embedded in .task bundle)@mediapipe/tasks-genai@0.10.26to@0.10.27LlmInference.createFromModelBuffer()withReadableStreamDefaultReaderinstead ofmodelAssetPath— same crash at same addressOther info / Complete Logs
Console output from standalone repro (verified to reproduce):
Diagnostics at crash time:
performance.memory.jsHeapSizeLimit: 4.00 GBperformance.memory.totalJSHeapSize: 0.029 GBperformance.memory.usedJSHeapSize: 0.027 GBnavigator.deviceMemory: 16navigator.hardwareConcurrency: 10navigator.gpu.requestAdapter(): returns valid adapterself.crossOriginIsolated: trueNote:
performance.memorymeasures JS heap only. WASM linear memory is separate and not measured by this API. The crash could potentially be WASM linear memory exhaustion, but the deterministic crash address ($func489:0x43b55) across all attempts suggests a code bug rather than a resource limit.chrome://gpusummary (full output attached asabout-gpu.txt):What we tested:
modelAssetPath(URL)createFromOptions({baseOptions: {modelAssetPath: url}})$func489:0x43b55modelAssetBuffer(stream)createFromModelBuffer(fileset, reader)server.py+index.html+ CDN imports@mediapipe/tasks-genai@0.10.26@mediapipe/tasks-genai@0.10.27maxTokens: 64#enable-unsafe-webgpuflaggemma3-270m-it-q8-web.task(263 MB) same device, same code pathThe Gemma 3 270m control test confirms the MediaPipe WASM runtime, WebGPU adapter, and Metal shader compilation all work correctly on this M4 device. The crash is specific to the Gemma 4 E2B model.
Related issues searched (via
gh search issues --repo google-ai-edge/mediapipe):RangeError: Array buffer allocation failedon 8 GB devices. Different error type (ArrayBuffer vs WASM OOB), different model, different device RAM (8 GB vs 16 GB). But same symptom:createFromOptions()fails for Gemma E2B-class models.Failed to build program executable - Out of host memoryon Android Adreno GPU. Different platform but similar failure point (GPU program build).modelAssetPathmethod of loading inLlmInference.createFromOptionsresults in inefficient loading and crashes #6160 —modelAssetPathloading method crashes. Closed as stale without resolution. We testedmodelAssetBuffertoo — same crash, ruling out the loading method.LlmInferencemodel not closing (resource leak). Not directly related but same component.createFromOptions. Related: our progress wrapper (fetch intercept) doesn't fire because model download happens inside WASM, not via JSfetch().Searched terms: "memory access out of bounds", "LlmInference", "gemma-4 web", "gemma E2B web.task", "llm_gpu_calculator", "shader compilation", "Metal Apple", "web.task crash". No existing issue matches this combination.
Question:
Is
gemma-4-E2B-it-web.taskexpected to work on Apple M4 (non-Max, 10 GPU cores) via Chrome WebGPU? The model card only benchmarks M4 Max (which has 40 GPU cores and more unified memory bandwidth). If M4 non-Max is not a supported target for this model size, could this be documented on the model card?