A Proof of Concept (PoC) demonstrating how to run Alibaba's Qwen 3.5 (0.8B) Multimodal Large Language Model natively inside the browser using WebGPU and Transformers.js.
This project explores client-side AI execution, allowing token generations to happen directly on the user's local graphics card. It serves as an example of running modern Vision-Language Models (VLMs) without relying on backend APIs.
qwebgpu-demo-720p.mp4
- 100% Local Inference: Data never leaves your machine.
- WebGPU Accelerated: Utilizes WebGPU
- Multimodal (Vision-Language): Support for both text-based chat and image understanding. Drag and drop an image to have the AI analyze it.
- Real-time Metrics: Displays Time to First Token (TTFT) and real-time generation speed (Tokens/Second).
- Vanilla JS Architecture: No frontend frameworks. Built with pure, optimized JavaScript and Vite for fast load times.
- Engine: Transformers.js by Hugging Face
- Runtime: ONNX Runtime Web (WebAssembly + WebGPU)
- Model:
onnx-community/Qwen3.5-0.8B-ONNX(4-bit quantized) - Build Tool: Vite
- A modern web browser with WebGPU support enabled.
- Node.js installed on your machine.
- A dedicated GPU for optimal token generation speeds.
-
Clone the repository:
git clone https://github.com/jerrikchr/WGPU-Qwen3.50.8b.git cd WGPU-Qwen3.50.8b -
Install dependencies:
npm install
-
Start the development server:
npm run dev
-
Launch the Lab: Open your browser and navigate to
http://localhost:5173.Note: On your first visit, the browser will download the ~650MB model weights and compile the WebGPU shaders. Subsequent loads will be nearly instant as the weights are cached locally via the Cache API.
- Text Chat: Type your prompt into the terminal interface and hit
Execute. - Vision Analysis: Click the image icon (or drag and drop) to load an image into the context. Ask a question about the image to see Qwen 3.5's multimodal capabilities in action.
- Session Purge: Click the
[ PURGE ]button in the top right to clear the context window and free up memory for a new task.
- Vanilla JS (Zero Telemetry): Chose vanilla JavaScript because it is small, fast, and contains no telemetry, ensuring the privacy of the local environment.
- VRAM OOM Prevention: Uploading large photos directly to the ONNX Runtime will crash the WebGPU sandbox (
std::bad_alloc). Implemented a pre-processing step using the HTML5 Canvas API to downscale images before tensor conversion, preventing Out-Of-Memory errors on constrained hardware.
This entire project was Vibecoded with Gemini CLI
This application is client-side. The initial model weights are downloaded from the Hugging Face Hub, but all subsequent chat history, images, and user prompts remain strictly within the memory of your local browser.
MIT License. Feel free to fork, modify, and build your own local AI applications!