"Captions With Attitude" in your browser from your webcam generated by a Vision Language Model (VLM) from a Go program running entirely on your local machine using llama.cpp!
It uses yzma to perform local inference using llama.cpp and GoCV for the video processing, then runs a local web server so you can see the often comedic results.
You must install yzma and llama.cpp to run this program.
See https://github.com/hybridgroup/yzma/blob/main/INSTALL.md
You must also install OpenCV and GoCV, which unlike yzma requires CGo.
See https://gocv.io/getting-started/
Although yzma does not use CGo, yzma can co-exist in Go applications that use CGo.
You will need a Vision Language Model (VLM). Download the model and projector files from Hugging Face in .gguf format.
Qwen3-VL-2B-Instruct model
Model: https://huggingface.co/ggml-org/Qwen3-VL-2B-Instruct-GGUF/blob/main/Qwen3-VL-2B-Instruct-Q8_0.gguf
Qwen3-VL-8B-Abliterated-Caption-it-GGUF
go build .$ ./captions-with-attitude
Usage:
captions-with-attitudes
-device string
camera device ID (default "0")
-host string
web server host:port (default "localhost:8080")
-model string
model file to use
-p string
prompt (default "Give a very brief description of what is going on.")
-projector string
projector file to use
-v verbose logging./captions-with-attitude -model ~/models/Qwen3-VL-8B-Abliterated-Caption-it.Q4_K_M.gguf -projector ~/models/Qwen3-VL-8B-Abliterated-Caption-it.mmproj-Q8_0.gguf -p "Describe the scene in the style of William Shakespeare."Now open your web browser pointed to http://localhost:8080/
flowchart TD
subgraph application
subgraph vlm
yzma<-->llama.cpp
end
subgraph video
gocv<-->opencv
end
subgraph webserver
index.html
mjpeg
caption
end
gocv-->mjpeg
gocv-->yzma
yzma-->caption
end
subgraph llama.cpp
model
end
subgraph opencv
webcam
end
The "Captions With Attitude" application consists of three main parts:
- Video capture
- Vision Language Model inference using the webcam images and the text prompt
- Web server to serve the page, the streaming video from the webcam, and the latest caption produced by the VLM
