kinescript is a lightweight CLI tool that turns local screen interactions into a structured dataset for multimodal AI training. It records your on-screen actions (mouse and keyboard) and captures screenshots, then processes them with OCR and optional LLM steps to produce JSONL examples.
Creating high-quality multimodal datasets from real desktop workflows is often painful:
- Manual and repetitive: Taking screenshots and writing notes for each click/keypress is tedious and error-prone.
- Inconsistent formats: Logs, images, and annotations end up in different places and formats.
- Heavy alternatives: Full-fledged RPA/UIs or custom capture apps are complex to set up when you just need reproducible traces.
kinescript provides a simpler path:
- π Simple & focused: A small CLI with two commands:
recordandprocess. - Event-driven capture: Listens for mouse/keyboard events and snapshots the screen on each event.
- Structured outputs: Stores screenshots and an
actions.jsonllog you can post-process deterministically. - OCR + LLM-ready: Extracts text with Tesseract OCR and leaves a clean placeholder for LLM-based generation (OpenAI, Gemini, Ollama later).
- CLI (Typer): Clear UX for
recordandprocesscommands. - Recording (mss, pynput): Captures screenshots and logs mouse/keyboard actions.
- Processing (Pillow, pytesseract, pandas): Runs OCR on images and joins with action logs.
- Dataset output: Produces a final
dataset.jsonlsuitable for multimodal training pipelines. - Pluggable LLM step: Placeholder designed for provider-agnostic integration (OpenAI/Gemini/Ollama) later.
kinescript runs locally as a Python process that (1) records events and screenshots, and (2) processes recorded sessions into structured data.
+------------------------------+ +-----------------------------+
| Recording Step | | Processing Step |
| (kinescript record) | | (kinescript process) |
+------------------------------+ +-----------------------------+
| - Mouse/keyboard listeners | | - Load actions.jsonl |
| - Screenshot on each event | images β | - OCR (pytesseract) |
| - Write actions.jsonl | actions β | - Join image+action+OCR |
+--------------+---------------+ +---------------+-------------+
| |
v v
session directory dataset.jsonl
- Python 3.9+
- Tesseract OCR installed on your system
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Windows: install Tesseract and ensure it is on PATH (see project docs)
- macOS:
Using uv (recommended):
# create a virtualenv
uv venv
# activate it
source .venv/bin/activate # Windows: .venv\Scripts\activate
# editable install
uv pip install -e .Using pip:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e .Record a session:
kinescript record --output-dir ./sessions/session-001
# Stop recording with ESC keyProcess a recorded session:
kinescript process \
--input-dir ./sessions/session-001 \
--output-file ./dataset.jsonlYou can also run via the module entry point:
python -m kinescript record --output-dir ./sessions/session-001
python -m kinescript process --input-dir ./sessions/session-001 --output-file ./dataset.jsonlsessions/session-001/
ββ images/
β ββ 0001.png
β ββ 0002.png
β ββ ...
ββ actions.jsonl
- Selective window capture and region filters
- Configurable event filters and sampling strategies
- Annotation/preview UI (optional)
- LLM integration (provider-agnostic; OpenAI/Gemini/Ollama)
- Export adapters for popular training formats
We welcome contributions! Please see CONTRIBUTING.md for guidelines, development setup, and workflow.
Licensed under the Apache License, Version 2.0. See LICENSE for full text.
You can generate Q/A locally using Ollama.
- macOS:
brew install ollama - Linux: follow your distro guide or
curl -fsSL https://ollama.com/install.sh | sh - Windows: use the official installer
ollama pull llama3.1
ollama serve # if needed (often runs automatically in background)Default endpoint is http://localhost:11434.
You can tweak defaults via environment variables. Place a .env file at the project root to load automatically.
Example (.env):
# Provider: ollama | openai | gemini
KINESCRIPT_LLM_PROVIDER=ollama
# Ollama settings
KINESCRIPT_OLLAMA_BASE_URL=http://localhost:11434
KINESCRIPT_OLLAMA_MODEL=llama3.1
KINESCRIPT_OLLAMA_TIMEOUT=30
# Max number of Q/A pairs
KINESCRIPT_QA_MAX_PAIRS=3Then run processing with Q/A generation:
kinescript process \
--input-dir ./sessions/session-001 \
--output-file ./dataset.jsonlprocessor.py lets you choose an LLM provider via environment variables. A .env file at the repo root will be picked up automatically.
KINESCRIPT_LLM_PROVIDER=ollama # or openai, geminiKINESCRIPT_LLM_PROVIDER=openai
OPENAI_API_KEY=sk-... # required
KINESCRIPT_OPENAI_MODEL=gpt-4o-mini
KINESCRIPT_OPENAI_TIMEOUT=30
# Optional: shared temperature for all providers
KINESCRIPT_LLM_TEMPERATURE=0.2KINESCRIPT_LLM_PROVIDER=gemini
GOOGLE_API_KEY=AIza... # required
KINESCRIPT_GEMINI_MODEL=gemini-1.5-flash
KINESCRIPT_GEMINI_TIMEOUT=30
# Optional: shared temperature
KINESCRIPT_LLM_TEMPERATURE=0.2.envβ environment variables βKINESCRIPT_LLM_PROVIDERdecides which provider to call (Ollama/OpenAI/Gemini).- If the response is not valid JSON, the parser safely returns an empty list.
- Number of Q/A pairs is limited by
KINESCRIPT_QA_MAX_PAIRS.
- When using OpenAI/Gemini, ensure you understand platform costs and rate limits.
- In corporate networks or behind proxies, requests may fail due to network restrictions.