Skip to content

sanspareilsmyn/kinescript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

kinescript

Python Version Build Status License

kinescript is a lightweight CLI tool that turns local screen interactions into a structured dataset for multimodal AI training. It records your on-screen actions (mouse and keyboard) and captures screenshots, then processes them with OCR and optional LLM steps to produce JSONL examples.

πŸ€” Why kinescript?

Creating high-quality multimodal datasets from real desktop workflows is often painful:

  • Manual and repetitive: Taking screenshots and writing notes for each click/keypress is tedious and error-prone.
  • Inconsistent formats: Logs, images, and annotations end up in different places and formats.
  • Heavy alternatives: Full-fledged RPA/UIs or custom capture apps are complex to set up when you just need reproducible traces.

kinescript provides a simpler path:

  • πŸš€ Simple & focused: A small CLI with two commands: record and process.
  • Event-driven capture: Listens for mouse/keyboard events and snapshots the screen on each event.
  • Structured outputs: Stores screenshots and an actions.jsonl log you can post-process deterministically.
  • OCR + LLM-ready: Extracts text with Tesseract OCR and leaves a clean placeholder for LLM-based generation (OpenAI, Gemini, Ollama later).

✨ Features

  • CLI (Typer): Clear UX for record and process commands.
  • Recording (mss, pynput): Captures screenshots and logs mouse/keyboard actions.
  • Processing (Pillow, pytesseract, pandas): Runs OCR on images and joins with action logs.
  • Dataset output: Produces a final dataset.jsonl suitable for multimodal training pipelines.
  • Pluggable LLM step: Placeholder designed for provider-agnostic integration (OpenAI/Gemini/Ollama) later.

πŸ—οΈ Architecture

kinescript runs locally as a Python process that (1) records events and screenshots, and (2) processes recorded sessions into structured data.

+------------------------------+            +-----------------------------+
|        Recording Step        |            |       Processing Step       |
|   (kinescript record)        |            |    (kinescript process)     |
+------------------------------+            +-----------------------------+
| - Mouse/keyboard listeners   |            | - Load actions.jsonl        |
| - Screenshot on each event   |  images β†’  | - OCR (pytesseract)         |
| - Write actions.jsonl        |  actions β†’ | - Join image+action+OCR     |
+--------------+---------------+            +---------------+-------------+
               |                                                |
               v                                                v
        session directory                               dataset.jsonl

πŸš€ Getting Started

Prerequisites

  • Python 3.9+
  • Tesseract OCR installed on your system
    • macOS: brew install tesseract
    • Ubuntu/Debian: sudo apt-get install tesseract-ocr
    • Windows: install Tesseract and ensure it is on PATH (see project docs)

Installation

Using uv (recommended):

# create a virtualenv
uv venv
# activate it
source .venv/bin/activate  # Windows: .venv\Scripts\activate
# editable install
uv pip install -e .

Using pip:

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .

Usage

Record a session:

kinescript record --output-dir ./sessions/session-001
# Stop recording with ESC key

Process a recorded session:

kinescript process \
  --input-dir ./sessions/session-001 \
  --output-file ./dataset.jsonl

You can also run via the module entry point:

python -m kinescript record --output-dir ./sessions/session-001
python -m kinescript process --input-dir ./sessions/session-001 --output-file ./dataset.jsonl

Output Layout

sessions/session-001/
β”œβ”€ images/
β”‚  β”œβ”€ 0001.png
β”‚  β”œβ”€ 0002.png
β”‚  └─ ...
└─ actions.jsonl

🧭 Roadmap (high level)

  • Selective window capture and region filters
  • Configurable event filters and sampling strategies
  • Annotation/preview UI (optional)
  • LLM integration (provider-agnostic; OpenAI/Gemini/Ollama)
  • Export adapters for popular training formats

πŸ™Œ Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines, development setup, and workflow.

πŸ“„ License

Licensed under the Apache License, Version 2.0. See LICENSE for full text.

🧰 Local LLM (Ollama) Setup

You can generate Q/A locally using Ollama.

1) Install

  • macOS: brew install ollama
  • Linux: follow your distro guide or curl -fsSL https://ollama.com/install.sh | sh
  • Windows: use the official installer

2) Prepare and run a model

ollama pull llama3.1
ollama serve  # if needed (often runs automatically in background)

Default endpoint is http://localhost:11434.

3) Integrate with kinescript

You can tweak defaults via environment variables. Place a .env file at the project root to load automatically.

Example (.env):

# Provider: ollama | openai | gemini
KINESCRIPT_LLM_PROVIDER=ollama

# Ollama settings
KINESCRIPT_OLLAMA_BASE_URL=http://localhost:11434
KINESCRIPT_OLLAMA_MODEL=llama3.1
KINESCRIPT_OLLAMA_TIMEOUT=30

# Max number of Q/A pairs
KINESCRIPT_QA_MAX_PAIRS=3

Then run processing with Q/A generation:

kinescript process \
  --input-dir ./sessions/session-001 \
  --output-file ./dataset.jsonl

πŸ” Using Other LLM Providers (.env-based)

processor.py lets you choose an LLM provider via environment variables. A .env file at the repo root will be picked up automatically.

Select provider

KINESCRIPT_LLM_PROVIDER=ollama   # or openai, gemini

OpenAI example

KINESCRIPT_LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...  # required
KINESCRIPT_OPENAI_MODEL=gpt-4o-mini
KINESCRIPT_OPENAI_TIMEOUT=30
# Optional: shared temperature for all providers
KINESCRIPT_LLM_TEMPERATURE=0.2

Google Gemini example

KINESCRIPT_LLM_PROVIDER=gemini
GOOGLE_API_KEY=AIza...  # required
KINESCRIPT_GEMINI_MODEL=gemini-1.5-flash
KINESCRIPT_GEMINI_TIMEOUT=30
# Optional: shared temperature
KINESCRIPT_LLM_TEMPERATURE=0.2

How it works

  • .env β†’ environment variables β†’ KINESCRIPT_LLM_PROVIDER decides which provider to call (Ollama/OpenAI/Gemini).
  • If the response is not valid JSON, the parser safely returns an empty list.
  • Number of Q/A pairs is limited by KINESCRIPT_QA_MAX_PAIRS.

Notes

  • When using OpenAI/Gemini, ensure you understand platform costs and rate limits.
  • In corporate networks or behind proxies, requests may fail due to network restrictions.

About

Screen-to-Dataset Generator for multimodal training data from local screen recordings

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages