v2026.02 macOS (Apple Silicon) · MLX Native
Local-first voice cloning, text-to-speech, Read Aloud document reader, audiobook creator, and agentic MCP automation.
Optimized for Apple Silicon with native Metal acceleration via MLX.
Get Started · View on GitHub
macOS (Apple Silicon) · MLX-Audio · Source Available
Windows support coming soon — the codebase runs on Windows, but we currently provide macOS binaries only.
Custom Voice Cloning | Text-to-Speech | PDF Read Aloud | Audiobook Creator | MCP & API Dashboard
A local-first application for macOS (Apple Silicon) with four integrated capabilities and production-oriented workflows: clone any voice from as little as 3 seconds of reference audio using multiple engines (Qwen3-TTS and Chatterbox), generate high-quality text-to-speech with fast and expressive model families (Kokoro and Supertonic), read documents aloud with sentence-level highlighting and synchronized progression (PDF, DOCX, EPUB, Markdown, TXT), and convert full documents to audiobooks with queueable chapter generation and reusable voice presets. MimikaStudio runs fully on-device, includes first-run model download management, and exposes both UI and API paths for advanced local automation.
License: Source code is licensed under Business Source License 1.1 (BSL-1.1), and binary distributions are licensed under the MimikaStudio Binary Distribution License. See LICENSE, BINARY-LICENSE.txt, and the website License page.
LICENSE · BINARY-LICENSE.txt · Website License page
The codebase is cross-platform, but we currently provide macOS binaries only.
we currently provide macOS binaries only.
Note: Windows support is planned for a future release.
| Model | Parameters | Type | Languages |
|---|---|---|---|
| Kokoro-82M | 82M | Fast TTS | English (British RP + American) |
| Qwen3-TTS 0.6B Base | 600M | Voice Cloning | 10 languages |
| Qwen3-TTS 1.7B Base | 1.7B | Voice Cloning | 10 languages |
| Qwen3-TTS 0.6B CustomVoice | 600M | Preset Speakers | 4 languages (en, zh, ja, ko) |
| Qwen3-TTS 1.7B CustomVoice | 1.7B | Preset Speakers | 4 languages (en, zh, ja, ko) |
| Qwen3-TTS 0.6B Base-8bit | 600M | Voice Cloning (8-bit) | 10 languages |
| Qwen3-TTS 1.7B Base-8bit | 1.7B | Voice Cloning (8-bit) | 10 languages |
| Qwen3-TTS 0.6B CustomVoice-8bit | 600M | Preset Speakers (8-bit) | 4 languages (en, zh, ja, ko) |
| Qwen3-TTS 1.7B CustomVoice-8bit | 1.7B | Preset Speakers (8-bit) | 4 languages (en, zh, ja, ko) |
| Chatterbox Multilingual | — | Voice Cloning | 23 languages |
| Supertonic-2 | — | Multilingual TTS (ONNX) | 5 languages (en, ko, es, pt, fr) |
| CosyVoice3 ONNX | — | Expressive TTS (ONNX backend) | 10 languages (auto, en, zh, ja, ko, de, es, fr, it, ru) |
Note: CosyVoice3 uses its own dedicated ONNX model package (
ayousanz/cosy-voice3-onnx) and is independent fromSupertonic-2.
Listen to samples generated by each TTS engine. For voice cloning demos, compare the reference voice with the generated output.
Voice cloning from a 3-second sample. Compare the reference voice with the generated output.
| Voice | Reference | Generated |
|---|---|---|
| Natasha Clone (Genesis4 Style) | Natasha.wav | qwen3-natasha-genesis4-demo.wav |
| Suzan Clone (Genesis4 Style) | Suzan.wav | qwen3-suzan-genesis4-demo.wav |
| Natasha (Hebrew) (Cross-language) | Natasha.wav | qwen3-natasha-hebrew-demo.wav |
| Speaker | Sample |
|---|---|
| Ryan (English, dynamic male, Genesis4 Style) | qwen3-ryan-genesis4-demo.wav |
Expressive voice cloning with emotion control. Natural, emotive speech synthesis.
| Voice | Reference | Generated |
|---|---|---|
| Natasha Clone (Emotional Speech) | Natasha.wav | chatterbox-natasha-demo.wav |
| Suzan Clone (Emotional Speech) | Suzan.wav | chatterbox-suzan-demo.wav |
| Voice | Sample |
|---|---|
| Emma (British RP Female) | sentence-01-bf_emma.wav |
| George (British Male) | sentence-02-bm_george.wav |
| Lily (British Female) | sentence-03-bf_lily.wav |
| Voice | Sample |
|---|---|
| Female (F1) (Genesis4 Style) | supertonic-f1-genesis4-demo.wav |
| Male (M2) (Genesis4 Style) | supertonic-m2-genesis4-demo.wav |
| Voice | Sample |
|---|---|
| Female (F1 / Eden Alias) (Genesis4 Style) | cosyvoice3-f1-genesis4-demo.wav |
| Male (M2 / Atlas Alias) (Genesis4 Style) | cosyvoice3-m2-genesis4-demo.wav |
All shipped pregenerated demo files in backend/data/pregenerated:
| Engine | File | Purpose |
|---|---|---|
| Qwen3-TTS | qwen3-natasha-genesis4-demo.wav | Voice clone demo (Natasha, Genesis4 style) |
| Qwen3-TTS | qwen3-suzan-genesis4-demo.wav | Voice clone demo (Suzan, Genesis4 style) |
| Qwen3-TTS | qwen3-ryan-genesis4-demo.wav | Preset speaker demo (Ryan) |
| Qwen3-TTS | qwen3-natasha-hebrew-demo.wav | Cross-language clone demo (Hebrew) |
| Qwen3-TTS | qwen3-natasha-hebrew-demo.txt | Source text for Hebrew demo |
| Chatterbox | chatterbox-natasha-demo-1770830814.wav | Emotional clone demo (Natasha) |
| Chatterbox | chatterbox-suzan-demo-1770830815.wav | Emotional clone demo (Suzan) |
| Supertonic | supertonic-f1-genesis4-demo.wav | Preset F1 multilingual ONNX demo |
| Supertonic | supertonic-m2-genesis4-demo.wav | Preset M2 multilingual ONNX demo |
| CosyVoice3 | cosyvoice3-f1-genesis4-demo.wav | CosyVoice3 F1/Eden standalone ONNX demo |
| CosyVoice3 | cosyvoice3-m2-genesis4-demo.wav | CosyVoice3 M2/Atlas standalone ONNX demo |
Kokoro examples are bundled under backend/data/samples/kokoro/ and listed above in the Kokoro section.
| Component | Requirement |
|---|---|
| OS | macOS 13+ (Ventura or later) |
| Chip | Apple Silicon (M1/M2/M3/M4) — Intel not supported |
| RAM | 8GB minimum, 16GB+ recommended |
| Storage | 5-10GB for models |
| Python | 3.10 or later |
| Flutter | 3.x with desktop support |
Windows & Linux: The codebase supports these platforms, but pre-built binaries are currently macOS-only. Windows/Linux support is planned for future releases.
As of February 19, 2026, the MimikaStudio DMG is not yet signed/notarized by Apple.
macOS may block first launch until you explicitly allow it in security settings.
- Open the DMG and drag
MimikaStudio.apptoApplications. - In
Applications, right-clickMimikaStudio.appand selectOpen. - Click
Openin the warning dialog. - If macOS still blocks launch, go to:
System Settings -> Privacy & Security -> Open Anyway(for MimikaStudio), then confirm with password/Touch ID. - On first launch, wait for the bundled backend to start. The startup log screen below is expected for a few seconds.
- On first use, click
Downloadfor the required model in the in-app model card.
A single install.sh in the project root handles everything: prerequisites,
virtual environment, all Python dependencies (including Qwen3-TTS, Chatterbox,
OmegaConf, Perth, etc.), database setup, and Flutter.
git clone https://github.com/BoltzmannEntropy/MimikaStudio.git
cd MimikaStudio
./install.shThe script will:
- Check / install Homebrew, Python 3, espeak-ng, and ffmpeg
- Create a Python venv in the project root (
./venv) - Install all Python dependencies from the root
requirements.txt - Install
chatterbox-ttswith--no-deps(its runtime deps are already inrequirements.txt) - Download the Dicta ONNX Hebrew diacritizer model (~1.1 GB) for Chatterbox Hebrew TTS (skip with
SKIP_DICTA=1) - Verify that every critical import works
- Initialize the SQLite database
- Set up Flutter (if installed)
Note: ./install.sh creates the Python virtual environment and installs large dependencies, so the first run can take a few minutes.
After installation, start MimikaStudio:
source venv/bin/activate
./bin/mimikactl up # Backend + MCP + Desktop appgit clone https://github.com/BoltzmannEntropy/MimikaStudio.git
cd MimikaStudio
# System dependencies (macOS)
brew install espeak-ng ffmpeg python@3.11
# Python venv
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
# All Python dependencies (from project root)
pip install -r requirements.txt
# Chatterbox TTS (--no-deps to avoid version conflicts with its strict pins)
pip install --no-deps chatterbox-tts==0.1.6
# Initialize database
cd backend && python3 database.py && cd ..
# Flutter (optional, for desktop app UI)
cd flutter_app && flutter pub get && cd ..
# Start
./bin/mimikactl upModels auto-download on first use (~3 GB total). To pre-download:
./bin/mimikactl models download kokoro # ~300 MB
./bin/mimikactl models download qwen3 # ~4 GB for 1.7BThe Dicta ONNX Hebrew diacritizer (~1.1 GB) is downloaded by install.sh automatically. If you skipped it (SKIP_DICTA=1) and need Hebrew TTS later, run:
mkdir -p backend/models/dicta-onnx
curl -L -o backend/models/dicta-onnx/dicta-1.0.onnx \
https://github.com/thewh1teagle/dicta-onnx/releases/download/model-files-v1.0/dicta-1.0.onnxsource venv/bin/activate
python -c "import kokoro; print('Kokoro OK')"
python -c "from qwen_tts import QwenTTS; print('Qwen3-TTS OK')"
python -c "from chatterbox import ChatterboxTTS; print('Chatterbox OK')"
python -c "import omegaconf; print('OmegaConf OK')"
python -c "import perth; print('Perth OK')"# Start all services (Backend + MCP + Desktop UI)
./bin/mimikactl up
# Or: Backend + MCP only (no Flutter UI)
./bin/mimikactl up --no-flutter
# Check status
./bin/mimikactl status
# View logs
./bin/mimikactl logs backendExample startup output:
=== Starting MimikaStudio ===
Starting backend...
Waiting for http://localhost:8000/api/health ...... OK
Starting MCP Server...
MCP Server started on port 8010
Starting Flutter UI (dev mode)...
MimikaStudio ships a desktop UI backed by the same local FastAPI server:
macOS Desktop App (default): ./bin/mimikactl up
MimikaStudio brings together the latest advances in neural text-to-speech into a unified desktop experience.
Kokoro TTS delivers sub-200ms latency with crystal-clear British and American accents. The 82M parameter model runs effortlessly on any machine, generating natural-sounding speech with Emma, George, Lily, and other premium voices.
Clone any voice from remarkably short audio samples. Qwen3-TTS requires just 3 seconds of reference audio to capture a speaker's characteristics. Upload a voice memo, a podcast clip, or any audio snippet, and MimikaStudio will synthesize new speech in that voice.
For multilingual cloning, Chatterbox Multilingual TTS supports 23 languages. Both Qwen3 and Chatterbox share a unified voice library — upload a voice sample once and use it across all cloning engines.
MimikaStudio includes 9 premium preset speakers across 4 languages (English, Chinese, Japanese, Korean), each with distinct personalities. These CustomVoice speakers require no audio samples at all.
| Model | Type | Strength |
|---|---|---|
| Kokoro-82M | Fast TTS | Sub-200ms latency, British RP & American accents |
| Qwen3-TTS 0.6B/1.7B Base | Voice Cloning | 3-second cloning, 10 languages |
| Qwen3-TTS 0.6B/1.7B CustomVoice | Preset Speakers | 9 premium voices, style control |
| Qwen3-TTS 8-bit variants (0.6B/1.7B Base + CustomVoice) | Low-memory mode | Smaller footprint with strong quality/speed tradeoff |
| Chatterbox Multilingual TTS | Voice Cloning | Multilingual cloning with prompt audio |
| Supertonic-2 | Multilingual ONNX TTS | Low-latency local synthesis across 5 languages |
| CosyVoice3 ONNX | Expressive preset TTS | Dedicated ONNX model with independent download/status and UI/API surface |
- Read Aloud Document Reader: Read PDF, DOCX, EPUB, Markdown, and TXT aloud with sentence-by-sentence highlighting
- Audiobook Creator: Convert documents into WAV/MP3/M4B audiobooks with smart chunking, crossfade merging, progress tracking, and chapter markers (Kokoro voices only)
- Unified Jobs Queue: Track every executed job (TTS, voice clone, and audiobook) with status and inline playback controls
- Shared Voice Library: Voice samples shared across all cloning engines (Qwen3, Chatterbox)
- Model Manager: In-app model download manager — check status and download models on demand
- Advanced Generation Controls: Temperature, top_p, top_k, repetition penalty, seed
- Style Instructions: Tell speakers how to speak - "whisper softly", "speak with excitement", etc.
- Real-time System Monitoring: CPU, RAM, and GPU usage in the app header
- Multi-LLM Support: Claude, OpenAI, Ollama (local), or Claude Code CLI
- Qwen3-TTS Voice Clone: Clone any voice from just 3+ seconds of audio
- Qwen3-TTS Custom Voice: 9 preset premium speakers (Ryan, Aiden, Vivian, Serena, Uncle Fu, Dylan, Eric, Ono Anna, Sohee)
- Chatterbox Voice Clone: Multilingual voice cloning with prompt audio
- Shared Voice Library: Voice samples uploaded to any engine are available across all voice cloning models
- Model Manager: In-app UI to check model download status and download models on demand
- Advanced Generation Controls: Temperature, top_p, top_k, repetition penalty, seed
- Model Size Selection: 0.6B (Fast) or 1.7B (Quality)
- Kokoro TTS: Fast, high-quality English synthesis with 21 British/American voices (IPA transcription is not part of the current release)
- Default Voice Samples: Max, Natasha, Sara, and Suzan ship with the app; user uploads are stored in
~/MimikaStudio/data/user_voices/cloners/by default (orMIMIKA_DATA_DIR) - User Voices in UI: Uploaded voices appear immediately under each engine's Your Voices section
- Jobs Tab: Unified queue of TTS, voice clone, and audiobook jobs with progress, completion state, and playback controls
- Folder View in Settings: View and open user home, Mimika data, logs, default voices (Natasha/Suzan), and user clone voices folders directly from the app
- Voice Previews: Tap play/pause/stop to audition voices before generating
- Document Reader: Read PDFs, TXT, and MD files aloud with Kokoro TTS
- Audiobook Creator: Convert full documents to audiobook files (WAV/MP3/M4B) with smart chunking, crossfade merging, progress tracking, and playback controls (Kokoro voices only)
- CLI Tool: Full command-line interface for Kokoro and Qwen3
- MCP & API Dashboard: Built-in tab showing all MCP tools and REST endpoints with live server status
- MCP Server: Full MCP integration for programmatic access to all API endpoints
- Windows Installer: PyInstaller + Inno Setup build script for standalone Windows distribution
- 60+ REST API endpoints with FastAPI (auto-documented at
/docs)
# Service Commands
./bin/mimikactl up # Start all services
./bin/mimikactl up --no-flutter # Backend + MCP only
./bin/mimikactl down # Stop all services
./bin/mimikactl restart # Restart all
./bin/mimikactl status # Check status
# Backend Commands
./bin/mimikactl backend start # Start backend only
./bin/mimikactl backend stop # Stop backend
# Flutter Commands
./bin/mimikactl flutter start # Start Flutter (release mode)
./bin/mimikactl flutter start --dev # Start in dev mode
./bin/mimikactl flutter stop # Stop Flutter
./bin/mimikactl flutter build # Build macOS app
# MCP Server Commands
./bin/mimikactl mcp start # Start MCP server (port 8010)
./bin/mimikactl mcp stop # Stop MCP server
./bin/mimikactl mcp status # Check MCP status
# Utility Commands
./bin/mimikactl logs [service] # Tail logs (backend|mcp|flutter|all)
./bin/mimikactl test # Run API tests
./bin/mimikactl clean # Clean logs and temp files
./bin/mimikactl version # Show version infoFull command-line interface for voice cloning and TTS generation.
# Kokoro TTS (fast British/American voices)
./bin/mimika kokoro "Hello, world!" --voice bf_emma --output hello.wav
./bin/mimika kokoro input.txt --voice bm_george --speed 1.2
# Qwen3 Custom Voice (preset speakers)
./bin/mimika qwen3 "Hello, world!" --speaker Ryan --style "professional narration"
# Qwen3 Voice Clone (clone from reference audio)
./bin/mimika qwen3 "Hello, world!" --clone --reference Alina.wav
./bin/mimika qwen3 book.pdf --clone --reference Bella.wav --output book.wav
# PDF audiobook generation (Kokoro voices only)
./bin/mimika kokoro book.pdf --voice bf_emma --output audiobook.wav
# List available voices
./bin/mimika voices --engine kokoro
./bin/mimika voices --engine qwen3| Variable | Default | Description |
|---|---|---|
MIMIKA_API_URL |
http://localhost:8000 |
Backend API URL |
./bin/mimika kokoro <input> [options]| Parameter | Short | Default | Description |
|---|---|---|---|
input |
required | Text string or file path (.txt, .pdf, .epub, .docx, .doc) | |
--voice |
-v |
bf_emma |
Voice ID (see mimika voices --engine kokoro) |
--speed |
-s |
1.0 |
Speech speed multiplier (0.5-2.0) |
--output |
-o |
<input>.wav |
Output WAV file path |
Available Kokoro Voices:
| Voice ID | Name | Gender | Accent |
|---|---|---|---|
bf_emma |
Emma | Female | British RP |
bf_isabella |
Isabella | Female | British |
bf_alice |
Alice | Female | British |
bf_lily |
Lily | Female | British |
bm_george |
George | Male | British |
bm_lewis |
Lewis | Male | British |
bm_daniel |
Daniel | Male | British |
af_heart |
Heart | Female | American |
af_bella |
Bella | Female | American |
af_nicole |
Nicole | Female | American |
af_aoede |
Aoede | Female | American |
af_kore |
Kore | Female | American |
af_sarah |
Sarah | Female | American |
af_sky |
Sky | Female | American |
am_michael |
Michael | Male | American |
am_adam |
Adam | Male | American |
am_echo |
Echo | Male | American |
am_liam |
Liam | Male | American |
am_onyx |
Onyx | Male | American |
am_puck |
Puck | Male | American |
am_santa |
Santa | Male | American |
./bin/mimika qwen3 <input> [options]Common Parameters:
| Parameter | Short | Default | Description |
|---|---|---|---|
input |
required | Text string or file path (.txt, .pdf, .epub, .docx, .doc) | |
--output |
-o |
<input>.wav |
Output WAV file path |
--model |
-m |
1.7B |
Model size: 0.6B (fast) or 1.7B (quality) |
--language |
-l |
auto |
Language code (auto, en, zh, ja, ko, de, fr, ru, pt, es, it) |
--temperature |
0.9 |
Generation randomness (0.1-2.0) | |
--top-p |
0.9 |
Nucleus sampling threshold (0.1-1.0) | |
--top-k |
50 |
Top-k sampling (1-100) |
Custom Voice Mode (Preset Speakers):
| Parameter | Short | Default | Description |
|---|---|---|---|
--speaker |
Ryan |
Preset speaker name | |
--style |
see below | Style instruction for voice |
Default style: "Optimized for engaging, professional audiobook narration"
Available Preset Speakers:
| Speaker | Language | Character |
|---|---|---|
Ryan |
English | Dynamic male, strong rhythm |
Aiden |
English | Sunny American male |
Vivian |
Chinese | Bright young female |
Serena |
Chinese | Warm gentle female |
Uncle_Fu |
Chinese | Seasoned male, mellow timbre |
Dylan |
Chinese | Beijing youthful male |
Eric |
Chinese | Sichuan lively male |
Ono_Anna |
Japanese | Playful female |
Sohee |
Korean | Warm emotional female |
Voice Clone Mode:
| Parameter | Short | Default | Description |
|---|---|---|---|
--clone |
flag | Enable voice cloning mode | |
--reference |
-r |
required | Reference audio file (WAV, 3+ seconds) |
--reference-text |
optional | Transcript of reference audio (improves quality) |
./bin/mimika voices [--engine kokoro|qwen3]| Format | Extension | Requirements |
|---|---|---|
| Plain Text | .txt |
Built-in |
.pdf |
PyPDF2, pymupdf |
|
| EPUB | .epub |
ebooklib, beautifulsoup4 |
| Word Document | .docx |
python-docx |
| Legacy Word | .doc |
docx2txt |
| Markdown | .md |
Built-in |
All format dependencies are included in requirements.txt.
Fast, high-quality British English synthesis (82M parameters, 24kHz). Features 8 premium British voices including Emma, Alice, Isabella, Lily, George, Daniel, Fable, and Lewis.
Clone any voice from just 3+ seconds of reference audio.
Models:
Qwen3-TTS-12Hz-0.6B-Base- Fast, 1.4GBQwen3-TTS-12Hz-1.7B-Base- Higher quality, 3.6GB
Languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
How It Works:
- Upload a 3+ second audio sample
- (Optional) Provide transcript for better quality
- Enter text to synthesize
- Adjust generation parameters if needed
- Generate!
Use 9 premium preset speakers without any reference audio.
Models:
Qwen3-TTS-12Hz-0.6B-CustomVoiceQwen3-TTS-12Hz-1.7B-CustomVoice
Style Instructions: Control tone with prompts like "Speak slowly", "Very happy", "Whisper", or use "Optimized for engaging, professional audiobook narration" for long-form content.
| Parameter | Default | Range | Description |
|---|---|---|---|
| Temperature | 0.9 | 0.1-2.0 | Randomness in generation |
| Top P | 0.9 | 0.1-1.0 | Nucleus sampling threshold |
| Top K | 50 | 1-100 | Top-k sampling |
| Repetition Penalty | 1.0 | 1.0-2.0 | Reduce repetition |
| Seed | -1 | -1 or 0+ | Reproducible generation (-1=random) |
Chatterbox adds multilingual voice cloning from a reference audio prompt. It uses the same voice library flow as Qwen3 (default samples + your uploads).
23 Supported Languages:
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
| ar | Arabic | he | Hebrew | no | Norwegian |
| da | Danish | hi | Hindi | pl | Polish |
| de | German | it | Italian | pt | Portuguese |
| el | Greek | ja | Japanese | ru | Russian |
| en | English | ko | Korean | sv | Swedish |
| es | Spanish | ms | Malay | sw | Swahili |
| fi | Finnish | nl | Dutch | tr | Turkish |
| fr | French | zh | Chinese |
Parameters:
- Temperature (randomness)
- CFG weight (conditioning strength)
- Exaggeration (style intensity)
- Seed (reproducibility)
Hebrew TTS: Chatterbox Hebrew requires the Dicta ONNX diacritizer model (dicta-1.0.onnx, ~1.1 GB) which adds vowel marks (nikud) to unvocalized Hebrew text before synthesis. Without it, Hebrew output quality is severely degraded. The model can be downloaded from the app's Model Manager, or automatically by install.sh (skip with SKIP_DICTA=1), and is stored at backend/models/dicta-onnx/dicta-1.0.onnx. To download manually:
mkdir -p backend/models/dicta-onnx
curl -L -o backend/models/dicta-onnx/dicta-1.0.onnx \
https://github.com/thewh1teagle/dicta-onnx/releases/download/model-files-v1.0/dicta-1.0.onnxNote: On Apple Silicon, Chatterbox runs on CPU due to MPS resampling limitations.
The backend exposes 60+ REST endpoints via FastAPI. Full interactive docs at http://localhost:8000/docs.
| Endpoint | Method | Description |
|---|---|---|
/api/health |
GET | Health check |
/api/system/info |
GET | System information (Python, device, models, OS) |
/api/system/stats |
GET | Real-time CPU/RAM/GPU usage |
| Endpoint | Method | Description |
|---|---|---|
/api/kokoro/generate |
POST | Generate speech with Kokoro |
/api/kokoro/voices |
GET | List available voices |
/api/kokoro/audio/list |
GET | List generated audio files |
/api/kokoro/audio/{filename} |
DELETE | Delete audio file |
| Endpoint | Method | Description |
|---|---|---|
/api/qwen3/generate |
POST | Generate audio (clone or custom mode) |
/api/qwen3/generate/stream |
POST | Streaming audio generation |
/api/qwen3/voices |
GET | List saved voice samples |
/api/qwen3/voices |
POST | Upload new voice sample |
/api/qwen3/voices/{name} |
PUT | Update voice sample |
/api/qwen3/voices/{name} |
DELETE | Delete voice sample |
/api/qwen3/voices/{name}/audio |
GET | Preview voice sample audio |
/api/qwen3/speakers |
GET | List 9 preset speakers |
/api/qwen3/models |
GET | List available models |
/api/qwen3/languages |
GET | List supported languages |
/api/qwen3/info |
GET | Model info and status |
/api/qwen3/clear-cache |
POST | Clear voice prompt cache |
| Endpoint | Method | Description |
|---|---|---|
/api/chatterbox/generate |
POST | Generate speech (voice clone) |
/api/chatterbox/voices |
GET | List saved voice samples |
/api/chatterbox/voices |
POST | Upload new voice sample |
/api/chatterbox/voices/{name} |
PUT | Update voice sample |
/api/chatterbox/voices/{name} |
DELETE | Delete voice sample |
/api/chatterbox/voices/{name}/audio |
GET | Preview voice sample audio |
/api/chatterbox/languages |
GET | List supported languages |
/api/chatterbox/info |
GET | Model info |
| Endpoint | Method | Description |
|---|---|---|
/api/models/status |
GET | Check download status of all models |
/api/models/{model_name}/download |
POST | Trigger HuggingFace model download |
| Endpoint | Method | Description |
|---|---|---|
/api/voices/custom |
GET | All custom voices across all engines |
| Endpoint | Method | Description |
|---|---|---|
/api/audiobook/generate |
POST | Start audiobook generation from text |
/api/audiobook/generate-from-file |
POST | Generate from uploaded document file (PDF/TXT/MD/DOCX/EPUB) |
/api/audiobook/status/{job_id} |
GET | Job progress (chars/sec, ETA, chapters) |
/api/audiobook/cancel/{job_id} |
POST | Cancel in-progress job |
/api/audiobook/list |
GET | List generated audiobooks |
/api/audiobook/{job_id} |
DELETE | Delete audiobook file |
Performance: ~60 chars/sec on M2 MacBook Pro CPU.
Output Formats: WAV (lossless), MP3 (compressed), M4B (audiobook with chapter markers).
Subtitle Formats: SRT (VLC-compatible), VTT (web-compatible).
| Endpoint | Method | Description |
|---|---|---|
/api/tts/audio/list |
GET | List Kokoro-generated audio |
/api/tts/audio/{filename} |
DELETE | Delete TTS audio file |
/api/voice-clone/audio/list |
GET | List Qwen3/Chatterbox clone audio |
/api/voice-clone/audio/{filename} |
DELETE | Delete clone audio file |
| Endpoint | Method | Description |
|---|---|---|
/api/samples/{engine} |
GET | Sample texts for engine |
/api/pregenerated |
GET | Pregenerated audio samples |
/api/voice-samples |
GET | Voice sample sentences with audio |
# Start generation
curl -X POST http://localhost:7693/api/audiobook/generate \
-H "Content-Type: application/json" \
-d '{"text": "Your document text...", "title": "My Audiobook", "voice": "bf_emma", "output_format": "m4b"}'
# From file
curl -X POST http://localhost:7693/api/audiobook/generate-from-file \
-F "file=@mybook.pdf" -F "title=My Audiobook" -F "voice=bf_emma" -F "output_format=m4b"
# Poll progress
curl http://localhost:7693/api/audiobook/status/{job_id}MimikaStudio includes a full MCP (Model Context Protocol) server that exposes every API endpoint as MCP tools for programmatic access via Claude Code CLI, Codex, or any MCP-compatible client.
Start: ./bin/mimikactl mcp start (port 8010)
The MCP server provides 50+ tools for:
- TTS generation (Kokoro, Qwen3, Chatterbox)
- Voice management (list, upload, delete, update, preview)
- Audiobook generation and management
- System info and monitoring
- LLM configuration
- Audio library management
This is the same JSON-RPC MCP workflow used by agent clients (Codex, Claude Code), without uploading audio anywhere.
# 1) Start backend + MCP
./bin/mimikactl up --no-flutter# 2) (Optional) confirm Kokoro voices via MCP
curl -s http://127.0.0.1:8010 \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"tts_list_voices","arguments":{"engine":"kokoro"}}}'# 3) Start audiobook generation from a local PDF file
curl -s http://127.0.0.1:8010 \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"audiobook_generate_from_file","arguments":{"file_path":"/absolute/path/to/document.pdf","title":"My Oral Exam Notes","voice":"bf_emma","speed":1.0,"output_format":"mp3"}}}'# 4) Poll status until "completed"
curl -s http://127.0.0.1:8010 \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"audiobook_status","arguments":{"job_id":"<JOB_ID>"}}}'On completion, the file is created locally in backend/outputs/:
backend/outputs/audiobook-<JOB_ID>.mp3- Served locally at
http://localhost:7693/audio/audiobook-<JOB_ID>.mp3
If your client is connected to the Mimika MCP server (http://127.0.0.1:8010), you can ask it to run the exact same flow.
Codex prompt example
Use Mimika MCP tool audiobook_generate_from_file with:
file_path=/absolute/path/to/document.pdf
title=My Oral Exam Notes
voice=bf_emma
output_format=mp3
Then poll audiobook_status until completed and return job_id + audio_url.
Claude Code prompt example
Call Mimika MCP audiobook_generate_from_file for /absolute/path/to/document.pdf
with voice bf_emma and output_format mp3.
Track audiobook_status every 10 seconds and report final audio_url.
The MCP & API tab in the Flutter app provides a live dashboard showing:
- Server status — Backend API (port 7693), MCP Server (port 8010), and API Docs availability with green/red indicators
- All MCP tools grouped by category (System, Kokoro, Qwen3, Chatterbox, Audiobook, Voice Management, Models, Samples) with expandable parameter details
- All 60+ REST API endpoints grouped by category with HTTP method badges (GET/POST/PUT/DELETE)
- Search — Filter tools and endpoints by name, path, or description
The dashboard fetches MCP tools live from the MCP server via JSON-RPC, so it always reflects the current tool set.
source venv/bin/activate
cd backend
# Run all tests (fast, no model loading required)
pytest tests/ -v
# Run specific test file
pytest tests/test_all_endpoints.py -v
# Run with actual model tests (slow, requires models downloaded)
RUN_MODEL_TESTS=1 pytest tests/MimikaStudio/
├── install.sh # Single install script (run this first)
├── requirements.txt # All Python dependencies
├── venv/ # Python virtual environment (created by install.sh)
│
├── bin/
│ ├── mimikactl # Service control script
│ ├── mimika # CLI tool for TTS/voice cloning
│ └── tts_mcp_server.py # MCP server for programmatic access
│
├── pdf/ # Place read-aloud documents here (pdf/txt/md/docx/epub)
│
├── flutter_app/ # Flutter desktop application (~10,100 lines Dart)
│ ├── lib/
│ │ ├── main.dart # App entry, 6-tab navigation + Model Manager
│ │ ├── screens/
│ │ │ ├── quick_tts_screen.dart # Kokoro TTS
│ │ │ ├── qwen3_clone_screen.dart # Qwen3 voice cloning
│ │ │ ├── chatterbox_clone_screen.dart # Chatterbox voice cloning
│ │ │ ├── pdf_reader_screen.dart # PDF reader with TTS
│ │ │ ├── mcp_endpoints_screen.dart # MCP & API dashboard
│ │ │ └── models_dialog.dart # Model download manager
│ │ ├── widgets/
│ │ │ ├── audio_player_widget.dart # Shared audio player
│ │ │ └── multi_layer_text.dart # Text overlay widget
│ │ └── services/
│ │ └── api_service.dart # Backend API client (823 lines)
│ └── macos/ # macOS configuration
│
├── backend/ # FastAPI Python backend (~8,500 lines Python, 60+ endpoints)
│ ├── main.py # API endpoints (2,078 lines)
│ ├── database.py # SQLite initialization and seeding
│ ├── requirements.txt # (legacy, use root requirements.txt)
│ ├── tts/ # TTS engine wrappers
│ │ ├── kokoro_engine.py
│ │ ├── qwen3_engine.py # Clone + CustomVoice
│ │ ├── chatterbox_engine.py # Multilingual voice clone
│ │ ├── text_chunking.py # Smart text chunking for audiobooks
│ │ ├── audio_utils.py # Audio processing utilities
│ │ └── audiobook.py # Audiobook generation logic (822 lines)
│ ├── language/
│ ├── llm/ # LLM provider integration
│ │ ├── factory.py # Claude, OpenAI, Ollama support
│ │ ├── claude_provider.py
│ │ ├── openai_provider.py
│ │ └── codex_provider.py
│ ├── models/
│ │ ├── registry.py # Model registry (all engines)
│ │ └── dicta-onnx/ # Hebrew diacritizer (~1.1 GB, downloaded by install.sh)
│ ├── tests/ # Comprehensive test suite
│ └── data/
│ ├── samples/ # Shipped voice samples (shared across engines)
│ │ ├── qwen3_voices/ # Natasha, Suzan
│ │ ├── chatterbox_voices/ # Natasha, Suzan, Hebrew_Natasha
│ │ └── kokoro/ # Pre-generated Kokoro samples
│ ├── user_voices/ # User uploads (git-ignored, shared across engines)
│ │ ├── qwen3/
│ │ ├── chatterbox/
│ └── outputs/ # Generated audio files
│
├── scripts/ # Build & installer scripts
│ ├── build_installer.ps1 # Windows installer build (PyInstaller + Inno Setup)
│ ├── mimikastudio.spec # PyInstaller spec file
│ ├── mimikastudio.iss # Inno Setup installer script
│ ├── install_macos.sh # (legacy, use root install.sh)
│ └── setup.sh # (legacy, use root install.sh)
| Language | Lines of Code | Files |
|---|---|---|
| Python (backend, scripts, MCP server) | ~8,500 | 20+ |
| Dart (Flutter UI) | ~10,100 | 13 |
| Total | ~18,600 | 33+ |
| Directory | Lines | Description |
|---|---|---|
backend/main.py |
2,078 | FastAPI endpoints |
backend/tts/ |
2,037 | TTS engine wrappers (Kokoro, Qwen3, Chatterbox) |
backend/tests/ |
1,567 | Comprehensive test suite |
bin/tts_mcp_server.py |
1,438 | MCP server |
backend/llm/ |
409 | LLM provider integration |
backend/models/ |
163 | Model registry |
scripts/ |
377 | Build & installer scripts |
| Directory | Lines | Description |
|---|---|---|
lib/screens/ |
7,080 | 8 screens (Models, TTS, Qwen3, Chatterbox, PDF, MCP, Settings, About) |
lib/services/ |
823 | API service client |
lib/widgets/ |
952 | Shared widgets (audio player, text overlay) |
lib/main.dart |
270 | App entry + 6-tab navigation |
| File | Lines |
|---|---|
backend/main.py |
2,078 |
screens/pdf_reader_screen.dart |
2,147 |
bin/tts_mcp_server.py |
1,438 |
screens/qwen3_clone_screen.dart |
1,482 |
screens/chatterbox_clone_screen.dart |
1,243 |
screens/quick_tts_screen.dart |
1,085 |
backend/tests/test_all_endpoints.py |
927 |
backend/tts/audiobook.py |
822 |
services/api_service.dart |
823 |
"espeak-ng not found"
brew install espeak-ng"ffmpeg not found" (for MP3/M4B export)
brew install ffmpeg"No module named 'perth'" or "No module named 'omegaconf'"
These are Chatterbox runtime dependencies. Run ./install.sh or manually:
source venv/bin/activate
pip install resemble-perth omegaconf conformer diffusers pyloudnorm pykakasi spacy-pkuseg
pip install --no-deps chatterbox-tts==0.1.6Hebrew TTS sounds garbled or robotic
The Dicta ONNX diacritizer model is missing. Chatterbox requires it to add vowel marks (nikud) to Hebrew text. Download it:
mkdir -p backend/models/dicta-onnx
curl -L -o backend/models/dicta-onnx/dicta-1.0.onnx \
https://github.com/thewh1teagle/dicta-onnx/releases/download/model-files-v1.0/dicta-1.0.onnxThen restart the backend. You should see [Chatterbox] Hebrew diacritizer loaded in the logs.
"spaCy not available" (warning, not error)
pip install spacy
# The app will use regex fallback if spaCy is not installedModels not downloading
- Ensure you have internet access
- Models are stored in
~/.cache/huggingface/(Qwen3) andbackend/models/(Kokoro)
Flutter build fails
flutter clean && flutter pub get && flutter build macos --releasePort 8000 already in use
lsof -i :8000
kill -9 <PID>MimikaStudio is optimized for Apple Silicon Macs with MPS (Metal Performance Shaders) acceleration where supported:
- Kokoro TTS: Uses MPS for GPU-accelerated inference — sub-200ms latency
- Qwen3-TTS: Runs on CPU (MPS support planned); still fast on M-series chips
- Chatterbox: Runs on CPU due to MPS resampling limitations
- Audiobook generation: Expect ~60 chars/sec on M2 MacBook Pro (matching audiblez benchmark)
- Memory: Close other apps when generating long audiobooks with 1.7B model
| Author | Shlomo Kashani |
| Affiliation | Johns Hopkins University, Maryland, U.S.A. |
@software{kashani2025mimikastudio,
title={MimikaStudio: Local-First Voice Cloning and Text-to-Speech Desktop Application},
author={Kashani, Shlomo},
year={2025},
institution={Johns Hopkins University},
url={https://github.com/BoltzmannEntropy/MimikaStudio},
note={Comprehensive desktop application integrating Qwen3-TTS and Kokoro for voice cloning and synthesis}
}APA Format:
Kashani, S. (2025). MimikaStudio: Local-First Voice Cloning and Text-to-Speech Desktop Application. Johns Hopkins University. https://github.com/BoltzmannEntropy/MimikaStudio
IEEE Format:
S. Kashani, "MimikaStudio: Local-First Voice Cloning and Text-to-Speech Desktop Application," Johns Hopkins University, 2025. [Online]. Available: https://github.com/BoltzmannEntropy/MimikaStudio
| Project | Description | Key Features |
|---|---|---|
| audiblez | EPUB to audiobook converter using Kokoro TTS | spaCy sentence tokenization, M4B output with chapters |
| pdf-narrator | PDF to audiobook with smart text extraction | Skips headers/footers/page numbers, TOC-based chapter splitting |
| abogen | Full-featured audiobook generator GUI | Voice mixer, subtitle generation, batch processing |
| Qwen3-Audiobook-Converter | Qwen3-TTS audiobook tool | Style instructions for professional narration |
- From audiblez: spaCy-based sentence tokenization, character-based progress tracking, M4B with chapters
- From pdf-narrator: Smart PDF extraction that skips headers/footers/page numbers, TOC-based chapters
- From abogen: Multiple output formats (WAV/MP3/M4B), real-time progress with ETA
- Unique to MimikaStudio: Native macOS Flutter UI, 3-second voice cloning, voice library management, full MCP server integration, 60+ REST API endpoints, in-app MCP & API dashboard
MimikaStudio uses a dual-license model:
| License | Scope | File |
|---|---|---|
| Business Source License 1.1 | Source code | LICENSE |
| Binary Distribution License | DMG/executables | BINARY-LICENSE.txt |
Source Code (BSL-1.1):
- Free to use, modify, and build for personal/internal use
- Production use permitted under BSL terms
- Converts to GPL-2.0-or-later after the change date
Binary Distribution:
- Free for personal and evaluation use
- Commercial use requires a license
- Redistribution not permitted
See our License page for a plain-English overview, or contact solomon@qneura.ai for commercial licensing.
- Qwen3-TTS - 3-second voice cloning with CustomVoice
- Kokoro TTS - Fast, high-quality English TTS
- Chatterbox - Multilingual voice cloning
- Dicta ONNX - Hebrew diacritization for Chatterbox TTS
- Flutter - Cross-platform UI framework
- FastAPI - Python API framework
- spaCy - Industrial-strength NLP for sentence tokenization
- PyMuPDF - Smart PDF text extraction











