feat: add KittenTTS text-to-speech model by beshkenadze · Pull Request #118 · Blaizzy/mlx-audio-swift

beshkenadze · 2026-03-23T09:35:00Z

Summary

Adds KittenTTS Module-based TTS model with ALBERT encoder, prosody predictor, iSTFT-Net decoder
Uses shared types from PR feat: add shared TTS types, G2P stack, and text processing #117: Albert, PLBertConfig, ISTFTNetConfig (Shared/), MisakiTextProcessor (G2P/)
Uses MLXAudioModules shared NN blocks from PR feat: add MLXAudioModules and MLXAudioG2P foundation modules #115
5 model files (~800 lines) + config + TTSModel registration + tests

Model

lucasnewman/kitten-tts-en-us (English US/GB)
Module-based architecture with @ModuleInfo, quantization support
Voices loaded from single voices.safetensors file

Dependencies

Depends on: PR feat: add MLXAudioModules and MLXAudioG2P foundation modules #115 (MLXAudioModules + MLXAudioG2P), PR feat: add shared TTS types, G2P stack, and text processing #117 (Shared types + G2P stack)
Adds: MLXAudioModules to MLXAudioTTS target dependencies

Files

File	Lines	Description
`KittenTTSConfig.swift`	72	Model config using shared `PLBertConfig` + `ISTFTNetConfig`
`KittenTTSModel.swift`	330	Main model: load, sanitize, quantize, generate
`KittenTTSModules.swift`	147	TextEncoder, DurationEncoder, ProsodyPredictor
`KittenTTSISTFTNet.swift`	231	Generator + Decoder with iSTFT
`KittenTTSTextCleaner.swift`	22	IPA symbol-to-index mapping
`TTSModel.swift`	+8	Add `kitten_tts` case + inference rule
Tests	+180	Config, text cleaner, model structure tests

…o-phoneme Add two new foundation modules: - MLXAudioModules: reusable neural network building blocks (BiLSTM, WeightNormedConv, InstanceNorm, AdaIN, ResBlocks, SineGenerator, etc.) shared across TTS models - MLXAudioG2P: clean-room grapheme-to-phoneme pipeline with CMUdict lexicon (BSD-2), text normalization, alignment, and extensible language pack architecture Also updates CI workflow to use struct-based test filtering and adds AGENTS.local.md with build/test conventions.

CMUDictLoader.load(from:) accepts directory URL instead of Bundle.module. EnglishLanguagePack.withCMUDict(directory:) and G2PPipeline.english(cmuDictDirectory:) take directory URL. Resources uploaded to beshkenadze/cmudict-ipa on HuggingFace. Tests guarded with MLXAUDIO_CMUDICT_DIR env var.

…ules, remove stale doc link

Copilot

Pull request overview

Adds a new KittenTTS text-to-speech implementation (ALBERT + prosody + iSTFT-Net) and introduces shared/supporting modules (G2P, neural G2P, reusable NN blocks) plus CI/test updates to support the new stack.

Changes:

Add KittenTTSModel + supporting modules/config/text cleaning, and register kitten_tts in TTS.loadModel.
Add new reusable libraries: MLXAudioModules, MLXAudioG2P, MLXAudioNeuralG2P (and associated unit/integration tests).
Update package products/targets and CI workflow to run tests in one pass while skipping SmokeTests.

Reviewed changes

Copilot reviewed 80 out of 81 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
Tests/MLXAudioTTSTests.swift	Adds KittenTTS-focused tests (config, text cleaner, weight key integration).
Tests/MLXAudioNeuralG2PTests.swift	Adds tokenizer/weight sanitization/config tests + optional network integration tests.
Tests/MLXAudioG2PTextNormalizerTests.swift	Adds unit tests for normalization/tokenization behavior.
Tests/MLXAudioG2PSmokeTests.swift	Adds basic G2P pipeline smoke coverage.
Tests/MLXAudioG2PLexiconTests.swift	Adds lexicon + fallback behavior tests.
Tests/MLXAudioG2PCMUDictTests.swift	Adds CMUDict parsing/loading/IPA mapping tests (env-gated).
Tests/MLXAudioG2PAlignmentTests.swift	Adds alignment tests for heuristic aligner.
Sources/MLXAudioTTS/TextProcessor.swift	Introduces `TextProcessor` protocol with async `prepare()`.
Sources/MLXAudioTTS/TTSModel.swift	Adds `textProcessor` parameter + registers Kitten model type inference/loader.
Sources/MLXAudioTTS/Shared/SharedConfigs.swift	Adds shared `PLBertConfig` and `ISTFTNetConfig`.
Sources/MLXAudioTTS/Shared/Albert.swift	Adds ALBERT encoder implementation used by KittenTTS.
Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSTextCleaner.swift	Adds symbol-to-index mapping + text cleaning for KittenTTS tokens.
Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSModules.swift	Adds Kitten text/duration/prosody encoder modules.
Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSModel.swift	Adds main KittenTTS model loading, quantization, and generation pipeline.
Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSISTFTNet.swift	Adds iSTFT-Net generator/decoder implementation for waveform synthesis.
Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSConfig.swift	Adds KittenTTS config decoding + quantization config support.
Sources/MLXAudioTTS/G2P/TokenContext.swift	Adds Misaki G2P token context helper.
Sources/MLXAudioTTS/G2P/MisakiTextProcessor.swift	Adds downloadable-resource-backed text processor for IPA phonemization.
Sources/MLXAudioTTS/G2P/MToken.swift	Adds Misaki token structures.
Sources/MLXAudioTTS/G2P/Lexicon/PennTagUtil.swift	Adds NLTag→Penn tag mapping utilities used by Misaki lexicon logic.
Sources/MLXAudioTTS/G2P/Lexicon/Lexicon.swift	Adds Misaki lexicon/transcription logic.
Sources/MLXAudioTTS/G2P/Lexicon/DataResourcesUtil.swift	Adds resource loading helpers for Misaki lexicon data.
Sources/MLXAudioTTS/G2P/G2PExtensions.swift	Adds small extensions used by Misaki implementation.
Sources/MLXAudioTTS/G2P/FallbackNetwork/MultiHeadAttention.swift	Adds BART attention block for Misaki neural fallback.
Sources/MLXAudioTTS/G2P/FallbackNetwork/FeedForward.swift	Adds FFN block for Misaki neural fallback.
Sources/MLXAudioTTS/G2P/FallbackNetwork/EnglishFallbackNetwork.swift	Adds BART-based fallback network wiring and resource loading.
Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTModel.swift	Adds BART model implementation + greedy generation.
Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTLayerNorm.swift	Adds BART-specific LayerNorm wrapper.
Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTEncoderLayer.swift	Adds BART encoder layer implementation.
Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTDecoderLayer.swift	Adds BART decoder layer implementation.
Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTConfig.swift	Adds BART config decoding for fallback resources.
Sources/MLXAudioTTS/G2P/EnglishNum2Word.swift	Adds number-to-words conversion used by Misaki pipeline.
Sources/MLXAudioTTS/G2P/EnglishG2P.swift	Adds Misaki English G2P pipeline implementation.
Sources/MLXAudioNeuralG2P/Weights.swift	Adds T5 weight loading + key sanitization.
Sources/MLXAudioNeuralG2P/Tokenizer.swift	Adds ByT5 byte-level tokenizer implementation.
Sources/MLXAudioNeuralG2P/RelativePositionBias.swift	Adds relative position bias + bucketing logic.
Sources/MLXAudioNeuralG2P/NeuralPhonemizer.swift	Adds `Phonemizing` adapter around neural G2P.
Sources/MLXAudioNeuralG2P/Model.swift	Adds T5 model wrapper for conditional generation.
Sources/MLXAudioNeuralG2P/G2P.swift	Adds inference wrapper for converting words to IPA using T5.
Sources/MLXAudioNeuralG2P/FeedForward.swift	Adds T5 FFN block.
Sources/MLXAudioNeuralG2P/EncoderLayer.swift	Adds T5 encoder layer.
Sources/MLXAudioNeuralG2P/Encoder.swift	Adds T5 encoder.
Sources/MLXAudioNeuralG2P/DecoderLayer.swift	Adds T5 decoder layer with KV cache support.
Sources/MLXAudioNeuralG2P/Decoder.swift	Adds T5 decoder with causal + position bias masking.
Sources/MLXAudioNeuralG2P/Config.swift	Adds T5 config decoding/loading from model dir.
Sources/MLXAudioNeuralG2P/Attention.swift	Adds attention core + KV cache type.
Sources/MLXAudioModules/WeightNormedConv.swift	Adds reusable weight-normalized conv module.
Sources/MLXAudioModules/Utilities.swift	Adds 1D interpolation helper.
Sources/MLXAudioModules/UpSample1d.swift	Adds simple upsample wrapper module.
Sources/MLXAudioModules/SineGenerator.swift	Adds harmonic source generation (NSF-style).
Sources/MLXAudioModules/ResidualBlocks.swift	Adds AdaIN residual blocks used by vocoders/decoders.
Sources/MLXAudioModules/Normalization.swift	Adds InstanceNorm/AdaIN/AdaLayerNorm blocks.
Sources/MLXAudioModules/LinearNorm.swift	Adds linear projection wrapper.
Sources/MLXAudioModules/BiLSTM.swift	Adds BiLSTM implementation used by TTS modules.
Sources/MLXAudioG2P/Tokenization/TextTokenizer.swift	Adds deterministic tokenization.
Sources/MLXAudioG2P/Tokenization/G2PToken.swift	Adds token structure/types.
Sources/MLXAudioG2P/TextNormalization/TextNormalizer.swift	Adds rule-based normalizer.
Sources/MLXAudioG2P/TextNormalization/NormalizationRule.swift	Adds normalization rule primitives and defaults.
Sources/MLXAudioG2P/README.md	Documents MLXAudioG2P scope/usage.
Sources/MLXAudioG2P/Pipeline/G2PPipeline.swift	Adds pipeline orchestration (normalize→tokenize→lexicon/fallback→alignment).
Sources/MLXAudioG2P/Pipeline/G2POutput.swift	Adds pipeline output model.
Sources/MLXAudioG2P/Pipeline/G2PInput.swift	Adds pipeline input model (future-facing).
Sources/MLXAudioG2P/Pipeline/G2PError.swift	Adds typed pipeline error set.
Sources/MLXAudioG2P/Phonemes/PhonemeUnit.swift	Adds phoneme unit model.
Sources/MLXAudioG2P/Phonemes/PhonemeSequence.swift	Adds phoneme sequence + rendering helpers.
Sources/MLXAudioG2P/MLXAudioG2P.swift	Adds module namespace + version.
Sources/MLXAudioG2P/Lexicon/LexiconProviding.swift	Adds lexicon lookup protocol.
Sources/MLXAudioG2P/Lexicon/LexiconEntry.swift	Adds lexicon entry model.
Sources/MLXAudioG2P/Lexicon/InMemoryLexicon.swift	Adds simple in-memory lexicon implementation.
Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictParser.swift	Adds CMUDict parsing support.
Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictLoader.swift	Adds CMUDict loader producing `InMemoryLexicon`.
Sources/MLXAudioG2P/Lexicon/CMUDict/ARPAbetMapper.swift	Adds ARPAbet→IPA mapping.
Sources/MLXAudioG2P/Languages/English/EnglishLanguagePack.swift	Adds English defaults + CMUDict factory.
Sources/MLXAudioG2P/Fallback/FallbackPhonemizer.swift	Adds rule-based fallback phonemizer.
Sources/MLXAudioG2P/Alignment/TokenAlignment.swift	Adds token↔phoneme alignment model.
Sources/MLXAudioG2P/Alignment/TokenAligning.swift	Adds alignment protocol.
Sources/MLXAudioG2P/Alignment/HeuristicTokenAligner.swift	Adds heuristic aligner implementation.
Package.swift	Registers new products/targets and wires dependencies (incl. TTS depending on Modules).
Package.resolved	Updates resolved dependency versions (notably EventSource pin).
AGENTS.local.md	Adds repo-specific build/test rules and CI guidance.
.github/workflows/tests.yaml	Simplifies CI test run and skips SmokeTests + disables parallel testing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

beshkenadze · 2026-03-23T09:44:59Z

Copilot review — addressed

Fixed (commit 90f53da):

Add DACVAE Codec #2 fromPretrained: now calls textProcessor?.prepare() before returning model
Add GLM ASR (STT) #4 prepareInputs: guards tokens.isEmpty and throws AudioGenerationError.invalidInput

Replied (no change needed):

MLX-Audio Swift SDK v1 #1 iSTFT CPU loops: valid perf observation, logged as follow-up optimization
Add CI/CD tests #3 hfToken not forwarded: pre-existing upstream pattern across all model loaders
feat(stt): Streaming Speech-to-Text with Whisper #5 Hard-coded 24kHz: model family constraint, not configurable
Vibe coded Chatterbox Port to Swift #6 Hard-coded hop size 5: matches genIstftHopSize config, changing requires retraining

Local development notes — should not be tracked in version control.

…s separate package

ByT5-based neural G2P supporting 100+ languages via beshkenadze/g2p-multilingual-byT5-tiny-mlx (20.8M params, MIT). Provides NeuralPhonemizer conforming to the Phonemizing protocol from MLXAudioG2P for seamless integration as a fallback phonemizer.

…nored keys, use MLX range, update scheme list

Extract shared ALBERT and config types used by both KittenTTS and Kokoro: - Shared/Albert.swift: ALBERT encoder (6 classes) - Shared/SharedConfigs.swift: PLBertConfig + ISTFTNetConfig Move G2P stack to top-level G2P/ directory (17 files): - English G2P pipeline with BART fallback network - Lexicon with gold/silver dictionary support - MisakiTextProcessor with HF resource download Add TextProcessor protocol for model-agnostic text processing. BART G2P resources now downloaded from HuggingFace (beshkenadze/kitten-tts-g2p) instead of bundled in repo.

…Processor protocol

beshkenadze · 2026-03-23T20:00:46Z

Restructuring into cleaner PRs

MLXAudioCodecs/Mimi/Mimi.swift had `import Tokenizers` but no actual Tokenizers.* reference — the import was dead. It only resolved before because mlx-swift-lm <2.x re-exported Tokenizers transitively through MLXLMCommon. After mlx-swift-lm#118 (tokenizer decoupling), the re-export is gone and downstream consumers fail with: Mimi.swift:7:8: error: unable to resolve module dependency: 'Tokenizers' Two fixes: 1. Add the explicit Transformers product dependency to MLXAudioCodecs target in Package.swift, mirroring what MLXAudioSTT/STS/TTS already do. 2. Remove the dead `import Tokenizers` line from Mimi.swift to reduce the surface area. The MimiTokenizer class defined inside Mimi.swift is a local type, not related to swift-transformers, so removing the import is safe. Verified with `swift build` against the current mlx-swift-lm main pin (post-Blaizzy#118).

After mlx-swift-lm#118 (tokenizer decoupling), MLXLMCommon now defines its own internal Tokenizer protocol. The MLXAudioSTT models that import both MLXLMCommon and Tokenizers (swift-transformers) now hit: error: 'Tokenizer' is ambiguous for type lookup in this context Mechanical fix: qualify the standalone Tokenizer type references with the Tokenizers. namespace prefix in five files. The actual Tokenizer type used by these models is the swift-transformers one (loaded via AutoTokenizer.from), so this is the correct qualification. Files touched: - Qwen3ASR.swift (1 site) - Qwen3ForcedAligner.swift (1 site) - GLMASR.swift (3 sites) - GraniteSpeech.swift (2 sites) - StreamingInferenceSession.swift (1 site, "any Tokenizer" form) Verified with `swift build` against current mlx-swift-lm pin and forward compatible with the post-Blaizzy#118 API.

beshkenadze added 3 commits March 23, 2026 00:07

fix: address Copilot review — rename SmokeTests, reorder normalizer r…

fb88daf

…ules, remove stale doc link

Copilot AI review requested due to automatic review settings March 23, 2026 09:35

Copilot started reviewing on behalf of beshkenadze March 23, 2026 09:35 View session

beshkenadze mentioned this pull request Mar 23, 2026

feat: add Kokoro TTS with multilingual support #119

Closed

Copilot AI reviewed Mar 23, 2026

View reviewed changes

beshkenadze added 2 commits March 23, 2026 18:10

chore: remove AGENTS.local.md from repo

edfd060

Local development notes — should not be tracked in version control.

chore: add AGENTS.local.md to .gitignore

e1de00e

beshkenadze force-pushed the feat/kitten-tts branch from b3a5fcf to 5bbb79b Compare March 23, 2026 16:12

fix: remove version constant from MLXAudioG2P namespace

0af92ec

beshkenadze force-pushed the feat/kitten-tts branch from 5bbb79b to ff438ab Compare March 23, 2026 18:35

beshkenadze added 5 commits March 23, 2026 21:21

refactor: move MLXAudioModules into Models/StyleTTS2/Blocks, remove a…

8ff81a3

…s separate package

fix: address Copilot review — trim phonemize output, pattern-match ig…

9afea89

…nored keys, use MLX range, update scheme list

fix: throwing init for EnglishFallbackNetwork + add prepare() to Text…

8b4d397

…Processor protocol

beshkenadze force-pushed the feat/kitten-tts branch from ff438ab to a5d5eea Compare March 23, 2026 19:24

feat: add KittenTTS text-to-speech model with Module-based architecture

5fa7de7

beshkenadze force-pushed the feat/kitten-tts branch from a5d5eea to 5fa7de7 Compare March 23, 2026 19:24

beshkenadze closed this Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add KittenTTS text-to-speech model#118

feat: add KittenTTS text-to-speech model#118
beshkenadze wants to merge 12 commits intoBlaizzy:mainfrom
beshkenadze:feat/kitten-tts

beshkenadze commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beshkenadze commented Mar 23, 2026

Uh oh!

beshkenadze commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

beshkenadze commented Mar 23, 2026

Summary

Model

Dependencies

Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beshkenadze commented Mar 23, 2026

Copilot review — addressed

Uh oh!

beshkenadze commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants