Skip to content

feat: add KittenTTS text-to-speech model#118

Closed
beshkenadze wants to merge 12 commits intoBlaizzy:mainfrom
beshkenadze:feat/kitten-tts
Closed

feat: add KittenTTS text-to-speech model#118
beshkenadze wants to merge 12 commits intoBlaizzy:mainfrom
beshkenadze:feat/kitten-tts

Conversation

@beshkenadze
Copy link
Copy Markdown
Contributor

Summary

Model

  • lucasnewman/kitten-tts-en-us (English US/GB)
  • Module-based architecture with @ModuleInfo, quantization support
  • Voices loaded from single voices.safetensors file

Dependencies

Files

File Lines Description
KittenTTSConfig.swift 72 Model config using shared PLBertConfig + ISTFTNetConfig
KittenTTSModel.swift 330 Main model: load, sanitize, quantize, generate
KittenTTSModules.swift 147 TextEncoder, DurationEncoder, ProsodyPredictor
KittenTTSISTFTNet.swift 231 Generator + Decoder with iSTFT
KittenTTSTextCleaner.swift 22 IPA symbol-to-index mapping
TTSModel.swift +8 Add kitten_tts case + inference rule
Tests +180 Config, text cleaner, model structure tests

…o-phoneme

Add two new foundation modules:

- MLXAudioModules: reusable neural network building blocks (BiLSTM,
  WeightNormedConv, InstanceNorm, AdaIN, ResBlocks, SineGenerator, etc.)
  shared across TTS models

- MLXAudioG2P: clean-room grapheme-to-phoneme pipeline with CMUdict
  lexicon (BSD-2), text normalization, alignment, and extensible
  language pack architecture

Also updates CI workflow to use struct-based test filtering and
adds AGENTS.local.md with build/test conventions.
CMUDictLoader.load(from:) accepts directory URL instead of Bundle.module.
EnglishLanguagePack.withCMUDict(directory:) and G2PPipeline.english(cmuDictDirectory:)
take directory URL. Resources uploaded to beshkenadze/cmudict-ipa on HuggingFace.
Tests guarded with MLXAUDIO_CMUDICT_DIR env var.
Copilot AI review requested due to automatic review settings March 23, 2026 09:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new KittenTTS text-to-speech implementation (ALBERT + prosody + iSTFT-Net) and introduces shared/supporting modules (G2P, neural G2P, reusable NN blocks) plus CI/test updates to support the new stack.

Changes:

  • Add KittenTTSModel + supporting modules/config/text cleaning, and register kitten_tts in TTS.loadModel.
  • Add new reusable libraries: MLXAudioModules, MLXAudioG2P, MLXAudioNeuralG2P (and associated unit/integration tests).
  • Update package products/targets and CI workflow to run tests in one pass while skipping SmokeTests.

Reviewed changes

Copilot reviewed 80 out of 81 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
Tests/MLXAudioTTSTests.swift Adds KittenTTS-focused tests (config, text cleaner, weight key integration).
Tests/MLXAudioNeuralG2PTests.swift Adds tokenizer/weight sanitization/config tests + optional network integration tests.
Tests/MLXAudioG2PTextNormalizerTests.swift Adds unit tests for normalization/tokenization behavior.
Tests/MLXAudioG2PSmokeTests.swift Adds basic G2P pipeline smoke coverage.
Tests/MLXAudioG2PLexiconTests.swift Adds lexicon + fallback behavior tests.
Tests/MLXAudioG2PCMUDictTests.swift Adds CMUDict parsing/loading/IPA mapping tests (env-gated).
Tests/MLXAudioG2PAlignmentTests.swift Adds alignment tests for heuristic aligner.
Sources/MLXAudioTTS/TextProcessor.swift Introduces TextProcessor protocol with async prepare().
Sources/MLXAudioTTS/TTSModel.swift Adds textProcessor parameter + registers Kitten model type inference/loader.
Sources/MLXAudioTTS/Shared/SharedConfigs.swift Adds shared PLBertConfig and ISTFTNetConfig.
Sources/MLXAudioTTS/Shared/Albert.swift Adds ALBERT encoder implementation used by KittenTTS.
Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSTextCleaner.swift Adds symbol-to-index mapping + text cleaning for KittenTTS tokens.
Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSModules.swift Adds Kitten text/duration/prosody encoder modules.
Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSModel.swift Adds main KittenTTS model loading, quantization, and generation pipeline.
Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSISTFTNet.swift Adds iSTFT-Net generator/decoder implementation for waveform synthesis.
Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSConfig.swift Adds KittenTTS config decoding + quantization config support.
Sources/MLXAudioTTS/G2P/TokenContext.swift Adds Misaki G2P token context helper.
Sources/MLXAudioTTS/G2P/MisakiTextProcessor.swift Adds downloadable-resource-backed text processor for IPA phonemization.
Sources/MLXAudioTTS/G2P/MToken.swift Adds Misaki token structures.
Sources/MLXAudioTTS/G2P/Lexicon/PennTagUtil.swift Adds NLTag→Penn tag mapping utilities used by Misaki lexicon logic.
Sources/MLXAudioTTS/G2P/Lexicon/Lexicon.swift Adds Misaki lexicon/transcription logic.
Sources/MLXAudioTTS/G2P/Lexicon/DataResourcesUtil.swift Adds resource loading helpers for Misaki lexicon data.
Sources/MLXAudioTTS/G2P/G2PExtensions.swift Adds small extensions used by Misaki implementation.
Sources/MLXAudioTTS/G2P/FallbackNetwork/MultiHeadAttention.swift Adds BART attention block for Misaki neural fallback.
Sources/MLXAudioTTS/G2P/FallbackNetwork/FeedForward.swift Adds FFN block for Misaki neural fallback.
Sources/MLXAudioTTS/G2P/FallbackNetwork/EnglishFallbackNetwork.swift Adds BART-based fallback network wiring and resource loading.
Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTModel.swift Adds BART model implementation + greedy generation.
Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTLayerNorm.swift Adds BART-specific LayerNorm wrapper.
Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTEncoderLayer.swift Adds BART encoder layer implementation.
Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTDecoderLayer.swift Adds BART decoder layer implementation.
Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTConfig.swift Adds BART config decoding for fallback resources.
Sources/MLXAudioTTS/G2P/EnglishNum2Word.swift Adds number-to-words conversion used by Misaki pipeline.
Sources/MLXAudioTTS/G2P/EnglishG2P.swift Adds Misaki English G2P pipeline implementation.
Sources/MLXAudioNeuralG2P/Weights.swift Adds T5 weight loading + key sanitization.
Sources/MLXAudioNeuralG2P/Tokenizer.swift Adds ByT5 byte-level tokenizer implementation.
Sources/MLXAudioNeuralG2P/RelativePositionBias.swift Adds relative position bias + bucketing logic.
Sources/MLXAudioNeuralG2P/NeuralPhonemizer.swift Adds Phonemizing adapter around neural G2P.
Sources/MLXAudioNeuralG2P/Model.swift Adds T5 model wrapper for conditional generation.
Sources/MLXAudioNeuralG2P/G2P.swift Adds inference wrapper for converting words to IPA using T5.
Sources/MLXAudioNeuralG2P/FeedForward.swift Adds T5 FFN block.
Sources/MLXAudioNeuralG2P/EncoderLayer.swift Adds T5 encoder layer.
Sources/MLXAudioNeuralG2P/Encoder.swift Adds T5 encoder.
Sources/MLXAudioNeuralG2P/DecoderLayer.swift Adds T5 decoder layer with KV cache support.
Sources/MLXAudioNeuralG2P/Decoder.swift Adds T5 decoder with causal + position bias masking.
Sources/MLXAudioNeuralG2P/Config.swift Adds T5 config decoding/loading from model dir.
Sources/MLXAudioNeuralG2P/Attention.swift Adds attention core + KV cache type.
Sources/MLXAudioModules/WeightNormedConv.swift Adds reusable weight-normalized conv module.
Sources/MLXAudioModules/Utilities.swift Adds 1D interpolation helper.
Sources/MLXAudioModules/UpSample1d.swift Adds simple upsample wrapper module.
Sources/MLXAudioModules/SineGenerator.swift Adds harmonic source generation (NSF-style).
Sources/MLXAudioModules/ResidualBlocks.swift Adds AdaIN residual blocks used by vocoders/decoders.
Sources/MLXAudioModules/Normalization.swift Adds InstanceNorm/AdaIN/AdaLayerNorm blocks.
Sources/MLXAudioModules/LinearNorm.swift Adds linear projection wrapper.
Sources/MLXAudioModules/BiLSTM.swift Adds BiLSTM implementation used by TTS modules.
Sources/MLXAudioG2P/Tokenization/TextTokenizer.swift Adds deterministic tokenization.
Sources/MLXAudioG2P/Tokenization/G2PToken.swift Adds token structure/types.
Sources/MLXAudioG2P/TextNormalization/TextNormalizer.swift Adds rule-based normalizer.
Sources/MLXAudioG2P/TextNormalization/NormalizationRule.swift Adds normalization rule primitives and defaults.
Sources/MLXAudioG2P/README.md Documents MLXAudioG2P scope/usage.
Sources/MLXAudioG2P/Pipeline/G2PPipeline.swift Adds pipeline orchestration (normalize→tokenize→lexicon/fallback→alignment).
Sources/MLXAudioG2P/Pipeline/G2POutput.swift Adds pipeline output model.
Sources/MLXAudioG2P/Pipeline/G2PInput.swift Adds pipeline input model (future-facing).
Sources/MLXAudioG2P/Pipeline/G2PError.swift Adds typed pipeline error set.
Sources/MLXAudioG2P/Phonemes/PhonemeUnit.swift Adds phoneme unit model.
Sources/MLXAudioG2P/Phonemes/PhonemeSequence.swift Adds phoneme sequence + rendering helpers.
Sources/MLXAudioG2P/MLXAudioG2P.swift Adds module namespace + version.
Sources/MLXAudioG2P/Lexicon/LexiconProviding.swift Adds lexicon lookup protocol.
Sources/MLXAudioG2P/Lexicon/LexiconEntry.swift Adds lexicon entry model.
Sources/MLXAudioG2P/Lexicon/InMemoryLexicon.swift Adds simple in-memory lexicon implementation.
Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictParser.swift Adds CMUDict parsing support.
Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictLoader.swift Adds CMUDict loader producing InMemoryLexicon.
Sources/MLXAudioG2P/Lexicon/CMUDict/ARPAbetMapper.swift Adds ARPAbet→IPA mapping.
Sources/MLXAudioG2P/Languages/English/EnglishLanguagePack.swift Adds English defaults + CMUDict factory.
Sources/MLXAudioG2P/Fallback/FallbackPhonemizer.swift Adds rule-based fallback phonemizer.
Sources/MLXAudioG2P/Alignment/TokenAlignment.swift Adds token↔phoneme alignment model.
Sources/MLXAudioG2P/Alignment/TokenAligning.swift Adds alignment protocol.
Sources/MLXAudioG2P/Alignment/HeuristicTokenAligner.swift Adds heuristic aligner implementation.
Package.swift Registers new products/targets and wires dependencies (incl. TTS depending on Modules).
Package.resolved Updates resolved dependency versions (notably EventSource pin).
AGENTS.local.md Adds repo-specific build/test rules and CI guidance.
.github/workflows/tests.yaml Simplifies CI test run and skips SmokeTests + disables parallel testing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Sources/MLXAudioTTS/Models/StyleTTS2/KittenTTS/KittenTTSModel.swift
Comment thread Sources/MLXAudioTTS/TTSModel.swift
Comment thread Sources/MLXAudioTTS/Models/StyleTTS2/KittenTTS/KittenTTSModel.swift
@beshkenadze
Copy link
Copy Markdown
Contributor Author

Copilot review — addressed

Fixed (commit 90f53da):

Replied (no change needed):

Local development notes — should not be tracked in version control.
ByT5-based neural G2P supporting 100+ languages via
beshkenadze/g2p-multilingual-byT5-tiny-mlx (20.8M params, MIT).

Provides NeuralPhonemizer conforming to the Phonemizing protocol
from MLXAudioG2P for seamless integration as a fallback phonemizer.
…nored keys, use MLX range, update scheme list
Extract shared ALBERT and config types used by both KittenTTS and Kokoro:
- Shared/Albert.swift: ALBERT encoder (6 classes)
- Shared/SharedConfigs.swift: PLBertConfig + ISTFTNetConfig

Move G2P stack to top-level G2P/ directory (17 files):
- English G2P pipeline with BART fallback network
- Lexicon with gold/silver dictionary support
- MisakiTextProcessor with HF resource download

Add TextProcessor protocol for model-agnostic text processing.

BART G2P resources now downloaded from HuggingFace
(beshkenadze/kitten-tts-g2p) instead of bundled in repo.
@beshkenadze
Copy link
Copy Markdown
Contributor Author

Restructuring into cleaner PRs

antmanler added a commit to platx-ai/mlx-audio-swift that referenced this pull request Apr 14, 2026
MLXAudioCodecs/Mimi/Mimi.swift had `import Tokenizers` but no actual
Tokenizers.* reference — the import was dead. It only resolved before
because mlx-swift-lm <2.x re-exported Tokenizers transitively through
MLXLMCommon. After mlx-swift-lm#118 (tokenizer decoupling), the
re-export is gone and downstream consumers fail with:

  Mimi.swift:7:8: error: unable to resolve module dependency: 'Tokenizers'

Two fixes:

1. Add the explicit Transformers product dependency to MLXAudioCodecs
   target in Package.swift, mirroring what MLXAudioSTT/STS/TTS already do.
2. Remove the dead `import Tokenizers` line from Mimi.swift to reduce
   the surface area.

The MimiTokenizer class defined inside Mimi.swift is a local type, not
related to swift-transformers, so removing the import is safe.

Verified with `swift build` against the current mlx-swift-lm main pin
(post-Blaizzy#118).
antmanler added a commit to platx-ai/mlx-audio-swift that referenced this pull request Apr 14, 2026
After mlx-swift-lm#118 (tokenizer decoupling), MLXLMCommon now defines
its own internal Tokenizer protocol. The MLXAudioSTT models that import
both MLXLMCommon and Tokenizers (swift-transformers) now hit:

  error: 'Tokenizer' is ambiguous for type lookup in this context

Mechanical fix: qualify the standalone Tokenizer type references with
the Tokenizers. namespace prefix in five files. The actual Tokenizer
type used by these models is the swift-transformers one (loaded via
AutoTokenizer.from), so this is the correct qualification.

Files touched:
  - Qwen3ASR.swift                (1 site)
  - Qwen3ForcedAligner.swift      (1 site)
  - GLMASR.swift                  (3 sites)
  - GraniteSpeech.swift           (2 sites)
  - StreamingInferenceSession.swift (1 site, "any Tokenizer" form)

Verified with `swift build` against current mlx-swift-lm pin and forward
compatible with the post-Blaizzy#118 API.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants