feat: add KittenTTS text-to-speech model#118
Closed
beshkenadze wants to merge 12 commits intoBlaizzy:mainfrom
Closed
feat: add KittenTTS text-to-speech model#118beshkenadze wants to merge 12 commits intoBlaizzy:mainfrom
beshkenadze wants to merge 12 commits intoBlaizzy:mainfrom
Conversation
…o-phoneme Add two new foundation modules: - MLXAudioModules: reusable neural network building blocks (BiLSTM, WeightNormedConv, InstanceNorm, AdaIN, ResBlocks, SineGenerator, etc.) shared across TTS models - MLXAudioG2P: clean-room grapheme-to-phoneme pipeline with CMUdict lexicon (BSD-2), text normalization, alignment, and extensible language pack architecture Also updates CI workflow to use struct-based test filtering and adds AGENTS.local.md with build/test conventions.
CMUDictLoader.load(from:) accepts directory URL instead of Bundle.module. EnglishLanguagePack.withCMUDict(directory:) and G2PPipeline.english(cmuDictDirectory:) take directory URL. Resources uploaded to beshkenadze/cmudict-ipa on HuggingFace. Tests guarded with MLXAUDIO_CMUDICT_DIR env var.
…ules, remove stale doc link
There was a problem hiding this comment.
Pull request overview
Adds a new KittenTTS text-to-speech implementation (ALBERT + prosody + iSTFT-Net) and introduces shared/supporting modules (G2P, neural G2P, reusable NN blocks) plus CI/test updates to support the new stack.
Changes:
- Add
KittenTTSModel+ supporting modules/config/text cleaning, and registerkitten_ttsinTTS.loadModel. - Add new reusable libraries:
MLXAudioModules,MLXAudioG2P,MLXAudioNeuralG2P(and associated unit/integration tests). - Update package products/targets and CI workflow to run tests in one pass while skipping SmokeTests.
Reviewed changes
Copilot reviewed 80 out of 81 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| Tests/MLXAudioTTSTests.swift | Adds KittenTTS-focused tests (config, text cleaner, weight key integration). |
| Tests/MLXAudioNeuralG2PTests.swift | Adds tokenizer/weight sanitization/config tests + optional network integration tests. |
| Tests/MLXAudioG2PTextNormalizerTests.swift | Adds unit tests for normalization/tokenization behavior. |
| Tests/MLXAudioG2PSmokeTests.swift | Adds basic G2P pipeline smoke coverage. |
| Tests/MLXAudioG2PLexiconTests.swift | Adds lexicon + fallback behavior tests. |
| Tests/MLXAudioG2PCMUDictTests.swift | Adds CMUDict parsing/loading/IPA mapping tests (env-gated). |
| Tests/MLXAudioG2PAlignmentTests.swift | Adds alignment tests for heuristic aligner. |
| Sources/MLXAudioTTS/TextProcessor.swift | Introduces TextProcessor protocol with async prepare(). |
| Sources/MLXAudioTTS/TTSModel.swift | Adds textProcessor parameter + registers Kitten model type inference/loader. |
| Sources/MLXAudioTTS/Shared/SharedConfigs.swift | Adds shared PLBertConfig and ISTFTNetConfig. |
| Sources/MLXAudioTTS/Shared/Albert.swift | Adds ALBERT encoder implementation used by KittenTTS. |
| Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSTextCleaner.swift | Adds symbol-to-index mapping + text cleaning for KittenTTS tokens. |
| Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSModules.swift | Adds Kitten text/duration/prosody encoder modules. |
| Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSModel.swift | Adds main KittenTTS model loading, quantization, and generation pipeline. |
| Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSISTFTNet.swift | Adds iSTFT-Net generator/decoder implementation for waveform synthesis. |
| Sources/MLXAudioTTS/Models/KittenTTS/KittenTTSConfig.swift | Adds KittenTTS config decoding + quantization config support. |
| Sources/MLXAudioTTS/G2P/TokenContext.swift | Adds Misaki G2P token context helper. |
| Sources/MLXAudioTTS/G2P/MisakiTextProcessor.swift | Adds downloadable-resource-backed text processor for IPA phonemization. |
| Sources/MLXAudioTTS/G2P/MToken.swift | Adds Misaki token structures. |
| Sources/MLXAudioTTS/G2P/Lexicon/PennTagUtil.swift | Adds NLTag→Penn tag mapping utilities used by Misaki lexicon logic. |
| Sources/MLXAudioTTS/G2P/Lexicon/Lexicon.swift | Adds Misaki lexicon/transcription logic. |
| Sources/MLXAudioTTS/G2P/Lexicon/DataResourcesUtil.swift | Adds resource loading helpers for Misaki lexicon data. |
| Sources/MLXAudioTTS/G2P/G2PExtensions.swift | Adds small extensions used by Misaki implementation. |
| Sources/MLXAudioTTS/G2P/FallbackNetwork/MultiHeadAttention.swift | Adds BART attention block for Misaki neural fallback. |
| Sources/MLXAudioTTS/G2P/FallbackNetwork/FeedForward.swift | Adds FFN block for Misaki neural fallback. |
| Sources/MLXAudioTTS/G2P/FallbackNetwork/EnglishFallbackNetwork.swift | Adds BART-based fallback network wiring and resource loading. |
| Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTModel.swift | Adds BART model implementation + greedy generation. |
| Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTLayerNorm.swift | Adds BART-specific LayerNorm wrapper. |
| Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTEncoderLayer.swift | Adds BART encoder layer implementation. |
| Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTDecoderLayer.swift | Adds BART decoder layer implementation. |
| Sources/MLXAudioTTS/G2P/FallbackNetwork/BARTConfig.swift | Adds BART config decoding for fallback resources. |
| Sources/MLXAudioTTS/G2P/EnglishNum2Word.swift | Adds number-to-words conversion used by Misaki pipeline. |
| Sources/MLXAudioTTS/G2P/EnglishG2P.swift | Adds Misaki English G2P pipeline implementation. |
| Sources/MLXAudioNeuralG2P/Weights.swift | Adds T5 weight loading + key sanitization. |
| Sources/MLXAudioNeuralG2P/Tokenizer.swift | Adds ByT5 byte-level tokenizer implementation. |
| Sources/MLXAudioNeuralG2P/RelativePositionBias.swift | Adds relative position bias + bucketing logic. |
| Sources/MLXAudioNeuralG2P/NeuralPhonemizer.swift | Adds Phonemizing adapter around neural G2P. |
| Sources/MLXAudioNeuralG2P/Model.swift | Adds T5 model wrapper for conditional generation. |
| Sources/MLXAudioNeuralG2P/G2P.swift | Adds inference wrapper for converting words to IPA using T5. |
| Sources/MLXAudioNeuralG2P/FeedForward.swift | Adds T5 FFN block. |
| Sources/MLXAudioNeuralG2P/EncoderLayer.swift | Adds T5 encoder layer. |
| Sources/MLXAudioNeuralG2P/Encoder.swift | Adds T5 encoder. |
| Sources/MLXAudioNeuralG2P/DecoderLayer.swift | Adds T5 decoder layer with KV cache support. |
| Sources/MLXAudioNeuralG2P/Decoder.swift | Adds T5 decoder with causal + position bias masking. |
| Sources/MLXAudioNeuralG2P/Config.swift | Adds T5 config decoding/loading from model dir. |
| Sources/MLXAudioNeuralG2P/Attention.swift | Adds attention core + KV cache type. |
| Sources/MLXAudioModules/WeightNormedConv.swift | Adds reusable weight-normalized conv module. |
| Sources/MLXAudioModules/Utilities.swift | Adds 1D interpolation helper. |
| Sources/MLXAudioModules/UpSample1d.swift | Adds simple upsample wrapper module. |
| Sources/MLXAudioModules/SineGenerator.swift | Adds harmonic source generation (NSF-style). |
| Sources/MLXAudioModules/ResidualBlocks.swift | Adds AdaIN residual blocks used by vocoders/decoders. |
| Sources/MLXAudioModules/Normalization.swift | Adds InstanceNorm/AdaIN/AdaLayerNorm blocks. |
| Sources/MLXAudioModules/LinearNorm.swift | Adds linear projection wrapper. |
| Sources/MLXAudioModules/BiLSTM.swift | Adds BiLSTM implementation used by TTS modules. |
| Sources/MLXAudioG2P/Tokenization/TextTokenizer.swift | Adds deterministic tokenization. |
| Sources/MLXAudioG2P/Tokenization/G2PToken.swift | Adds token structure/types. |
| Sources/MLXAudioG2P/TextNormalization/TextNormalizer.swift | Adds rule-based normalizer. |
| Sources/MLXAudioG2P/TextNormalization/NormalizationRule.swift | Adds normalization rule primitives and defaults. |
| Sources/MLXAudioG2P/README.md | Documents MLXAudioG2P scope/usage. |
| Sources/MLXAudioG2P/Pipeline/G2PPipeline.swift | Adds pipeline orchestration (normalize→tokenize→lexicon/fallback→alignment). |
| Sources/MLXAudioG2P/Pipeline/G2POutput.swift | Adds pipeline output model. |
| Sources/MLXAudioG2P/Pipeline/G2PInput.swift | Adds pipeline input model (future-facing). |
| Sources/MLXAudioG2P/Pipeline/G2PError.swift | Adds typed pipeline error set. |
| Sources/MLXAudioG2P/Phonemes/PhonemeUnit.swift | Adds phoneme unit model. |
| Sources/MLXAudioG2P/Phonemes/PhonemeSequence.swift | Adds phoneme sequence + rendering helpers. |
| Sources/MLXAudioG2P/MLXAudioG2P.swift | Adds module namespace + version. |
| Sources/MLXAudioG2P/Lexicon/LexiconProviding.swift | Adds lexicon lookup protocol. |
| Sources/MLXAudioG2P/Lexicon/LexiconEntry.swift | Adds lexicon entry model. |
| Sources/MLXAudioG2P/Lexicon/InMemoryLexicon.swift | Adds simple in-memory lexicon implementation. |
| Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictParser.swift | Adds CMUDict parsing support. |
| Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictLoader.swift | Adds CMUDict loader producing InMemoryLexicon. |
| Sources/MLXAudioG2P/Lexicon/CMUDict/ARPAbetMapper.swift | Adds ARPAbet→IPA mapping. |
| Sources/MLXAudioG2P/Languages/English/EnglishLanguagePack.swift | Adds English defaults + CMUDict factory. |
| Sources/MLXAudioG2P/Fallback/FallbackPhonemizer.swift | Adds rule-based fallback phonemizer. |
| Sources/MLXAudioG2P/Alignment/TokenAlignment.swift | Adds token↔phoneme alignment model. |
| Sources/MLXAudioG2P/Alignment/TokenAligning.swift | Adds alignment protocol. |
| Sources/MLXAudioG2P/Alignment/HeuristicTokenAligner.swift | Adds heuristic aligner implementation. |
| Package.swift | Registers new products/targets and wires dependencies (incl. TTS depending on Modules). |
| Package.resolved | Updates resolved dependency versions (notably EventSource pin). |
| AGENTS.local.md | Adds repo-specific build/test rules and CI guidance. |
| .github/workflows/tests.yaml | Simplifies CI test run and skips SmokeTests + disables parallel testing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Author
Copilot review — addressedFixed (commit 90f53da):
Replied (no change needed):
|
Local development notes — should not be tracked in version control.
b3a5fcf to
5bbb79b
Compare
5bbb79b to
ff438ab
Compare
…s separate package
ByT5-based neural G2P supporting 100+ languages via beshkenadze/g2p-multilingual-byT5-tiny-mlx (20.8M params, MIT). Provides NeuralPhonemizer conforming to the Phonemizing protocol from MLXAudioG2P for seamless integration as a fallback phonemizer.
…nored keys, use MLX range, update scheme list
Extract shared ALBERT and config types used by both KittenTTS and Kokoro: - Shared/Albert.swift: ALBERT encoder (6 classes) - Shared/SharedConfigs.swift: PLBertConfig + ISTFTNetConfig Move G2P stack to top-level G2P/ directory (17 files): - English G2P pipeline with BART fallback network - Lexicon with gold/silver dictionary support - MisakiTextProcessor with HF resource download Add TextProcessor protocol for model-agnostic text processing. BART G2P resources now downloaded from HuggingFace (beshkenadze/kitten-tts-g2p) instead of bundled in repo.
…Processor protocol
ff438ab to
a5d5eea
Compare
a5d5eea to
5fa7de7
Compare
Contributor
Author
|
Restructuring into cleaner PRs |
antmanler
added a commit
to platx-ai/mlx-audio-swift
that referenced
this pull request
Apr 14, 2026
MLXAudioCodecs/Mimi/Mimi.swift had `import Tokenizers` but no actual Tokenizers.* reference — the import was dead. It only resolved before because mlx-swift-lm <2.x re-exported Tokenizers transitively through MLXLMCommon. After mlx-swift-lm#118 (tokenizer decoupling), the re-export is gone and downstream consumers fail with: Mimi.swift:7:8: error: unable to resolve module dependency: 'Tokenizers' Two fixes: 1. Add the explicit Transformers product dependency to MLXAudioCodecs target in Package.swift, mirroring what MLXAudioSTT/STS/TTS already do. 2. Remove the dead `import Tokenizers` line from Mimi.swift to reduce the surface area. The MimiTokenizer class defined inside Mimi.swift is a local type, not related to swift-transformers, so removing the import is safe. Verified with `swift build` against the current mlx-swift-lm main pin (post-Blaizzy#118).
antmanler
added a commit
to platx-ai/mlx-audio-swift
that referenced
this pull request
Apr 14, 2026
After mlx-swift-lm#118 (tokenizer decoupling), MLXLMCommon now defines its own internal Tokenizer protocol. The MLXAudioSTT models that import both MLXLMCommon and Tokenizers (swift-transformers) now hit: error: 'Tokenizer' is ambiguous for type lookup in this context Mechanical fix: qualify the standalone Tokenizer type references with the Tokenizers. namespace prefix in five files. The actual Tokenizer type used by these models is the swift-transformers one (loaded via AutoTokenizer.from), so this is the correct qualification. Files touched: - Qwen3ASR.swift (1 site) - Qwen3ForcedAligner.swift (1 site) - GLMASR.swift (3 sites) - GraniteSpeech.swift (2 sites) - StreamingInferenceSession.swift (1 site, "any Tokenizer" form) Verified with `swift build` against current mlx-swift-lm pin and forward compatible with the post-Blaizzy#118 API.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Albert,PLBertConfig,ISTFTNetConfig(Shared/),MisakiTextProcessor(G2P/)MLXAudioModulesshared NN blocks from PR feat: add MLXAudioModules and MLXAudioG2P foundation modules #115Model
lucasnewman/kitten-tts-en-us(English US/GB)@ModuleInfo, quantization supportvoices.safetensorsfileDependencies
MLXAudioModulesto MLXAudioTTS target dependenciesFiles
KittenTTSConfig.swiftPLBertConfig+ISTFTNetConfigKittenTTSModel.swiftKittenTTSModules.swiftKittenTTSISTFTNet.swiftKittenTTSTextCleaner.swiftTTSModel.swiftkitten_ttscase + inference rule