feat: add MLXAudioNeuralG2P ByT5 multilingual G2P#116
feat: add MLXAudioNeuralG2P ByT5 multilingual G2P#116beshkenadze wants to merge 9 commits intoBlaizzy:mainfrom
Conversation
…o-phoneme Add two new foundation modules: - MLXAudioModules: reusable neural network building blocks (BiLSTM, WeightNormedConv, InstanceNorm, AdaIN, ResBlocks, SineGenerator, etc.) shared across TTS models - MLXAudioG2P: clean-room grapheme-to-phoneme pipeline with CMUdict lexicon (BSD-2), text normalization, alignment, and extensible language pack architecture Also updates CI workflow to use struct-based test filtering and adds AGENTS.local.md with build/test conventions.
CMUDictLoader.load(from:) accepts directory URL instead of Bundle.module. EnglishLanguagePack.withCMUDict(directory:) and G2PPipeline.english(cmuDictDirectory:) take directory URL. Resources uploaded to beshkenadze/cmudict-ipa on HuggingFace. Tests guarded with MLXAUDIO_CMUDICT_DIR env var.
…ules, remove stale doc link
There was a problem hiding this comment.
Pull request overview
Adds a new multilingual neural G2P engine (MLXAudioNeuralG2P) built around a ByT5-style T5 encoder/decoder in MLX, plus foundational/shared modules (MLXAudioModules) and a text-only G2P pipeline (MLXAudioG2P) with tests and CI updates.
Changes:
- Introduces
MLXAudioNeuralG2P(tokenizer, T5 model, weight loading/sanitization, greedy decoding) and aNeuralPhonemizeradapter conforming toPhonemizing. - Adds
MLXAudioG2Ppipeline (normalization, tokenization, lexicon + fallback phonemizer, optional alignment) with unit tests. - Updates package products/targets and CI workflow to build/test via
xcodebuild.
Reviewed changes
Copilot reviewed 53 out of 54 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| Tests/MLXAudioNeuralG2PTests.swift | Unit + guarded network integration tests for tokenizer, config, weights, and neural G2P. |
| Tests/MLXAudioG2PTextNormalizerTests.swift | Tests for normalization + tokenization basics. |
| Tests/MLXAudioG2PSmokeTests.swift | Basic pipeline smoke coverage (module import + convert behavior). |
| Tests/MLXAudioG2PLexiconTests.swift | Tests for lexicon lookup and fallback behavior. |
| Tests/MLXAudioG2PCMUDictTests.swift | Tests for CMUdict parsing/loading and ARPAbet→IPA mapping. |
| Tests/MLXAudioG2PAlignmentTests.swift | Tests for heuristic token↔phoneme alignment behavior. |
| Sources/MLXAudioNeuralG2P/Weights.swift | Loads safetensors, sanitizes HF keys to MLX module keys, updates/freeze model. |
| Sources/MLXAudioNeuralG2P/Tokenizer.swift | Byte-level ByT5 tokenizer (no vocab file). |
| Sources/MLXAudioNeuralG2P/RelativePositionBias.swift | Relative position bucket + bias embeddings for T5 attention. |
| Sources/MLXAudioNeuralG2P/NeuralPhonemizer.swift | Phonemizing adapter around neural G2P output. |
| Sources/MLXAudioNeuralG2P/Model.swift | T5 conditional generation wrapper (encoder/decoder + LM head tying). |
| Sources/MLXAudioNeuralG2P/G2P.swift | Public G2P API + greedy decoding loop. |
| Sources/MLXAudioNeuralG2P/FeedForward.swift | T5 gated-GELU feed-forward block. |
| Sources/MLXAudioNeuralG2P/EncoderLayer.swift | Encoder layer: attention + FFN with RMSNorm and residuals. |
| Sources/MLXAudioNeuralG2P/Encoder.swift | Encoder stack + shared relative attention bias module. |
| Sources/MLXAudioNeuralG2P/DecoderLayer.swift | Decoder layer: self-attn + cross-attn + FFN with caching. |
| Sources/MLXAudioNeuralG2P/Decoder.swift | Decoder stack + causal/self mask construction + KV cache plumbing. |
| Sources/MLXAudioNeuralG2P/Config.swift | Codable T5 config loader from config.json. |
| Sources/MLXAudioNeuralG2P/Attention.swift | Multi-head attention implementation with KV caching. |
| Sources/MLXAudioModules/WeightNormedConv.swift | Weight-normalized conv helper module. |
| Sources/MLXAudioModules/Utilities.swift | Shared utility: 1D interpolation. |
| Sources/MLXAudioModules/UpSample1d.swift | Upsampling helper module. |
| Sources/MLXAudioModules/SineGenerator.swift | Sine + noise source generation utilities. |
| Sources/MLXAudioModules/ResidualBlocks.swift | AdaIN/AdaIN-Snake style residual blocks using shared modules. |
| Sources/MLXAudioModules/Normalization.swift | InstanceNorm, AdaIN, and AdaLayerNorm implementations. |
| Sources/MLXAudioModules/LinearNorm.swift | Linear wrapper module with named key mapping. |
| Sources/MLXAudioModules/BiLSTM.swift | BiLSTM building block (manual gate math). |
| Sources/MLXAudioG2P/Tokenization/TextTokenizer.swift | Deterministic tokenizer into word/punct/whitespace tokens with ranges. |
| Sources/MLXAudioG2P/Tokenization/G2PToken.swift | Token model including kind + normalized-text range. |
| Sources/MLXAudioG2P/TextNormalization/TextNormalizer.swift | Normalizer applying a list of rules. |
| Sources/MLXAudioG2P/TextNormalization/NormalizationRule.swift | Rule representation + English default rules (quotes/dashes/whitespace). |
| Sources/MLXAudioG2P/README.md | Module README documenting scope and usage. |
| Sources/MLXAudioG2P/Pipeline/G2PPipeline.swift | Main pipeline: normalize → tokenize → lexicon-first → fallback → optional alignment. |
| Sources/MLXAudioG2P/Pipeline/G2POutput.swift | Pipeline output structure (text/tokens/phonemes/alignment). |
| Sources/MLXAudioG2P/Pipeline/G2PInput.swift | Input struct (text/locale/alignment flag). |
| Sources/MLXAudioG2P/Pipeline/G2PError.swift | Shared error enum for pipeline components. |
| Sources/MLXAudioG2P/Phonemes/PhonemeUnit.swift | Phoneme unit type. |
| Sources/MLXAudioG2P/Phonemes/PhonemeSequence.swift | Sequence container + render helper. |
| Sources/MLXAudioG2P/MLXAudioG2P.swift | Module namespace + version string. |
| Sources/MLXAudioG2P/Lexicon/LexiconProviding.swift | Lexicon protocol. |
| Sources/MLXAudioG2P/Lexicon/LexiconEntry.swift | Lexicon entry model. |
| Sources/MLXAudioG2P/Lexicon/InMemoryLexicon.swift | In-memory case-insensitive lexicon implementation. |
| Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictParser.swift | CMUdict line/text parsing into raw entries. |
| Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictLoader.swift | Loads cmudict.dict from directory into lexicon (ARPAbet→IPA). |
| Sources/MLXAudioG2P/Lexicon/CMUDict/ARPAbetMapper.swift | ARPAbet→IPA mapping + stress handling. |
| Sources/MLXAudioG2P/Languages/English/EnglishLanguagePack.swift | English defaults + CMUdict-backed factory. |
| Sources/MLXAudioG2P/Fallback/FallbackPhonemizer.swift | Rule-based fallback phonemizer + Phonemizing protocol. |
| Sources/MLXAudioG2P/Alignment/TokenAlignment.swift | Alignment model mapping tokenIndex→phonemeRange. |
| Sources/MLXAudioG2P/Alignment/TokenAligning.swift | Alignment protocol. |
| Sources/MLXAudioG2P/Alignment/HeuristicTokenAligner.swift | Simple heuristic aligner implementation. |
| Package.swift | Adds new library products/targets and wires them into umbrella MLXAudio + tests. |
| Package.resolved | Updates pinned dependency revisions/versions. |
| AGENTS.local.md | Documents repo-specific build/test conventions and CI constraints. |
| .github/workflows/tests.yaml | Simplifies CI to a single xcodebuild test run (skipping SmokeTests). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public func phonemize(_ grapheme: String) throws -> [PhonemeUnit] { | ||
| let ipa = g2p.convert(grapheme, language: language) | ||
|
|
||
| guard !ipa.isEmpty else { | ||
| throw G2PError.phonemizationFailed( | ||
| token: grapheme, | ||
| reason: "Neural model returned empty output" | ||
| ) | ||
| } | ||
|
|
||
| return ipa.map { PhonemeUnit(symbol: String($0)) } | ||
| } |
There was a problem hiding this comment.
phonemize converts the model output string into one PhonemeUnit per Character, which will split multi-character phonemes (e.g. "tʃ", "oʊ") and can also emit whitespace/newline units. This is inconsistent with the rest of MLXAudioG2P where a unit is a phoneme token. Consider trimming the decoded string and tokenizing it (e.g., splitting on whitespace and filtering empties) before building [PhonemeUnit], and treat empty/whitespace-only output as failure.
| private static let ignoredSuffixes: [String] = [ | ||
| "decoder.layers.0.cross_attention.relative_attention_bias.embeddings.weight" | ||
| ] | ||
|
|
||
| static func sanitizeKey(_ key: String) -> String? { | ||
| var key = key | ||
|
|
||
| for (from, to) in sharedReplacements { | ||
| key = key.replacingOccurrences(of: from, with: to) | ||
| } | ||
|
|
||
| if key.hasPrefix("encoder.") { | ||
| for (from, to) in encoderReplacements { | ||
| key = key.replacingOccurrences(of: from, with: to) | ||
| } | ||
| } else if key.hasPrefix("decoder.") { | ||
| for (from, to) in decoderReplacements { | ||
| key = key.replacingOccurrences(of: from, with: to) | ||
| } | ||
| } | ||
|
|
||
| if ignoredSuffixes.contains(key) { return nil } | ||
| return key |
There was a problem hiding this comment.
ignoredSuffixes only filters a single fully-qualified key (layer 0). If the source checkpoint contains the same cross_attention.relative_attention_bias... parameter for other blocks (or variants), sanitize will pass them through and update(verify: .noUnusedKeys) will fail due to unused keys. Consider ignoring by pattern (e.g., any key containing .cross_attention.relative_attention_bias.) rather than exact string match.
| let contextPosition = MLXArray( | ||
| Array(offset ..< (offset + queryLength)) | ||
| ).expandedDimensions(axis: 1) | ||
| let memoryPosition = MLXArray( | ||
| Array(0 ..< keyLength) | ||
| ).expandedDimensions(axis: 0) |
There was a problem hiding this comment.
RelativePositionBias.callAsFunction builds Swift Arrays for positions on every call. This allocates/copies on the CPU and can become a bottleneck during autoregressive decoding. Prefer MLX.arange(...) (as used elsewhere in the repo) to build these tensors directly on the MLX side.
| ``` | ||
|
|
||
| Available schemes: MLXAudio, MLXAudio-Package, MLXAudioCodecs, MLXAudioCore, | ||
| MLXAudioG2P, MLXAudioModules, MLXAudioLID, MLXAudioSTS, MLXAudioSTT, |
There was a problem hiding this comment.
The “Available schemes” list doesn’t include the newly-added MLXAudioNeuralG2P scheme/target, which may confuse contributors following this doc. Consider updating the scheme list to include it (or clarifying how schemes are generated).
| MLXAudioG2P, MLXAudioModules, MLXAudioLID, MLXAudioSTS, MLXAudioSTT, | |
| MLXAudioG2P, MLXAudioNeuralG2P, MLXAudioModules, MLXAudioLID, MLXAudioSTS, MLXAudioSTT, |
|
All 4 Copilot comments addressed in e4ba855:
|
Local development notes — should not be tracked in version control.
e4ba855 to
4ccd40d
Compare
4ccd40d to
2d344b3
Compare
…s separate package
ByT5-based neural G2P supporting 100+ languages via beshkenadze/g2p-multilingual-byT5-tiny-mlx (20.8M params, MIT). Provides NeuralPhonemizer conforming to the Phonemizing protocol from MLXAudioG2P for seamless integration as a fallback phonemizer.
…nored keys, use MLX range, update scheme list
2d344b3 to
9afea89
Compare
|
Restructuring into cleaner PRs |
Summary
MLXAudioNeuralG2Pmodule — a ByT5-based neural grapheme-to-phoneme engine supporting 100+ languagesbeshkenadze/g2p-multilingual-byT5-tiny-mlx(20.8M params, MIT license)NeuralPhonemizerconforms toPhonemizingprotocol fromMLXAudioG2Pfor use as fallback phonemizerDetails
13 source files implementing a complete ByT5 T5 encoder-decoder architecture:
Dependencies
MLXAudioG2PforPhonemizingprotocol)MLXAudioG2P,MLX,MLXFast,MLXNN,MLXRandomTesting
MLXAUDIO_ENABLE_NETWORK_TESTS=1(downloads model from HF)swift build --target MLXAudioNeuralG2P✓