feat: add MLXAudioNeuralG2P ByT5 multilingual G2P by beshkenadze · Pull Request #116 · Blaizzy/mlx-audio-swift

beshkenadze · 2026-03-23T08:32:23Z

Summary

Adds MLXAudioNeuralG2P module — a ByT5-based neural grapheme-to-phoneme engine supporting 100+ languages
Model: beshkenadze/g2p-multilingual-byT5-tiny-mlx (20.8M params, MIT license)
NeuralPhonemizer conforms to Phonemizing protocol from MLXAudioG2P for use as fallback phonemizer

Details

13 source files implementing a complete ByT5 T5 encoder-decoder architecture:

Byte-level tokenizer (no vocabulary file needed)
Encoder with relative position bias
Autoregressive decoder with cross-attention
Weight sanitization for HuggingFace → MLX key mapping

Dependencies

Depends on PR feat: add MLXAudioModules and MLXAudioG2P foundation modules #115 (MLXAudioG2P for Phonemizing protocol)
New deps: MLXAudioG2P, MLX, MLXFast, MLXNN, MLXRandom

Testing

Integration tests guarded by MLXAUDIO_ENABLE_NETWORK_TESTS=1 (downloads model from HF)
Build verified: swift build --target MLXAudioNeuralG2P ✓

…o-phoneme Add two new foundation modules: - MLXAudioModules: reusable neural network building blocks (BiLSTM, WeightNormedConv, InstanceNorm, AdaIN, ResBlocks, SineGenerator, etc.) shared across TTS models - MLXAudioG2P: clean-room grapheme-to-phoneme pipeline with CMUdict lexicon (BSD-2), text normalization, alignment, and extensible language pack architecture Also updates CI workflow to use struct-based test filtering and adds AGENTS.local.md with build/test conventions.

CMUDictLoader.load(from:) accepts directory URL instead of Bundle.module. EnglishLanguagePack.withCMUDict(directory:) and G2PPipeline.english(cmuDictDirectory:) take directory URL. Resources uploaded to beshkenadze/cmudict-ipa on HuggingFace. Tests guarded with MLXAUDIO_CMUDICT_DIR env var.

…ules, remove stale doc link

Copilot

Pull request overview

Adds a new multilingual neural G2P engine (MLXAudioNeuralG2P) built around a ByT5-style T5 encoder/decoder in MLX, plus foundational/shared modules (MLXAudioModules) and a text-only G2P pipeline (MLXAudioG2P) with tests and CI updates.

Changes:

Introduces MLXAudioNeuralG2P (tokenizer, T5 model, weight loading/sanitization, greedy decoding) and a NeuralPhonemizer adapter conforming to Phonemizing.
Adds MLXAudioG2P pipeline (normalization, tokenization, lexicon + fallback phonemizer, optional alignment) with unit tests.
Updates package products/targets and CI workflow to build/test via xcodebuild.

Reviewed changes

Copilot reviewed 53 out of 54 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
Tests/MLXAudioNeuralG2PTests.swift	Unit + guarded network integration tests for tokenizer, config, weights, and neural G2P.
Tests/MLXAudioG2PTextNormalizerTests.swift	Tests for normalization + tokenization basics.
Tests/MLXAudioG2PSmokeTests.swift	Basic pipeline smoke coverage (module import + convert behavior).
Tests/MLXAudioG2PLexiconTests.swift	Tests for lexicon lookup and fallback behavior.
Tests/MLXAudioG2PCMUDictTests.swift	Tests for CMUdict parsing/loading and ARPAbet→IPA mapping.
Tests/MLXAudioG2PAlignmentTests.swift	Tests for heuristic token↔phoneme alignment behavior.
Sources/MLXAudioNeuralG2P/Weights.swift	Loads safetensors, sanitizes HF keys to MLX module keys, updates/freeze model.
Sources/MLXAudioNeuralG2P/Tokenizer.swift	Byte-level ByT5 tokenizer (no vocab file).
Sources/MLXAudioNeuralG2P/RelativePositionBias.swift	Relative position bucket + bias embeddings for T5 attention.
Sources/MLXAudioNeuralG2P/NeuralPhonemizer.swift	`Phonemizing` adapter around neural G2P output.
Sources/MLXAudioNeuralG2P/Model.swift	T5 conditional generation wrapper (encoder/decoder + LM head tying).
Sources/MLXAudioNeuralG2P/G2P.swift	Public G2P API + greedy decoding loop.
Sources/MLXAudioNeuralG2P/FeedForward.swift	T5 gated-GELU feed-forward block.
Sources/MLXAudioNeuralG2P/EncoderLayer.swift	Encoder layer: attention + FFN with RMSNorm and residuals.
Sources/MLXAudioNeuralG2P/Encoder.swift	Encoder stack + shared relative attention bias module.
Sources/MLXAudioNeuralG2P/DecoderLayer.swift	Decoder layer: self-attn + cross-attn + FFN with caching.
Sources/MLXAudioNeuralG2P/Decoder.swift	Decoder stack + causal/self mask construction + KV cache plumbing.
Sources/MLXAudioNeuralG2P/Config.swift	Codable T5 config loader from `config.json`.
Sources/MLXAudioNeuralG2P/Attention.swift	Multi-head attention implementation with KV caching.
Sources/MLXAudioModules/WeightNormedConv.swift	Weight-normalized conv helper module.
Sources/MLXAudioModules/Utilities.swift	Shared utility: 1D interpolation.
Sources/MLXAudioModules/UpSample1d.swift	Upsampling helper module.
Sources/MLXAudioModules/SineGenerator.swift	Sine + noise source generation utilities.
Sources/MLXAudioModules/ResidualBlocks.swift	AdaIN/AdaIN-Snake style residual blocks using shared modules.
Sources/MLXAudioModules/Normalization.swift	InstanceNorm, AdaIN, and AdaLayerNorm implementations.
Sources/MLXAudioModules/LinearNorm.swift	Linear wrapper module with named key mapping.
Sources/MLXAudioModules/BiLSTM.swift	BiLSTM building block (manual gate math).
Sources/MLXAudioG2P/Tokenization/TextTokenizer.swift	Deterministic tokenizer into word/punct/whitespace tokens with ranges.
Sources/MLXAudioG2P/Tokenization/G2PToken.swift	Token model including kind + normalized-text range.
Sources/MLXAudioG2P/TextNormalization/TextNormalizer.swift	Normalizer applying a list of rules.
Sources/MLXAudioG2P/TextNormalization/NormalizationRule.swift	Rule representation + English default rules (quotes/dashes/whitespace).
Sources/MLXAudioG2P/README.md	Module README documenting scope and usage.
Sources/MLXAudioG2P/Pipeline/G2PPipeline.swift	Main pipeline: normalize → tokenize → lexicon-first → fallback → optional alignment.
Sources/MLXAudioG2P/Pipeline/G2POutput.swift	Pipeline output structure (text/tokens/phonemes/alignment).
Sources/MLXAudioG2P/Pipeline/G2PInput.swift	Input struct (text/locale/alignment flag).
Sources/MLXAudioG2P/Pipeline/G2PError.swift	Shared error enum for pipeline components.
Sources/MLXAudioG2P/Phonemes/PhonemeUnit.swift	Phoneme unit type.
Sources/MLXAudioG2P/Phonemes/PhonemeSequence.swift	Sequence container + render helper.
Sources/MLXAudioG2P/MLXAudioG2P.swift	Module namespace + version string.
Sources/MLXAudioG2P/Lexicon/LexiconProviding.swift	Lexicon protocol.
Sources/MLXAudioG2P/Lexicon/LexiconEntry.swift	Lexicon entry model.
Sources/MLXAudioG2P/Lexicon/InMemoryLexicon.swift	In-memory case-insensitive lexicon implementation.
Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictParser.swift	CMUdict line/text parsing into raw entries.
Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictLoader.swift	Loads `cmudict.dict` from directory into lexicon (ARPAbet→IPA).
Sources/MLXAudioG2P/Lexicon/CMUDict/ARPAbetMapper.swift	ARPAbet→IPA mapping + stress handling.
Sources/MLXAudioG2P/Languages/English/EnglishLanguagePack.swift	English defaults + CMUdict-backed factory.
Sources/MLXAudioG2P/Fallback/FallbackPhonemizer.swift	Rule-based fallback phonemizer + `Phonemizing` protocol.
Sources/MLXAudioG2P/Alignment/TokenAlignment.swift	Alignment model mapping tokenIndex→phonemeRange.
Sources/MLXAudioG2P/Alignment/TokenAligning.swift	Alignment protocol.
Sources/MLXAudioG2P/Alignment/HeuristicTokenAligner.swift	Simple heuristic aligner implementation.
Package.swift	Adds new library products/targets and wires them into umbrella `MLXAudio` + tests.
Package.resolved	Updates pinned dependency revisions/versions.
AGENTS.local.md	Documents repo-specific build/test conventions and CI constraints.
.github/workflows/tests.yaml	Simplifies CI to a single xcodebuild test run (skipping SmokeTests).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T08:37:39Z

+    public func phonemize(_ grapheme: String) throws -> [PhonemeUnit] {
+        let ipa = g2p.convert(grapheme, language: language)
+
+        guard !ipa.isEmpty else {
+            throw G2PError.phonemizationFailed(
+                token: grapheme,
+                reason: "Neural model returned empty output"
+            )
+        }
+
+        return ipa.map { PhonemeUnit(symbol: String($0)) }
+    }


phonemize converts the model output string into one PhonemeUnit per Character, which will split multi-character phonemes (e.g. "tʃ", "oʊ") and can also emit whitespace/newline units. This is inconsistent with the rest of MLXAudioG2P where a unit is a phoneme token. Consider trimming the decoded string and tokenizing it (e.g., splitting on whitespace and filtering empties) before building [PhonemeUnit], and treat empty/whitespace-only output as failure.

Copilot · 2026-03-23T08:37:39Z

+    private static let ignoredSuffixes: [String] = [
+        "decoder.layers.0.cross_attention.relative_attention_bias.embeddings.weight"
+    ]
+
+    static func sanitizeKey(_ key: String) -> String? {
+        var key = key
+
+        for (from, to) in sharedReplacements {
+            key = key.replacingOccurrences(of: from, with: to)
+        }
+
+        if key.hasPrefix("encoder.") {
+            for (from, to) in encoderReplacements {
+                key = key.replacingOccurrences(of: from, with: to)
+            }
+        } else if key.hasPrefix("decoder.") {
+            for (from, to) in decoderReplacements {
+                key = key.replacingOccurrences(of: from, with: to)
+            }
+        }
+
+        if ignoredSuffixes.contains(key) { return nil }
+        return key


ignoredSuffixes only filters a single fully-qualified key (layer 0). If the source checkpoint contains the same cross_attention.relative_attention_bias... parameter for other blocks (or variants), sanitize will pass them through and update(verify: .noUnusedKeys) will fail due to unused keys. Consider ignoring by pattern (e.g., any key containing .cross_attention.relative_attention_bias.) rather than exact string match.

Copilot · 2026-03-23T08:37:40Z

+        let contextPosition = MLXArray(
+            Array(offset ..< (offset + queryLength))
+        ).expandedDimensions(axis: 1)
+        let memoryPosition = MLXArray(
+            Array(0 ..< keyLength)
+        ).expandedDimensions(axis: 0)


RelativePositionBias.callAsFunction builds Swift Arrays for positions on every call. This allocates/copies on the CPU and can become a bottleneck during autoregressive decoding. Prefer MLX.arange(...) (as used elsewhere in the repo) to build these tensors directly on the MLX side.

Copilot · 2026-03-23T08:37:40Z

+```
+
+Available schemes: MLXAudio, MLXAudio-Package, MLXAudioCodecs, MLXAudioCore,
+MLXAudioG2P, MLXAudioModules, MLXAudioLID, MLXAudioSTS, MLXAudioSTT,


The “Available schemes” list doesn’t include the newly-added MLXAudioNeuralG2P scheme/target, which may confuse contributors following this doc. Consider updating the scheme list to include it (or clarifying how schemes are generated).

Suggested change

MLXAudioG2P, MLXAudioModules, MLXAudioLID, MLXAudioSTS, MLXAudioSTT,

MLXAudioG2P, MLXAudioNeuralG2P, MLXAudioModules, MLXAudioLID, MLXAudioSTS, MLXAudioSTT,

beshkenadze · 2026-03-23T08:44:25Z

All 4 Copilot comments addressed in e4ba855:

NeuralPhonemizer — Added whitespace trimming + filtering before building PhonemeUnits. Per-character splitting is consistent with CMUDict IPA output in this codebase.
Weights — Changed from exact key match to pattern-based filtering (.cross_attention.relative_attention_bias.) for safety across all layers.
RelativePositionBias — Replaced Swift Array allocation with MLXArray range construction directly.
AGENTS.local.md — Added MLXAudioNeuralG2P to scheme list.

Local development notes — should not be tracked in version control.

…s separate package

ByT5-based neural G2P supporting 100+ languages via beshkenadze/g2p-multilingual-byT5-tiny-mlx (20.8M params, MIT). Provides NeuralPhonemizer conforming to the Phonemizing protocol from MLXAudioG2P for seamless integration as a fallback phonemizer.

…nored keys, use MLX range, update scheme list

beshkenadze · 2026-03-23T20:00:42Z

Restructuring into cleaner PRs

beshkenadze added 3 commits March 23, 2026 00:07

fix: address Copilot review — rename SmokeTests, reorder normalizer r…

fb88daf

…ules, remove stale doc link

Copilot AI review requested due to automatic review settings March 23, 2026 08:32

Copilot started reviewing on behalf of beshkenadze March 23, 2026 08:32 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

This was referenced Mar 23, 2026

feat: add shared TTS types, G2P stack, and text processing #117

Closed

feat: add Kokoro TTS with multilingual support #119

Closed

beshkenadze added 2 commits March 23, 2026 18:10

chore: remove AGENTS.local.md from repo

edfd060

Local development notes — should not be tracked in version control.

chore: add AGENTS.local.md to .gitignore

e1de00e

beshkenadze force-pushed the feat/neural-g2p branch from e4ba855 to 4ccd40d Compare March 23, 2026 16:11

fix: remove version constant from MLXAudioG2P namespace

0af92ec

beshkenadze force-pushed the feat/neural-g2p branch from 4ccd40d to 2d344b3 Compare March 23, 2026 18:35

beshkenadze added 3 commits March 23, 2026 21:21

refactor: move MLXAudioModules into Models/StyleTTS2/Blocks, remove a…

8ff81a3

…s separate package

fix: address Copilot review — trim phonemize output, pattern-match ig…

9afea89

…nored keys, use MLX range, update scheme list

beshkenadze force-pushed the feat/neural-g2p branch from 2d344b3 to 9afea89 Compare March 23, 2026 19:22

beshkenadze closed this Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add MLXAudioNeuralG2P ByT5 multilingual G2P#116

feat: add MLXAudioNeuralG2P ByT5 multilingual G2P#116
beshkenadze wants to merge 9 commits intoBlaizzy:mainfrom
beshkenadze:feat/neural-g2p

beshkenadze commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

beshkenadze commented Mar 23, 2026

Uh oh!

beshkenadze commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	MLXAudioG2P, MLXAudioModules, MLXAudioLID, MLXAudioSTS, MLXAudioSTT,
	MLXAudioG2P, MLXAudioNeuralG2P, MLXAudioModules, MLXAudioLID, MLXAudioSTS, MLXAudioSTT,

Uh oh!

Conversation

beshkenadze commented Mar 23, 2026

Summary

Details

Dependencies

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

beshkenadze commented Mar 23, 2026

Uh oh!

beshkenadze commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants