Skip to content

feat: add MLXAudioNeuralG2P ByT5 multilingual G2P#116

Closed
beshkenadze wants to merge 9 commits intoBlaizzy:mainfrom
beshkenadze:feat/neural-g2p
Closed

feat: add MLXAudioNeuralG2P ByT5 multilingual G2P#116
beshkenadze wants to merge 9 commits intoBlaizzy:mainfrom
beshkenadze:feat/neural-g2p

Conversation

@beshkenadze
Copy link
Copy Markdown
Contributor

Summary

  • Adds MLXAudioNeuralG2P module — a ByT5-based neural grapheme-to-phoneme engine supporting 100+ languages
  • Model: beshkenadze/g2p-multilingual-byT5-tiny-mlx (20.8M params, MIT license)
  • NeuralPhonemizer conforms to Phonemizing protocol from MLXAudioG2P for use as fallback phonemizer

Details

13 source files implementing a complete ByT5 T5 encoder-decoder architecture:

  • Byte-level tokenizer (no vocabulary file needed)
  • Encoder with relative position bias
  • Autoregressive decoder with cross-attention
  • Weight sanitization for HuggingFace → MLX key mapping

Dependencies

Testing

  • Integration tests guarded by MLXAUDIO_ENABLE_NETWORK_TESTS=1 (downloads model from HF)
  • Build verified: swift build --target MLXAudioNeuralG2P

…o-phoneme

Add two new foundation modules:

- MLXAudioModules: reusable neural network building blocks (BiLSTM,
  WeightNormedConv, InstanceNorm, AdaIN, ResBlocks, SineGenerator, etc.)
  shared across TTS models

- MLXAudioG2P: clean-room grapheme-to-phoneme pipeline with CMUdict
  lexicon (BSD-2), text normalization, alignment, and extensible
  language pack architecture

Also updates CI workflow to use struct-based test filtering and
adds AGENTS.local.md with build/test conventions.
CMUDictLoader.load(from:) accepts directory URL instead of Bundle.module.
EnglishLanguagePack.withCMUDict(directory:) and G2PPipeline.english(cmuDictDirectory:)
take directory URL. Resources uploaded to beshkenadze/cmudict-ipa on HuggingFace.
Tests guarded with MLXAUDIO_CMUDICT_DIR env var.
Copilot AI review requested due to automatic review settings March 23, 2026 08:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new multilingual neural G2P engine (MLXAudioNeuralG2P) built around a ByT5-style T5 encoder/decoder in MLX, plus foundational/shared modules (MLXAudioModules) and a text-only G2P pipeline (MLXAudioG2P) with tests and CI updates.

Changes:

  • Introduces MLXAudioNeuralG2P (tokenizer, T5 model, weight loading/sanitization, greedy decoding) and a NeuralPhonemizer adapter conforming to Phonemizing.
  • Adds MLXAudioG2P pipeline (normalization, tokenization, lexicon + fallback phonemizer, optional alignment) with unit tests.
  • Updates package products/targets and CI workflow to build/test via xcodebuild.

Reviewed changes

Copilot reviewed 53 out of 54 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
Tests/MLXAudioNeuralG2PTests.swift Unit + guarded network integration tests for tokenizer, config, weights, and neural G2P.
Tests/MLXAudioG2PTextNormalizerTests.swift Tests for normalization + tokenization basics.
Tests/MLXAudioG2PSmokeTests.swift Basic pipeline smoke coverage (module import + convert behavior).
Tests/MLXAudioG2PLexiconTests.swift Tests for lexicon lookup and fallback behavior.
Tests/MLXAudioG2PCMUDictTests.swift Tests for CMUdict parsing/loading and ARPAbet→IPA mapping.
Tests/MLXAudioG2PAlignmentTests.swift Tests for heuristic token↔phoneme alignment behavior.
Sources/MLXAudioNeuralG2P/Weights.swift Loads safetensors, sanitizes HF keys to MLX module keys, updates/freeze model.
Sources/MLXAudioNeuralG2P/Tokenizer.swift Byte-level ByT5 tokenizer (no vocab file).
Sources/MLXAudioNeuralG2P/RelativePositionBias.swift Relative position bucket + bias embeddings for T5 attention.
Sources/MLXAudioNeuralG2P/NeuralPhonemizer.swift Phonemizing adapter around neural G2P output.
Sources/MLXAudioNeuralG2P/Model.swift T5 conditional generation wrapper (encoder/decoder + LM head tying).
Sources/MLXAudioNeuralG2P/G2P.swift Public G2P API + greedy decoding loop.
Sources/MLXAudioNeuralG2P/FeedForward.swift T5 gated-GELU feed-forward block.
Sources/MLXAudioNeuralG2P/EncoderLayer.swift Encoder layer: attention + FFN with RMSNorm and residuals.
Sources/MLXAudioNeuralG2P/Encoder.swift Encoder stack + shared relative attention bias module.
Sources/MLXAudioNeuralG2P/DecoderLayer.swift Decoder layer: self-attn + cross-attn + FFN with caching.
Sources/MLXAudioNeuralG2P/Decoder.swift Decoder stack + causal/self mask construction + KV cache plumbing.
Sources/MLXAudioNeuralG2P/Config.swift Codable T5 config loader from config.json.
Sources/MLXAudioNeuralG2P/Attention.swift Multi-head attention implementation with KV caching.
Sources/MLXAudioModules/WeightNormedConv.swift Weight-normalized conv helper module.
Sources/MLXAudioModules/Utilities.swift Shared utility: 1D interpolation.
Sources/MLXAudioModules/UpSample1d.swift Upsampling helper module.
Sources/MLXAudioModules/SineGenerator.swift Sine + noise source generation utilities.
Sources/MLXAudioModules/ResidualBlocks.swift AdaIN/AdaIN-Snake style residual blocks using shared modules.
Sources/MLXAudioModules/Normalization.swift InstanceNorm, AdaIN, and AdaLayerNorm implementations.
Sources/MLXAudioModules/LinearNorm.swift Linear wrapper module with named key mapping.
Sources/MLXAudioModules/BiLSTM.swift BiLSTM building block (manual gate math).
Sources/MLXAudioG2P/Tokenization/TextTokenizer.swift Deterministic tokenizer into word/punct/whitespace tokens with ranges.
Sources/MLXAudioG2P/Tokenization/G2PToken.swift Token model including kind + normalized-text range.
Sources/MLXAudioG2P/TextNormalization/TextNormalizer.swift Normalizer applying a list of rules.
Sources/MLXAudioG2P/TextNormalization/NormalizationRule.swift Rule representation + English default rules (quotes/dashes/whitespace).
Sources/MLXAudioG2P/README.md Module README documenting scope and usage.
Sources/MLXAudioG2P/Pipeline/G2PPipeline.swift Main pipeline: normalize → tokenize → lexicon-first → fallback → optional alignment.
Sources/MLXAudioG2P/Pipeline/G2POutput.swift Pipeline output structure (text/tokens/phonemes/alignment).
Sources/MLXAudioG2P/Pipeline/G2PInput.swift Input struct (text/locale/alignment flag).
Sources/MLXAudioG2P/Pipeline/G2PError.swift Shared error enum for pipeline components.
Sources/MLXAudioG2P/Phonemes/PhonemeUnit.swift Phoneme unit type.
Sources/MLXAudioG2P/Phonemes/PhonemeSequence.swift Sequence container + render helper.
Sources/MLXAudioG2P/MLXAudioG2P.swift Module namespace + version string.
Sources/MLXAudioG2P/Lexicon/LexiconProviding.swift Lexicon protocol.
Sources/MLXAudioG2P/Lexicon/LexiconEntry.swift Lexicon entry model.
Sources/MLXAudioG2P/Lexicon/InMemoryLexicon.swift In-memory case-insensitive lexicon implementation.
Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictParser.swift CMUdict line/text parsing into raw entries.
Sources/MLXAudioG2P/Lexicon/CMUDict/CMUDictLoader.swift Loads cmudict.dict from directory into lexicon (ARPAbet→IPA).
Sources/MLXAudioG2P/Lexicon/CMUDict/ARPAbetMapper.swift ARPAbet→IPA mapping + stress handling.
Sources/MLXAudioG2P/Languages/English/EnglishLanguagePack.swift English defaults + CMUdict-backed factory.
Sources/MLXAudioG2P/Fallback/FallbackPhonemizer.swift Rule-based fallback phonemizer + Phonemizing protocol.
Sources/MLXAudioG2P/Alignment/TokenAlignment.swift Alignment model mapping tokenIndex→phonemeRange.
Sources/MLXAudioG2P/Alignment/TokenAligning.swift Alignment protocol.
Sources/MLXAudioG2P/Alignment/HeuristicTokenAligner.swift Simple heuristic aligner implementation.
Package.swift Adds new library products/targets and wires them into umbrella MLXAudio + tests.
Package.resolved Updates pinned dependency revisions/versions.
AGENTS.local.md Documents repo-specific build/test conventions and CI constraints.
.github/workflows/tests.yaml Simplifies CI to a single xcodebuild test run (skipping SmokeTests).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +26 to +37
public func phonemize(_ grapheme: String) throws -> [PhonemeUnit] {
let ipa = g2p.convert(grapheme, language: language)

guard !ipa.isEmpty else {
throw G2PError.phonemizationFailed(
token: grapheme,
reason: "Neural model returned empty output"
)
}

return ipa.map { PhonemeUnit(symbol: String($0)) }
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phonemize converts the model output string into one PhonemeUnit per Character, which will split multi-character phonemes (e.g. "tʃ", "oʊ") and can also emit whitespace/newline units. This is inconsistent with the rest of MLXAudioG2P where a unit is a phoneme token. Consider trimming the decoded string and tokenizing it (e.g., splitting on whitespace and filtering empties) before building [PhonemeUnit], and treat empty/whitespace-only output as failure.

Copilot uses AI. Check for mistakes.
Comment thread Sources/MLXAudioNeuralG2P/Weights.swift Outdated
Comment on lines +36 to +58
private static let ignoredSuffixes: [String] = [
"decoder.layers.0.cross_attention.relative_attention_bias.embeddings.weight"
]

static func sanitizeKey(_ key: String) -> String? {
var key = key

for (from, to) in sharedReplacements {
key = key.replacingOccurrences(of: from, with: to)
}

if key.hasPrefix("encoder.") {
for (from, to) in encoderReplacements {
key = key.replacingOccurrences(of: from, with: to)
}
} else if key.hasPrefix("decoder.") {
for (from, to) in decoderReplacements {
key = key.replacingOccurrences(of: from, with: to)
}
}

if ignoredSuffixes.contains(key) { return nil }
return key
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoredSuffixes only filters a single fully-qualified key (layer 0). If the source checkpoint contains the same cross_attention.relative_attention_bias... parameter for other blocks (or variants), sanitize will pass them through and update(verify: .noUnusedKeys) will fail due to unused keys. Consider ignoring by pattern (e.g., any key containing .cross_attention.relative_attention_bias.) rather than exact string match.

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +74
let contextPosition = MLXArray(
Array(offset ..< (offset + queryLength))
).expandedDimensions(axis: 1)
let memoryPosition = MLXArray(
Array(0 ..< keyLength)
).expandedDimensions(axis: 0)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RelativePositionBias.callAsFunction builds Swift Arrays for positions on every call. This allocates/copies on the CPU and can become a bottleneck during autoregressive decoding. Prefer MLX.arange(...) (as used elsewhere in the repo) to build these tensors directly on the MLX side.

Copilot uses AI. Check for mistakes.
Comment thread AGENTS.local.md Outdated
```

Available schemes: MLXAudio, MLXAudio-Package, MLXAudioCodecs, MLXAudioCore,
MLXAudioG2P, MLXAudioModules, MLXAudioLID, MLXAudioSTS, MLXAudioSTT,
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Available schemes” list doesn’t include the newly-added MLXAudioNeuralG2P scheme/target, which may confuse contributors following this doc. Consider updating the scheme list to include it (or clarifying how schemes are generated).

Suggested change
MLXAudioG2P, MLXAudioModules, MLXAudioLID, MLXAudioSTS, MLXAudioSTT,
MLXAudioG2P, MLXAudioNeuralG2P, MLXAudioModules, MLXAudioLID, MLXAudioSTS, MLXAudioSTT,

Copilot uses AI. Check for mistakes.
@beshkenadze
Copy link
Copy Markdown
Contributor Author

All 4 Copilot comments addressed in e4ba855:

  1. NeuralPhonemizer — Added whitespace trimming + filtering before building PhonemeUnits. Per-character splitting is consistent with CMUDict IPA output in this codebase.
  2. Weights — Changed from exact key match to pattern-based filtering (.cross_attention.relative_attention_bias.) for safety across all layers.
  3. RelativePositionBias — Replaced Swift Array allocation with MLXArray range construction directly.
  4. AGENTS.local.md — Added MLXAudioNeuralG2P to scheme list.

Local development notes — should not be tracked in version control.
ByT5-based neural G2P supporting 100+ languages via
beshkenadze/g2p-multilingual-byT5-tiny-mlx (20.8M params, MIT).

Provides NeuralPhonemizer conforming to the Phonemizing protocol
from MLXAudioG2P for seamless integration as a fallback phonemizer.
…nored keys, use MLX range, update scheme list
@beshkenadze
Copy link
Copy Markdown
Contributor Author

Restructuring into cleaner PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants