-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Moonshine to KerasHub #2093
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR! I left some initial comments.
I would suggest following the format, structure and naming conventions similar to teh Whisper model here - https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/models/whisper
- add docstrings
- convert backbone to a functional model
- add a moonshine_audio_converter.py
- Add a numerics verification colab to verify the implementation
Will make the changes at the earliest, thanks for the review! |
you will need to run shell/api_gen.sh and also shell/format.sh at root to resolve the code formatting error |
Thanks for the review, made the changes! The issue regarding the build still persists. |
Summary of Changes:
|
TODO:
|
Status of the PR: Outputs of the MD5 Checksum Comparison |
keras_hub/src/models/moonshine/moonshine_audio_converter_test.py
Outdated
Show resolved
Hide resolved
What does librosa buy us? We definitely can't add it as a hard dependency. Most people using KerasHub will not be using it for audio modeling today, so adding a hard dep for them will just create headaches. However, we could add a conditional import of librosa in our API -- attempt to use moonshine in KerasHub, get an error message asking for librosa. And of course if we just need this for data preparation outside of our API, that is easiest, just use librosa in any guides and example (but ultimately leave it up to the user). |
@mattdangerw, that sounds about right, a conditional import seems like the way to go. |
…kbone to fix test cases
… dict for inputs in the backbone and task model
…sary classes, and improve test robustness
…taryEmbedding by using Constant initializer for inv_freq
- Added both preset configurations in weights conversion script. - Verified edge cases and error handling. - Implemented attention/caching speedup for improved inference performance. - Finalized API docstrings.
…lving issues with model saving
I've incorporated all review feedback and completed the following updates:
Classes Merged/Removed:
Additionally, there's a bug in Hugging Face's implementation of the tiny preset, which has been corrected in the Keras Hub version. I suspect the issue lies in the application of rotary positional embeddings in the HF implementation, in any case, the following code cell demonstrates the bug, and the notebook examples for each backend show how the Keras Hub implementation's tiny preset overcomes this bug for the same audio sample: Code cell showing HF bug for the tiny preset. I've chosen five samples, including the buggy one mentioned above, that have been tested for all backends and both presets in the notebook. All bells and whistles are covered with the new test suites, caching, forward passes, standardized tests, training flows, and beyond. Notebook: Colab Notebook I've also shared the model weights for each preset so you can test them yourself, just as a Keras Hub user would (this would be the end-to-end example you asked about, @divyashreepathihalli and @mattdangerw): I’d love to make a few updates based on your reviews and wrap up the project! |
@harshaljanjani Thanks! I'll take a look more closely soon, but I think at a high-level this is too close to huggingface's abstractions and not quite congruous with KerasHub's yet. We probably want to keep this most closely modeled off the audio_tensor = load_audio_tensor_with_any_lib()
audio_batch = ... # With a batch dim
audio_dataset = ... # Paired audio tensors and strings as a tf.data.Dataset.
# Load model arch and preprocessing.
audio_to_text = keras_hub.models.AudioToText.from_preset(
"moonshine_preset_name_blah"
)
# Equivalent, no auto class functionality.
audio_to_text = keras_hub.models.MoonshineAudioToText.from_preset(
"moonshine_preset_name_blah"
)
# Direct string output!
audio_to_text.generate(audio_tensor)
# List of strings output!
audio_to_text.generate(audio_batch)
# Change the generation sampler and regenerate
audio_to_text.compile(sampler="top_k")
audio_to_text.generate(audio_tensor)
audio_to_text.compile(sampler=keras_hub.samplers.Greedy())
audio_to_text.generate(audio_tensor)
# Fine-tune with a dataset dataset!
audio_to_text.compile(optimizer=...)
audio_to_text.enable_lora(4) # Optional.
audio_to_text.fit(audio_dataset)
# Set max sequence length for encoder and decoder inputs.
audio_to_text.preprocessor.encoder_sequence_length = 1024
audio_to_text.preprocessor.decoder_sequence_length = 512
audio_to_text.generate(audio_tensor)
# Strip preprocessing form the generated function entirely.
preprocessor = audio_to_text.preprocessor
audio_to_text.preprocessor = None
# Run preprocessing separately.
preprocessed_batch = preprocessor.generate_preprocess(audio_batch)
# Returns a token id tensor!
generated_batch = audio_to_text.generate(preprocessed_batch)
# Converts to strings!
preprocessor.generate_preprocess(generated_batch) I'd maybe start by trying to move the generation from this huggingface port to our infra. Can you |
…date conversion script)
… changes related to self_attention_cache_update_index to compare implementations apples-to-apples
Please do look into it regarding the issue, thanks! |
…the issue still persists
…asHub; issue still persists
… issue still persists
@harshaljanjani caching logic is complicated. It is out of scope for the maintainers to debug contribution code. Please take your time to debug the model to get matching outputs with that of the HF model. |
… the PyTorch backend, integrated into the KerasHub infra!
Moonshine ASR Model Implementation in Keras
This PR introduces the Moonshine Automatic Speech Recognition (ASR) model into the Keras ecosystem. The Moonshine model, originally developed by UsefulSensors and available via Hugging Face, is a transformer-based architecture designed to transcribe audio inputs into text. This implementation ports the model into Keras, complete with support for pre-trained weights from Hugging Face.
Overview
The Moonshine ASR model employs an encoder-decoder architecture. The encoder processes audio features, while the decoder generates text transcriptions. This implementation includes custom layers and components to mirror the original model's behavior, validated against the Hugging Face version for accuracy.
Files Added
The following files have been added to implement the Moonshine ASR model:
moonshine_backbone.py
defines theMoonshineBackbone
class, the core of the model. It integrates the encoder and decoder blocks, embeddings, and layer normalization, forming the complete encoder-decoder pipeline.moonshine_decoder.py
contains theMoonshineDecoderBlock
class, a custom decoder block with self-attention (causal), cross-attention, and feedforward layers. It supports caching for efficient generation and uses SwiGLU activation by default.moonshine_encoder.py
implements theMoonshineEncoderBlock
class, the encoder component with self-attention and feedforward layers. It optionally uses SwiGLU activation, matching the original model's configuration.moonshine_multi_head_attention.py
provides a custom multi-head attention layer, theMoonshineMultiHeadAttention
class, that is used in three ways:moonshine_layers.py
includes utility layers, which are:MoonshineRotaryEmbedding
: Rotary positional embeddings with dynamic scaling support.MoonshineMLP
: Can be configured to use SwiGLU activation for feedforward networks or as a linear layer with GeLU activation.moonshine_audio_converter.py
implements theMoonshineAudioConverter
class, a specialized audio preprocessing layer that converts raw audio waveforms into feature representations suitable for the Moonshine ASR model. It includes downsampling and feature extraction, normalization, and handling of attention masks.moonshine_tokenizer.py
provides theMoonshineTokenizer
class, which extends theLlamaTokenizer
to handle text tokenization for the Moonshine model. It incorporates Moonshine-specific special tokens, including position embedding tokens, hex tokens, and empty tokens, and manages the conversion between raw text and token IDs.moonshine_audio_to_text.py
implements theMoonshineAudioToText
class, a task model that extends theSeq2SeqLM
base class. This class integrates the audio converter, backbone, and tokenizer components to create a complete end-to-end ASR pipeline. It includes methods for text generation from audio inputs, with support for customizable generation parameters and built-in trimming of output sequences.Weights Conversion Script
MoonshineBackbone
model.Dependencies
Notes for Reviewers
get_config()
and registered with@keras.saving.register_keras_serializable
, ensuring compatibility with Keras model saving/loading.Closes issue #2083.