New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Llama3.1 #2132

Open

pctablet505 wants to merge 49 commits into keras-team:master from pctablet505:llama3.1

Collaborator

pctablet505 commented Mar 7, 2025

Added Support for Llama3.1

Notebook to verify numerics using repo

Notebook for actual numeric verification and code modifications

Tokenizer part is still remaining to be done.

pctablet505 added 13 commits

March 5, 2025 11:34


          Add Llama 3.1

a2e9207


          Update llama3_rotary_embedding.py

c31f222

code fix


          Update llama3_rotary_embedding.py

68afeb8

code fix


          Update llama31_attention.py

6dc38b9

code fix


          Update __init__.py

39b29e7


          Update llama31_attention.py

ea667a5


          Update llama31_causal_lm.py

021b553


          Update llama31_attention.py

9b38b25


          Update llama3_rotary_embedding.py

7989b74


          Update llama31_decoder.py

510523f


          Update llama31_attention.py

593f867


          Code fix

11a3fb7


          code fix

1e651c5

pctablet505 mentioned this pull request

Add Llama 3.1, 3.2, 3.3 to KerasHub #2076

Open

pctablet505 and others added 4 commits

March 7, 2025 10:06


          code refactoring

dba14d0


          Merge branch 'keras-team:master' into llama3.1

b9b9ce7


          replaced bitwise_and to logical_and. bitwise_and is not supported for…

e896b03

… jax backend.


          deleted files added by mistake.

Member

mattdangerw commented Mar 11, 2025

I'm surprised this is coming in as a separate arch. How hard would it be to consolidate and add new config options to existing llama? Ideally a minor release like this is not a new entire set of symbols in the library.

Collaborator

abheesht17 commented Mar 11, 2025 •

edited

Loading

I'm surprised this is coming in as a separate arch. How hard would it be to consolidate and add new config options to existing llama? Ideally a minor release like this is not a new entire set of symbols in the library.

The only difference is some changes in RoPE. Will fix!

pctablet505 added 2 commits

March 17, 2025 06:17


          Added changes in llama3 to support llama3.1\n Modified rotary embeddi…

6ef7abe

…ng layer


          removed llama31 from model, and added support for llama3.1 using same…

203ade4

… llama3 apis

Collaborator Author

pctablet505 commented Mar 17, 2025

I've removed that separate directory from keras-hub/models and did modifications to support llama3.1 weights using old APIs. Only we need to pass the rope related parameters in backbone.

abheesht17 reviewed

View reviewed changes

Collaborator

abheesht17 left a comment

Thanks for the PR! Did a quick review.

Let's fill up the arg descriptions in the doc-string.
We need better argument names. old_context_len is a bit awkward.
Also, can the if all(...) condition be simplified? Alternatively, is there a way to set default values such that we don't need the if condition at all?

Pinging @mattdangerw for a review here as well. Wonder if the added ops are too custom (like, for example, old_context_len is weird) for the RotaryEmbedding layer, and they warrant a custom layer for Llama 3.1 after all.

keras_hub/src/layers/modeling/rotary_embedding.py Outdated

                       **kwargs,
                   ):
                       super().__init__(**kwargs)
                       self.max_wavelength = max_wavelength
                       self.sequence_axis = sequence_axis
                       self.feature_axis = feature_axis
                       self.scaling_factor = scaling_factor
+                      self.llow_freq_factor = low_freq_factor

Collaborator

abheesht17 Mar 17, 2025

Why self.llow_freq_factor and not self.low_freq_factor?

Collaborator Author

pctablet505 Mar 17, 2025

That was a typo.

keras_hub/src/layers/modeling/rotary_embedding.py Outdated

+                          inverse_freq = ops.where(
+                              is_medium_freq, smoothed_inv_freq, inverse_freq
+                          )
+                          ops.cast(inverse_freq, "float32")

Collaborator

abheesht17 Mar 17, 2025

Should this be inverse_freq = ops.cast(inverse_freq, "float32")? Or is this line meant to be removed?

Collaborator Author

pctablet505 Mar 18, 2025

removed this one, and verified.

keras_hub/src/layers/modeling/rotary_embedding.py Outdated

@@ @@ -66,13 +68,19 @@ def __init__( @@
                       scaling_factor=1.0,
                       sequence_axis=1,
                       feature_axis=-1,
+                      low_freq_factor=None,
+                      high_freq_factor=None,
+                      old_context_len=None,

Collaborator

abheesht17 Mar 17, 2025

This naming is a bit awkward. I don't know what to call it either. Pinging @mattdangerw / @divyashreepathihalli, who might have a better idea.

keras_hub/src/layers/modeling/rotary_embedding.py Outdated

Comment on lines 160 to 162

+                          old_context_len = self.old_context_len
+                          low_freq_factor = self.llow_freq_factor
+                          high_freq_factor = self.high_freq_factor

Collaborator

abheesht17 Mar 17, 2025

Let's just re-use self.{variable} instead of defining new variables?

Collaborator Author

pctablet505 Mar 18, 2025

modified this way

keras_hub/src/layers/modeling/rotary_embedding.py Outdated

Comment on lines 71 to 73

+                      low_freq_factor=None,
+                      high_freq_factor=None,
+                      old_context_len=None,

Collaborator

abheesht17 Mar 17, 2025

Need to add description of all these arguments in the doc-string.

Collaborator Author

pctablet505 Mar 18, 2025

done

Member

mattdangerw Mar 20, 2025

Yeah, can you explain more of what's going on here? What is "old context length"? Let's probably rename this.

keras_hub/src/layers/modeling/rotary_embedding.py Outdated

+                          ) * inverse_freq / factor + smooth_factor * inverse_freq
+                          is_medium_freq = ops.logical_and(
+                              ops.cast(
+                                  ops.greater_equal(wavelen, high_freq_wavelen), dtype="int8"

Collaborator

abheesht17 Mar 17, 2025

Does dtype=bool not work here? Why do we need to cast to int8?

Collaborator Author

pctablet505 Mar 17, 2025

Earlier we were using ops.bitwise_and and that was not working, it required integer data type.
Will need to check for ops.logical_and

Collaborator Author

pctablet505 Mar 18, 2025

modified with bool

keras_hub/src/layers/modeling/rotary_embedding.py Outdated

+                              self.old_context_len,
+                          )
+                      ):
+                          factor = self.scaling_factor

Collaborator

abheesht17 Mar 17, 2025

Is this the same as the scaling factor used to scale positions:

keras-hub/keras_hub/src/layers/modeling/rotary_embedding.py

Line 125 in 203ade4

positions = positions / ops.cast(self.scaling_factor, "float32")

?

Do we need a separate argument for this particular scaling factor?

keras_hub/src/layers/modeling/rotary_embedding.py Outdated

+                              inverse_freq,
+                          )
+                          # otherwise: interpolate between the two, using a smooth factor

Collaborator

abheesht17 Mar 17, 2025

Why "otherwise"? Interpolation is happening irrespective of what you're doing above.

keras_hub/src/layers/modeling/rotary_embedding.py Outdated

+                          inverse_freq = ops.where(
+                              ops.greater(wavelen, low_freq_wavelen),
+                              inverse_freq / factor,

Collaborator

abheesht17 Mar 17, 2025

Wherever tensors are involved, prefer using keras.ops.{op}. So, ops.divide(...) here

Collaborator Author

pctablet505 Mar 18, 2025

Is it required for operations between 2 tensors, or also for 1 scalar and 1 tensor?

keras_hub/src/layers/modeling/rotary_embedding.py Outdated

Comment on lines 151 to 158

+                      if all(
+                          x is not None
+                          for x in (
+                              self.llow_freq_factor,
+                              self.high_freq_factor,
+                              self.old_context_len,
+                          )
+                      ):

Collaborator

abheesht17 Mar 17, 2025

Hmmm, is there a better way of specifying this condition?

Collaborator Author

pctablet505 Mar 17, 2025

We need to have a discussion on it, either we can pass some boolean flag for llama3.1 specific usecase or other way, currently these 3 parameters were newly introduced for llama3.1, and user might provide some value to one of these parameters and, then we don't know how to use it.

abheesht17 requested a review from mattdangerw

March 17, 2025 13:41

pctablet505 and others added 5 commits

March 17, 2025 21:52


          Update rotary_embedding.py

8c8a710

Typo fix


          Update rotary_embedding.py

1d67423

Removed repeated declarations of variables.


          Added argument details, and optimized operations

f7c8a58


          optimized operations

f2c980f


          optimized operations

f3dae63

pctablet505 added the kokoro:force-run label

kokoro-team removed the kokoro:force-run label


          Update convert_llama3.py

b603c5b

pctablet505 added the kokoro:force-run label

kokoro-team removed the kokoro:force-run label

Collaborator Author

pctablet505 commented Apr 3, 2025 •

edited

Loading

Colab Notebooks:-

output verification text output generation results.
Numeric Verification

mattdangerw reviewed

View reviewed changes

keras_hub/src/layers/modeling/llama_rotary_embedding.py Outdated

		from keras_hub.src.api_export import keras_hub_export


		@keras_hub_export("keras_hub.layers.LlamaRotaryEmbedding")

Member

mattdangerw Apr 3, 2025

Sorry just notice we are adding this to keras_hub.layers. We should not. If we have a generic layer used by a few different models we can put it here with a generic name.

But in this case, we should move this into llama, and not expose it (like all the other llama layers).

Collaborator Author

pctablet505 Apr 4, 2025

Moved it to llama.

keras_hub/src/layers/modeling/llama_rotary_embedding.py Outdated

+                      args_none = [x is None for x in grouped_args]
+                      if any(args_none) and not all(args_none):
+                          raise ValueError(
+                              "Either all of ... should be set, or none of ... should be set"

Member

mattdangerw Apr 3, 2025

you need to actually fill this in. this is user facing!

Collaborator Author

pctablet505 Apr 4, 2025

done

keras_hub/src/utils/transformers/convert_llama3.py

		}
		if transformers_config.get("rope_scaling", None) is not None:

Member

mattdangerw Apr 3, 2025

You might want to add some validation here for weirder llama uploads. E.g. inside the if block if transformers_config["rope_type"] != "llama3": raise ValueError(help message)

Collaborator Author

pctablet505 Apr 4, 2025

included this check, inside the if block.

abheesht17 reviewed

View reviewed changes

keras_hub/src/layers/modeling/llama_rotary_embedding.py Outdated

Comment on lines 13 to 23

+                  This layer encodes absolute positional information with a rotation
+                  matrix. It calculates the rotary encoding with a mix of sine and
+                  cosine functions with geometrically increasing wavelengths.
+                  Defined and formulated in
+                  [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864v4).
+                  The input must be a tensor with shape a sequence dimension and a feature
+                  dimension. Typically, this will either an input with shape
+                  `(batch_size, sequence_length, feature_length)` or
+                  `(batch_size, sequence_length, num_heads, feature_length)`.
+                  This layer will return a new tensor with the rotary embedding applied to
+                  the input tensor.

Collaborator

abheesht17 Apr 3, 2025

This reads the same as https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/layers/modeling/rotary_embedding.py#L11-L21. We should probably add a line on how it is different from the original layer.

Collaborator Author

pctablet505 Apr 7, 2025

I've added a line for it.

keras_hub/src/layers/modeling/llama_rotary_embedding.py Outdated



		@keras_hub_export("keras_hub.layers.LlamaRotaryEmbedding")
		class LlamaRotaryEmbedding(keras.layers.Layer):

Collaborator

abheesht17 Apr 3, 2025

Quick question - why can we not subclass the original RoPE layer, and override just __init__ and the _get_inv_freq function? We don't need to copy the whole thing over, right?

Collaborator Author

pctablet505 Apr 4, 2025

Done, passed position_scaling_factor as scaling_factor to RotaryEmbedding class through super().init

keras_hub/src/models/llama/llama_backbone.py Outdated

@@ @@ -40,10 +40,18 @@ class LlamaBackbone(Backbone): @@
                           a three-layer feedforward network for each transformer.
                       num_key_value_heads (int): The number of key and value attention heads
                           for each transformer.
-                      rope_max_wavelength (int, optional): The maximum angular wavelength of
+                      rope_max_wavelength: (int, optional): The maximum angular wavelength of

Collaborator

abheesht17 Apr 3, 2025

Nit: this is not how we specify args. It should be

rope_max_wavelength: int. The maximum .... Defaults to None.

pctablet505 and others added 6 commits

April 4, 2025 11:22


          moved llama_rotary_embedding from keras_hub.models to llama directory

1655a63


          Update llama_rotary_embedding.py

e4d523b


          Update llama_rotary_embedding.py

a08a403


          Update convert_llama3.py

dc924c0


          Merge branch 'keras-team:master' into llama3.1

184a808


          docstring changes

be205a8

Collaborator Author

pctablet505 commented Apr 7, 2025

@abheesht17
I've made all the necessary changes.
May you please approve it for merge.


          Update llama_backbone.py

b7988b4

abheesht17 reviewed

View reviewed changes

keras_hub/src/models/llama/llama_rotary_embedding.py Outdated

+                  This layer will return a new tensor with the rotary embedding applied to
+                  the input tensor.
+                  It is extended from `RotaryEmbedding` layer in `keras_hub.layers`.
+                  It has additional smoothning and interpolation for some frequency ranges.

Collaborator

abheesht17 Apr 7, 2025

smoothning -> smoothening

abheesht17 reviewed

View reviewed changes

keras_hub/src/utils/transformers/convert_llama3.py Outdated

                   }
+                  if transformers_config.get("rope_scaling", None) is not None:
+                      if transformers_config["rope_scaling"]["rope_type"] != "llama3":
+                          raise ValueError("The config shall be a valid llama3 config.")

Collaborator

abheesht17 Apr 7, 2025

shall -> should?

Collaborator

abheesht17 Apr 9, 2025

@pctablet505 - can you please make this one last change? Thanks!


          Update llama_rotary_embedding.py

4e4fa14

abheesht17 reviewed

View reviewed changes

keras_hub/src/models/llama/llama_rotary_embedding.py

+                  def __init__(
+                      self,
+                      max_wavelength=10000,

Collaborator

abheesht17 Apr 7, 2025

Need to pass this to the superclass

Collaborator Author

pctablet505 Apr 9, 2025

this was done

pctablet505 added 2 commits

April 7, 2025 13:25


          Update llama_rotary_embedding.py

93b6d2b


          Update llama_rotary_embedding.py

d453b1b

abheesht17 reviewed

View reviewed changes

keras_hub/src/models/llama/llama_backbone.py Outdated

+                          calculation of rotary embedding. Defaults to `1.0`
+                      rope_frequency_adjustment_factor: flaot. The scaling factor
+                          used to scale the inverse frequencies.  Defaults to `None`.
+                      rope_low_freq_factor: flaot. The low frequency scaling

Collaborator

abheesht17 Apr 7, 2025

flaot --> float. Here, and other places too.

pctablet505 and others added 2 commits

April 8, 2025 09:28


          typo fix

672a6f2


          Update convert_llama3.py

1de4e7d

abheesht17 added the kokoro:force-run label

abheesht17 approved these changes

View reviewed changes

Collaborator

abheesht17 left a comment

LGTM, thanks!

Did you fix the from_preset() issue you were facing earlier?

kokoro-team removed the kokoro:force-run label

pctablet505 added the kokoro:force-run label

kokoro-team removed the kokoro:force-run label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet