Changes for basic LLaDA style diffusion masking support #238

New issue

Jump to bottom

Draft

gopeshh wants to merge 2 commits into main from gopeshh/masked_diffusion

+620 −28

Collaborator

gopeshh commented Apr 21, 2025

✨ Description

Cleaned up the code a bit:

Added Diffusion config object as we discussed
removed noise schedules for v1
Moved loss calculation to head.py (as I noticed language modelling loss is computed there)
Moved bidirectional attention to preprocessing.py file as it seems like the attention mask is computed there

Of course still a WIP but feel free to leave comments and suggestions

These are changes to address this PR: #208 (comment)


          changes for basic LLaDA style diffusion masking support

db28a11

gopeshh requested a review from tscholak

April 21, 2025 12:12


          tests for masking and MLM loss

3d44671

PierreAndreNoel reviewed

View reviewed changes

PierreAndreNoel left a comment

This is just quick feedback as I am very busy with other things, but please remind me to come back here next week and I'll dig deeper in.

fast_llm/layers/transformer/preprocessing.py


		t = torch.rand(batch_size, device=device)

		p_mask = (1 - diffusion_config.epsilon) * t + diffusion_config.epsilon

PierreAndreNoel Apr 22, 2025

Some questions/thoughts (I am just browsing quickly, and I am not looking at the paper right now):

Why is the lower bound epsilon and the upper bound max_mask_prob?
My guts tell me you never want the mask probability to be exactly 1, for the same kind of reasons you don't want it to be exactly 0.
This approach using torch.min will put a discrete probability for p_mask to be exactly max_mask_prob.

fast_llm/layers/transformer/preprocessing.py


		masked_indices = torch.rand((batch_size, seq_len), device=device) < p_mask

		if diffusion_config.pad_prob > 0:

PierreAndreNoel Apr 22, 2025

Meta: I currently can't comment about padding; it will have to wait for next week, as I need to re-read the paper better (our own work doesn't do padding).

fast_llm/layers/transformer/preprocessing.py

+                      p_mask = torch.min(p_mask, torch.tensor(diffusion_config.max_mask_prob))
+                      p_mask = p_mask[:, None].expand(-1, seq_len)
+                      masked_indices = torch.rand((batch_size, seq_len), device=device) < p_mask

PierreAndreNoel Apr 22, 2025

Assuming True means "masked".

fast_llm/layers/transformer/preprocessing.py

+                          attention_mask = torch.ones((batch_size, 1, seq_len, seq_len), device=device, dtype=torch.bool)
+                      else:
+                          # Causal attention
+                          attention_mask = torch.ones((batch_size, 1, seq_len, seq_len), device=device, dtype=torch.bool).tril_()

PierreAndreNoel Apr 22, 2025

My understanding is that you never want such a triangular causal attention, as this would give a strictly worse model than an autoregressive model.

Suppose that, at inference, tokens are unmasked in the order (4, 2, 3, 0, 1). Token 4 is unmasked first, but this triangular matrix prevents all other tokens from ever "seeing" it.

What is the closest case that makes sense to me would be to permute the rows and columns of the triangular matrix using (4,2,3,0,1), so that token 2 can see token 4, token 3 can see tokens 2 and 4, etc.

fast_llm/layers/transformer/preprocessing.py

+                      kwargs['masked_indices'] = masked_indices
+                      kwargs['p_mask'] = p_mask
+                      if self._config.diffusion.bidirectional_attention:

PierreAndreNoel Apr 22, 2025

You may want a string instead of a boolean, as there are many possible attention choices (e.g., blocks) that may come up. Also see the next comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet