Implementation of generate #222

bigximik · 2025-04-03T12:31:59Z

✨ Description

part of #217

Closes #

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

bigximik · 2025-04-03T12:41:01Z

I have created a debugging sandbox with manual tests for now. The results are as follows:

Ignoring `attention_mask` and `position_ids`:

Batch Size	No Flash Attention (Float32)	No Flash Attention (BF16)	Flash Attention (BF16)
1	Same output (same model via HF and Fast-LLM)	Same output	Different output
2	Different output	Different output	Different output

Converting `attention_mask` (from HF `forward`) to `sequence_lengths`:

Batch Size	No Flash Attention (Float32)	No Flash Attention (BF16)	Flash Attention (BF16)
1	FastLLM empty output	FastLLM empty output	Different output
2	FastLLM empty output	FastLLM empty output	Different output

It seems sequence_lengths is not supported for fused attention and does not improve Flash Attention. Could this be correct?

If attention_mask is a left-padded mask like this:
[[0, 0, 0, 1, 1, 1, 1], ....]
I convert it to sequence_lengths = [[3, 4], ....].

# First non zero indexes or zero index if the row is all zeros (invalid row)
first_non_zero_indexes = attention_mask.argmax(dim=1)

# Check if the sequence is left-padded and if the remaining ones are continuous 1-ns
assert (attention_mask.sum(axis=1) == (attention_mask.shape[1] - first_non_zero_indexes)).all()

sequence_lenghts = [
    torch.tensor(
        [attention_mask.shape[1]] if el == 0 else [el, attention_mask.shape[1] - el], dtype=torch.int64
    )
    for el in first_non_zero_indexes.tolist()
]

bigximik · 2025-04-03T13:02:42Z

@sohamparikh @jlamypoirier Hi, I am trying to use the cross-document attention prevention that @tscholak pointed me to (https://github.com/ServiceNow/Fast-LLM/pull/177/files) to mimic left padding for documents in a batch during generation. It appears to be doing the correct thing, such as building the internal mask and position IDs, but it is not working. Could you please comment on what might be wrong? Thanks!

…model and saving

…denis/generate

…f from cli

tscholak · 2025-04-22T15:17:42Z

fast_llm/engine/training/evaluator.py

+        completed_steps: int,
+        consumed_samples: int,
+        consumed_tokens: int,
+    ) -> tuple[dict[str, any], str | None]: ...


use dataclass

tscholak · 2025-04-22T15:46:29Z

fast_llm/engine/training/evaluator.py

+        )
+        end_time = time.perf_counter()
+        time_per_iteration = (end_time - begin_time) / num_iters
+        model_tflops, hardware_tflops = self._get_tflops_func(phase, time_per_iteration)


move downstream

tscholak · 2025-04-22T15:47:09Z

fast_llm/engine/training/evaluator.py

+    @classmethod
+    def build(
+        cls,
+        name: str,
+        eval_config: EvaluationLossConfig,
+        trainer_config: TrainerConfig,
+        get_tflops_func: callable,
+    ) -> "Evaluation":
+        return cls(
+            name=name,
+            eval_config=eval_config,
+            trainer_config=trainer_config,
+            get_tflops_func=get_tflops_func,
+        )


make dataclass fields

tscholak · 2025-04-22T15:48:00Z

fast_llm/engine/training/evaluator.py

+        self._trainer_config = trainer_config
+        self._get_tflops_func = get_tflops_func
+
+        self._loss_defs = self._multi_stage.base_model.loss_defs


use __post_init__?

tscholak · 2025-04-22T15:53:43Z

fast_llm/engine/training/lm_eval/utils.py

+    assert not args.wandb_args  # default empty string
+    assert not args.wandb_config_args  # default empty string
+    assert args.model == "hf"  # default value of 'hf'
+    assert not args.model_args  # default empty string
+    assert args.batch_size == 1  # default value of 1
+    assert args.max_batch_size is None
+    assert args.device is None


make sure these are raised during config class validation

tscholak · 2025-04-22T15:55:34Z

fast_llm/engine/training/lm_eval/utils.py

+                continue
+
+
+def setup_parser() -> argparse.ArgumentParser:


use https://github.com/EleutherAI/lm-evaluation-harness/blob/e4a7b69fe0fc6cb430e12cf15c4109bf28185124/lm_eval/__main__.py#L83

tscholak · 2025-04-22T15:57:09Z

fast_llm/engine/training/lm_eval/utils.py

+    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+
+    # update the evaluation tracker args with the output path and the HF token
+    if args.output_path:


please clean this up vvv.
we are not pushing anything to the hf hub during eval.
the remainder should be controlled by fast-llm

tscholak · 2025-04-22T15:58:16Z

fast_llm/engine/training/lm_eval/utils.py

+    # utils.setup_logging(args.verbosity)
+    # eval_logger = logging.getLogger(__name__)
+
+    os.environ["TOKENIZERS_PARALLELISM"] = "false"


doesn't apply

jlamypoirier · 2025-04-23T15:59:30Z

Can we please break down this PR? Otherwise it will make reviewing too difficult. Let's keep this one about the minimalistic generate, and move the rest to the next PR

tscholak · 2025-04-23T20:08:22Z

Can we please break down this PR? Otherwise it will make reviewing too difficult. Let's keep this one about the minimalistic generate, and move the rest to the next PR

Sure, eventually we can do that. @bigximik is currently iterating towards an end-to-end solution for running benchmarks, and he's solving issues as they arise. It makes sense for him to operate that way for the time being, but when the time comes to review the changes, we should separate the concerns.

tscholak · 2025-04-23T20:16:44Z

@jlamypoirier, btw, we need your guidance in determining the best way to distribute generation across ranks.
Concretely, we are looking to implement this lm-eval-harness API:

    @abc.abstractmethod
    def generate_until(self, requests) -> List[str]:
        """Generate greedily until a stopping sequence

        :param requests: list[Instance]
            A list of Instance objects with property `args` which returns a tuple (context, gen_kwargs).
            context: str
                Context string
            gen_kwargs: dict
                A dictionary of keyword arguments to pass to the generation function e.g. top_k, until, etc.
        :return: list[str]
            A list of model generated continuations.
            continuation: str
                The generated continuation.
        """
        pass

where generate_until(requests: list[Instance], ...) is called from rank 0 and distribute the Instances across ranks calling the Fast-LLM model's generate(inputs: torch.Tensor, ...). An Instance is a prompt with fluff, https://github.com/EleutherAI/lm-evaluation-harness/blob/e4a7b69fe0fc6cb430e12cf15c4109bf28185124/lm_eval/api/instance.py#L11.

changes for debugging

b8f6b62

Toolkit User and others added 8 commits April 8, 2025 09:57

temporal hack to save logits

4022249

simple generate function and less hacky way to save logits

505e658

moved mkdir

a53855b

fast llm classes

c14ca4d

refactored logits saving, test and added hidden_test return from the …

6a72203

…model and saving

added notebook to check logits and hidden states diffs

e713aa2

Merge branch 'denis/generate' of github.com:ServiceNow/Fast-LLM into …

04e914c

…denis/generate

fix to a document mask for attention mask

51f59f8

This was referenced Apr 15, 2025

Integration of Fast-LLM EleutherAI/lm-evaluation-harness#2913

Closed

Integration of Fast-LLM ServiceNow/fml-lm-evaluation-harness#1

Draft

Toolkit User added 9 commits April 15, 2025 15:21

fix for an absent attention_mask

9c01471

updated classes and funcions naming, removed temporal param from init

7f1ca8a

updated manual test

0488fdb

evaluation abstraction implementation

c65d9ba

fixes for evaluation only in trainer

1543a56

added evaluate command

dc2b5e0

lm_eval integration, one gpu

0cb3ad7

fixing typos

c86ae20

fixes to make lm_eval reporting to work with wrapper object instead o…

145ee50

…f from cli

tscholak reviewed Apr 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of generate #222

Implementation of generate #222

bigximik commented Apr 3, 2025

bigximik commented Apr 3, 2025 •

edited

Loading

bigximik commented Apr 3, 2025

tscholak Apr 22, 2025

tscholak Apr 22, 2025

tscholak Apr 22, 2025

tscholak Apr 22, 2025 •

edited

Loading

tscholak Apr 22, 2025

tscholak Apr 22, 2025

tscholak Apr 22, 2025 •

edited

Loading

tscholak Apr 22, 2025

jlamypoirier commented Apr 23, 2025

tscholak commented Apr 23, 2025

tscholak commented Apr 23, 2025

Implementation of generate #222

Are you sure you want to change the base?

Implementation of generate #222

Conversation

bigximik commented Apr 3, 2025

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

bigximik commented Apr 3, 2025 • edited Loading

Ignoring attention_mask and position_ids:

Converting attention_mask (from HF forward) to sequence_lengths:

bigximik commented Apr 3, 2025

tscholak Apr 22, 2025

Choose a reason for hiding this comment

tscholak Apr 22, 2025

Choose a reason for hiding this comment

tscholak Apr 22, 2025

Choose a reason for hiding this comment

tscholak Apr 22, 2025 • edited Loading

Choose a reason for hiding this comment

tscholak Apr 22, 2025

Choose a reason for hiding this comment

tscholak Apr 22, 2025

Choose a reason for hiding this comment

tscholak Apr 22, 2025 • edited Loading

Choose a reason for hiding this comment

tscholak Apr 22, 2025

Choose a reason for hiding this comment

jlamypoirier commented Apr 23, 2025

tscholak commented Apr 23, 2025

tscholak commented Apr 23, 2025

bigximik commented Apr 3, 2025 •

edited

Loading

Ignoring `attention_mask` and `position_ids`:

Converting `attention_mask` (from HF `forward`) to `sequence_lengths`:

tscholak Apr 22, 2025 •

edited

Loading

tscholak Apr 22, 2025 •

edited

Loading