Question of text generation of a gpt model #912

FlyingFish760 · 2025-11-20T04:08:04Z

FlyingFish760
Nov 20, 2025

Hello! I have a maybe stupid question about the text generation process of a GPT model. In the generate_text_simple() function of the main code of chapter4 "ch04.ipynb", I saw only the last output logits is chosen:

# Get the predictions
with torch.no_grad():
    logits = model(idx_cond)

# Focus only on the last time step
# (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
logits = logits[:, -1, :]

So if only the last logits is chosen, why are all the input tokens used as query in generating the text? In other words, would the next token chosen be the same if we only use the last input token as a query and all input tokens as keys and values? Thanks!

rasbt · 2025-11-20T15:52:10Z

rasbt
Nov 20, 2025
Maintainer

No worries, that's what the discussions here are for :)

This here is the typical, regular usage. But this is a good observation, the previous queries are not used, and thus it's quite wasteful to recompute them. In practice, that's where the KV cache comes into play: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/03_kv-cache

Does that address your question?

4 replies

FlyingFish760 Nov 21, 2025
Author

Thanks for the reply! I read the KV cache chapter and understood how it works. Since only the newly generated tokens will be used as queries, the new text generation function only passes the new ones to the model in the loop, while caching the KV values of all tokens, which will be used in calculating attention values. I think this neatly saves the recomputation time of previous KV values and avoids attention computation of previous queries, which will not be taken into account for next token generation :).

FlyingFish760 Nov 21, 2025
Author

But now I have a new question regarding the pre-training process of the GPT model. If only the last output logit is used to predict the next token, why are all the output logits used to compute a cross-entropy loss? For this, my understanding is that during training, we are not only trying to teach the model to learn to generate the next token tk from a piece chunk of text [t0, t1, ..., tk-1], but to make the model learn that it is generating a new token after the context, more specifically, generating [t1, t2, ..., tk] from [t0, t1, ..., tk-1]. Maybe my language does not express it very clearly, but I think maybe it is the correct way to explain this question?

rasbt Nov 21, 2025
Maintainer

Thanks for the reply! I read the KV cache chapter and understood how it works. Since only the newly generated tokens will be used as queries, the new text generation function only passes the new ones to the model in the loop, while caching the KV values of all tokens, which will be used in calculating attention values. I think this neatly saves the recomputation time of previous KV values and avoids attention computation of previous queries, which will not be taken into account for next token generation :).

You had a good intuition though. If KV cache wasn't already invented, you probably would have been the inventor :).

If only the last output logit is used to predict the next token, why are all the output logits used to compute a cross-entropy loss?

This is for efficiency reasons.

E.g., if you have a sentence "The cat walks", instead of just having one pre-training task

"The cat walks" -> "outside"

you can turn this into 3 pre-training tasks:

"The" -> "cat"
"The cat" -> "walks"
"The cat walks" -> "outside"

FlyingFish760 Nov 23, 2025
Author

Thanks and now I see. During training, the Q matrix comprises several concatenated queries, each of which serves as a pre-training task to compute attention. A casual mask is used to avoid the queries at position i from seeing the KV values after position i. Your reply answered my question, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question of text generation of a gpt model #912

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question of text generation of a gpt model #912

Uh oh!

FlyingFish760 Nov 20, 2025

Replies: 1 comment · 4 replies

Uh oh!

rasbt Nov 20, 2025 Maintainer

Uh oh!

FlyingFish760 Nov 21, 2025 Author

Uh oh!

FlyingFish760 Nov 21, 2025 Author

Uh oh!

Uh oh!

rasbt Nov 21, 2025 Maintainer

Uh oh!

FlyingFish760 Nov 23, 2025 Author

FlyingFish760
Nov 20, 2025

Replies: 1 comment 4 replies

rasbt
Nov 20, 2025
Maintainer

FlyingFish760 Nov 21, 2025
Author

FlyingFish760 Nov 21, 2025
Author

rasbt Nov 21, 2025
Maintainer

FlyingFish760 Nov 23, 2025
Author