Inference-time techniques #928

talentJay-ux · 2025-12-20T06:21:18Z

talentJay-ux
Dec 20, 2025

Do you have any recommended resources on inference-time techniques—such as:

the GGML backend, the GGUF format and compute graphs and layer offloading
vLLM and PagedAttention,
TensorRT / TensorRT-LLM,
quantization (INT8/INT4, GPTQ/AWQ)
KV-cache optimizations (e.g., prefix caching, compression, speculative decoding)
broader ML compilation stacks (e.g., JAX/XLA)?

My understanding is that these advanced inference techniques don’t necessarily improve model quality, but they do make models far more accessible by improving latency/cost and enabling deployment across a wider range of endpoints (e.g., phones, browsers via WebAssembly, and TPUs).

rasbt · 2025-12-20T16:34:29Z

rasbt
Dec 20, 2025
Maintainer

Good questions and suggestions!

Regarding 1-3: I haven't covered those because they are pretty tool-specific, and I think if someone wants to learn about vLLM, for example, the best resource is the vLLM documentation.

quantization (INT8/INT4, GPTQ/AWQ)

Quantization is on my list for some time in the future.

KV-cache optimizations (e.g., prefix caching, compression, speculative decoding)

Same as above. I cover basic KV caching, but there's so much more. I was thinking about a dedicated book actually because there's so much.

broader ML compilation stacks (e.g., JAX/XLA)?

I am personally not very interested in JAX and XLA because it's heavily optimized for TPUs, and TPUs are (still) proprietary. But this may of course change in the future...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference-time techniques #928

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Inference-time techniques #928

Uh oh!

talentJay-ux Dec 20, 2025

Replies: 1 comment

Uh oh!

rasbt Dec 20, 2025 Maintainer

talentJay-ux
Dec 20, 2025

rasbt
Dec 20, 2025
Maintainer