Inference-time techniques #928
Replies: 1 comment
-
|
Good questions and suggestions! Regarding 1-3: I haven't covered those because they are pretty tool-specific, and I think if someone wants to learn about vLLM, for example, the best resource is the vLLM documentation.
Quantization is on my list for some time in the future.
Same as above. I cover basic KV caching, but there's so much more. I was thinking about a dedicated book actually because there's so much.
I am personally not very interested in JAX and XLA because it's heavily optimized for TPUs, and TPUs are (still) proprietary. But this may of course change in the future... |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Do you have any recommended resources on inference-time techniques—such as:
My understanding is that these advanced inference techniques don’t necessarily improve model quality, but they do make models far more accessible by improving latency/cost and enabling deployment across a wider range of endpoints (e.g., phones, browsers via WebAssembly, and TPUs).
Beta Was this translation helpful? Give feedback.
All reactions