Conversation
|
I like the idea of this but concerned it took over 35 min for Test Turbine Models. Maybe this belongs more in a nightly than for every patch? |
Ah, thanks for the suggestion Ian. I think perhaps a good portion of the time is compiling the Stateless llama. let me try make it reuse the vmfb when possible. If that doesn't work, I can move it into some nightly action thing. :) |
|
@saienduri before you left your internship you were working on benchmark following Ben's fancy double vmfb thing. What happened to that? |
Hey Dan, I think it's there, but it's using benchmark-module which is good for microbenchmarking as opposed to this one which tests perf on actual workload + e2e python. |
@IanNod I brought it down to 23minutes. I think before this test, it's ~18minutes. What do you think? |
Huh, used to be ~10 mins. Wonder what brought it up to almost double that. I still feel this belongs more in a nightly but am fine with it for now as we have a lot of ramping up on CI work to do. |
| hf_auth_token=None, | ||
| compile_to="vmfb", | ||
| external_weights="safetensors", | ||
| # external_weight_file="Llama-2-7b-chat-hf-function-calling-v2_f16_int4.safetensors", Do not export weights because this doesn't get quantized |
| assert benchmark_result[1]["decoded_tokens"] == 25 | ||
| assert benchmark_result[1]["num_iterations"] == 1 | ||
| assert benchmark_result[1]["decode_speed(tok/s)"] > 0 | ||
| assert benchmark_result[1]["prefill_speed(tok/s)"] > 0 |
There was a problem hiding this comment.
Doesn't really test for regressions, just that it ran, right?
Modifications to SharkLLM + Implementation of benchmarking script to track performance of SHARK-2.0 LLM models. Here is a sample output from the benchmarking script. https://gist.github.com/raikonenfnu/4120ddfdcb2964608c89d31079594d05