[shortfin][ci] Flaky startup timer test failure #872

dan-garvey · 2025-01-27T23:20:05Z

app_tests/integration_tests/llm/server_management.py:118: TimeoutError
------------------------------ Captured log setup ------------------------------
INFO app_tests.integration_tests.llm.model_management:model_management.py:102 Copying local model from /data/llama3.1/weights/8b/fp16/llama3.1_8b_fp16_instruct.irpa
INFO app_tests.integration_tests.llm.model_management:model_management.py:138 Downloading tokenizer NousResearch/Meta-Llama-3.1-8B
INFO app_tests.integration_tests.llm.model_management:model_management.py:151 Exporting model with following settings:
MLIR Path: /shark-dev/pytest-of-nod/pytest-516/model_cache1/local/llama3.1_8b_fp16_instruct/model.mlir
Config Path: /shark-dev/pytest-of-nod/pytest-516/model_cache1/local/llama3.1_8b_fp16_instruct/config.json
Batch Sizes: 1,4
INFO app_tests.integration_tests.llm.model_management:model_management.py:172 Model successfully exported to /shark-dev/pytest-of-nod/pytest-516/model_cache1/local/llama3.1_8b_fp16_instruct/model.mlir
INFO app_tests.integration_tests.llm.model_management:model_management.py:178 Compiling model to /shark-dev/pytest-of-nod/pytest-516/model_cache1/local/llama3.1_8b_fp16_instruct/model.vmfb
INFO app_tests.integration_tests.llm.model_management:model_management.py:189 Model successfully compiled to /shark-dev/pytest-of-nod/pytest-516/model_cache1/local/llama3.1_8b_fp16_instruct/model.vmfb
=========================== short test summary info ============================
ERROR app_tests/integration_tests/llm/shortfin/cpu_llm_server_test.py::TestLLMServer::test_basic_generation[llama31_8b_none] - TimeoutError: Server failed to start within 10 seconds

This test is fairly flaky. Can we come up with a better testing methodology?

renxida · 2025-01-28T00:30:57Z

Huh. This one should be xfailed.

We don't have meta llama 3.1 weights on all of the machiens

stbaione · 2025-01-28T01:17:10Z

Huh. This one should be xfailed.

We don't have meta llama 3.1 weights on all of the machiens

I recently removed the xfail, because it was moved to Mi300x-3 which had the irpa file. But, makes sense if you wanna set it back to xfail since the file isn't at the same location on all of the machines. We had some bad irpa files that were causing the shortfin output to seem corrupt, which got fixed by downloading safetensors and regenerating the irpa files, so had that issue on my mind when I removed the xfail

dan-garvey changed the title ~~[shortfin][ci]~~ [shortfin][ci] Flaky startup timer test failure Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[shortfin][ci] Flaky startup timer test failure #872

[shortfin][ci] Flaky startup timer test failure #872

dan-garvey commented Jan 27, 2025

renxida commented Jan 28, 2025

stbaione commented Jan 28, 2025

[shortfin][ci] Flaky startup timer test failure #872

[shortfin][ci] Flaky startup timer test failure #872

Comments

dan-garvey commented Jan 27, 2025

renxida commented Jan 28, 2025

stbaione commented Jan 28, 2025