Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[shortfin][ci] Flaky startup timer test failure #872

Open
dan-garvey opened this issue Jan 27, 2025 · 2 comments
Open

[shortfin][ci] Flaky startup timer test failure #872

dan-garvey opened this issue Jan 27, 2025 · 2 comments

Comments

@dan-garvey
Copy link
Member

app_tests/integration_tests/llm/server_management.py:118: TimeoutError
------------------------------ Captured log setup ------------------------------
INFO app_tests.integration_tests.llm.model_management:model_management.py:102 Copying local model from /data/llama3.1/weights/8b/fp16/llama3.1_8b_fp16_instruct.irpa
INFO app_tests.integration_tests.llm.model_management:model_management.py:138 Downloading tokenizer NousResearch/Meta-Llama-3.1-8B
INFO app_tests.integration_tests.llm.model_management:model_management.py:151 Exporting model with following settings:
MLIR Path: /shark-dev/pytest-of-nod/pytest-516/model_cache1/local/llama3.1_8b_fp16_instruct/model.mlir
Config Path: /shark-dev/pytest-of-nod/pytest-516/model_cache1/local/llama3.1_8b_fp16_instruct/config.json
Batch Sizes: 1,4
INFO app_tests.integration_tests.llm.model_management:model_management.py:172 Model successfully exported to /shark-dev/pytest-of-nod/pytest-516/model_cache1/local/llama3.1_8b_fp16_instruct/model.mlir
INFO app_tests.integration_tests.llm.model_management:model_management.py:178 Compiling model to /shark-dev/pytest-of-nod/pytest-516/model_cache1/local/llama3.1_8b_fp16_instruct/model.vmfb
INFO app_tests.integration_tests.llm.model_management:model_management.py:189 Model successfully compiled to /shark-dev/pytest-of-nod/pytest-516/model_cache1/local/llama3.1_8b_fp16_instruct/model.vmfb
=========================== short test summary info ============================
ERROR app_tests/integration_tests/llm/shortfin/cpu_llm_server_test.py::TestLLMServer::test_basic_generation[llama31_8b_none] - TimeoutError: Server failed to start within 10 seconds

This test is fairly flaky. Can we come up with a better testing methodology?

@dan-garvey dan-garvey changed the title [shortfin][ci] [shortfin][ci] Flaky startup timer test failure Jan 27, 2025
@renxida
Copy link
Contributor

renxida commented Jan 28, 2025

Huh. This one should be xfailed.

We don't have meta llama 3.1 weights on all of the machiens

@stbaione
Copy link
Contributor

Huh. This one should be xfailed.

We don't have meta llama 3.1 weights on all of the machiens

I recently removed the xfail, because it was moved to Mi300x-3 which had the irpa file. But, makes sense if you wanna set it back to xfail since the file isn't at the same location on all of the machines. We had some bad irpa files that were causing the shortfin output to seem corrupt, which got fixed by downloading safetensors and regenerating the irpa files, so had that issue on my mind when I removed the xfail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants