-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DO NOT COMMIT test for u55 + mv2 #9830
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9830
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit e466669 with merge base bcf4b46 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
8a305e0
to
70d1da7
Compare
70d1da7
to
e466669
Compare
@zingo - If I run MV2 locally I get a different number for ethos PMU, any clue on how to start debugging? I believe George also confirmed that something odd is going on after comparing two runs. |
When you run it locally do you use run.sh or ( --memory_mode Sram_Only seem to be missing from at least this version of the patch) But I get the same big number 8M with it so looking in the lor there is probably a bug in the backends/arm/test/test_model.py test flow as I spot that Vela is using Sram_Only BUT the elf is build with Shared_Sram So until we fix this run.sh would work best for testing (or running all sub scrips directly) |
Found the bug and a PR is on it's way :) With it I now see 5.5M NPU cycles on mv2 when using Sram_Only with test_model.py also so it should match run.sh testing. Thanks for testing and reporting this :) |
I hope this PR will fix the problem you got. |
Thanks @zingo @digantdesai !
For Shared_Sram, we see a difference in the performance between the Ethos_U55_Deep_Embedded and the Ethos_U55_High_End_Embedded system configs and that is expected. This is because the memory bandwidth and latency are different between the two system configurations. With the deeply embedded timing adapter settings, the NPU perceives bus bandwidth of 4bits/clock cycle(=250MB/s if you are @ 500MHz) on the external memory and for the High_End_Embedded, the bus bandwidth is 8bits/cc(or 500MB/s if the NPU is clocked at 500MHz) for the external memory. Note that even though the performance is different for Shared_Sram(8.3M vs 6.7M cycles), the PMU counters for amount of beats read on the AXI0/AXI1 interfaces are the same. In other words, the NPU reads the same amount of data between the deeply embedded & high end embedded system configs. This is again expected because we run the same model, just in the deeply embedded case the NPU has to wait longer for the data to arrive due to the lower bandwidth on the Flash. As for Sram_Only, we place everything in the SRAM and the SRAM bandwidth is the same between Deeply Embedded & High End systems, therefore we see identical performance. |
Summary
[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.
[PLEASE REMOVE] If this PR closes an issue, please add a
Fixes #<issue-id>
line.[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.
Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.