-
Notifications
You must be signed in to change notification settings - Fork 663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E tests suite for the Attention. #17892
Comments
From @raikonenfnu :
I'd like to be organized about this, considering the growing number of test suites while also recognizing the importance of test coverage for 'attention' ops. I'll write up some of my own thoughts too. |
Existing test suitesThis page documents many of the tests that we have in-tree and out-of-tree: https://iree.dev/developers/general/testing-guide/. Notably missing there are explanations for Matmul and convolution generated testsThe in-tree convolution and matmul tests work like this:
Pytest regression testshttps://github.com/iree-org/iree/tree/main/experimental/regression_suite These tests are starting to get more focus, as a way to stay in Python and pass any flags. Pytest iree_tests (onnx + others) test suitehttps://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests These tests are generated offline from existing test suites and focus on using default flags, with a focus on large sets of XFAILs to track across in-tree and out-of-tree plugins. BrainstormingI've had my eye on refactoring the entire
I think Stan is on the right track here for a solution I'd be happier with, but depending on prioritization and implementation complexity, forking the existing matmul/convolution test suites may be more expedient. |
Thanks for the state of affairs WRT test suite @ScottTodd! I think you covered the pros and cons very well. I think forking the matmul/convolution would work, although I'd like slight changes to them to fit our current requirement for the coming attention test.
Optionally, IMO having some jinja + template MLIR + DB refactoring, would also help cleanup and make adding tests for variants of ops much easier, although we can do that later. CC: @erman-gurses |
RE: performance tests / benchmarks, I think an offline generation approach helps manage complexity there too:
The thing to look for there is where a line can be drawn between discrete steps. A build system could add sugar that combines steps (like the |
That makes sense! We are quite constrained on time though, and need these tests and benchmarking up and running ASAP for the coming sprint and various optimization works that we are going to have on attention. Based on your CI experience and expertise, do you think we can get here by EoW? (i.e landing @erman-gurses's base e2e attention test + wiring in the easy drop in compile cmd flags + benchmark performance?) If not, what is the estimate you think it will take for us to get there, and if it's worth it setting up a temporary pkgCI in python that can do this? :) |
For just testing, 1 week seems like a reasonable target. I wouldn't want to retrofit benchmarking onto the
though. I don't know off the top of my head if those are expected to work with Benchmarking also needs a very careful eye... these single op tests are microbenchmarks while IREE isn't designed (exclusively) to perform well on. IREE is a whole program compiler. We have some dotprod / matmul microbenchmark suites (last updated in #17748, see the source at https://github.com/iree-org/iree-comparative-benchmark/blob/main/common_benchmark_suite/openxla/benchmark/comparative_suite/jax/scripts/generate_model_artifacts.py) I'm not really sure what we should be doing for benchmarking. We currently "support" multiple generations of benchmarking infrastructure (longitudinal on perf.iree.dev, comparative with iree-comparative-benchmark, sdxl regression in pkgci, etc.) and this is different in its own ways, so we'd be looking at adding yet another system. What I'd be looking for is something simple enough that we can iterate on it easily and then delete it / promote it somewhere / etc. as we get some mileage on it and decide what we want to do with it. Files + tools in, .csv file or text logs out -> feed output into a spreadsheet, learn something, iterate. |
RE: Benchmarking Agreed, there is already so much system. Ideally, we can piggyback off one of those haha. Would you think it's better to have a benchmarking system more interconnected to this test or do we just write a separate PkgCI that does only perf checks? (sorry for the constant comparison to pkgCI, that's the only CI I am most familiar with :) ) |
Pkgci is just a set of actions that
The older ci.yml workflows rely on a tighter coupling between the build (CMake) and the tests (also CMake). The matmul and convolution test suites are covered by ci.yml as they are deeply integrated with Bazel and CMake. So for new tests (and benchmarks), I'd much prefer for them to just operate on installed packages. That can either mean using IREE's Python APIs or just working with |
Doesn't pkgCI also use the built iree-compile/benchmark-module by somwhat pointing to the build path as seen iree/experimental/regression_suite/ireers/fixtures.py Lines 38 to 60 in 00daa02
Also do you have thoughts/ideas on best way to build out the attention perf benchmarks would you recommend any one of these existing frameworks |
Those tools are available as console scripts after installing python packages: Lines 454 to 463 in 3dffadb
Lines 599 to 609 in 3dffadb
Pkgci is completely decoupled from the build system. Think about using packages (since that's what anyone outside of a direct project developer should be using), not about interfacing with anything in CMake/Bazel. |
Start simple. Have something write to json / xml that can be visualized somehow. Put historical data in a folder in a github repo or cloud document folder. Find a way to automate parts of that later. |
Seems like we can also modify those underlying implementation (i.e pkgCI's CC: @suryajasper who I heard from Harsh is going to expand his gemm-ai benchmark to attention-ai-benchmark :) |
The related PR is ready to land: #17751. |
Request description
E2E tests suite for the Attention that has reference implementation in it.
What component(s) does this issue relate to?
Compiler
Additional context
I raised a PR: #17751.
The text was updated successfully, but these errors were encountered: