Conversation
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2518 +/- ##
=======================================
Coverage ? 77.55%
=======================================
Files ? 264
Lines ? 25198
Branches ? 0
=======================================
Hits ? 19543
Misses ? 5655
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
stbaione
left a comment
There was a problem hiding this comment.
May want to add a case to accuracy_test and smoke tests for validation
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
|
I'll add the accuracy tests and smoke tests after the iree issue with dynamic batch sizes is fixed |
stbaione
left a comment
There was a problem hiding this comment.
Looks good, just a couple questions
| chunk_block_size=None, | ||
| ) | ||
|
|
||
| async def prepare_args(self, batch_size: int) -> List[sfnp.device_array]: |
There was a problem hiding this comment.
Given the recent changes, I think this prepare_args produces the same results as the existing PrefillTask.prepare_args func. I think we can just use the existing PrefillTask then
| "Export from `sharktank` with `--has-prefill-position` for full trie prefix sharing benefits." | ||
| ) | ||
|
|
||
| batch_mode = server_params.batch_mode |
There was a problem hiding this comment.
It looks like chunk_block_size isn't taken into consideration when extend_attention is used.
Might be good to log a warning if both have a value, and just say that chunk_block_size is ignored when using extend_attention
| seq_lens = torch.empty(bs_min, dtype=torch.int64) | ||
|
|
||
| print(f"Exporting prefill_bs{bs}") | ||
| # Use different naming for extend-attention mode to avoid confusion |
There was a problem hiding this comment.
What's the reasoning for adding this change?
e7018d3 to
3fdfded
Compare
… into update-extend-attn
Implement Extend-Attention in Shortfin