Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLVMGPUVectorDistribute] Support vector.mask + vector.multi_reduce #19880

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

manupak
Copy link
Contributor

@manupak manupak commented Feb 3, 2025

This commit enables vector layout propogation
into and out of vector.mask and its body.

Moreover, it enables the distribution of vector.multi_reduce
that is wrapped in a vector.mask.
The way that is done is :

  • The distributed mask is applied to thread-local reduce
  • The distributed operand is selected between the
    reduction identity and the provided operand using
    the distributed mask.

depends on : #19830 (hence putting to draft until thats merged)

@manupak manupak requested a review from Groverkss February 3, 2025 14:07
@manupak manupak marked this pull request as draft February 3, 2025 14:08
@manupak
Copy link
Contributor Author

manupak commented Feb 3, 2025

@Groverkss is it fair to assume region nesting will be honored when distributing?
(I know this how ops are collected for distribution for now -- but want to confirm whether its coincidence or by design)
i.e. innermost nesting will be distributed prior to outer ones

@manupak
Copy link
Contributor Author

manupak commented Feb 3, 2025

@Groverkss is it fair to assume region nesting will be honored when distributing? (I know this how ops are collected for distribution for now -- but want to confirm whether its coincidence or by design) i.e. innermost nesting will be distributed prior to outer ones

Alright, I introduced MaskedOpDistributionPattern which rewrites vector.mask op wrapper away post-distribution.

This commit enables vector layout propogation
into and out of vector.mask and its body.

Moreover, it enables the distribution of vector.multi_reduce
that is wrapped in a vector.mask.
The way that is done is :
* The distributed mask is applied to thread-local reduce
* The distributed opernad is selected between the
  reduction identity and the provided operand using
  the distributed mask.

Signed-off-by: Manupa Karunaratne <[email protected]>
a hook to provide vector.mask { op } rewrites.

This removes the rewrite ordering constraint that
would otherwise be there where body op has to be
distributed prior to mask op.

Now, using this hook, developers could write
masked op distribution pattern where pre-distribution
mask op would be removed as part of the rewrite.

Signed-off-by: Manupa Karunaratne <[email protected]>
@manupak manupak force-pushed the distribute-masked-reductions branch from a8dcc8b to 920fb53 Compare February 20, 2025 12:12
@manupak manupak marked this pull request as ready for review February 20, 2025 12:15
@manupak
Copy link
Contributor Author

manupak commented Feb 20, 2025

PTAL @qedawkins if you have sometime...

Copy link
Contributor

@qedawkins qedawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One main question about why we need both the local mask and the select, otherwise LGTM

std::function<void(DistributionLayout *, mlir::ChangeResult)> update) {
mask.getBody()->walk(
[&](Operation *traversed) { visitOperation(traversed); });
// Propogate from body to results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Propogate from body to results
// Propagate from body to results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
mask = getDistributed(rewriter, maskOp.getMask(), maskLayout);
Value passThruSrc = getCombiningIdentityValue(
loc, rewriter, multiReduceOp.getKind(), disSrc.getType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector.mask can carry its own pass_thru, which I'm guessing goes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipped as discussed down below

loc, disSrc, localInit, distributedReductionMask,
multiReduceOp.getKind());
if (mask) {
localReduction =
vector::maskOperation(rewriter, localReduction.getDefiningOp(), mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the arith.select and the vector.mask?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed post-distribution masking for now as discussed.

// CHECK: %[[MASK_ITL_PCK:.+]] = vector.transpose %[[MASK_PCK]], [0, 3, 1, 4, 2, 5] : vector<2x2x2x1x1x8xi1> to vector<2x1x2x1x2x8xi1>

// CHECK: %[[SELECT:.+]] = arith.select %[[MASK_ITL_PCK]], {{.*}}, %[[RED_IDENTITY]] : vector<2x1x2x1x2x8xi1>, vector<2x1x2x1x2x8xf16>
// CHECK: vector.mask %[[MASK_ITL_PCK]] { vector.multi_reduction <add>, %[[SELECT]], {{.*}} [0, 2, 4] : vector<2x1x2x1x2x8xf16> to vector<1x1x8xf16> } : vector<2x1x2x1x2x8xi1> -> vector<1x1x8xf16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test with pass_thru on the vector.mask?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipped as discussed down below

@manupak
Copy link
Contributor Author

manupak commented Feb 21, 2025

One main question about why we need both the local mask and the select, otherwise LGTM

So the masking is for thread-local reductions.
The distribution can and will happen across reduction dimensions. Therefore, in the corner case, where no reductions to happen thread-locally, I thought it needed to select reduction identity.

Re-thinking, maybe the init might cover that already -- I can give a go at removing the select.

@manupak
Copy link
Contributor Author

manupak commented Feb 21, 2025

wait .. no if we want to support passthru then select is needed.
So I ll add that with a test then ?

@qedawkins
Copy link
Contributor

yeah I think we need to keep the select (although also fine to just not support pass_thru right now) and we can try dropping the mask.

@manupak
Copy link
Contributor Author

manupak commented Feb 21, 2025

I can add passthru but why drop the mask ?
it doesn't hurt to retain the mask post-distribution. no?
(well except I need this : llvm/llvm-project#126722 to be integrated into IREE)

@qedawkins
Copy link
Contributor

Isn't the mask redundant if we have the select?

@manupak
Copy link
Contributor Author

manupak commented Feb 21, 2025

for e.g., If the thread-local reduction dimension is long (longer than a machine vector), would n't that be used to cut down instructions issued?
(though I dont know whether thats how its lowered -- so happy to skip masking the distributed op -- if you think cons outweigh the pros)

@qedawkins
Copy link
Contributor

If the thread-local reduction dimension is long (longer than a machine vector), would n't that be used to cut down instructions issued?

It didn't look like the existing lowerings were doing that to me, but I might not have looked close enough. If it does work out like that, keeping the mask makes sense. I've mostly been asking because the mask was surprising to me, I can approve and leave it as a future exercise to determine whether it's useful.

@manupak
Copy link
Contributor Author

manupak commented Feb 21, 2025

I spent time reading the upstream code as well and traces now.
as per current upstream implementations, it seems it does a select at much finer granularity.
At the sametime, I didn;t even see a single mention of the "passthru of the mask op" in the upstream lowering; so it might not likely is implemented as well.

Thus I ll leave a comment here and remove post-distribution mask; just not to trip on anything.
(Sorry for the carrying on my overthinking trip here... :) )

@qedawkins
Copy link
Contributor

ah ok, well ignore the pass_thru then. Sounds good to me! We can always add it back later if it's better.

@manupak manupak requested a review from qedawkins February 21, 2025 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants