Skip to content

Decoupled lookahead scan MVP #6644

@bernhardmgruber

Description

@bernhardmgruber

Christmas came early and @ahendriksen gifted us a faster implementation of inclusive scan using a new algorithm called decoupled lookahead. Let's implement it as part of the Munich hackathon.

For an internal presentation+recording of the new algorithm and the source code of the prototype, checkout the discussion on: https://github.com/NVIDIA/cccl_private/issues/404. The most recent scan implementation is based on warpspeed, a small library of utilities for kernel authoring.

For the beginning, we should understand the various techniques and optimizations used and how we can integrate those into CUB:

  • warp specialization
  • 1D TMA loads and their alignment and size requirements (UBLKCP), and how to work around that (overcopy, etc.)
  • pipelining of TMA loads (overlapping multiple TMA stages)
  • persistent CTAs and work stealing (UGETNEXTWORKID)
  • the decoupled look ahead algorithm
  • dynamic SMEM allocation, interplay with static SMEM, SMEM alignment, using arch specific SMEM, setting custom utilitization (cudaFuncAttributeMaxDynamicSharedMemorySize)
  • SMEM layout of the new algorithm including multiple mbarriers, aligned copy destinations, stages, partial aggregates, work ids, etc.
  • maybe fast block reductions, maybe we can just use CUB's BlockReduce or WarpReduce, remains to be seen

To get us started, we will have a presentation of warpspeed and current scan kernel from @ahendriksen.

Based on the internal prototype, this issue now contains sub issues breaking down the integration. We will split the tasks as we see fit after the initial discussion. We should focus on tasks first that could impact performance.

The goal is create an (almost) production ready PR that we can merge into CUB. We should focus an implementation for Blackwell. If we can, it should also run on Hopper. Ampere is not necessary, since the status quo is fine there. We should support CTK >= 12.0.

Sub-issues

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions