[AMDGPU] Constrain AV->VReg if we do not exceed RP thresholds #150086

jrbyrnes · 2025-07-22T19:04:10Z

Adds a scheduling stage AVGPRRewriteSchedule to constrain AVRegs to VReg if we do not have excess pressure. As a standalone, it is a little awkward to do this in a separate scheduling stage, however, there is a broader set of register class transforms that will eventually be done here (i.e. inflate the OpA, and OpB to AVGPR , rewrite VGPR MFMAs to AGPR). These transforms are all guided by pressure based heuristics; the scheduler has the most accurate and available RP analysis. Moreover, these transforms will change the pressure problem, so rerunning scheduling after is beneficial. (As a concrete example, I plan to extend into #149367).

This is built on top of #149863 .

As a standalone PR, this doesn't have much impact on CodeGen since AVRegs and VRegs allocation order is basically the same, and this is only doing the transform if we do not exceed the VGPR addressable limit. But after #146606 , AVRegs will be allocated last. In the case where we don't actually need the AGPR allocation property of the AVGPR, this may cause degradations in RA as we are now assigning larger registers after smaller ones. This PR helps that case by constraining to VGPR, which defaults to assigning larger registers first.

Change-Id: Ifcd242c111b139f62109b12b588bb6af764fe4df

Change-Id: I17cb012504946fa9dca88b32548f922e2ce4b7a9

llvmbot · 2025-07-22T19:04:44Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

Adds a scheduling stage AVGPRRewriteSchedule to constrain AVRegs to VReg if we do not have excess pressure. As a standalone, it is a little awkward to do this in a separate scheduling stage, however, there is a broader set of register class transforms that will eventually be done here (i.e. inflate the OpA, and OpB to AVGPR , rewrite VGPR MFMAs to AGPR). These transforms are all guided by pressure based heuristics; the scheduler has the most accurate and available RP analysis. Moreover, these transforms will change the pressure problem, so rerunning scheduling after is beneficial. (As a concrete example, I plan to extend into #149367).

This is built on top of #149863 .

As a standalone PR, this doesn't have much impact on CodeGen since AVRegs and VRegs allocation order is basically the same, and this is only doing the transform if we do not exceed the VGPR addressable limit. But after #146606 , AVRegs will be allocated last. In the case where we don't actually need the AGPR allocation property of the AVGPR, this may cause degradations in RA as we are now assigning larger registers after smaller ones. This PR helps that case by constraining to VGPR, which defaults to assigning larger registers first.

Patch is 38.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/150086.diff

7 Files Affected:

(modified) llvm/lib/Target/AMDGPU/GCNRegPressure.cpp (+5-1)
(modified) llvm/lib/Target/AMDGPU/GCNRegPressure.h (+28-13)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+88)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+25-5)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir (+20-20)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2c.mir (+8-8)
(added) llvm/test/CodeGen/AMDGPU/schedule-reconstrain-avgpr.mir (+188)

diff --git a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
index 7d6723a6108be..334afd3a2a5b4 100644
--- a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
@@ -38,7 +38,11 @@ bool llvm::isEqual(const GCNRPTracker::LiveRegSet &S1,
 
 unsigned GCNRegPressure::getRegKind(const TargetRegisterClass *RC,
                                     const SIRegisterInfo *STI) {
-  return STI->isSGPRClass(RC) ? SGPR : (STI->isAGPRClass(RC) ? AGPR : VGPR);
+  return STI->isSGPRClass(RC)
+             ? SGPR
+             : (STI->isAGPRClass(RC)
+                    ? AGPR
+                    : (STI->isVectorSuperClass(RC) ? AVGPR : VGPR));
 }
 
 void GCNRegPressure::inc(unsigned Reg,
diff --git a/llvm/lib/Target/AMDGPU/GCNRegPressure.h b/llvm/lib/Target/AMDGPU/GCNRegPressure.h
index 3749b6d1efc63..5ec898351f922 100644
--- a/llvm/lib/Target/AMDGPU/GCNRegPressure.h
+++ b/llvm/lib/Target/AMDGPU/GCNRegPressure.h
@@ -29,43 +29,58 @@ class raw_ostream;
 class SlotIndex;
 
 struct GCNRegPressure {
-  enum RegKind { SGPR, VGPR, AGPR, TOTAL_KINDS };
+  enum RegKind { SGPR, VGPR, AGPR, AVGPR, TOTAL_KINDS };
 
   GCNRegPressure() {
     clear();
   }
 
-  bool empty() const { return !Value[SGPR] && !Value[VGPR] && !Value[AGPR]; }
+  bool empty() const {
+    return !Value[SGPR] && !Value[VGPR] && !Value[AGPR] && !Value[AVGPR];
+  }
 
   void clear() { std::fill(&Value[0], &Value[ValueArraySize], 0); }
 
   /// \returns the SGPR32 pressure
   unsigned getSGPRNum() const { return Value[SGPR]; }
-  /// \returns the aggregated ArchVGPR32, AccVGPR32 pressure dependent upon \p
-  /// UnifiedVGPRFile
+  /// \returns the aggregated ArchVGPR32, AccVGPR32, and Pseudo AVGPR pressure
+  /// dependent upon \p UnifiedVGPRFile
   unsigned getVGPRNum(bool UnifiedVGPRFile) const {
     if (UnifiedVGPRFile) {
-      return Value[AGPR] ? getUnifiedVGPRNum(Value[VGPR], Value[AGPR])
-                         : Value[VGPR];
+      return Value[AGPR]
+                 ? getUnifiedVGPRNum(Value[VGPR], Value[AGPR], Value[AVGPR])
+                 : Value[VGPR] + Value[AVGPR];
     }
-    return std::max(Value[VGPR], Value[AGPR]);
+    // Until we hit the VGPRThreshold, we will assign AV as VGPR. After that
+    // point, we will assign as AGPR.
+    return std::max(Value[VGPR] + Value[AVGPR], Value[AGPR]);
   }
 
   /// Returns the aggregated VGPR pressure, assuming \p NumArchVGPRs ArchVGPRs
-  /// and \p NumAGPRs AGPRS, for a target with a unified VGPR file.
+  /// \p NumAGPRs AGPRS, and \p NumAVGPRs AVGPRs for a target with a unified
+  /// VGPR file.
   inline static unsigned getUnifiedVGPRNum(unsigned NumArchVGPRs,
-                                           unsigned NumAGPRs) {
-    return alignTo(NumArchVGPRs, AMDGPU::IsaInfo::getArchVGPRAllocGranule()) +
+                                           unsigned NumAGPRs,
+                                           unsigned NumAVGPRs) {
+
+    // Until we hit the VGPRThreshold, we will assign AV as VGPR. After that
+    // point, we will assign as AGPR.
+    return alignTo(NumArchVGPRs + NumAVGPRs,
+                   AMDGPU::IsaInfo::getArchVGPRAllocGranule()) +
            NumAGPRs;
   }
 
-  /// \returns the ArchVGPR32 pressure
-  unsigned getArchVGPRNum() const { return Value[VGPR]; }
+  /// \returns the ArchVGPR32 pressure, plus the AVGPRS which we assume will be
+  /// allocated as VGPR
+  unsigned getArchVGPRNum() const { return Value[VGPR] + Value[AVGPR]; }
   /// \returns the AccVGPR32 pressure
   unsigned getAGPRNum() const { return Value[AGPR]; }
+  /// \returns the AVGPR32 pressure
+  unsigned getAVGPRNum() const { return Value[AVGPR]; }
 
   unsigned getVGPRTuplesWeight() const {
-    return std::max(Value[TOTAL_KINDS + VGPR], Value[TOTAL_KINDS + AGPR]);
+    return std::max(Value[TOTAL_KINDS + VGPR] + Value[TOTAL_KINDS + AVGPR],
+                    Value[TOTAL_KINDS + AGPR]);
   }
   unsigned getSGPRTuplesWeight() const { return Value[TOTAL_KINDS + SGPR]; }
 
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index a6553083d722b..9189361324a1c 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -528,6 +528,7 @@ GCNMaxOccupancySchedStrategy::GCNMaxOccupancySchedStrategy(
     const MachineSchedContext *C, bool IsLegacyScheduler)
     : GCNSchedStrategy(C) {
   SchedStages.push_back(GCNSchedStageID::OccInitialSchedule);
+  SchedStages.push_back(GCNSchedStageID::AVGPRRewriteSchedule);
   SchedStages.push_back(GCNSchedStageID::UnclusteredHighRPReschedule);
   SchedStages.push_back(GCNSchedStageID::ClusteredLowOccupancyReschedule);
   SchedStages.push_back(GCNSchedStageID::PreRARematerialize);
@@ -778,6 +779,8 @@ GCNScheduleDAGMILive::createSchedStage(GCNSchedStageID SchedStageID) {
   switch (SchedStageID) {
   case GCNSchedStageID::OccInitialSchedule:
     return std::make_unique<OccInitialScheduleStage>(SchedStageID, *this);
+  case GCNSchedStageID::AVGPRRewriteSchedule:
+    return std::make_unique<AVGPRRewriteScheduleStage>(SchedStageID, *this);
   case GCNSchedStageID::UnclusteredHighRPReschedule:
     return std::make_unique<UnclusteredHighRPStage>(SchedStageID, *this);
   case GCNSchedStageID::ClusteredLowOccupancyReschedule:
@@ -941,10 +944,14 @@ void GCNScheduleDAGMILive::finalizeSchedule() {
   Pressure.resize(Regions.size());
   RegionsWithHighRP.resize(Regions.size());
   RegionsWithExcessRP.resize(Regions.size());
+  RegionsWithAVRegs.resize(Regions.size());
+  RegionsWithExcessVGPRRP.resize(Regions.size());
   RegionsWithMinOcc.resize(Regions.size());
   RegionsWithIGLPInstrs.resize(Regions.size());
   RegionsWithHighRP.reset();
   RegionsWithExcessRP.reset();
+  RegionsWithAVRegs.reset();
+  RegionsWithExcessVGPRRP.reset();
   RegionsWithMinOcc.reset();
   RegionsWithIGLPInstrs.reset();
 
@@ -1003,6 +1010,9 @@ raw_ostream &llvm::operator<<(raw_ostream &OS, const GCNSchedStageID &StageID) {
   case GCNSchedStageID::OccInitialSchedule:
     OS << "Max Occupancy Initial Schedule";
     break;
+  case GCNSchedStageID::AVGPRRewriteSchedule:
+    OS << "AVGPR Rewriting Reschedule";
+    break;
   case GCNSchedStageID::UnclusteredHighRPReschedule:
     OS << "Unclustered High Register Pressure Reschedule";
     break;
@@ -1036,6 +1046,78 @@ bool GCNSchedStage::initGCNSchedStage() {
   return true;
 }
 
+bool AVGPRRewriteScheduleStage::reconstrainRegClass(
+    Register Reg, const TargetRegisterClass *NewRC) const {
+  const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
+  const TargetRegisterClass *OldRC = DAG.MRI.getRegClass(Reg);
+  const TargetRegisterInfo *TRI = DAG.MRI.getTargetRegisterInfo();
+  const TargetRegisterClass *ConstrainRC = NewRC;
+  const SIRegisterInfo *SRI = MF.getSubtarget<GCNSubtarget>().getRegisterInfo();
+
+  // Stop early if there is nothing to do.
+  if (!NewRC || NewRC == OldRC)
+    return false;
+
+  // Accumulate constraints from all uses.
+  for (MachineOperand &MO : DAG.MRI.reg_nodbg_operands(Reg)) {
+    // Apply the effect of the given operand to NewRC.
+    MachineInstr *MI = MO.getParent();
+    unsigned OpNo = &MO - &MI->getOperand(0);
+    ConstrainRC = MI->getRegClassConstraintEffect(OpNo, ConstrainRC, TII, TRI);
+    if (!ConstrainRC)
+      return false;
+    if (MI->isCopy()) {
+      MachineOperand &OtherOp = MI->getOperand(1 - OpNo);
+      if (!OtherOp.isReg())
+        continue;
+
+      if (!SRI->isVGPR(DAG.MRI, OtherOp.getReg()))
+        return false;
+    }
+  }
+  DAG.MRI.setRegClass(Reg, ConstrainRC);
+  return true;
+}
+
+bool AVGPRRewriteScheduleStage::initGCNSchedStage() {
+  const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
+
+  // The main benefit of AVReg usage is that the register can be assigned to
+  // either VGPR or AGPR. However, for the unified RF case, we should only be
+  // using AGPR if strictly necessary. That is, if the required number of VGPRs
+  // exceeds the addressable limit.  Otherwise, we should be stricly using VGPRs
+  // to minimize cross RC copies. Thus, if we are underc this limit, we should
+  // constrain AVReg- > VReg.
+  // TODO: AVReg constraining for non unified case.
+  if (!ST.hasGFX90AInsts() || DAG.RegionsWithAVRegs.empty() ||
+      DAG.RegionsWithExcessVGPRRP.any())
+    return false;
+
+  const SIRegisterInfo *SRI = ST.getRegisterInfo();
+
+  for (unsigned I = 0, E = DAG.MRI.getNumVirtRegs(); I != E; ++I) {
+    Register Reg = Register::index2VirtReg(I);
+    if (!DAG.LIS->hasInterval(Reg))
+      continue;
+    const TargetRegisterClass *RC = DAG.MRI.getRegClass(Reg);
+    if (!SRI->isVectorSuperClass(RC))
+      continue;
+
+    reconstrainRegClass(Reg, SRI->getEquivalentVGPRClass(RC));
+  }
+
+  // TODO -- opposite case, inflate to AV when we have AVGPR + VGPR RP greater
+  // than addressable limit.
+
+  // TODO - after we separate out AVGPR pressure from the e.g. getVGPRNum
+  // pressure queries, we may need to update the cached RP.
+
+  // TODO - there is a benefit to rescheduling with the constraints, as the
+  // generic trackers do not track AVGPR pressure. But we should teach the
+  // default trackers about AVGPR rather than doing rescheduling here.
+  return false;
+}
+
 bool UnclusteredHighRPStage::initGCNSchedStage() {
   if (DisableUnclusterHighRP)
     return false;
@@ -1278,6 +1360,9 @@ void GCNSchedStage::checkScheduling() {
   LLVM_DEBUG(dbgs() << "Pressure after scheduling: " << print(PressureAfter));
   LLVM_DEBUG(dbgs() << "Region: " << RegionIdx << ".\n");
 
+  if (PressureAfter.getAVGPRNum())
+    DAG.RegionsWithAVRegs[RegionIdx] = true;
+
   unsigned DynamicVGPRBlockSize = DAG.MFI.getDynamicVGPRBlockSize();
 
   if (PressureAfter.getSGPRNum() <= S.SGPRCriticalLimit &&
@@ -1331,6 +1416,9 @@ void GCNSchedStage::checkScheduling() {
   unsigned MaxArchVGPRs = std::min(MaxVGPRs, ST.getAddressableNumArchVGPRs());
   unsigned MaxSGPRs = ST.getMaxNumSGPRs(MF);
 
+  if (PressureAfter.getArchVGPRNum() > ST.getAddressableNumArchVGPRs())
+    DAG.RegionsWithExcessVGPRRP[RegionIdx] = true;
+
   if (PressureAfter.getVGPRNum(ST.hasGFX90AInsts()) > MaxVGPRs ||
       PressureAfter.getArchVGPRNum() > MaxArchVGPRs ||
       PressureAfter.getAGPRNum() > MaxArchVGPRs ||
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
index 94cd795bbc8f6..7575d7611bbcb 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
@@ -28,11 +28,12 @@ class GCNSchedStage;
 
 enum class GCNSchedStageID : unsigned {
   OccInitialSchedule = 0,
-  UnclusteredHighRPReschedule = 1,
-  ClusteredLowOccupancyReschedule = 2,
-  PreRARematerialize = 3,
-  ILPInitialSchedule = 4,
-  MemoryClauseInitialSchedule = 5
+  AVGPRRewriteSchedule = 1,
+  UnclusteredHighRPReschedule = 2,
+  ClusteredLowOccupancyReschedule = 3,
+  PreRARematerialize = 4,
+  ILPInitialSchedule = 5,
+  MemoryClauseInitialSchedule = 6
 };
 
 #ifndef NDEBUG
@@ -224,6 +225,7 @@ using RegionBoundaries =
 class GCNScheduleDAGMILive final : public ScheduleDAGMILive {
   friend class GCNSchedStage;
   friend class OccInitialScheduleStage;
+  friend class AVGPRRewriteScheduleStage;
   friend class UnclusteredHighRPStage;
   friend class ClusteredLowOccStage;
   friend class PreRARematStage;
@@ -250,9 +252,15 @@ class GCNScheduleDAGMILive final : public ScheduleDAGMILive {
   // limit. Register pressure in these regions usually will result in spilling.
   BitVector RegionsWithExcessRP;
 
+  // Regions that have VGPR RP which exceed the addressable limit.
+  BitVector RegionsWithExcessVGPRRP;
+
   // Regions that has the same occupancy as the latest MinOccupancy
   BitVector RegionsWithMinOcc;
 
+  // Regions which use the AV RC.
+  BitVector RegionsWithAVRegs;
+
   // Regions that have IGLP instructions (SCHED_GROUP_BARRIER or IGLP_OPT).
   BitVector RegionsWithIGLPInstrs;
 
@@ -401,6 +409,18 @@ class OccInitialScheduleStage : public GCNSchedStage {
       : GCNSchedStage(StageID, DAG) {}
 };
 
+class AVGPRRewriteScheduleStage : public GCNSchedStage {
+private:
+  bool reconstrainRegClass(Register Reg,
+                           const TargetRegisterClass *NewRC) const;
+
+public:
+  bool initGCNSchedStage() override;
+
+  AVGPRRewriteScheduleStage(GCNSchedStageID StageID, GCNScheduleDAGMILive &DAG)
+      : GCNSchedStage(StageID, DAG) {}
+};
+
 class UnclusteredHighRPStage : public GCNSchedStage {
 private:
   // Save the initial occupancy before starting this stage.
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
index e93595b9ef273..0ec67f44e2cbb 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
@@ -23,10 +23,10 @@ body:             |
   ; GCN-NEXT: {{  $}}
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = IMPLICIT_DEF
   ; GCN-NEXT:   [[DEF1:%[0-9]+]]:areg_512_align2 = IMPLICIT_DEF
-  ; GCN-NEXT:   [[DEF2:%[0-9]+]]:av_128_align2 = IMPLICIT_DEF
-  ; GCN-NEXT:   [[DEF3:%[0-9]+]]:av_128_align2 = IMPLICIT_DEF
-  ; GCN-NEXT:   [[DEF4:%[0-9]+]]:av_128_align2 = IMPLICIT_DEF
-  ; GCN-NEXT:   [[DEF5:%[0-9]+]]:av_128_align2 = IMPLICIT_DEF
+  ; GCN-NEXT:   [[DEF2:%[0-9]+]]:vreg_128_align2 = IMPLICIT_DEF
+  ; GCN-NEXT:   [[DEF3:%[0-9]+]]:vreg_128_align2 = IMPLICIT_DEF
+  ; GCN-NEXT:   [[DEF4:%[0-9]+]]:vreg_128_align2 = IMPLICIT_DEF
+  ; GCN-NEXT:   [[DEF5:%[0-9]+]]:vreg_128_align2 = IMPLICIT_DEF
   ; GCN-NEXT:   [[DEF6:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
   ; GCN-NEXT:   [[DEF7:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
   ; GCN-NEXT:   [[DEF8:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
@@ -75,40 +75,40 @@ body:             |
   ; GCN-NEXT: bb.1:
   ; GCN-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
   ; GCN-NEXT: {{  $}}
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 0, 0, implicit $exec :: (load (s128) from %ir.in0, !alias.scope !0, addrspace 3)
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_1:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 0, 0, implicit $exec :: (load (s128) from %ir.in4, !alias.scope !0, addrspace 3)
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_2:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 1040, 0, implicit $exec :: (load (s128) from %ir.in1, !alias.scope !0, addrspace 3)
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_3:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 2064, 0, implicit $exec :: (load (s128) from %ir.in5, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 0, 0, implicit $exec :: (load (s128) from %ir.in0, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_1:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 0, 0, implicit $exec :: (load (s128) from %ir.in4, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_2:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 1040, 0, implicit $exec :: (load (s128) from %ir.in1, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_3:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 2064, 0, implicit $exec :: (load (s128) from %ir.in5, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_]].sub0_sub1, [[DS_READ_B128_gfx9_1]].sub0_sub1, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_4:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 2080, 0, implicit $exec :: (load (s128) from %ir.in2, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_4:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 2080, 0, implicit $exec :: (load (s128) from %ir.in2, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF8]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_1:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF9]], implicit $exec
   ; GCN-NEXT:   dead [[V_MFMA_F32_32X32X8F16_mac_e64_:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_]].sub2_sub3, [[DS_READ_B128_gfx9_1]].sub2_sub3, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_5:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 3120, 0, implicit $exec :: (load (s128) from %ir.in3, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_5:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 3120, 0, implicit $exec :: (load (s128) from %ir.in3, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_2:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF10]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_3:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF11]], implicit $exec
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_2]].sub0_sub1, [[DS_READ_B128_gfx9_3]].sub0_sub1, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_6:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 4128, 0, implicit $exec :: (load (s128) from %ir.in6, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_6:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 4128, 0, implicit $exec :: (load (s128) from %ir.in6, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_4:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF12]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_5:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF13]], implicit $exec
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_2]].sub2_sub3, [[DS_READ_B128_gfx9_3]].sub2_sub3, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_7:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 6192, 0, implicit $exec :: (load (s128) from %ir.in7, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_7:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 6192, 0, implicit $exec :: (load (s128) from %ir.in7, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_6:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF14]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_7:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF15]], implicit $exec
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_4]].sub0_sub1, [[DS_READ_B128_gfx9_6]].sub0_sub1, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_8:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 1024, 0, implicit $exec :: (load (s128) from %ir.in8, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_8:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 1024, 0, implicit $exec :: (load (s128) from %ir.in8, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_8:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF16]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_9:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF17]], implicit $exec
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_4]].sub2_sub3, [[DS_READ_B128_gfx9_6]].sub2_sub3, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_9:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 3088, 0, implicit $exec :: (load (s128) from %ir.in9, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_9:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 3088, 0, implicit $exec :: (load (s128) from %ir.in9, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_10:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF18]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_11:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF19]], implicit $exec
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_5]].sub0_sub1, [[DS_READ_B128_gfx9_7]].sub0_sub1, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_10:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 5152, 0, implicit $exec :: (load (s128) from %ir.in10, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_10:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 5152,...
[truncated]

arsenm

I don't really see the point of doing this. All the downstream register allocator code is free to re-inflate these right back to the AV_*. If we do not exceed the pressure threshold, we can always just remove the A registers from the allocation order by reserving them. The scheduler should not be actively trying to increase restrictions just in case. If you're changing the instruction type, you can adjust the register class directly as part of that transform

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

jrbyrnes · 2025-07-24T00:18:20Z

I don't really see the point of doing this. All the downstream register allocator code is free to re-inflate these right back to the AV_*.

The issue is that the register class priority is dependent upon the register pressure situation. If we need to use AGPRs, then it is best to assign AV last, otherwise, it is best to assign based on the bitwidth of the register. To address minor increase in register usage caused by #146606

A more direct solution is to provide a hook at some level in DefaultPriorityAdvisor::getPriority to allow dynamic RC priorites based on RP situation.

Co-authored-by: Matt Arsenault <[email protected]>

lucas-rami · 2025-07-23T22:14:37Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h

+  // Regions that have VGPR RP which exceed the addressable limit.
+  BitVector RegionsWithExcessVGPRRP;
+
  // Regions that has the same occupancy as the latest MinOccupancy
  BitVector RegionsWithMinOcc;

+  // Regions which use the AV RC.
+  BitVector RegionsWithAVRegs;
+


Can these two bitvectors be internal to the new stage instead? This avoids recomputation between all other stages that don't care about them and eliminates the possibility that we forget to keep them up to date (after a scheduling revert for example).

lucas-rami · 2025-07-24T14:49:18Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+  // using AGPR if strictly necessary. That is, if the required number of VGPRs
+  // exceeds the addressable limit.  Otherwise, we should be stricly using VGPRs
+  // to minimize cross RC copies. Thus, if we are under this limit, we should
+  // constrain AVReg- > VReg.


Suggested change

// constrain AVReg- > VReg.

// constrain AVReg -> VReg.

arsenm · 2025-07-24T14:54:34Z

The issue is that the register class priority is dependent upon the register pressure situation. If we need to use AGPRs, then it is best to assign AV last, otherwise, it is best to assign based on the bitwidth of the register. To address minor increase in register usage caused by #146606

Can you just drop the AV priorities to the bottom regardless of bitwidth?

A more direct solution is to provide a hook at some level in DefaultPriorityAdvisor::getPriority to allow dynamic RC priorites based on RP situation.

I don't think the scheduler should be touching register classes at all. It has impacts outside of the current scheduling region, and the scheduler should not be increasing constraints

jrbyrnes · 2025-07-24T18:41:33Z

Can you just drop the AV priorities to the bottom regardless of bitwidth?

We can, but we know it is better to use bitwidth if we don't need the AGPR assignment from AVGPR. So I think we should try to do teach the compiler about this in some way to reduce regression likelihood -- so long as doing so doesn't require some excessive burden.

I don't think the scheduler should be touching register classes at all. It has impacts outside of the current scheduling region, and the scheduler should not be increasing constraints

It is clear to me now that this particular RC transform should not happen in max-occupancy scheduling. It does not directly help reduce RP or directly help us to improve occupancy. That said, I believe RC transforms which can improve RP and result in better scheduling should happen in the scheduler. There is precedent for pre-sched-stage function-wide transforms (i.e. remat).

jrbyrnes added 2 commits July 21, 2025 11:11

[AMDGPU] Track AV Register Pressure separately

c75b24e

Change-Id: Ifcd242c111b139f62109b12b588bb6af764fe4df

[AMDGPU] Constrain AV->VReg if we do not exceed RP thresholds

e3dd5ac

Change-Id: I17cb012504946fa9dca88b32548f922e2ce4b7a9

jrbyrnes requested review from arsenm, kerbowa and lucas-rami July 22, 2025 19:04

llvmbot added the backend:AMDGPU label Jul 22, 2025

jrbyrnes mentioned this pull request Jul 22, 2025

[AMDGPU] Allocate AVRegClass last #146606

Open

arsenm reviewed Jul 23, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp Outdated Show resolved Hide resolved

jrbyrnes and others added 2 commits July 23, 2025 17:31

Update llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

041a72d

Co-authored-by: Matt Arsenault <[email protected]>

Update llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

1d41952

Co-authored-by: Matt Arsenault <[email protected]>

lucas-rami reviewed Jul 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Constrain AV->VReg if we do not exceed RP thresholds #150086

[AMDGPU] Constrain AV->VReg if we do not exceed RP thresholds #150086

jrbyrnes commented Jul 22, 2025

Uh oh!

llvmbot commented Jul 22, 2025

Uh oh!

arsenm left a comment

Uh oh!

Uh oh!

Uh oh!

jrbyrnes commented Jul 24, 2025 •

edited

Loading

Uh oh!

lucas-rami Jul 23, 2025

Uh oh!

lucas-rami Jul 24, 2025

Uh oh!

arsenm commented Jul 24, 2025

Uh oh!

jrbyrnes commented Jul 24, 2025

Uh oh!

Uh oh!

[AMDGPU] Constrain AV->VReg if we do not exceed RP thresholds #150086

Are you sure you want to change the base?

[AMDGPU] Constrain AV->VReg if we do not exceed RP thresholds #150086

Conversation

jrbyrnes commented Jul 22, 2025

Uh oh!

llvmbot commented Jul 22, 2025

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jrbyrnes commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucas-rami Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

lucas-rami Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm commented Jul 24, 2025

Uh oh!

jrbyrnes commented Jul 24, 2025

Uh oh!

Uh oh!

jrbyrnes commented Jul 24, 2025 •

edited

Loading