Skip to content

[AMDGPU] Constrain AV->VReg if we do not exceed RP thresholds #150086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jrbyrnes
Copy link
Contributor

Adds a scheduling stage AVGPRRewriteSchedule to constrain AVRegs to VReg if we do not have excess pressure. As a standalone, it is a little awkward to do this in a separate scheduling stage, however, there is a broader set of register class transforms that will eventually be done here (i.e. inflate the OpA, and OpB to AVGPR , rewrite VGPR MFMAs to AGPR). These transforms are all guided by pressure based heuristics; the scheduler has the most accurate and available RP analysis. Moreover, these transforms will change the pressure problem, so rerunning scheduling after is beneficial. (As a concrete example, I plan to extend into #149367).

This is built on top of #149863 .

As a standalone PR, this doesn't have much impact on CodeGen since AVRegs and VRegs allocation order is basically the same, and this is only doing the transform if we do not exceed the VGPR addressable limit. But after #146606 , AVRegs will be allocated last. In the case where we don't actually need the AGPR allocation property of the AVGPR, this may cause degradations in RA as we are now assigning larger registers after smaller ones. This PR helps that case by constraining to VGPR, which defaults to assigning larger registers first.

jrbyrnes added 2 commits July 21, 2025 11:11
Change-Id: Ifcd242c111b139f62109b12b588bb6af764fe4df
Change-Id: I17cb012504946fa9dca88b32548f922e2ce4b7a9
@llvmbot
Copy link
Member

llvmbot commented Jul 22, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

Adds a scheduling stage AVGPRRewriteSchedule to constrain AVRegs to VReg if we do not have excess pressure. As a standalone, it is a little awkward to do this in a separate scheduling stage, however, there is a broader set of register class transforms that will eventually be done here (i.e. inflate the OpA, and OpB to AVGPR , rewrite VGPR MFMAs to AGPR). These transforms are all guided by pressure based heuristics; the scheduler has the most accurate and available RP analysis. Moreover, these transforms will change the pressure problem, so rerunning scheduling after is beneficial. (As a concrete example, I plan to extend into #149367).

This is built on top of #149863 .

As a standalone PR, this doesn't have much impact on CodeGen since AVRegs and VRegs allocation order is basically the same, and this is only doing the transform if we do not exceed the VGPR addressable limit. But after #146606 , AVRegs will be allocated last. In the case where we don't actually need the AGPR allocation property of the AVGPR, this may cause degradations in RA as we are now assigning larger registers after smaller ones. This PR helps that case by constraining to VGPR, which defaults to assigning larger registers first.


Patch is 38.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/150086.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/GCNRegPressure.cpp (+5-1)
  • (modified) llvm/lib/Target/AMDGPU/GCNRegPressure.h (+28-13)
  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+88)
  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+25-5)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir (+20-20)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2c.mir (+8-8)
  • (added) llvm/test/CodeGen/AMDGPU/schedule-reconstrain-avgpr.mir (+188)
diff --git a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
index 7d6723a6108be..334afd3a2a5b4 100644
--- a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
@@ -38,7 +38,11 @@ bool llvm::isEqual(const GCNRPTracker::LiveRegSet &S1,
 
 unsigned GCNRegPressure::getRegKind(const TargetRegisterClass *RC,
                                     const SIRegisterInfo *STI) {
-  return STI->isSGPRClass(RC) ? SGPR : (STI->isAGPRClass(RC) ? AGPR : VGPR);
+  return STI->isSGPRClass(RC)
+             ? SGPR
+             : (STI->isAGPRClass(RC)
+                    ? AGPR
+                    : (STI->isVectorSuperClass(RC) ? AVGPR : VGPR));
 }
 
 void GCNRegPressure::inc(unsigned Reg,
diff --git a/llvm/lib/Target/AMDGPU/GCNRegPressure.h b/llvm/lib/Target/AMDGPU/GCNRegPressure.h
index 3749b6d1efc63..5ec898351f922 100644
--- a/llvm/lib/Target/AMDGPU/GCNRegPressure.h
+++ b/llvm/lib/Target/AMDGPU/GCNRegPressure.h
@@ -29,43 +29,58 @@ class raw_ostream;
 class SlotIndex;
 
 struct GCNRegPressure {
-  enum RegKind { SGPR, VGPR, AGPR, TOTAL_KINDS };
+  enum RegKind { SGPR, VGPR, AGPR, AVGPR, TOTAL_KINDS };
 
   GCNRegPressure() {
     clear();
   }
 
-  bool empty() const { return !Value[SGPR] && !Value[VGPR] && !Value[AGPR]; }
+  bool empty() const {
+    return !Value[SGPR] && !Value[VGPR] && !Value[AGPR] && !Value[AVGPR];
+  }
 
   void clear() { std::fill(&Value[0], &Value[ValueArraySize], 0); }
 
   /// \returns the SGPR32 pressure
   unsigned getSGPRNum() const { return Value[SGPR]; }
-  /// \returns the aggregated ArchVGPR32, AccVGPR32 pressure dependent upon \p
-  /// UnifiedVGPRFile
+  /// \returns the aggregated ArchVGPR32, AccVGPR32, and Pseudo AVGPR pressure
+  /// dependent upon \p UnifiedVGPRFile
   unsigned getVGPRNum(bool UnifiedVGPRFile) const {
     if (UnifiedVGPRFile) {
-      return Value[AGPR] ? getUnifiedVGPRNum(Value[VGPR], Value[AGPR])
-                         : Value[VGPR];
+      return Value[AGPR]
+                 ? getUnifiedVGPRNum(Value[VGPR], Value[AGPR], Value[AVGPR])
+                 : Value[VGPR] + Value[AVGPR];
     }
-    return std::max(Value[VGPR], Value[AGPR]);
+    // Until we hit the VGPRThreshold, we will assign AV as VGPR. After that
+    // point, we will assign as AGPR.
+    return std::max(Value[VGPR] + Value[AVGPR], Value[AGPR]);
   }
 
   /// Returns the aggregated VGPR pressure, assuming \p NumArchVGPRs ArchVGPRs
-  /// and \p NumAGPRs AGPRS, for a target with a unified VGPR file.
+  /// \p NumAGPRs AGPRS, and \p NumAVGPRs AVGPRs for a target with a unified
+  /// VGPR file.
   inline static unsigned getUnifiedVGPRNum(unsigned NumArchVGPRs,
-                                           unsigned NumAGPRs) {
-    return alignTo(NumArchVGPRs, AMDGPU::IsaInfo::getArchVGPRAllocGranule()) +
+                                           unsigned NumAGPRs,
+                                           unsigned NumAVGPRs) {
+
+    // Until we hit the VGPRThreshold, we will assign AV as VGPR. After that
+    // point, we will assign as AGPR.
+    return alignTo(NumArchVGPRs + NumAVGPRs,
+                   AMDGPU::IsaInfo::getArchVGPRAllocGranule()) +
            NumAGPRs;
   }
 
-  /// \returns the ArchVGPR32 pressure
-  unsigned getArchVGPRNum() const { return Value[VGPR]; }
+  /// \returns the ArchVGPR32 pressure, plus the AVGPRS which we assume will be
+  /// allocated as VGPR
+  unsigned getArchVGPRNum() const { return Value[VGPR] + Value[AVGPR]; }
   /// \returns the AccVGPR32 pressure
   unsigned getAGPRNum() const { return Value[AGPR]; }
+  /// \returns the AVGPR32 pressure
+  unsigned getAVGPRNum() const { return Value[AVGPR]; }
 
   unsigned getVGPRTuplesWeight() const {
-    return std::max(Value[TOTAL_KINDS + VGPR], Value[TOTAL_KINDS + AGPR]);
+    return std::max(Value[TOTAL_KINDS + VGPR] + Value[TOTAL_KINDS + AVGPR],
+                    Value[TOTAL_KINDS + AGPR]);
   }
   unsigned getSGPRTuplesWeight() const { return Value[TOTAL_KINDS + SGPR]; }
 
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index a6553083d722b..9189361324a1c 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -528,6 +528,7 @@ GCNMaxOccupancySchedStrategy::GCNMaxOccupancySchedStrategy(
     const MachineSchedContext *C, bool IsLegacyScheduler)
     : GCNSchedStrategy(C) {
   SchedStages.push_back(GCNSchedStageID::OccInitialSchedule);
+  SchedStages.push_back(GCNSchedStageID::AVGPRRewriteSchedule);
   SchedStages.push_back(GCNSchedStageID::UnclusteredHighRPReschedule);
   SchedStages.push_back(GCNSchedStageID::ClusteredLowOccupancyReschedule);
   SchedStages.push_back(GCNSchedStageID::PreRARematerialize);
@@ -778,6 +779,8 @@ GCNScheduleDAGMILive::createSchedStage(GCNSchedStageID SchedStageID) {
   switch (SchedStageID) {
   case GCNSchedStageID::OccInitialSchedule:
     return std::make_unique<OccInitialScheduleStage>(SchedStageID, *this);
+  case GCNSchedStageID::AVGPRRewriteSchedule:
+    return std::make_unique<AVGPRRewriteScheduleStage>(SchedStageID, *this);
   case GCNSchedStageID::UnclusteredHighRPReschedule:
     return std::make_unique<UnclusteredHighRPStage>(SchedStageID, *this);
   case GCNSchedStageID::ClusteredLowOccupancyReschedule:
@@ -941,10 +944,14 @@ void GCNScheduleDAGMILive::finalizeSchedule() {
   Pressure.resize(Regions.size());
   RegionsWithHighRP.resize(Regions.size());
   RegionsWithExcessRP.resize(Regions.size());
+  RegionsWithAVRegs.resize(Regions.size());
+  RegionsWithExcessVGPRRP.resize(Regions.size());
   RegionsWithMinOcc.resize(Regions.size());
   RegionsWithIGLPInstrs.resize(Regions.size());
   RegionsWithHighRP.reset();
   RegionsWithExcessRP.reset();
+  RegionsWithAVRegs.reset();
+  RegionsWithExcessVGPRRP.reset();
   RegionsWithMinOcc.reset();
   RegionsWithIGLPInstrs.reset();
 
@@ -1003,6 +1010,9 @@ raw_ostream &llvm::operator<<(raw_ostream &OS, const GCNSchedStageID &StageID) {
   case GCNSchedStageID::OccInitialSchedule:
     OS << "Max Occupancy Initial Schedule";
     break;
+  case GCNSchedStageID::AVGPRRewriteSchedule:
+    OS << "AVGPR Rewriting Reschedule";
+    break;
   case GCNSchedStageID::UnclusteredHighRPReschedule:
     OS << "Unclustered High Register Pressure Reschedule";
     break;
@@ -1036,6 +1046,78 @@ bool GCNSchedStage::initGCNSchedStage() {
   return true;
 }
 
+bool AVGPRRewriteScheduleStage::reconstrainRegClass(
+    Register Reg, const TargetRegisterClass *NewRC) const {
+  const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
+  const TargetRegisterClass *OldRC = DAG.MRI.getRegClass(Reg);
+  const TargetRegisterInfo *TRI = DAG.MRI.getTargetRegisterInfo();
+  const TargetRegisterClass *ConstrainRC = NewRC;
+  const SIRegisterInfo *SRI = MF.getSubtarget<GCNSubtarget>().getRegisterInfo();
+
+  // Stop early if there is nothing to do.
+  if (!NewRC || NewRC == OldRC)
+    return false;
+
+  // Accumulate constraints from all uses.
+  for (MachineOperand &MO : DAG.MRI.reg_nodbg_operands(Reg)) {
+    // Apply the effect of the given operand to NewRC.
+    MachineInstr *MI = MO.getParent();
+    unsigned OpNo = &MO - &MI->getOperand(0);
+    ConstrainRC = MI->getRegClassConstraintEffect(OpNo, ConstrainRC, TII, TRI);
+    if (!ConstrainRC)
+      return false;
+    if (MI->isCopy()) {
+      MachineOperand &OtherOp = MI->getOperand(1 - OpNo);
+      if (!OtherOp.isReg())
+        continue;
+
+      if (!SRI->isVGPR(DAG.MRI, OtherOp.getReg()))
+        return false;
+    }
+  }
+  DAG.MRI.setRegClass(Reg, ConstrainRC);
+  return true;
+}
+
+bool AVGPRRewriteScheduleStage::initGCNSchedStage() {
+  const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
+
+  // The main benefit of AVReg usage is that the register can be assigned to
+  // either VGPR or AGPR. However, for the unified RF case, we should only be
+  // using AGPR if strictly necessary. That is, if the required number of VGPRs
+  // exceeds the addressable limit.  Otherwise, we should be stricly using VGPRs
+  // to minimize cross RC copies. Thus, if we are underc this limit, we should
+  // constrain AVReg- > VReg.
+  // TODO: AVReg constraining for non unified case.
+  if (!ST.hasGFX90AInsts() || DAG.RegionsWithAVRegs.empty() ||
+      DAG.RegionsWithExcessVGPRRP.any())
+    return false;
+
+  const SIRegisterInfo *SRI = ST.getRegisterInfo();
+
+  for (unsigned I = 0, E = DAG.MRI.getNumVirtRegs(); I != E; ++I) {
+    Register Reg = Register::index2VirtReg(I);
+    if (!DAG.LIS->hasInterval(Reg))
+      continue;
+    const TargetRegisterClass *RC = DAG.MRI.getRegClass(Reg);
+    if (!SRI->isVectorSuperClass(RC))
+      continue;
+
+    reconstrainRegClass(Reg, SRI->getEquivalentVGPRClass(RC));
+  }
+
+  // TODO -- opposite case, inflate to AV when we have AVGPR + VGPR RP greater
+  // than addressable limit.
+
+  // TODO - after we separate out AVGPR pressure from the e.g. getVGPRNum
+  // pressure queries, we may need to update the cached RP.
+
+  // TODO - there is a benefit to rescheduling with the constraints, as the
+  // generic trackers do not track AVGPR pressure. But we should teach the
+  // default trackers about AVGPR rather than doing rescheduling here.
+  return false;
+}
+
 bool UnclusteredHighRPStage::initGCNSchedStage() {
   if (DisableUnclusterHighRP)
     return false;
@@ -1278,6 +1360,9 @@ void GCNSchedStage::checkScheduling() {
   LLVM_DEBUG(dbgs() << "Pressure after scheduling: " << print(PressureAfter));
   LLVM_DEBUG(dbgs() << "Region: " << RegionIdx << ".\n");
 
+  if (PressureAfter.getAVGPRNum())
+    DAG.RegionsWithAVRegs[RegionIdx] = true;
+
   unsigned DynamicVGPRBlockSize = DAG.MFI.getDynamicVGPRBlockSize();
 
   if (PressureAfter.getSGPRNum() <= S.SGPRCriticalLimit &&
@@ -1331,6 +1416,9 @@ void GCNSchedStage::checkScheduling() {
   unsigned MaxArchVGPRs = std::min(MaxVGPRs, ST.getAddressableNumArchVGPRs());
   unsigned MaxSGPRs = ST.getMaxNumSGPRs(MF);
 
+  if (PressureAfter.getArchVGPRNum() > ST.getAddressableNumArchVGPRs())
+    DAG.RegionsWithExcessVGPRRP[RegionIdx] = true;
+
   if (PressureAfter.getVGPRNum(ST.hasGFX90AInsts()) > MaxVGPRs ||
       PressureAfter.getArchVGPRNum() > MaxArchVGPRs ||
       PressureAfter.getAGPRNum() > MaxArchVGPRs ||
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
index 94cd795bbc8f6..7575d7611bbcb 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
@@ -28,11 +28,12 @@ class GCNSchedStage;
 
 enum class GCNSchedStageID : unsigned {
   OccInitialSchedule = 0,
-  UnclusteredHighRPReschedule = 1,
-  ClusteredLowOccupancyReschedule = 2,
-  PreRARematerialize = 3,
-  ILPInitialSchedule = 4,
-  MemoryClauseInitialSchedule = 5
+  AVGPRRewriteSchedule = 1,
+  UnclusteredHighRPReschedule = 2,
+  ClusteredLowOccupancyReschedule = 3,
+  PreRARematerialize = 4,
+  ILPInitialSchedule = 5,
+  MemoryClauseInitialSchedule = 6
 };
 
 #ifndef NDEBUG
@@ -224,6 +225,7 @@ using RegionBoundaries =
 class GCNScheduleDAGMILive final : public ScheduleDAGMILive {
   friend class GCNSchedStage;
   friend class OccInitialScheduleStage;
+  friend class AVGPRRewriteScheduleStage;
   friend class UnclusteredHighRPStage;
   friend class ClusteredLowOccStage;
   friend class PreRARematStage;
@@ -250,9 +252,15 @@ class GCNScheduleDAGMILive final : public ScheduleDAGMILive {
   // limit. Register pressure in these regions usually will result in spilling.
   BitVector RegionsWithExcessRP;
 
+  // Regions that have VGPR RP which exceed the addressable limit.
+  BitVector RegionsWithExcessVGPRRP;
+
   // Regions that has the same occupancy as the latest MinOccupancy
   BitVector RegionsWithMinOcc;
 
+  // Regions which use the AV RC.
+  BitVector RegionsWithAVRegs;
+
   // Regions that have IGLP instructions (SCHED_GROUP_BARRIER or IGLP_OPT).
   BitVector RegionsWithIGLPInstrs;
 
@@ -401,6 +409,18 @@ class OccInitialScheduleStage : public GCNSchedStage {
       : GCNSchedStage(StageID, DAG) {}
 };
 
+class AVGPRRewriteScheduleStage : public GCNSchedStage {
+private:
+  bool reconstrainRegClass(Register Reg,
+                           const TargetRegisterClass *NewRC) const;
+
+public:
+  bool initGCNSchedStage() override;
+
+  AVGPRRewriteScheduleStage(GCNSchedStageID StageID, GCNScheduleDAGMILive &DAG)
+      : GCNSchedStage(StageID, DAG) {}
+};
+
 class UnclusteredHighRPStage : public GCNSchedStage {
 private:
   // Save the initial occupancy before starting this stage.
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
index e93595b9ef273..0ec67f44e2cbb 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
@@ -23,10 +23,10 @@ body:             |
   ; GCN-NEXT: {{  $}}
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = IMPLICIT_DEF
   ; GCN-NEXT:   [[DEF1:%[0-9]+]]:areg_512_align2 = IMPLICIT_DEF
-  ; GCN-NEXT:   [[DEF2:%[0-9]+]]:av_128_align2 = IMPLICIT_DEF
-  ; GCN-NEXT:   [[DEF3:%[0-9]+]]:av_128_align2 = IMPLICIT_DEF
-  ; GCN-NEXT:   [[DEF4:%[0-9]+]]:av_128_align2 = IMPLICIT_DEF
-  ; GCN-NEXT:   [[DEF5:%[0-9]+]]:av_128_align2 = IMPLICIT_DEF
+  ; GCN-NEXT:   [[DEF2:%[0-9]+]]:vreg_128_align2 = IMPLICIT_DEF
+  ; GCN-NEXT:   [[DEF3:%[0-9]+]]:vreg_128_align2 = IMPLICIT_DEF
+  ; GCN-NEXT:   [[DEF4:%[0-9]+]]:vreg_128_align2 = IMPLICIT_DEF
+  ; GCN-NEXT:   [[DEF5:%[0-9]+]]:vreg_128_align2 = IMPLICIT_DEF
   ; GCN-NEXT:   [[DEF6:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
   ; GCN-NEXT:   [[DEF7:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
   ; GCN-NEXT:   [[DEF8:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
@@ -75,40 +75,40 @@ body:             |
   ; GCN-NEXT: bb.1:
   ; GCN-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
   ; GCN-NEXT: {{  $}}
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 0, 0, implicit $exec :: (load (s128) from %ir.in0, !alias.scope !0, addrspace 3)
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_1:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 0, 0, implicit $exec :: (load (s128) from %ir.in4, !alias.scope !0, addrspace 3)
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_2:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 1040, 0, implicit $exec :: (load (s128) from %ir.in1, !alias.scope !0, addrspace 3)
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_3:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 2064, 0, implicit $exec :: (load (s128) from %ir.in5, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 0, 0, implicit $exec :: (load (s128) from %ir.in0, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_1:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 0, 0, implicit $exec :: (load (s128) from %ir.in4, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_2:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 1040, 0, implicit $exec :: (load (s128) from %ir.in1, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_3:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 2064, 0, implicit $exec :: (load (s128) from %ir.in5, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_]].sub0_sub1, [[DS_READ_B128_gfx9_1]].sub0_sub1, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_4:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 2080, 0, implicit $exec :: (load (s128) from %ir.in2, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_4:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 2080, 0, implicit $exec :: (load (s128) from %ir.in2, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF8]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_1:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF9]], implicit $exec
   ; GCN-NEXT:   dead [[V_MFMA_F32_32X32X8F16_mac_e64_:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_]].sub2_sub3, [[DS_READ_B128_gfx9_1]].sub2_sub3, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_5:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 3120, 0, implicit $exec :: (load (s128) from %ir.in3, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_5:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF6]], 3120, 0, implicit $exec :: (load (s128) from %ir.in3, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_2:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF10]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_3:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF11]], implicit $exec
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_2]].sub0_sub1, [[DS_READ_B128_gfx9_3]].sub0_sub1, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_6:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 4128, 0, implicit $exec :: (load (s128) from %ir.in6, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_6:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 4128, 0, implicit $exec :: (load (s128) from %ir.in6, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_4:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF12]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_5:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF13]], implicit $exec
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_2]].sub2_sub3, [[DS_READ_B128_gfx9_3]].sub2_sub3, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_7:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 6192, 0, implicit $exec :: (load (s128) from %ir.in7, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_7:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 6192, 0, implicit $exec :: (load (s128) from %ir.in7, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_6:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF14]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_7:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF15]], implicit $exec
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_4]].sub0_sub1, [[DS_READ_B128_gfx9_6]].sub0_sub1, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_8:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 1024, 0, implicit $exec :: (load (s128) from %ir.in8, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_8:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 1024, 0, implicit $exec :: (load (s128) from %ir.in8, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_8:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF16]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_9:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF17]], implicit $exec
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_4]].sub2_sub3, [[DS_READ_B128_gfx9_6]].sub2_sub3, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_9:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 3088, 0, implicit $exec :: (load (s128) from %ir.in9, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_9:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 3088, 0, implicit $exec :: (load (s128) from %ir.in9, !alias.scope !0, addrspace 3)
   ; GCN-NEXT:   [[V_ADD_U32_e32_10:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF18]], implicit $exec
   ; GCN-NEXT:   [[V_ADD_U32_e32_11:%[0-9]+]]:vgpr_32 = V_ADD_U32_e32 [[DEF42]], [[DEF19]], implicit $exec
   ; GCN-NEXT:   [[DEF:%[0-9]+]]:areg_512_align2 = contract V_MFMA_F32_32X32X8F16_mac_e64 [[DS_READ_B128_gfx9_5]].sub0_sub1, [[DS_READ_B128_gfx9_7]].sub0_sub1, [[DEF]], 0, 0, 0, implicit $mode, implicit $exec
-  ; GCN-NEXT:   [[DS_READ_B128_gfx9_10:%[0-9]+]]:av_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 5152, 0, implicit $exec :: (load (s128) from %ir.in10, !alias.scope !0, addrspace 3)
+  ; GCN-NEXT:   [[DS_READ_B128_gfx9_10:%[0-9]+]]:vreg_128_align2 = DS_READ_B128_gfx9 [[DEF7]], 5152,...
[truncated]

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see the point of doing this. All the downstream register allocator code is free to re-inflate these right back to the AV_*. If we do not exceed the pressure threshold, we can always just remove the A registers from the allocation order by reserving them. The scheduler should not be actively trying to increase restrictions just in case. If you're changing the instruction type, you can adjust the register class directly as part of that transform

@jrbyrnes
Copy link
Contributor Author

jrbyrnes commented Jul 24, 2025

I don't really see the point of doing this. All the downstream register allocator code is free to re-inflate these right back to the AV_*.

The issue is that the register class priority is dependent upon the register pressure situation. If we need to use AGPRs, then it is best to assign AV last, otherwise, it is best to assign based on the bitwidth of the register. To address minor increase in register usage caused by #146606

A more direct solution is to provide a hook at some level in DefaultPriorityAdvisor::getPriority to allow dynamic RC priorites based on RP situation.

Comment on lines +255 to +263
// Regions that have VGPR RP which exceed the addressable limit.
BitVector RegionsWithExcessVGPRRP;

// Regions that has the same occupancy as the latest MinOccupancy
BitVector RegionsWithMinOcc;

// Regions which use the AV RC.
BitVector RegionsWithAVRegs;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these two bitvectors be internal to the new stage instead? This avoids recomputation between all other stages that don't care about them and eliminates the possibility that we forget to keep them up to date (after a scheduling revert for example).

// using AGPR if strictly necessary. That is, if the required number of VGPRs
// exceeds the addressable limit. Otherwise, we should be stricly using VGPRs
// to minimize cross RC copies. Thus, if we are under this limit, we should
// constrain AVReg- > VReg.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// constrain AVReg- > VReg.
// constrain AVReg -> VReg.

@arsenm
Copy link
Contributor

arsenm commented Jul 24, 2025

The issue is that the register class priority is dependent upon the register pressure situation. If we need to use AGPRs, then it is best to assign AV last, otherwise, it is best to assign based on the bitwidth of the register. To address minor increase in register usage caused by #146606

Can you just drop the AV priorities to the bottom regardless of bitwidth?

A more direct solution is to provide a hook at some level in DefaultPriorityAdvisor::getPriority to allow dynamic RC priorites based on RP situation.

I don't think the scheduler should be touching register classes at all. It has impacts outside of the current scheduling region, and the scheduler should not be increasing constraints

@jrbyrnes
Copy link
Contributor Author

Can you just drop the AV priorities to the bottom regardless of bitwidth?

We can, but we know it is better to use bitwidth if we don't need the AGPR assignment from AVGPR. So I think we should try to do teach the compiler about this in some way to reduce regression likelihood -- so long as doing so doesn't require some excessive burden.

I don't think the scheduler should be touching register classes at all. It has impacts outside of the current scheduling region, and the scheduler should not be increasing constraints

It is clear to me now that this particular RC transform should not happen in max-occupancy scheduling. It does not directly help reduce RP or directly help us to improve occupancy. That said, I believe RC transforms which can improve RP and result in better scheduling should happen in the scheduler. There is precedent for pre-sched-stage function-wide transforms (i.e. remat).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants