Skip to content

Commit 77f8fa3

Browse files
Initial release: GPU PCIe diagnostic tool v2.7.4
0 parents  commit 77f8fa3

5 files changed

Lines changed: 1305 additions & 0 deletions

File tree

.gitattributes

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
*.cu text
2+
*.md text
3+
*.sh text

.gitignore

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Build artifacts
2+
pcie_diag
3+
*.o
4+
5+
# Runtime results
6+
results/
7+
8+
# Nsight / profiling
9+
*.nsys-rep
10+
*.sqlite
11+
*.qdrep
12+
13+
# Editor / OS noise
14+
*~
15+
.DS_Store

Makefile

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
NVCC := nvcc
2+
NVCC_FLAGS := -O3
3+
LIBS := -lnvidia-ml -pthread
4+
TARGET := pcie_diag
5+
SRC := pcie_diagnostic_pro.cu
6+
7+
all: $(TARGET)
8+
9+
$(TARGET): $(SRC)
10+
$(NVCC) $(NVCC_FLAGS) $(SRC) $(LIBS) -o $(TARGET)
11+
12+
clean:
13+
rm -f $(TARGET)
14+
15+

README.md

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
2+
# GPU PCIe Diagnostic & Bandwidth Analysis
3+
4+
A deterministic command-line tool for validating GPU PCIe link health, bandwidth, and real-world PCIe utilization using only observable hardware data.
5+
6+
This tool answers one question reliably:
7+
8+
> **Is my GPU’s PCIe link behaving as it should, and can I prove it?**
9+
10+
No registry hacks.
11+
No BIOS assumptions.
12+
No “magic” optimizations.
13+
14+
Only measurable link state, copy throughput, and hardware counters.
15+
16+
***
17+
18+
## What This Tool Does
19+
20+
This tool performs hardware-observable PCIe diagnostics and reports factual results with deterministic verdicts.
21+
22+
It measures and reports directly from GPU hardware:
23+
24+
- PCIe **current and maximum** link generation and width (via NVML)
25+
- **Peak Host→Device and Device→Host copy bandwidth** using CUDA memcpy timing
26+
- **Sustained PCIe utilization under load** using NVML TX/RX counters
27+
- Efficiency relative to theoretical PCIe payload bandwidth
28+
- Clear VERDICT from observable conditions only
29+
30+
The tool does not attempt to tune, fix, or modify system configuration.
31+
32+
***
33+
34+
## Verdict Semantics
35+
36+
- **OK** — The negotiated PCIe link and measured throughput are consistent with expected behavior.
37+
- **DEGRADED** — The GPU is operating below its maximum supported PCIe generation or width.
38+
- **UNDERPERFORMING** — The full link is negotiated, but sustained bandwidth is significantly lower than expected.
39+
40+
Verdicts are rule-based and derived only from measured data.
41+
42+
***
43+
44+
## Why This Tool Exists
45+
46+
Modern systems frequently exhibit PCIe issues that are difficult to diagnose:
47+
48+
- GPUs negotiating **x8 / x4 / x1** instead of **x16**
49+
- PCIe generation downgrades after BIOS or firmware updates
50+
- Slot bifurcation, riser cable, or motherboard lane-sharing issues
51+
- Reduced PCIe bandwidth occurring while system status is reported as normal
52+
- Confusion between PCIe transport limits and workload bottlenecks
53+
54+
This tool exists to:
55+
56+
1. **Reproducible PCIe diagnostic baseline**
57+
2. **Hardware-level proof** of PCIe behavior
58+
3. **Isolate link negotiation** from kernel/workload effects
59+
60+
***
61+
62+
## Example Output
63+
64+
GPU PCIe Diagnostic & Bandwidth Analysis v2.7.4
65+
GPU: NVIDIA GeForce GTX 1080
66+
BDF: 00000000:01:00.0
67+
UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (redacted)
68+
PCIe Link
69+
Current: Gen3 x16
70+
Max Cap: Gen3 x16
71+
Theoretical (payload): 15.76 GB/s
72+
Transfer Size: 1024 MiB
73+
Peak Copy Bandwidth
74+
Host → Device: 12.5 GB/s
75+
Device → Host: 12.7 GB/s
76+
Telemetry (NVML)
77+
Window: 5.0 s (50 samples @ 100 ms)
78+
TX avg: 7.6 GB/s
79+
RX avg: 7.1 GB/s
80+
Combined: 14.7 GB/s
81+
Verdict
82+
State: OK
83+
Reason: Throughput and link state are consistent with a healthy PCIe path
84+
Efficiency: 93.5%
85+
86+
System Signals (informational)
87+
MaxReadReq: 512 bytes
88+
Persistence Mode: Disabled
89+
ASPM Policy (sysfs string): [default] performance powersave powersupersave
90+
IOMMU: Platform default (no explicit flags)
91+
92+
***
93+
94+
## Requirements
95+
96+
- NVIDIA GPU with a supported driver
97+
- CUDA Toolkit (for `nvcc`)
98+
- NVML development library (`-lnvidia-ml`)
99+
- Linux operating system (tested on Ubuntu 20.04+)
100+
101+
***
102+
103+
## Permissions & Logging Notes
104+
105+
On some Linux systems, PCIe and NVML diagnostics require elevated privileges due to kernel and driver access controls.
106+
If log files were previously created using `sudo`, the results directory may become root-owned. In that case, subsequent runs may prompt for a password when appending logs.
107+
108+
To restore normal user access to the results directory:
109+
110+
```bash
111+
sudo chown -R $USER:$USER results/
112+
113+
***
114+
115+
## Build
116+
117+
make
118+
119+
or manually:
120+
121+
nvcc -O3 pcie_diagnostic_pro.cu -lnvidia-ml -Xcompiler -pthread -o pcie_diag
122+
123+
***
124+
125+
## Usage
126+
127+
./pcie_diag 1024
128+
129+
***
130+
131+
## Logging
132+
133+
./pcie_diag 1024 --log --csv
134+
./pcie_diag 1024 --log --json
135+
./pcie_diag 1024 --log --csv --json
136+
137+
Logs are written to:
138+
139+
- `results/csv/pcie_log.csv`
140+
- `results/json/pcie_sessions.json`
141+
142+
***
143+
144+
## Extended Telemetry Window
145+
146+
./pcie_diag 1024 --duration-ms 8000
147+
- improves measurement stability
148+
149+
***
150+
151+
## Optional Integrity Counters
152+
153+
./pcie_diag 1024 --integrity
154+
- Enables read-only inspection of PCIe Advanced Error Reporting (AER) counters via Linux sysfs, if exposed by the platform.
155+
- If counters are unavailable on the platform, integrity checks are automatically skipped with clear reporting.
156+
157+
158+
## Multi-GPU Logging Behavior
159+
160+
When running in multi-GPU mode (`--all-gpus`), each detected GPU is evaluated independently.
161+
162+
- One result row (CSV) or object (JSON) is emitted per GPU per run.
163+
- Each entry includes device UUID and PCIe BDF for unambiguous attribution.
164+
- Multi-GPU configurations have not been exhaustively validated on all platforms.
165+
- Users are encouraged to verify results on their specific hardware.
166+
167+
Example:
168+
169+
```bash
170+
./pcie_diag 1024 --all-gpus --log --csv
171+
./pcie_diag 1024 --all-gpus --log --json
172+
./pcie_diag 1024 --gpu-index 1 # Target single GPU by index
173+
```
174+
175+
***
176+
177+
## Logging & Reproducibility
178+
179+
- CSV and JSON logs include stable device identifiers
180+
- Device UUIDs are reported at runtime via NVML for consistent identification across runs
181+
- UUIDs shown in documentation are intentionally **redacted**
182+
- Logs are append-friendly for time-series analysis and automated monitoring
183+
184+
***
185+
186+
## Scope & Limitations
187+
188+
- This tool evaluates PCIe transport behavior only
189+
- It does not measure kernel performance or application-level efficiency
190+
- It does not modify BIOS, firmware, registry, or PCIe configuration
191+
- It reports observable facts only and never infers beyond available data
192+
193+
***
194+
195+
## Validation
196+
197+
- Memcpy timing and PCIe behavior were cross-validated during development using Nsight Systems.
198+
- Nsight is not required to use this tool and is referenced only as an external correctness check.
199+
200+
***
201+
202+
## Author
203+
204+
Author: Joe McLaren (Human–AI collaborative engineering)
205+
https://github.com/parallelArchitect
206+
207+
***
208+
209+
## License
210+
211+
MIT License
212+
213+
Copyright (c) 2025 Joe McLaren
214+
215+
Permission is hereby granted, free of charge, to any person obtaining a copy
216+
of this software and associated documentation files (the "Software"), to deal
217+
in the Software without restriction, including without limitation the rights
218+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
219+
copies of the Software, and to permit persons to whom the Software is
220+
furnished to do so, subject to the following conditions:
221+
222+
The above copyright notice and this permission notice shall be included in all
223+
copies or substantial portions of the Software.
224+
225+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
226+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
227+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
228+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
229+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
230+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
231+
SOFTWARE.
232+
233+
234+
## References
235+
236+
- **NVIDIA PCIe Logging & Counters**
237+
https://docs.nvidia.com/networking/display/bfswtroubleshooting/pcie#src-4103229342_PCIe-LoggingandCounters
238+
239+
- **Linux PCIe AER Documentation**
240+
https://docs.kernel.org/PCI/pcieaer-howto.html
241+
242+
- **Oracle Linux PCIe AER Overview**
243+
https://blogs.oracle.com/linux/pci-express-advanced-error-reporting

0 commit comments

Comments
 (0)