Results¶
All benchmark results for the multi-GPU CG stencil solver. Measured on 8× NVIDIA A100-SXM4-80GB (NVLink NV12), median of 10 runs with 3 warmup runs discarded.
For the analysis behind these numbers, see profiling-2d.md (2D, kernel breakdown, roofline) and profiling-3d.md (3D, compute-communication overlap). For measurement methodology, see methodology.md.
2D — Strong Scaling (Custom CG)¶
Multi-GPU Strong Scaling on 8× NVIDIA A100-SXM4-80GB
| Problem Size | 1 GPU | 8 GPUs | Speedup | Efficiency |
|---|---|---|---|---|
| 100M unknowns (10k×10k stencil) | 133.9 ms | 19.3 ms | 6.94× | 86.8% |
| 225M unknowns (15k×15k stencil) | 300.1 ms | 40.4 ms | 7.43× | 92.9% |
| 400M unknowns (20k×20k stencil) | 531.4 ms | 71.0 ms | 7.48× | 93.5% |
Median of 10 runs; 3 warmup runs discarded.
2D — Detailed Scaling (1 / 2 / 4 / 8 GPUs)¶
10000×10000 stencil (100M unknowns)¶
| GPUs | Time (ms) | Speedup | Efficiency |
|---|---|---|---|
| 1 | 133.9 | 1.00× | 100.0% |
| 2 | 68.7 | 1.95× | 97.5% |
| 4 | 35.7 | 3.76× | 93.9% |
| 8 | 19.3 | 6.94× | 86.8% |
15000×15000 stencil (225M unknowns)¶
| GPUs | Time (ms) | Speedup | Efficiency |
|---|---|---|---|
| 1 | 300.1 | 1.00× | 100.0% |
| 2 | 152.5 | 1.97× | 98.4% |
| 4 | 77.7 | 3.86× | 96.5% |
| 8 | 40.4 | 7.43× | 92.9% |
20000×20000 stencil (400M unknowns)¶
| GPUs | Time (ms) | Speedup | Efficiency |
|---|---|---|---|
| 1 | 531.4 | 1.00× | 100.0% |
| 2 | 269.3 | 1.97× | 98.7% |
| 4 | 136.3 | 3.90× | 97.5% |
| 8 | 71.0 | 7.48× | 93.5% |
Convergence: 14 iterations across all configurations. Source: results_archive/results_problem_size_scaling_NVIDIAA100-SXM4-80GB_20260109_123920/.
2D — SpMV Format Comparison¶
Format Comparison on NVIDIA A100 80GB PCIe
| Matrix Size | CSR (cuSPARSE) | STENCIL5 (Custom) | Speedup | Bandwidth Improvement |
|---|---|---|---|---|
| 10k×10k (100M unknowns) | 6.77 ms | 3.25 ms | 2.08× | 1.98× (1182 → 2339 GB/s) |
| 15k×15k (225M unknowns) | 15.00 ms | 7.29 ms | 2.06× | 1.96× (1200 → 2346 GB/s) |
| 20k×20k (400M unknowns) | 26.77 ms | 12.86 ms | 2.08× | 1.98× (1195 → 2364 GB/s) |
Median of 10 runs; 3 warmup runs discarded.
2D — Custom CG vs NVIDIA AmgX¶
Hardware: 8× NVIDIA A100-SXM4-80GB · CUDA 12.8 · Driver 575.57 (same configuration for both solvers)
| Matrix Size | Implementation | 1 GPU | 8 GPUs | Speedup | Efficiency |
|---|---|---|---|---|---|
| 10k×10k | Custom CG | 133.9 ms | 19.3 ms | 6.94× | 86.8% |
| (100M unknowns) | NVIDIA AmgX | 188.7 ms | 27.0 ms | 6.99× | 87.4% |
| 15k×15k | Custom CG | 300.1 ms | 40.4 ms | 7.43× | 92.9% |
| (225M unknowns) | NVIDIA AmgX | 420.0 ms | 57.0 ms | 7.36× | 92.0% |
| 20k×20k | Custom CG | 531.4 ms | 71.0 ms | 7.48× | 93.5% |
| (400M unknowns) | NVIDIA AmgX | 746.7 ms | 102.3 ms | 7.30× | 91.3% |
3D — 7-Point Stencil (Sync vs Overlap)¶
Hardware: 8× NVIDIA A100-SXM4-80GB (NVLink)
| Grid | GPUs | Sync (ms) | Overlap (ms) | Overlap Gain | Iterations |
|---|---|---|---|---|---|
| 128³ | 1 | 73.2 | 74.0 | — | 261 |
| 128³ | 2 | 52.8 | 43.9 | 1.20× | 261 |
| 128³ | 4 | 51.4 | 46.7 | 1.10× | 261 |
| 128³ | 8 | 47.8 | 49.7 | 0.96× | 261 |
| 256³ | 1 | 970.3 | 972.4 | — | 527 |
| 256³ | 2 | 583.3 | 515.7 | 1.13× | 527 |
| 256³ | 4 | 409.0 | 318.0 | 1.29× | 527 |
| 256³ | 8 | 304.7 | 265.8 | 1.15× | 527 |
| 512³ | 1 | 15127 | 15129 | — | 1065 |
| 512³ | 2 | 8211 | 7682 | 1.07× | 1065 |
| 512³ | 4 | 5088 | 3944 | 1.29× | 1065 |
| 512³ | 8 | 3323 | 2453 | 1.36× | 1065 |
1-GPU rows show no overlap gain (no communication to hide). 128³/8GPU shows slight overhead (0.96×): per-GPU workload is too small for dual-stream overhead to pay off.
3D — 27-Point Stencil (Sync vs Overlap)¶
| Grid | GPUs | Sync (ms) | Overlap (ms) | Overlap Gain | Iterations |
|---|---|---|---|---|---|
| 128³ | 1 | 89.2 | 89.6 | — | 151 |
| 128³ | 2 | 57.3 | 51.1 | 1.12× | 151 |
| 128³ | 4 | 47.3 | 36.6 | 1.29× | 151 |
| 128³ | 8 | 40.5 | 33.6 | 1.21× | 151 |
| 256³ | 1 | 1315.4 | 1315.4 | — | 303 |
| 256³ | 2 | 718.9 | 680.3 | 1.06× | 303 |
| 256³ | 4 | 447.5 | 367.5 | 1.22× | 303 |
| 256³ | 8 | 294.0 | 203.5 | 1.45× | 303 |
| 512³ | 1 | 22016 | 21997 | — | 611 |
| 512³ | 2 | 11438 | 11142 | 1.03× | 611 |
| 512³ | 4 | 6461 | 5815 | 1.11× | 611 |
| 512³ | 8 | 3809 | 3110 | 1.23× | 611 |
3D — Strong Scaling Efficiency (overlap solver)¶
7-point stencil — speedup relative to 1-GPU sync baseline:
| Grid | 1 GPU | 2 GPUs | 4 GPUs | 8 GPUs |
|---|---|---|---|---|
| 128³ | 1.00× | 1.69× | 1.59× | 1.49× |
| 256³ | 1.00× | 1.88× | 3.06× | 3.66× |
| 512³ | 1.00× | 1.97× | 3.84× | 6.17× |
512³ at 8 GPUs: 15127/2453 = 6.17× → 77% parallel efficiency
27-point stencil — speedup relative to 1-GPU sync baseline:
| Grid | 1 GPU | 2 GPUs | 4 GPUs | 8 GPUs |
|---|---|---|---|---|
| 128³ | 1.00× | 1.75× | 2.44× | 2.66× |
| 256³ | 1.00× | 1.93× | 3.58× | 6.47× |
| 512³ | 1.00× | 1.98× | 3.79× | 7.08× |
512³ at 8 GPUs: 22016/3110 = 7.08× → 88% parallel efficiency