Methodology¶
This document describes how the performance results are measured: timing scope, statistical methodology, reproducibility conditions, compilation flags, and profiling tools.
For build and run instructions, see reproducing.md. For the full benchmark results, see results.md.
Hardware context. All headline results were measured on 8× NVIDIA A100-SXM4-80GB (NVLink NV12). Profiling for roofline analysis was performed on RTX 4060 Laptop due to NCU permission constraints on shared A100 hosts. See
profiling-2d.mdandprofiling-3d.mdfor the analyses themselves.
How results were measured:
| Parameter | Value |
|---|---|
| Runs per configuration | 10 (median reported) |
| Warmup runs | 3 (discarded) |
| Timing scope | Solver only (excludes I/O, matrix setup) |
| Convergence criterion | Relative residual < 1e-6 |
| Profiling tools | Nsight Systems (timeline), Nsight Compute (roofline) |
Reproducibility conditions: Identical test matrices, GPU clocks at default (no boost lock), 3 warmup runs before measurement, separate process per configuration, same binary for all runs.
Compilation flags (release build):
Run benchmarks on your hardware:
# Quick test (512×512)
./scripts/run_all.sh --quick
# Full benchmark suite
./scripts/run_all.sh --size=1000
Results are saved to results/raw/ (TXT) and results/json/ (structured data).
Note: The showcase results (1.44× vs AmgX, multi-GPU scaling) were measured on 8× NVIDIA A100-SXM4-80GB with 10k-20k matrices. To reproduce those specific results, use
--size=10000(or larger) on equivalent hardware.