| Abstract |
We develop a microbenchmark-based performance model for NVIDIA GeForce 200-series GPUs.
Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows
programmers and architects to predict the benefits of potential program optimizations and architectural
improvements. In particular, we use a microbenchmark-based approach to develop a throughput model
for three major components of GPU execution time: the instruction pipeline, shared memory access,
and global memory access. Because our model is based on the GPU’s native instruction set, we can
predict performance with a 5–15% error. To demonstrate the usefulness of the model, we analyze
three representative real-world and already highly-optimized programs: dense matrix multiply, tridiagonal
systems solver, and sparse matrix-vector multiply. The model provides us detailed quantitative analysis
on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation
and to optimize the tridiagonal solver and sparse matrix vector multiply by 60% and 18%
respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural
improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory
transaction granularity.
|