Abhishek A Patwardhan and Ramakrishna Upadrasta
We propose a compiler driven acceleration of parallel computations on GPUs by exploiting the various special varieties of caches (texture, surface and constant for NVIDIA GPUs). We show that our method obtains better performance for a class of computations when compared with earlier methods that use on-chip shared memory. We provide an end-to-end solution by developing a fully automatic, sound, static framework within a state-of-art source-to-source Polyhedral compiler (PPCG) to exploit these varieties of GPU caches. We use polyhedral model for profitability modeling of the particular variety of GPU caches. We evaluate our implementation on PolyBench/C benchmark kernels and report up to 1. 5x speedups over the current memory mapping strategy used by PPCG compiler. We also consider sample real-world representative kernels: PageRank, DNN layer (LSTM), solvers (Poisson and DWE-FDTD stencil), and show that using the special GPU caches in these programs results in up to 2. 6x speedup over a standard shared memory based implementation. We believe that our contribution is towards automatic exploitation of GPU cache/memory hierarchy as it shows general purpose computing usage of special GPU caches that were originally designed for image processing applications.