Download PDFOpen PDF in browser

Analyzing GPU Tensor Core Potential for Fast Reductions

EasyChair Preprint no. 565

5 pagesDate: October 7, 2018


The Nvidia GPU architecture has introduced new computing elements such as the tensor cores, which are special processing units dedicated to perform fast matrix-multiply- accumulate (MMA) operations and accelerate Deep Learning applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose a new GPU tensor-core based algorithm as well as analyze its potential performance benefits in comparison to a traditional GPU-based one. The proposed method, encodes the reduction of n numbers as a set of m × m MMA tensor-core operations (for Nvidia’s Volta architecture m = 16) and takes advantage from the fact that each MMA operation takes just one GPU cycle. When analyzing the cost under a simplified GPU computing model, the result is that the new algorithm manages to reduce a problem of n numbers in T(n) = 5 log_m^2 (n) steps with a speedup of S = 4/5 log (m^2).

Keyphrases: GPU computing, matrix-multiply-accumulate, NVIDIA Tensor Cores, reduction

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
  author = {Roberto A. Carrasco Cavieres and Raimundo Vega and Cristóbal A. Navarro},
  title = {Analyzing GPU Tensor Core Potential for Fast Reductions},
  howpublished = {EasyChair Preprint no. 565},
  doi = {10.29007/zlmg},
  year = {EasyChair, 2018}}
Download PDFOpen PDF in browser