Download PDFOpen PDF in browser

Root Causing MPI Workloads Imbalance Issues via Scalable MPI Critical Path Analysis

EasyChair Preprint no. 8087

9 pagesDate: May 24, 2022

Abstract

Analyzing performance of MPI application usually requires non-trivial approach-es. Classical hotspot-based analysis is often misleading for such an applications because hotspots optimization might not actually cause any speedup, but just in-crease the time ranks spent on waiting for each other. One of the solutions is representing MPI program as a graph (known as Program Activity Graph) and perform only analysis of activities on Critical Path of this graph (the longest path containing computation and communication, but not wait-ing). Reducing computing time on Critical Path obviously reduces elapsed time of the whole application. While there are many papers in this area, Critical Path analysis representation in well-known performance tools is still quite limited. One side of this is that real-life HPC applications running on large scale produce huge Program Activity Graphs and scalability of classical graph algorithms is quite poor to calculate Critical Path reasonably fast. Another one relates to the limited capabilities performance tools provide based on Critical Path using timing infor-mation only. This paper describes an algorithm of building Program Activity Graph and calculating Critical Path which naturally scales to the same amount of CPU cores as profiled MPI application uses. We also show how to combine Crit-ical Path analysis with Performance Monitoring Unit (PMU) data to enable effi-cient root causing of MPI imbalance issues even on very high scale.

Keyphrases: critical path, imbalance, MPI, PMU

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@Booklet{EasyChair:8087,
  author = {Maksim Fatin and Artem Shatalin and Vitaly Slobodskoy},
  title = {Root Causing MPI Workloads Imbalance Issues via Scalable MPI Critical Path Analysis},
  howpublished = {EasyChair Preprint no. 8087},

  year = {EasyChair, 2022}}
Download PDFOpen PDF in browser