Parallelizing heavyweight debugging tools with mpiecho

Barry Rountree, Todd Gamblin, Bronis R. De Supinski, Martin Schulz, David K Lowenthal, Guy Cobb, Henry Tufo

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Idioms created for debugging execution on single processors and multicore systems have been successfully scaled to thousands of processors, but there is little hope that this class of techniques can continue to be scaled out to tens of millions of cores. In order to allow development of more scalable debugging idioms we introduce mpiecho, a novel runtime platform that enables cloning of MPI ranks. Given identical execution on each clone, we then show how heavyweight debugging approaches can be parallelized, reducing their overhead to a fraction of the serialized case. We also show how this platform can be useful in isolating the source of hardware-based nondeterministic behavior and provide a case study based on a recent processor bug at LLNL. While total overhead will depend on the individual tool, we show that the platform itself contributes little: 512x tool parallelization incurs at worst 2x overhead across the NAS Parallel benchmarks, hardware fault isolation contributes at worst an additional 44% overhead. Finally, we show how mpiecho can lead to near-linear reduction in overhead when combined with maid, a heavyweight memory tracking tool provided with Intel's pin platform. We demonstrate overhead reduction from 1466% to 53% and from 740% to 14% for cg (class D, 64 processes) and lu (class D, 64 processes), respectively, using only an additional 64 cores.

Original languageEnglish (US)
Pages (from-to)156-166
Number of pages11
JournalParallel Computing
Volume39
Issue number3
DOIs
StatePublished - 2013

Fingerprint

Debugging
Hardware
Fault Isolation
Cloning
Clone
Parallelization
Continue
Benchmark
Data storage equipment
Demonstrate
Class

Keywords

  • Dynamic binary instrumentation
  • Heavyweight tools
  • MPI

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software
  • Hardware and Architecture
  • Artificial Intelligence
  • Computer Graphics and Computer-Aided Design
  • Theoretical Computer Science

Cite this

Rountree, B., Gamblin, T., De Supinski, B. R., Schulz, M., Lowenthal, D. K., Cobb, G., & Tufo, H. (2013). Parallelizing heavyweight debugging tools with mpiecho. Parallel Computing, 39(3), 156-166. https://doi.org/10.1016/j.parco.2012.11.002

Parallelizing heavyweight debugging tools with mpiecho. / Rountree, Barry; Gamblin, Todd; De Supinski, Bronis R.; Schulz, Martin; Lowenthal, David K; Cobb, Guy; Tufo, Henry.

In: Parallel Computing, Vol. 39, No. 3, 2013, p. 156-166.

Research output: Contribution to journalArticle

Rountree, B, Gamblin, T, De Supinski, BR, Schulz, M, Lowenthal, DK, Cobb, G & Tufo, H 2013, 'Parallelizing heavyweight debugging tools with mpiecho', Parallel Computing, vol. 39, no. 3, pp. 156-166. https://doi.org/10.1016/j.parco.2012.11.002
Rountree B, Gamblin T, De Supinski BR, Schulz M, Lowenthal DK, Cobb G et al. Parallelizing heavyweight debugging tools with mpiecho. Parallel Computing. 2013;39(3):156-166. https://doi.org/10.1016/j.parco.2012.11.002
Rountree, Barry ; Gamblin, Todd ; De Supinski, Bronis R. ; Schulz, Martin ; Lowenthal, David K ; Cobb, Guy ; Tufo, Henry. / Parallelizing heavyweight debugging tools with mpiecho. In: Parallel Computing. 2013 ; Vol. 39, No. 3. pp. 156-166.
@article{ae8c90399e7e4ab2bcdbcf7c3d8d9a74,
title = "Parallelizing heavyweight debugging tools with mpiecho",
abstract = "Idioms created for debugging execution on single processors and multicore systems have been successfully scaled to thousands of processors, but there is little hope that this class of techniques can continue to be scaled out to tens of millions of cores. In order to allow development of more scalable debugging idioms we introduce mpiecho, a novel runtime platform that enables cloning of MPI ranks. Given identical execution on each clone, we then show how heavyweight debugging approaches can be parallelized, reducing their overhead to a fraction of the serialized case. We also show how this platform can be useful in isolating the source of hardware-based nondeterministic behavior and provide a case study based on a recent processor bug at LLNL. While total overhead will depend on the individual tool, we show that the platform itself contributes little: 512x tool parallelization incurs at worst 2x overhead across the NAS Parallel benchmarks, hardware fault isolation contributes at worst an additional 44{\%} overhead. Finally, we show how mpiecho can lead to near-linear reduction in overhead when combined with maid, a heavyweight memory tracking tool provided with Intel's pin platform. We demonstrate overhead reduction from 1466{\%} to 53{\%} and from 740{\%} to 14{\%} for cg (class D, 64 processes) and lu (class D, 64 processes), respectively, using only an additional 64 cores.",
keywords = "Dynamic binary instrumentation, Heavyweight tools, MPI",
author = "Barry Rountree and Todd Gamblin and {De Supinski}, {Bronis R.} and Martin Schulz and Lowenthal, {David K} and Guy Cobb and Henry Tufo",
year = "2013",
doi = "10.1016/j.parco.2012.11.002",
language = "English (US)",
volume = "39",
pages = "156--166",
journal = "Parallel Computing",
issn = "0167-8191",
publisher = "Elsevier",
number = "3",

}

TY - JOUR

T1 - Parallelizing heavyweight debugging tools with mpiecho

AU - Rountree, Barry

AU - Gamblin, Todd

AU - De Supinski, Bronis R.

AU - Schulz, Martin

AU - Lowenthal, David K

AU - Cobb, Guy

AU - Tufo, Henry

PY - 2013

Y1 - 2013

N2 - Idioms created for debugging execution on single processors and multicore systems have been successfully scaled to thousands of processors, but there is little hope that this class of techniques can continue to be scaled out to tens of millions of cores. In order to allow development of more scalable debugging idioms we introduce mpiecho, a novel runtime platform that enables cloning of MPI ranks. Given identical execution on each clone, we then show how heavyweight debugging approaches can be parallelized, reducing their overhead to a fraction of the serialized case. We also show how this platform can be useful in isolating the source of hardware-based nondeterministic behavior and provide a case study based on a recent processor bug at LLNL. While total overhead will depend on the individual tool, we show that the platform itself contributes little: 512x tool parallelization incurs at worst 2x overhead across the NAS Parallel benchmarks, hardware fault isolation contributes at worst an additional 44% overhead. Finally, we show how mpiecho can lead to near-linear reduction in overhead when combined with maid, a heavyweight memory tracking tool provided with Intel's pin platform. We demonstrate overhead reduction from 1466% to 53% and from 740% to 14% for cg (class D, 64 processes) and lu (class D, 64 processes), respectively, using only an additional 64 cores.

AB - Idioms created for debugging execution on single processors and multicore systems have been successfully scaled to thousands of processors, but there is little hope that this class of techniques can continue to be scaled out to tens of millions of cores. In order to allow development of more scalable debugging idioms we introduce mpiecho, a novel runtime platform that enables cloning of MPI ranks. Given identical execution on each clone, we then show how heavyweight debugging approaches can be parallelized, reducing their overhead to a fraction of the serialized case. We also show how this platform can be useful in isolating the source of hardware-based nondeterministic behavior and provide a case study based on a recent processor bug at LLNL. While total overhead will depend on the individual tool, we show that the platform itself contributes little: 512x tool parallelization incurs at worst 2x overhead across the NAS Parallel benchmarks, hardware fault isolation contributes at worst an additional 44% overhead. Finally, we show how mpiecho can lead to near-linear reduction in overhead when combined with maid, a heavyweight memory tracking tool provided with Intel's pin platform. We demonstrate overhead reduction from 1466% to 53% and from 740% to 14% for cg (class D, 64 processes) and lu (class D, 64 processes), respectively, using only an additional 64 cores.

KW - Dynamic binary instrumentation

KW - Heavyweight tools

KW - MPI

UR - http://www.scopus.com/inward/record.url?scp=84875935688&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84875935688&partnerID=8YFLogxK

U2 - 10.1016/j.parco.2012.11.002

DO - 10.1016/j.parco.2012.11.002

M3 - Article

AN - SCOPUS:84875935688

VL - 39

SP - 156

EP - 166

JO - Parallel Computing

JF - Parallel Computing

SN - 0167-8191

IS - 3

ER -