Accurately selecting block size at runtime in pipelined parallel programs

Research output: Contribution to journalArticle

15 Citations (Scopus)

Abstract

Loops that contain cross-processor data dependencies, known as DOACROSS loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for DOACROSS loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.

Original languageEnglish (US)
Pages (from-to)245-274
Number of pages30
JournalInternational Journal of Parallel Programming
Volume28
Issue number3
DOIs
StatePublished - Jun 2000
Externally publishedYes

Fingerprint

Parallel Programs
Pipelines
Workload
Vertex of a graph
Runtime Analysis
Network of Workstations
Data Dependency
Pipelining
Parallelization
Experimentation
Choose
Iteration
Necessary
Model

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Theoretical Computer Science

Cite this

Accurately selecting block size at runtime in pipelined parallel programs. / Lowenthal, David K.

In: International Journal of Parallel Programming, Vol. 28, No. 3, 06.2000, p. 245-274.

Research output: Contribution to journalArticle

@article{41ec81385e1143988b734b6f8178a145,
title = "Accurately selecting block size at runtime in pipelined parallel programs",
abstract = "Loops that contain cross-processor data dependencies, known as DOACROSS loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for DOACROSS loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18{\%} when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.",
author = "Lowenthal, {David K}",
year = "2000",
month = "6",
doi = "10.1023/A:1007577115980",
language = "English (US)",
volume = "28",
pages = "245--274",
journal = "International Journal of Parallel Programming",
issn = "0885-7458",
publisher = "Springer New York",
number = "3",

}

TY - JOUR

T1 - Accurately selecting block size at runtime in pipelined parallel programs

AU - Lowenthal, David K

PY - 2000/6

Y1 - 2000/6

N2 - Loops that contain cross-processor data dependencies, known as DOACROSS loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for DOACROSS loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.

AB - Loops that contain cross-processor data dependencies, known as DOACROSS loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for DOACROSS loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.

UR - http://www.scopus.com/inward/record.url?scp=0034207513&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0034207513&partnerID=8YFLogxK

U2 - 10.1023/A:1007577115980

DO - 10.1023/A:1007577115980

M3 - Article

AN - SCOPUS:0034207513

VL - 28

SP - 245

EP - 274

JO - International Journal of Parallel Programming

JF - International Journal of Parallel Programming

SN - 0885-7458

IS - 3

ER -