Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2

Aniruddha Marathe, Rachel Harris, David K Lowenthal, Bronis R. De Supinski, Barry Rountree, Martin Schulz

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

The use of clouds to execute high-performance computing (HPC) applications has greatly increased recently. Clouds provide several potential advantages over traditional supercomputers and in-house clusters. The most popular cloud is currently Amazon EC2, which provides fixed-cost and variable-cost, auction-based options. The auction market trades lower cost for potential interruptions that necessitate checkpointing; if the market price exceeds the bid price, a node is taken away from the user without warning. We explore techniques to maximize performance per dollar given a time constraint within which an application must complete. Specifically, we design and implement multiple techniques to reduce expected cost by exploiting redundancy in the EC2 auction market. We then design an adaptive algorithm that selects a scheduling algorithm and determines the bid price. We show that our adaptive algorithm executes programs up to seven times cheaper than using the on-demand market and up to 44 percent cheaper than the best non-redundant, auction-market algorithm. We extend our adaptive algorithm to incorporate application scalability characteristics for further cost savings. We show that the adaptive algorithm informed with scalability characteristics of applications achieves up to 56 percent cost savings compared to the expected cost for the base adaptive algorithm run at a fixed, user-defined scale.

Original languageEnglish (US)
Article number7355374
Pages (from-to)2574-2588
Number of pages15
JournalIEEE Transactions on Parallel and Distributed Systems
Volume27
Issue number9
DOIs
StatePublished - Sep 1 2016

Fingerprint

Redundancy
Scalability
Adaptive algorithms
Costs
Supercomputers
Scheduling algorithms

Keywords

  • cloud computing
  • cost optimization
  • Fault tolerance
  • reliability
  • resource provisioning

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this

Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2. / Marathe, Aniruddha; Harris, Rachel; Lowenthal, David K; De Supinski, Bronis R.; Rountree, Barry; Schulz, Martin.

In: IEEE Transactions on Parallel and Distributed Systems, Vol. 27, No. 9, 7355374, 01.09.2016, p. 2574-2588.

Research output: Contribution to journalArticle

Marathe, Aniruddha ; Harris, Rachel ; Lowenthal, David K ; De Supinski, Bronis R. ; Rountree, Barry ; Schulz, Martin. / Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2. In: IEEE Transactions on Parallel and Distributed Systems. 2016 ; Vol. 27, No. 9. pp. 2574-2588.
@article{c0f13777fb6046d3addf63757bbabd71,
title = "Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2",
abstract = "The use of clouds to execute high-performance computing (HPC) applications has greatly increased recently. Clouds provide several potential advantages over traditional supercomputers and in-house clusters. The most popular cloud is currently Amazon EC2, which provides fixed-cost and variable-cost, auction-based options. The auction market trades lower cost for potential interruptions that necessitate checkpointing; if the market price exceeds the bid price, a node is taken away from the user without warning. We explore techniques to maximize performance per dollar given a time constraint within which an application must complete. Specifically, we design and implement multiple techniques to reduce expected cost by exploiting redundancy in the EC2 auction market. We then design an adaptive algorithm that selects a scheduling algorithm and determines the bid price. We show that our adaptive algorithm executes programs up to seven times cheaper than using the on-demand market and up to 44 percent cheaper than the best non-redundant, auction-market algorithm. We extend our adaptive algorithm to incorporate application scalability characteristics for further cost savings. We show that the adaptive algorithm informed with scalability characteristics of applications achieves up to 56 percent cost savings compared to the expected cost for the base adaptive algorithm run at a fixed, user-defined scale.",
keywords = "cloud computing, cost optimization, Fault tolerance, reliability, resource provisioning",
author = "Aniruddha Marathe and Rachel Harris and Lowenthal, {David K} and {De Supinski}, {Bronis R.} and Barry Rountree and Martin Schulz",
year = "2016",
month = "9",
day = "1",
doi = "10.1109/TPDS.2015.2508457",
language = "English (US)",
volume = "27",
pages = "2574--2588",
journal = "IEEE Transactions on Parallel and Distributed Systems",
issn = "1045-9219",
publisher = "IEEE Computer Society",
number = "9",

}

TY - JOUR

T1 - Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2

AU - Marathe, Aniruddha

AU - Harris, Rachel

AU - Lowenthal, David K

AU - De Supinski, Bronis R.

AU - Rountree, Barry

AU - Schulz, Martin

PY - 2016/9/1

Y1 - 2016/9/1

N2 - The use of clouds to execute high-performance computing (HPC) applications has greatly increased recently. Clouds provide several potential advantages over traditional supercomputers and in-house clusters. The most popular cloud is currently Amazon EC2, which provides fixed-cost and variable-cost, auction-based options. The auction market trades lower cost for potential interruptions that necessitate checkpointing; if the market price exceeds the bid price, a node is taken away from the user without warning. We explore techniques to maximize performance per dollar given a time constraint within which an application must complete. Specifically, we design and implement multiple techniques to reduce expected cost by exploiting redundancy in the EC2 auction market. We then design an adaptive algorithm that selects a scheduling algorithm and determines the bid price. We show that our adaptive algorithm executes programs up to seven times cheaper than using the on-demand market and up to 44 percent cheaper than the best non-redundant, auction-market algorithm. We extend our adaptive algorithm to incorporate application scalability characteristics for further cost savings. We show that the adaptive algorithm informed with scalability characteristics of applications achieves up to 56 percent cost savings compared to the expected cost for the base adaptive algorithm run at a fixed, user-defined scale.

AB - The use of clouds to execute high-performance computing (HPC) applications has greatly increased recently. Clouds provide several potential advantages over traditional supercomputers and in-house clusters. The most popular cloud is currently Amazon EC2, which provides fixed-cost and variable-cost, auction-based options. The auction market trades lower cost for potential interruptions that necessitate checkpointing; if the market price exceeds the bid price, a node is taken away from the user without warning. We explore techniques to maximize performance per dollar given a time constraint within which an application must complete. Specifically, we design and implement multiple techniques to reduce expected cost by exploiting redundancy in the EC2 auction market. We then design an adaptive algorithm that selects a scheduling algorithm and determines the bid price. We show that our adaptive algorithm executes programs up to seven times cheaper than using the on-demand market and up to 44 percent cheaper than the best non-redundant, auction-market algorithm. We extend our adaptive algorithm to incorporate application scalability characteristics for further cost savings. We show that the adaptive algorithm informed with scalability characteristics of applications achieves up to 56 percent cost savings compared to the expected cost for the base adaptive algorithm run at a fixed, user-defined scale.

KW - cloud computing

KW - cost optimization

KW - Fault tolerance

KW - reliability

KW - resource provisioning

UR - http://www.scopus.com/inward/record.url?scp=84982108499&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84982108499&partnerID=8YFLogxK

U2 - 10.1109/TPDS.2015.2508457

DO - 10.1109/TPDS.2015.2508457

M3 - Article

VL - 27

SP - 2574

EP - 2588

JO - IEEE Transactions on Parallel and Distributed Systems

JF - IEEE Transactions on Parallel and Distributed Systems

SN - 1045-9219

IS - 9

M1 - 7355374

ER -