Design and evaluation of a self-healing Kepler for scientific workflows

Arjun Hary, Ali Akoglu, Youssif AlNashif, Salim A Hariri, Darrel Jenerette

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Kepler is a popular open source scientific workflow (SWF) as it simplifies the effort required to construct complex data flow models through a visual interface. As the complexity of the workflow applications that will run on heterogeneous distributed systems increases, fault management becomes a critical design issue for large scale scientific and engineering applications. Due to the long execution times of these applications, it is important that they are fault tolerant; i.e. the workflow application can recover gracefully from faults without the need to restart the application from the beginning. The current implementation of Kepler tool does not support fault tolerance or recovery mechanisms. In this paper, we extend the Kepler capabilities to support fault tolerant scientific workflow (FT-SWF) with a checkpoint mechanism where corrective measures are taken seamlessly in an autonomic manner whenever a fault is detected. To the best of our knowledge, this is the first approach on adding autonomic operations to Kepler. We have evaluated the FT-Kepler on a distributed application used by ecosystem researchers. We evaluated the performance of the workflow with hardware and software based fault scenarios in terms of execution time, recovery time, and the checkpoint mechanism overhead. The experimental evaluations indicate that the checkpoint mechanism adds negligible overhead to the total execution time of the workflow and as the fault rate increases, the number of checkpoints should be increased.

Original languageEnglish (US)
Title of host publicationHPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Pages340-343
Number of pages4
DOIs
StatePublished - 2010
Event19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010 - Chicago, IL, United States
Duration: Jun 21 2010Jun 25 2010

Other

Other19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010
CountryUnited States
CityChicago, IL
Period6/21/106/25/10

Fingerprint

Recovery
Fault tolerance
Ecosystems
Hardware

Keywords

  • Autonomic
  • Fault tolerant
  • Kepler
  • Scientific workflow

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Cite this

Hary, A., Akoglu, A., AlNashif, Y., Hariri, S. A., & Jenerette, D. (2010). Design and evaluation of a self-healing Kepler for scientific workflows. In HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (pp. 340-343) https://doi.org/10.1145/1851476.1851525

Design and evaluation of a self-healing Kepler for scientific workflows. / Hary, Arjun; Akoglu, Ali; AlNashif, Youssif; Hariri, Salim A; Jenerette, Darrel.

HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. 2010. p. 340-343.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hary, A, Akoglu, A, AlNashif, Y, Hariri, SA & Jenerette, D 2010, Design and evaluation of a self-healing Kepler for scientific workflows. in HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. pp. 340-343, 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, Chicago, IL, United States, 6/21/10. https://doi.org/10.1145/1851476.1851525
Hary A, Akoglu A, AlNashif Y, Hariri SA, Jenerette D. Design and evaluation of a self-healing Kepler for scientific workflows. In HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. 2010. p. 340-343 https://doi.org/10.1145/1851476.1851525
Hary, Arjun ; Akoglu, Ali ; AlNashif, Youssif ; Hariri, Salim A ; Jenerette, Darrel. / Design and evaluation of a self-healing Kepler for scientific workflows. HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. 2010. pp. 340-343
@inproceedings{2299643b6cee4e96877fbade792b94de,
title = "Design and evaluation of a self-healing Kepler for scientific workflows",
abstract = "Kepler is a popular open source scientific workflow (SWF) as it simplifies the effort required to construct complex data flow models through a visual interface. As the complexity of the workflow applications that will run on heterogeneous distributed systems increases, fault management becomes a critical design issue for large scale scientific and engineering applications. Due to the long execution times of these applications, it is important that they are fault tolerant; i.e. the workflow application can recover gracefully from faults without the need to restart the application from the beginning. The current implementation of Kepler tool does not support fault tolerance or recovery mechanisms. In this paper, we extend the Kepler capabilities to support fault tolerant scientific workflow (FT-SWF) with a checkpoint mechanism where corrective measures are taken seamlessly in an autonomic manner whenever a fault is detected. To the best of our knowledge, this is the first approach on adding autonomic operations to Kepler. We have evaluated the FT-Kepler on a distributed application used by ecosystem researchers. We evaluated the performance of the workflow with hardware and software based fault scenarios in terms of execution time, recovery time, and the checkpoint mechanism overhead. The experimental evaluations indicate that the checkpoint mechanism adds negligible overhead to the total execution time of the workflow and as the fault rate increases, the number of checkpoints should be increased.",
keywords = "Autonomic, Fault tolerant, Kepler, Scientific workflow",
author = "Arjun Hary and Ali Akoglu and Youssif AlNashif and Hariri, {Salim A} and Darrel Jenerette",
year = "2010",
doi = "10.1145/1851476.1851525",
language = "English (US)",
isbn = "9781605589428",
pages = "340--343",
booktitle = "HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing",

}

TY - GEN

T1 - Design and evaluation of a self-healing Kepler for scientific workflows

AU - Hary, Arjun

AU - Akoglu, Ali

AU - AlNashif, Youssif

AU - Hariri, Salim A

AU - Jenerette, Darrel

PY - 2010

Y1 - 2010

N2 - Kepler is a popular open source scientific workflow (SWF) as it simplifies the effort required to construct complex data flow models through a visual interface. As the complexity of the workflow applications that will run on heterogeneous distributed systems increases, fault management becomes a critical design issue for large scale scientific and engineering applications. Due to the long execution times of these applications, it is important that they are fault tolerant; i.e. the workflow application can recover gracefully from faults without the need to restart the application from the beginning. The current implementation of Kepler tool does not support fault tolerance or recovery mechanisms. In this paper, we extend the Kepler capabilities to support fault tolerant scientific workflow (FT-SWF) with a checkpoint mechanism where corrective measures are taken seamlessly in an autonomic manner whenever a fault is detected. To the best of our knowledge, this is the first approach on adding autonomic operations to Kepler. We have evaluated the FT-Kepler on a distributed application used by ecosystem researchers. We evaluated the performance of the workflow with hardware and software based fault scenarios in terms of execution time, recovery time, and the checkpoint mechanism overhead. The experimental evaluations indicate that the checkpoint mechanism adds negligible overhead to the total execution time of the workflow and as the fault rate increases, the number of checkpoints should be increased.

AB - Kepler is a popular open source scientific workflow (SWF) as it simplifies the effort required to construct complex data flow models through a visual interface. As the complexity of the workflow applications that will run on heterogeneous distributed systems increases, fault management becomes a critical design issue for large scale scientific and engineering applications. Due to the long execution times of these applications, it is important that they are fault tolerant; i.e. the workflow application can recover gracefully from faults without the need to restart the application from the beginning. The current implementation of Kepler tool does not support fault tolerance or recovery mechanisms. In this paper, we extend the Kepler capabilities to support fault tolerant scientific workflow (FT-SWF) with a checkpoint mechanism where corrective measures are taken seamlessly in an autonomic manner whenever a fault is detected. To the best of our knowledge, this is the first approach on adding autonomic operations to Kepler. We have evaluated the FT-Kepler on a distributed application used by ecosystem researchers. We evaluated the performance of the workflow with hardware and software based fault scenarios in terms of execution time, recovery time, and the checkpoint mechanism overhead. The experimental evaluations indicate that the checkpoint mechanism adds negligible overhead to the total execution time of the workflow and as the fault rate increases, the number of checkpoints should be increased.

KW - Autonomic

KW - Fault tolerant

KW - Kepler

KW - Scientific workflow

UR - http://www.scopus.com/inward/record.url?scp=78649998209&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78649998209&partnerID=8YFLogxK

U2 - 10.1145/1851476.1851525

DO - 10.1145/1851476.1851525

M3 - Conference contribution

AN - SCOPUS:78649998209

SN - 9781605589428

SP - 340

EP - 343

BT - HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

ER -