SDM: A scientific dataset delivery platform

Illyoung Choi, Jude Nelson, Larry Lee Peterson, John Hartman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Scientific computing is becoming more data-centric and more collaborative, requiring increasingly large datasets to be transferred across the Internet. Transferring these datasets efficiently and making them accessible to scientific workflows is an increasingly difficult task. In addition, the data transfer time can be a significant portion of the overall workflow running time. This paper presents SDM (Syndicate Dataset Manager), a scientific dataset delivery platform. Unlike general-purpose data transfer tools, SDM offers on-demand access to remote scientific datasets. On-demand access doesn't require staging datasets to local file systems prior to computing on them, and it also enables overlapping computation and I/O. In addition, SDM offers a simple interface for users to locate and access datasets. To validate the usefulness of SDM, we performed realistic metagenomic sequence analysis workflows on remote genomic datasets. In general, SDM configured with a CDN outperforms existing data access methods. With warm CDN caches, SDM completes the workflow 17-20% faster than staging methods. Its performance is even comparable to local storage. SDM is only 9% slower than local HDD storage and 18% slower than local SSD storage. Together, its performance and its ease-of-use make SDM an attractive platform for performing scientific workflows on remote datasets.

Original languageEnglish (US)
Title of host publicationProceedings - IEEE 15th International Conference on eScience, eScience 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages378-387
Number of pages10
ISBN (Electronic)9781728124513
DOIs
StatePublished - Sep 2019
Event15th IEEE International Conference on eScience, eScience 2019 - San Diego, United States
Duration: Sep 24 2019Sep 27 2019

Publication series

NameProceedings - IEEE 15th International Conference on eScience, eScience 2019

Conference

Conference15th IEEE International Conference on eScience, eScience 2019
CountryUnited States
CitySan Diego
Period9/24/199/27/19

Keywords

  • Cloud storage
  • Data delivery platform
  • Data transfer
  • Scientific computing
  • Wide-Area network

ASJC Scopus subject areas

  • Computer Science Applications
  • Software
  • Ecological Modeling
  • Modeling and Simulation

Fingerprint Dive into the research topics of 'SDM: A scientific dataset delivery platform'. Together they form a unique fingerprint.

  • Cite this

    Choi, I., Nelson, J., Peterson, L. L., & Hartman, J. (2019). SDM: A scientific dataset delivery platform. In Proceedings - IEEE 15th International Conference on eScience, eScience 2019 (pp. 378-387). [9041779] (Proceedings - IEEE 15th International Conference on eScience, eScience 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/eScience.2019.00049