Evaluating Distributed Computing Infrastructures: An Empirical Study Comparing Hadoop Deployments on Cloud and Local Systems

Devipsita Bhattacharya, Faiz Currim, Sudha Ram

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

The popularity of distributed computing platforms (e.g., Hadoop) is largely to their ability to address scalability issues that arise due to data storage and processing limitations of standard computing systems. However, the decision to dedicate organizational resources and capital for such systems needs a careful consideration of several factors including evaluation of cloud-based distributed computing options. We propose a framework of metrics which we used to conduct an in-depth performance and cost benefit analysis of two standard Hadoop infrastructural choices, i.e., a Platform as a Service (PaaS) on-demand cloud setup and a local organizational setup. We evaluated the framework with an exploratory data analysis use case for a large-scale graph processing research problem. Our analysis considered highly granular aspects of distributed computing performance and studied how utilization rates and infrastructure amortization times affect break-even times. We identified that virtual memory management adversely affects the performance of a cloud cluster during the reduce phase with the magnitude of degradation dependent on the type of MapReduce operation. Our study is intended not only as an evaluation of infrastructural choices but also a development of a metric framework that can serve as a baseline for researchers examining distributed infrastructures.

Original languageEnglish (US)
JournalIEEE Transactions on Cloud Computing
DOIs
StateAccepted/In press - Jan 1 2019

Fingerprint

Distributed computer systems
Data storage equipment
Cost benefit analysis
Processing
Scalability
Degradation

Keywords

  • Cloud computing
  • Cloud computing
  • Computer architecture
  • Computers and information processing
  • Cost benefit analysis
  • Data processing
  • Data processing
  • Distributed computing
  • Measurement
  • Parallel processing
  • Performance evaluation
  • Platform-as-a-Service
  • Task analysis

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Hardware and Architecture
  • Computer Science Applications
  • Computer Networks and Communications

Cite this

@article{a76512dccc4c4c17953cf433cd94dbee,
title = "Evaluating Distributed Computing Infrastructures: An Empirical Study Comparing Hadoop Deployments on Cloud and Local Systems",
abstract = "The popularity of distributed computing platforms (e.g., Hadoop) is largely to their ability to address scalability issues that arise due to data storage and processing limitations of standard computing systems. However, the decision to dedicate organizational resources and capital for such systems needs a careful consideration of several factors including evaluation of cloud-based distributed computing options. We propose a framework of metrics which we used to conduct an in-depth performance and cost benefit analysis of two standard Hadoop infrastructural choices, i.e., a Platform as a Service (PaaS) on-demand cloud setup and a local organizational setup. We evaluated the framework with an exploratory data analysis use case for a large-scale graph processing research problem. Our analysis considered highly granular aspects of distributed computing performance and studied how utilization rates and infrastructure amortization times affect break-even times. We identified that virtual memory management adversely affects the performance of a cloud cluster during the reduce phase with the magnitude of degradation dependent on the type of MapReduce operation. Our study is intended not only as an evaluation of infrastructural choices but also a development of a metric framework that can serve as a baseline for researchers examining distributed infrastructures.",
keywords = "Cloud computing, Cloud computing, Computer architecture, Computers and information processing, Cost benefit analysis, Data processing, Data processing, Distributed computing, Measurement, Parallel processing, Performance evaluation, Platform-as-a-Service, Task analysis",
author = "Devipsita Bhattacharya and Faiz Currim and Sudha Ram",
year = "2019",
month = "1",
day = "1",
doi = "10.1109/TCC.2019.2902377",
language = "English (US)",
journal = "IEEE Transactions on Cloud Computing",
issn = "2168-7161",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Evaluating Distributed Computing Infrastructures

T2 - An Empirical Study Comparing Hadoop Deployments on Cloud and Local Systems

AU - Bhattacharya, Devipsita

AU - Currim, Faiz

AU - Ram, Sudha

PY - 2019/1/1

Y1 - 2019/1/1

N2 - The popularity of distributed computing platforms (e.g., Hadoop) is largely to their ability to address scalability issues that arise due to data storage and processing limitations of standard computing systems. However, the decision to dedicate organizational resources and capital for such systems needs a careful consideration of several factors including evaluation of cloud-based distributed computing options. We propose a framework of metrics which we used to conduct an in-depth performance and cost benefit analysis of two standard Hadoop infrastructural choices, i.e., a Platform as a Service (PaaS) on-demand cloud setup and a local organizational setup. We evaluated the framework with an exploratory data analysis use case for a large-scale graph processing research problem. Our analysis considered highly granular aspects of distributed computing performance and studied how utilization rates and infrastructure amortization times affect break-even times. We identified that virtual memory management adversely affects the performance of a cloud cluster during the reduce phase with the magnitude of degradation dependent on the type of MapReduce operation. Our study is intended not only as an evaluation of infrastructural choices but also a development of a metric framework that can serve as a baseline for researchers examining distributed infrastructures.

AB - The popularity of distributed computing platforms (e.g., Hadoop) is largely to their ability to address scalability issues that arise due to data storage and processing limitations of standard computing systems. However, the decision to dedicate organizational resources and capital for such systems needs a careful consideration of several factors including evaluation of cloud-based distributed computing options. We propose a framework of metrics which we used to conduct an in-depth performance and cost benefit analysis of two standard Hadoop infrastructural choices, i.e., a Platform as a Service (PaaS) on-demand cloud setup and a local organizational setup. We evaluated the framework with an exploratory data analysis use case for a large-scale graph processing research problem. Our analysis considered highly granular aspects of distributed computing performance and studied how utilization rates and infrastructure amortization times affect break-even times. We identified that virtual memory management adversely affects the performance of a cloud cluster during the reduce phase with the magnitude of degradation dependent on the type of MapReduce operation. Our study is intended not only as an evaluation of infrastructural choices but also a development of a metric framework that can serve as a baseline for researchers examining distributed infrastructures.

KW - Cloud computing

KW - Cloud computing

KW - Computer architecture

KW - Computers and information processing

KW - Cost benefit analysis

KW - Data processing

KW - Data processing

KW - Distributed computing

KW - Measurement

KW - Parallel processing

KW - Performance evaluation

KW - Platform-as-a-Service

KW - Task analysis

UR - http://www.scopus.com/inward/record.url?scp=85062675578&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062675578&partnerID=8YFLogxK

U2 - 10.1109/TCC.2019.2902377

DO - 10.1109/TCC.2019.2902377

M3 - Article

AN - SCOPUS:85062675578

JO - IEEE Transactions on Cloud Computing

JF - IEEE Transactions on Cloud Computing

SN - 2168-7161

ER -