On the power of in-network caching in the Hadoop distributed file system

Eric Newberry, Beichuan Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

The Hadoop Distributed File System (HDFS) is a network file system used to support multiple widely-used big data frameworks that can scale to run on large clusters. In this paper, we evaluate the effectiveness of using in-network caching on switches in HDFS-supported clusters in order to reduce per-link bandwidth usage in the network. We discovered that some applications featured large amounts of data requested by multiple clients and that, by caching read data in the network, the average per-link bandwidth usage of read operations in these applications could be reduced by more than half. We also found that the choice of cache replacement policy could have a significant impact on caching effectiveness in this environment, with LIRS and ARC generally performing the best for larger and smaller cache sizes, respectively. Moreover, given the structure of HDFS write operations, we developed a mechanism to reduce the total per-link bandwidth usage of HDFS write operations by replacing write pipelining with multicast. In order to evaluate in-network caching potential, we developed a simulator to replay real traces through a fat tree network simulating the caching architecture used in the Named Data Networking (NDN) information-centric networking (ICN) architecture. Our results suggest that ICN-style in-network caching can provide significant benefits to HDFS-supported big data clusters, justifying future work to apply ICN architectures to cluster environments.

Original languageEnglish (US)
Title of host publicationICN 2019 - Proceedings of the 2019 Conference on Information-Centric Networking
PublisherAssociation for Computing Machinery, Inc
Pages89-99
Number of pages11
ISBN (Electronic)9781450369701
DOIs
StatePublished - Sep 24 2019
Event6th ACM Conference on Information-Centric Networking, ICN 2019 - Macau, China
Duration: Sep 24 2019Sep 26 2019

Publication series

NameICN 2019 - Proceedings of the 2019 Conference on Information-Centric Networking

Conference

Conference6th ACM Conference on Information-Centric Networking, ICN 2019
CountryChina
CityMacau
Period9/24/199/26/19

Keywords

  • Big data
  • Caching
  • HDFS
  • ICN
  • Information-centric networking
  • NDN
  • Named data networking
  • Spark

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems

Fingerprint Dive into the research topics of 'On the power of in-network caching in the Hadoop distributed file system'. Together they form a unique fingerprint.

Cite this