Architecture-aware coding for distributed storage

Repairable block failure resilient codes

Gokhan Calis, Onur Ozan Koyluoglu

Research output: Contribution to journalArticle

Abstract

In large scale distributed storage systems (DSS) deployed in cloud computing, correlated failures resulting in simultaneous failure (or, unavailability) of blocks of nodes are common. In such scenarios, the stored data or a content of a failed node can only be reconstructed from the available live nodes belonging to the available blocks. To analyze the resilience of the system against such block failures, this work introduces the framework of Block Failure Resilient (BFR) codes, wherein the data (e.g., a file in DSS) can be decoded by reading out from a same number of codeword symbols (nodes) from a subset of available blocks of the underlying codeword. Further, repairable BFR codes are introduced, wherein any codeword symbol in a failed block can be repaired by contacting a subset of remaining blocks in the system. File size bounds for repairable BFR codes are derived, and the trade-off between per node storage and repair bandwidth is analyzed, and the corresponding minimum storage regenerating (BFR-MSR) and minimum bandwidth regenerating (BFR-MBR) points are derived. Explicit codes achieving the two operating points for a special case of parameters are constructed, wherein the underlying regenerating codewords are distributed to BFR codeword symbols according to combinatorial designs. Finally, BFR locally repairable codes (BFR-LRC) are introduced, an upper bound on the resilience is derived and optimal code construction are provided by a concatenation of Gabidulin and MDS codes. Repair efficiency of BFR-LRC is further studied via the use of BFR-MSR/MBR codes as local codes. Code constructions achieving optimal resilience for BFR-MSR/MBR-LRCs are provided for certain parameter regimes. Overall, this work introduces the framework of block failures along with optimal code constructions, and the study of architecture-aware coding for distributed storage systems.

Original languageEnglish (US)
Pages (from-to)465-503
Number of pages39
JournalAdvances in Mathematics of Communications
Volume12
Issue number3
DOIs
StatePublished - Jan 1 2018
Externally publishedYes

Fingerprint

Coding
Repair
Bandwidth
Cloud computing
Resilience
Storage System
Distributed Systems
Optimal Codes
Vertex of a graph
Architecture
MDS Codes
Combinatorial Design
Subset
Concatenation
Cloud Computing
Trade-offs
Upper bound
Scenarios

Keywords

  • Distributed storage and regenerating codes
  • Information theory

ASJC Scopus subject areas

  • Algebra and Number Theory
  • Computer Networks and Communications
  • Discrete Mathematics and Combinatorics
  • Applied Mathematics

Cite this

Architecture-aware coding for distributed storage : Repairable block failure resilient codes. / Calis, Gokhan; Koyluoglu, Onur Ozan.

In: Advances in Mathematics of Communications, Vol. 12, No. 3, 01.01.2018, p. 465-503.

Research output: Contribution to journalArticle

@article{c3dd17cbcccf4a879352144bbdc70303,
title = "Architecture-aware coding for distributed storage: Repairable block failure resilient codes",
abstract = "In large scale distributed storage systems (DSS) deployed in cloud computing, correlated failures resulting in simultaneous failure (or, unavailability) of blocks of nodes are common. In such scenarios, the stored data or a content of a failed node can only be reconstructed from the available live nodes belonging to the available blocks. To analyze the resilience of the system against such block failures, this work introduces the framework of Block Failure Resilient (BFR) codes, wherein the data (e.g., a file in DSS) can be decoded by reading out from a same number of codeword symbols (nodes) from a subset of available blocks of the underlying codeword. Further, repairable BFR codes are introduced, wherein any codeword symbol in a failed block can be repaired by contacting a subset of remaining blocks in the system. File size bounds for repairable BFR codes are derived, and the trade-off between per node storage and repair bandwidth is analyzed, and the corresponding minimum storage regenerating (BFR-MSR) and minimum bandwidth regenerating (BFR-MBR) points are derived. Explicit codes achieving the two operating points for a special case of parameters are constructed, wherein the underlying regenerating codewords are distributed to BFR codeword symbols according to combinatorial designs. Finally, BFR locally repairable codes (BFR-LRC) are introduced, an upper bound on the resilience is derived and optimal code construction are provided by a concatenation of Gabidulin and MDS codes. Repair efficiency of BFR-LRC is further studied via the use of BFR-MSR/MBR codes as local codes. Code constructions achieving optimal resilience for BFR-MSR/MBR-LRCs are provided for certain parameter regimes. Overall, this work introduces the framework of block failures along with optimal code constructions, and the study of architecture-aware coding for distributed storage systems.",
keywords = "Distributed storage and regenerating codes, Information theory",
author = "Gokhan Calis and Koyluoglu, {Onur Ozan}",
year = "2018",
month = "1",
day = "1",
doi = "10.3934/amc.2018028",
language = "English (US)",
volume = "12",
pages = "465--503",
journal = "Advances in Mathematics of Communications",
issn = "1930-5346",
publisher = "American Institute of Mathematical Sciences",
number = "3",

}

TY - JOUR

T1 - Architecture-aware coding for distributed storage

T2 - Repairable block failure resilient codes

AU - Calis, Gokhan

AU - Koyluoglu, Onur Ozan

PY - 2018/1/1

Y1 - 2018/1/1

N2 - In large scale distributed storage systems (DSS) deployed in cloud computing, correlated failures resulting in simultaneous failure (or, unavailability) of blocks of nodes are common. In such scenarios, the stored data or a content of a failed node can only be reconstructed from the available live nodes belonging to the available blocks. To analyze the resilience of the system against such block failures, this work introduces the framework of Block Failure Resilient (BFR) codes, wherein the data (e.g., a file in DSS) can be decoded by reading out from a same number of codeword symbols (nodes) from a subset of available blocks of the underlying codeword. Further, repairable BFR codes are introduced, wherein any codeword symbol in a failed block can be repaired by contacting a subset of remaining blocks in the system. File size bounds for repairable BFR codes are derived, and the trade-off between per node storage and repair bandwidth is analyzed, and the corresponding minimum storage regenerating (BFR-MSR) and minimum bandwidth regenerating (BFR-MBR) points are derived. Explicit codes achieving the two operating points for a special case of parameters are constructed, wherein the underlying regenerating codewords are distributed to BFR codeword symbols according to combinatorial designs. Finally, BFR locally repairable codes (BFR-LRC) are introduced, an upper bound on the resilience is derived and optimal code construction are provided by a concatenation of Gabidulin and MDS codes. Repair efficiency of BFR-LRC is further studied via the use of BFR-MSR/MBR codes as local codes. Code constructions achieving optimal resilience for BFR-MSR/MBR-LRCs are provided for certain parameter regimes. Overall, this work introduces the framework of block failures along with optimal code constructions, and the study of architecture-aware coding for distributed storage systems.

AB - In large scale distributed storage systems (DSS) deployed in cloud computing, correlated failures resulting in simultaneous failure (or, unavailability) of blocks of nodes are common. In such scenarios, the stored data or a content of a failed node can only be reconstructed from the available live nodes belonging to the available blocks. To analyze the resilience of the system against such block failures, this work introduces the framework of Block Failure Resilient (BFR) codes, wherein the data (e.g., a file in DSS) can be decoded by reading out from a same number of codeword symbols (nodes) from a subset of available blocks of the underlying codeword. Further, repairable BFR codes are introduced, wherein any codeword symbol in a failed block can be repaired by contacting a subset of remaining blocks in the system. File size bounds for repairable BFR codes are derived, and the trade-off between per node storage and repair bandwidth is analyzed, and the corresponding minimum storage regenerating (BFR-MSR) and minimum bandwidth regenerating (BFR-MBR) points are derived. Explicit codes achieving the two operating points for a special case of parameters are constructed, wherein the underlying regenerating codewords are distributed to BFR codeword symbols according to combinatorial designs. Finally, BFR locally repairable codes (BFR-LRC) are introduced, an upper bound on the resilience is derived and optimal code construction are provided by a concatenation of Gabidulin and MDS codes. Repair efficiency of BFR-LRC is further studied via the use of BFR-MSR/MBR codes as local codes. Code constructions achieving optimal resilience for BFR-MSR/MBR-LRCs are provided for certain parameter regimes. Overall, this work introduces the framework of block failures along with optimal code constructions, and the study of architecture-aware coding for distributed storage systems.

KW - Distributed storage and regenerating codes

KW - Information theory

UR - http://www.scopus.com/inward/record.url?scp=85052139536&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052139536&partnerID=8YFLogxK

U2 - 10.3934/amc.2018028

DO - 10.3934/amc.2018028

M3 - Article

VL - 12

SP - 465

EP - 503

JO - Advances in Mathematics of Communications

JF - Advances in Mathematics of Communications

SN - 1930-5346

IS - 3

ER -