Tackling permanent faults in the Network-on-Chip router pipeline

Pavan Poluri, Ahmed Louri

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

The proliferation of multi-core and many-core chips for performance scaling is making the Network-on-Chip (NoC) occupy a growing amount of silicon area spanning several metal layers. The NoC is neither immune to hard faults and transient faults nor unaffected by the adverse increase in hard faults caused by technology scaling. The ramifications for the NoC are immense: a single fault in the NoC may paralyze the working of the entire chip. To this end, we propose a Permanent Fault Tolerant Router (PFTR) that is capable of tolerating multiple permanent faults in the pipeline. PFTR is designed by making architectural modifications to individual pipeline stages of the baseline NoC router. These architectural modifications involve adding minimum extra circuitry and exploiting temporal parallelism to accomplish fault tolerance. Tolerance of multiple faults is achieved by striking a balance between three important design factors namely, area overhead, power overhead and reliability. We use Silicon Protection Factor [13] (SPF) as the reliability metric to assess the reliability improvement of the proposed architecture. SPF takes into account the number of faults required to cause failure and the area overhead of the additional circuitry to evaluate reliability. SPF calculation reveals that the proposed PFTR is 11 times more reliable than the baseline NoC router. Synthesis results using Cadence Encounter RTL Compiler at 45nm technology show that the additional circuitry adds an area overhead of 31% and power overhead of 30% with respect to the baseline NoC router. PFTR provides much better reliability with much less overhead as compared to other fault tolerant routers such as BulletProof [13], Vicis [14] and RoCo [15].

Original languageEnglish (US)
Title of host publicationProceedings - Symposium on Computer Architecture and High Performance Computing
PublisherIEEE Computer Society
Pages49-56
Number of pages8
ISBN (Print)9781479929276
DOIs
StatePublished - 2013
Event2013 25th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2013 - Porto de Galinhas, PE, Brazil
Duration: Oct 23 2013Oct 26 2013

Other

Other2013 25th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2013
CountryBrazil
CityPorto de Galinhas, PE
Period10/23/1310/26/13

Fingerprint

Routers
Pipelines
Silicon
Network-on-chip
Fault tolerance
Metals

Keywords

  • Area
  • Latency
  • Network-on-Chip
  • Power
  • Reliability
  • Router architecture

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software

Cite this

Poluri, P., & Louri, A. (2013). Tackling permanent faults in the Network-on-Chip router pipeline. In Proceedings - Symposium on Computer Architecture and High Performance Computing (pp. 49-56). [6702579] IEEE Computer Society. https://doi.org/10.1109/SBAC-PAD.2013.32

Tackling permanent faults in the Network-on-Chip router pipeline. / Poluri, Pavan; Louri, Ahmed.

Proceedings - Symposium on Computer Architecture and High Performance Computing. IEEE Computer Society, 2013. p. 49-56 6702579.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Poluri, P & Louri, A 2013, Tackling permanent faults in the Network-on-Chip router pipeline. in Proceedings - Symposium on Computer Architecture and High Performance Computing., 6702579, IEEE Computer Society, pp. 49-56, 2013 25th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2013, Porto de Galinhas, PE, Brazil, 10/23/13. https://doi.org/10.1109/SBAC-PAD.2013.32
Poluri P, Louri A. Tackling permanent faults in the Network-on-Chip router pipeline. In Proceedings - Symposium on Computer Architecture and High Performance Computing. IEEE Computer Society. 2013. p. 49-56. 6702579 https://doi.org/10.1109/SBAC-PAD.2013.32
Poluri, Pavan ; Louri, Ahmed. / Tackling permanent faults in the Network-on-Chip router pipeline. Proceedings - Symposium on Computer Architecture and High Performance Computing. IEEE Computer Society, 2013. pp. 49-56
@inproceedings{42ff5b26fca74d618fda712de02aba43,
title = "Tackling permanent faults in the Network-on-Chip router pipeline",
abstract = "The proliferation of multi-core and many-core chips for performance scaling is making the Network-on-Chip (NoC) occupy a growing amount of silicon area spanning several metal layers. The NoC is neither immune to hard faults and transient faults nor unaffected by the adverse increase in hard faults caused by technology scaling. The ramifications for the NoC are immense: a single fault in the NoC may paralyze the working of the entire chip. To this end, we propose a Permanent Fault Tolerant Router (PFTR) that is capable of tolerating multiple permanent faults in the pipeline. PFTR is designed by making architectural modifications to individual pipeline stages of the baseline NoC router. These architectural modifications involve adding minimum extra circuitry and exploiting temporal parallelism to accomplish fault tolerance. Tolerance of multiple faults is achieved by striking a balance between three important design factors namely, area overhead, power overhead and reliability. We use Silicon Protection Factor [13] (SPF) as the reliability metric to assess the reliability improvement of the proposed architecture. SPF takes into account the number of faults required to cause failure and the area overhead of the additional circuitry to evaluate reliability. SPF calculation reveals that the proposed PFTR is 11 times more reliable than the baseline NoC router. Synthesis results using Cadence Encounter RTL Compiler at 45nm technology show that the additional circuitry adds an area overhead of 31{\%} and power overhead of 30{\%} with respect to the baseline NoC router. PFTR provides much better reliability with much less overhead as compared to other fault tolerant routers such as BulletProof [13], Vicis [14] and RoCo [15].",
keywords = "Area, Latency, Network-on-Chip, Power, Reliability, Router architecture",
author = "Pavan Poluri and Ahmed Louri",
year = "2013",
doi = "10.1109/SBAC-PAD.2013.32",
language = "English (US)",
isbn = "9781479929276",
pages = "49--56",
booktitle = "Proceedings - Symposium on Computer Architecture and High Performance Computing",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - Tackling permanent faults in the Network-on-Chip router pipeline

AU - Poluri, Pavan

AU - Louri, Ahmed

PY - 2013

Y1 - 2013

N2 - The proliferation of multi-core and many-core chips for performance scaling is making the Network-on-Chip (NoC) occupy a growing amount of silicon area spanning several metal layers. The NoC is neither immune to hard faults and transient faults nor unaffected by the adverse increase in hard faults caused by technology scaling. The ramifications for the NoC are immense: a single fault in the NoC may paralyze the working of the entire chip. To this end, we propose a Permanent Fault Tolerant Router (PFTR) that is capable of tolerating multiple permanent faults in the pipeline. PFTR is designed by making architectural modifications to individual pipeline stages of the baseline NoC router. These architectural modifications involve adding minimum extra circuitry and exploiting temporal parallelism to accomplish fault tolerance. Tolerance of multiple faults is achieved by striking a balance between three important design factors namely, area overhead, power overhead and reliability. We use Silicon Protection Factor [13] (SPF) as the reliability metric to assess the reliability improvement of the proposed architecture. SPF takes into account the number of faults required to cause failure and the area overhead of the additional circuitry to evaluate reliability. SPF calculation reveals that the proposed PFTR is 11 times more reliable than the baseline NoC router. Synthesis results using Cadence Encounter RTL Compiler at 45nm technology show that the additional circuitry adds an area overhead of 31% and power overhead of 30% with respect to the baseline NoC router. PFTR provides much better reliability with much less overhead as compared to other fault tolerant routers such as BulletProof [13], Vicis [14] and RoCo [15].

AB - The proliferation of multi-core and many-core chips for performance scaling is making the Network-on-Chip (NoC) occupy a growing amount of silicon area spanning several metal layers. The NoC is neither immune to hard faults and transient faults nor unaffected by the adverse increase in hard faults caused by technology scaling. The ramifications for the NoC are immense: a single fault in the NoC may paralyze the working of the entire chip. To this end, we propose a Permanent Fault Tolerant Router (PFTR) that is capable of tolerating multiple permanent faults in the pipeline. PFTR is designed by making architectural modifications to individual pipeline stages of the baseline NoC router. These architectural modifications involve adding minimum extra circuitry and exploiting temporal parallelism to accomplish fault tolerance. Tolerance of multiple faults is achieved by striking a balance between three important design factors namely, area overhead, power overhead and reliability. We use Silicon Protection Factor [13] (SPF) as the reliability metric to assess the reliability improvement of the proposed architecture. SPF takes into account the number of faults required to cause failure and the area overhead of the additional circuitry to evaluate reliability. SPF calculation reveals that the proposed PFTR is 11 times more reliable than the baseline NoC router. Synthesis results using Cadence Encounter RTL Compiler at 45nm technology show that the additional circuitry adds an area overhead of 31% and power overhead of 30% with respect to the baseline NoC router. PFTR provides much better reliability with much less overhead as compared to other fault tolerant routers such as BulletProof [13], Vicis [14] and RoCo [15].

KW - Area

KW - Latency

KW - Network-on-Chip

KW - Power

KW - Reliability

KW - Router architecture

UR - http://www.scopus.com/inward/record.url?scp=84893542726&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84893542726&partnerID=8YFLogxK

U2 - 10.1109/SBAC-PAD.2013.32

DO - 10.1109/SBAC-PAD.2013.32

M3 - Conference contribution

SN - 9781479929276

SP - 49

EP - 56

BT - Proceedings - Symposium on Computer Architecture and High Performance Computing

PB - IEEE Computer Society

ER -