High-performance, Energy-efficient, Fault-tolerant Network-on-Chip Design Using Reinforcement Learnin

Ke Wang, Ahmed Louri, Avinash Karanth, Razvan Bunescu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Network-on-Chips (NoCs) are becoming the standard communication fabric for multi-core and system on a chip (SoC) architectures. As technology continues to scale, transistors and wires on the chip are becoming increasingly vulnerable to various fault mechanisms, especially timing errors, resulting in exacerbation of energy efficiency and performance for NoCs. Typical techniques for handling timing errors are reactive in nature, responding to the faults after their occurrence. They rely on error detection/correction techniques which have resulted in excessive power consumption and degraded performance, since the error detection/correction hardware is constantly enabled. On the other hand, indiscriminately disabling error handling hardware can induce more errors and intrusive retransmission traffic. Therefore, the challenge is to balance the trade-offs among error rate, packet retransmission, performance, and energy. In this paper, we propose a proactive fault-tolerant mechanism to optimize energy efficiency and performance with reinforcement learning (RL). First, we propose a new proactive error handling technique comprised of a dynamic scheme for enabling per-router error detection/correction hardware and an effective retransmission mechanism. Second, we propose the use of RL to train the dynamic control policy with the goals of providing increased fault-tolerance, reduced power consumption and improved performance as compared to conventional techniques. Our evaluation indicates that, on average, end-to-end packet latency is lowered by 55%, energy efficiency is improved by 64%, and retransmission caused by faults is reduced by 48% over the reactive error correction techniques.

Original languageEnglish (US)
Title of host publicationProceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1166-1171
Number of pages6
ISBN (Electronic)9783981926323
DOIs
StatePublished - May 14 2019
Externally publishedYes
Event22nd Design, Automation and Test in Europe Conference and Exhibition, DATE 2019 - Florence, Italy
Duration: Mar 25 2019Mar 29 2019

Publication series

NameProceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019

Conference

Conference22nd Design, Automation and Test in Europe Conference and Exhibition, DATE 2019
CountryItaly
CityFlorence
Period3/25/193/29/19

Fingerprint

Reinforcement
Fault-tolerant
Energy Efficient
High Performance
Error Detection
Energy Efficiency
Error detection
Fault
Energy efficiency
Hardware
Reinforcement Learning
Reinforcement learning
Power Consumption
Timing
Chip
Electric power utilization
Dynamic Control
Control Policy
Error Correction
Router

ASJC Scopus subject areas

  • Hardware and Architecture
  • Electrical and Electronic Engineering
  • Safety, Risk, Reliability and Quality
  • Control and Optimization

Cite this

Wang, K., Louri, A., Karanth, A., & Bunescu, R. (2019). High-performance, Energy-efficient, Fault-tolerant Network-on-Chip Design Using Reinforcement Learnin. In Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019 (pp. 1166-1171). [8714869] (Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.23919/DATE.2019.8714869

High-performance, Energy-efficient, Fault-tolerant Network-on-Chip Design Using Reinforcement Learnin. / Wang, Ke; Louri, Ahmed; Karanth, Avinash; Bunescu, Razvan.

Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019. Institute of Electrical and Electronics Engineers Inc., 2019. p. 1166-1171 8714869 (Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wang, K, Louri, A, Karanth, A & Bunescu, R 2019, High-performance, Energy-efficient, Fault-tolerant Network-on-Chip Design Using Reinforcement Learnin. in Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019., 8714869, Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019, Institute of Electrical and Electronics Engineers Inc., pp. 1166-1171, 22nd Design, Automation and Test in Europe Conference and Exhibition, DATE 2019, Florence, Italy, 3/25/19. https://doi.org/10.23919/DATE.2019.8714869
Wang K, Louri A, Karanth A, Bunescu R. High-performance, Energy-efficient, Fault-tolerant Network-on-Chip Design Using Reinforcement Learnin. In Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 1166-1171. 8714869. (Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019). https://doi.org/10.23919/DATE.2019.8714869
Wang, Ke ; Louri, Ahmed ; Karanth, Avinash ; Bunescu, Razvan. / High-performance, Energy-efficient, Fault-tolerant Network-on-Chip Design Using Reinforcement Learnin. Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 1166-1171 (Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019).
@inproceedings{5a6c02201dd64ea4a34b17edc1e5522d,
title = "High-performance, Energy-efficient, Fault-tolerant Network-on-Chip Design Using Reinforcement Learnin",
abstract = "Network-on-Chips (NoCs) are becoming the standard communication fabric for multi-core and system on a chip (SoC) architectures. As technology continues to scale, transistors and wires on the chip are becoming increasingly vulnerable to various fault mechanisms, especially timing errors, resulting in exacerbation of energy efficiency and performance for NoCs. Typical techniques for handling timing errors are reactive in nature, responding to the faults after their occurrence. They rely on error detection/correction techniques which have resulted in excessive power consumption and degraded performance, since the error detection/correction hardware is constantly enabled. On the other hand, indiscriminately disabling error handling hardware can induce more errors and intrusive retransmission traffic. Therefore, the challenge is to balance the trade-offs among error rate, packet retransmission, performance, and energy. In this paper, we propose a proactive fault-tolerant mechanism to optimize energy efficiency and performance with reinforcement learning (RL). First, we propose a new proactive error handling technique comprised of a dynamic scheme for enabling per-router error detection/correction hardware and an effective retransmission mechanism. Second, we propose the use of RL to train the dynamic control policy with the goals of providing increased fault-tolerance, reduced power consumption and improved performance as compared to conventional techniques. Our evaluation indicates that, on average, end-to-end packet latency is lowered by 55{\%}, energy efficiency is improved by 64{\%}, and retransmission caused by faults is reduced by 48{\%} over the reactive error correction techniques.",
author = "Ke Wang and Ahmed Louri and Avinash Karanth and Razvan Bunescu",
year = "2019",
month = "5",
day = "14",
doi = "10.23919/DATE.2019.8714869",
language = "English (US)",
series = "Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "1166--1171",
booktitle = "Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019",

}

TY - GEN

T1 - High-performance, Energy-efficient, Fault-tolerant Network-on-Chip Design Using Reinforcement Learnin

AU - Wang, Ke

AU - Louri, Ahmed

AU - Karanth, Avinash

AU - Bunescu, Razvan

PY - 2019/5/14

Y1 - 2019/5/14

N2 - Network-on-Chips (NoCs) are becoming the standard communication fabric for multi-core and system on a chip (SoC) architectures. As technology continues to scale, transistors and wires on the chip are becoming increasingly vulnerable to various fault mechanisms, especially timing errors, resulting in exacerbation of energy efficiency and performance for NoCs. Typical techniques for handling timing errors are reactive in nature, responding to the faults after their occurrence. They rely on error detection/correction techniques which have resulted in excessive power consumption and degraded performance, since the error detection/correction hardware is constantly enabled. On the other hand, indiscriminately disabling error handling hardware can induce more errors and intrusive retransmission traffic. Therefore, the challenge is to balance the trade-offs among error rate, packet retransmission, performance, and energy. In this paper, we propose a proactive fault-tolerant mechanism to optimize energy efficiency and performance with reinforcement learning (RL). First, we propose a new proactive error handling technique comprised of a dynamic scheme for enabling per-router error detection/correction hardware and an effective retransmission mechanism. Second, we propose the use of RL to train the dynamic control policy with the goals of providing increased fault-tolerance, reduced power consumption and improved performance as compared to conventional techniques. Our evaluation indicates that, on average, end-to-end packet latency is lowered by 55%, energy efficiency is improved by 64%, and retransmission caused by faults is reduced by 48% over the reactive error correction techniques.

AB - Network-on-Chips (NoCs) are becoming the standard communication fabric for multi-core and system on a chip (SoC) architectures. As technology continues to scale, transistors and wires on the chip are becoming increasingly vulnerable to various fault mechanisms, especially timing errors, resulting in exacerbation of energy efficiency and performance for NoCs. Typical techniques for handling timing errors are reactive in nature, responding to the faults after their occurrence. They rely on error detection/correction techniques which have resulted in excessive power consumption and degraded performance, since the error detection/correction hardware is constantly enabled. On the other hand, indiscriminately disabling error handling hardware can induce more errors and intrusive retransmission traffic. Therefore, the challenge is to balance the trade-offs among error rate, packet retransmission, performance, and energy. In this paper, we propose a proactive fault-tolerant mechanism to optimize energy efficiency and performance with reinforcement learning (RL). First, we propose a new proactive error handling technique comprised of a dynamic scheme for enabling per-router error detection/correction hardware and an effective retransmission mechanism. Second, we propose the use of RL to train the dynamic control policy with the goals of providing increased fault-tolerance, reduced power consumption and improved performance as compared to conventional techniques. Our evaluation indicates that, on average, end-to-end packet latency is lowered by 55%, energy efficiency is improved by 64%, and retransmission caused by faults is reduced by 48% over the reactive error correction techniques.

UR - http://www.scopus.com/inward/record.url?scp=85066612816&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85066612816&partnerID=8YFLogxK

U2 - 10.23919/DATE.2019.8714869

DO - 10.23919/DATE.2019.8714869

M3 - Conference contribution

AN - SCOPUS:85066612816

T3 - Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019

SP - 1166

EP - 1171

BT - Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019

PB - Institute of Electrical and Electronics Engineers Inc.

ER -