TY - JOUR
T1 - Hardware-Level Thread Migration to Reduce On-Chip Data Movement Via Reinforcement Learning
AU - Fettes, Quintin
AU - Karanth, Avinash
AU - Bunescu, Razvan
AU - Louri, Ahmed
AU - Shiflett, Kyle
N1 - Funding Information:
Manuscript received April 17, 2020; revised June 12, 2020; accepted July 6, 2020. Date of publication October 2, 2020; date of current version October 27, 2020. This work was supported in part by NSF under Award CCF-1513606, Award CCF-1513923, Award CCF-1702980, Award CCF-1703013, Award CCF-1812495, Award CCF-1901192, and Award CCF-1936794. This article was presented in the International Conference on Hardware/Software Codesign and System Synthesis 2020 and appears as part of the ESWEEK-TCAD special issue. (Corresponding author: Avinash Karanth.) Quintin Fettes, Avinash Karanth, Razvan Bunescu, and Kyle Shiflett are with EECS Department, Ohio University, Athens, OH 45701 USA.
Publisher Copyright:
© 1982-2012 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - As the number of processing cores and associated threads in chip multiprocessors (CMPs) continues to scale out, on-chip memory access latency dominates application execution time due to increased data movement. Although tiled CMP architectures with distributed shared caches provide a scalable design, increased physical distance between requesting and responding cores has led to both increased on-chip memory access latency and excess energy consumption. Near data processing is a promising approach that can migrate threads closer to data, however prior hand-engineered rules for fine-grained hardware-level thread migration are either too slow to react to changes in data access patterns, or unable to exploit the large variety of data access patterns. In this article, we propose to use reinforcement learning (RL) to learn relatively complex data access patterns to improve on hardware-level thread migration techniques. By utilizing the recent history of memory access locations as input, each thread learns to recognize the relationship between prior access patterns and future memory access locations. This leads to the unique ability of the proposed technique to make fewer, more effective migrations to intermediate cores that minimize the distance to multiple distinct memory access locations. By allowing a low-overhead RL agent to learn a policy from real interaction with parallel programming benchmarks in a parallel simulator, we show that a migration policy which recognizes more complex data access patterns can be learned. The proposed approach reduces on-chip data movement and energy consumption by an average of 41%, while reducing execution time by 43% when compared to a simple baseline with no thread migration; furthermore, energy consumption and execution time are reduced by an additional 10% when compared to a hand-engineered fine-grained migration policy.
AB - As the number of processing cores and associated threads in chip multiprocessors (CMPs) continues to scale out, on-chip memory access latency dominates application execution time due to increased data movement. Although tiled CMP architectures with distributed shared caches provide a scalable design, increased physical distance between requesting and responding cores has led to both increased on-chip memory access latency and excess energy consumption. Near data processing is a promising approach that can migrate threads closer to data, however prior hand-engineered rules for fine-grained hardware-level thread migration are either too slow to react to changes in data access patterns, or unable to exploit the large variety of data access patterns. In this article, we propose to use reinforcement learning (RL) to learn relatively complex data access patterns to improve on hardware-level thread migration techniques. By utilizing the recent history of memory access locations as input, each thread learns to recognize the relationship between prior access patterns and future memory access locations. This leads to the unique ability of the proposed technique to make fewer, more effective migrations to intermediate cores that minimize the distance to multiple distinct memory access locations. By allowing a low-overhead RL agent to learn a policy from real interaction with parallel programming benchmarks in a parallel simulator, we show that a migration policy which recognizes more complex data access patterns can be learned. The proposed approach reduces on-chip data movement and energy consumption by an average of 41%, while reducing execution time by 43% when compared to a simple baseline with no thread migration; furthermore, energy consumption and execution time are reduced by an additional 10% when compared to a hand-engineered fine-grained migration policy.
KW - Chip multiprocessors (CMPs)
KW - data movement
KW - reinforcement learning (RL)
KW - thread migration
UR - http://www.scopus.com/inward/record.url?scp=85096035984&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096035984&partnerID=8YFLogxK
U2 - 10.1109/TCAD.2020.3012650
DO - 10.1109/TCAD.2020.3012650
M3 - Article
AN - SCOPUS:85096035984
VL - 39
SP - 3638
EP - 3649
JO - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
JF - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
SN - 0278-0070
IS - 11
M1 - 9211404
ER -