TY - GEN
T1 - Mitigating inter-job interference using adaptive flow-aware routing
AU - Smith, Staci A.
AU - Cromey, Clara E.
AU - Lowenthal, David K.
AU - Domke, Jens
AU - Jain, Nikhil
AU - Thiagarajan, Jayaraman J.
AU - Bhatele, Abhinav
N1 - Funding Information:
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory (LLNL) under Contract DE-AC52-07NA27344 (LLNL-CONF-745538). We are also indebted to Dave Dannenberg, Trent D’Hooge, Don Frederick, Jim Silva, Greg Tomaschke, Py Watson, and many others in Livermore Computing at LLNL for their support of our DAT runs.
Publisher Copyright:
© 2018 IEEE.
PY - 2019/3/11
Y1 - 2019/3/11
N2 - On most high performance computing platforms, concurrently executing jobs share network resources. This sharing can lead to inter-job network interference, which can have a significant impact on the performance of communication-intensive applications. No satisfactory solutions yet exist for mitigating such performance degradation on systems that allow jobs to share the network for the sake of higher utilization. In this paper, we analyze network congestion caused by multi-job workloads on two production systems that use popular network topologies - fat-tree and dragonfly. For each system, we establish a regression model to relate network hotspots to application performance degradation. The models show that current routing strategies are ineffective at balancing network traffic and mitigating interference on production systems. We propose an alternative routing strategy, which we call adaptive flow-aware routing. We implement our strategy on a fat-tree system, and tests on the system show up to a 46% improvement in job run time when compared to the default routing.
AB - On most high performance computing platforms, concurrently executing jobs share network resources. This sharing can lead to inter-job network interference, which can have a significant impact on the performance of communication-intensive applications. No satisfactory solutions yet exist for mitigating such performance degradation on systems that allow jobs to share the network for the sake of higher utilization. In this paper, we analyze network congestion caused by multi-job workloads on two production systems that use popular network topologies - fat-tree and dragonfly. For each system, we establish a regression model to relate network hotspots to application performance degradation. The models show that current routing strategies are ineffective at balancing network traffic and mitigating interference on production systems. We propose an alternative routing strategy, which we call adaptive flow-aware routing. We implement our strategy on a fat-tree system, and tests on the system show up to a 46% improvement in job run time when compared to the default routing.
KW - Adaptive routing
KW - Congestion
KW - Fat-tree topology
KW - High-speed networks
KW - Performance degradation
KW - Routing protocols
UR - http://www.scopus.com/inward/record.url?scp=85064140506&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85064140506&partnerID=8YFLogxK
U2 - 10.1109/SC.2018.00030
DO - 10.1109/SC.2018.00030
M3 - Conference contribution
AN - SCOPUS:85064140506
T3 - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
SP - 346
EP - 360
BT - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
Y2 - 11 November 2018 through 16 November 2018
ER -