Mitigating inter-job interference using adaptive flow-aware routing

Staci A. Smith, Clara E. Cromey, David K Lowenthal, Jens Domke, Nikhil Jain, Jayaraman J. Thiagarajan, Abhinav Bhatele

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

On most high performance computing platforms, concurrently executing jobs share network resources. This sharing can lead to inter-job network interference, which can have a significant impact on the performance of communication-intensive applications. No satisfactory solutions yet exist for mitigating such performance degradation on systems that allow jobs to share the network for the sake of higher utilization. In this paper, we analyze network congestion caused by multi-job workloads on two production systems that use popular network topologies - fat-tree and dragonfly. For each system, we establish a regression model to relate network hotspots to application performance degradation. The models show that current routing strategies are ineffective at balancing network traffic and mitigating interference on production systems. We propose an alternative routing strategy, which we call adaptive flow-aware routing. We implement our strategy on a fat-tree system, and tests on the system show up to a 46% improvement in job run time when compared to the default routing.

Original languageEnglish (US)
Title of host publicationProceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages346-360
Number of pages15
ISBN (Electronic)9781538683842
DOIs
StatePublished - Mar 11 2019
Event2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 - Dallas, United States
Duration: Nov 11 2018Nov 16 2018

Publication series

NameProceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018

Conference

Conference2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
CountryUnited States
CityDallas
Period11/11/1811/16/18

Fingerprint

Oils and fats
Routing
Interference
Degradation
Production Systems
Topology
Communication
Network Traffic
Hot Spot
Congestion
Network Topology
Balancing
Workload
Regression Model
Sharing
High Performance
Resources
Computing
Alternatives
Strategy

Keywords

  • Adaptive routing
  • Congestion
  • Fat-tree topology
  • High-speed networks
  • Performance degradation
  • Routing protocols

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Networks and Communications
  • Hardware and Architecture
  • Theoretical Computer Science

Cite this

Smith, S. A., Cromey, C. E., Lowenthal, D. K., Domke, J., Jain, N., Thiagarajan, J. J., & Bhatele, A. (2019). Mitigating inter-job interference using adaptive flow-aware routing. In Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 (pp. 346-360). [8665797] (Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SC.2018.00030

Mitigating inter-job interference using adaptive flow-aware routing. / Smith, Staci A.; Cromey, Clara E.; Lowenthal, David K; Domke, Jens; Jain, Nikhil; Thiagarajan, Jayaraman J.; Bhatele, Abhinav.

Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018. Institute of Electrical and Electronics Engineers Inc., 2019. p. 346-360 8665797 (Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Smith, SA, Cromey, CE, Lowenthal, DK, Domke, J, Jain, N, Thiagarajan, JJ & Bhatele, A 2019, Mitigating inter-job interference using adaptive flow-aware routing. in Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018., 8665797, Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Institute of Electrical and Electronics Engineers Inc., pp. 346-360, 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Dallas, United States, 11/11/18. https://doi.org/10.1109/SC.2018.00030
Smith SA, Cromey CE, Lowenthal DK, Domke J, Jain N, Thiagarajan JJ et al. Mitigating inter-job interference using adaptive flow-aware routing. In Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018. Institute of Electrical and Electronics Engineers Inc. 2019. p. 346-360. 8665797. (Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018). https://doi.org/10.1109/SC.2018.00030
Smith, Staci A. ; Cromey, Clara E. ; Lowenthal, David K ; Domke, Jens ; Jain, Nikhil ; Thiagarajan, Jayaraman J. ; Bhatele, Abhinav. / Mitigating inter-job interference using adaptive flow-aware routing. Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 346-360 (Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018).
@inproceedings{e747ec8a2c5b4182be4db7abb4f2a087,
title = "Mitigating inter-job interference using adaptive flow-aware routing",
abstract = "On most high performance computing platforms, concurrently executing jobs share network resources. This sharing can lead to inter-job network interference, which can have a significant impact on the performance of communication-intensive applications. No satisfactory solutions yet exist for mitigating such performance degradation on systems that allow jobs to share the network for the sake of higher utilization. In this paper, we analyze network congestion caused by multi-job workloads on two production systems that use popular network topologies - fat-tree and dragonfly. For each system, we establish a regression model to relate network hotspots to application performance degradation. The models show that current routing strategies are ineffective at balancing network traffic and mitigating interference on production systems. We propose an alternative routing strategy, which we call adaptive flow-aware routing. We implement our strategy on a fat-tree system, and tests on the system show up to a 46{\%} improvement in job run time when compared to the default routing.",
keywords = "Adaptive routing, Congestion, Fat-tree topology, High-speed networks, Performance degradation, Routing protocols",
author = "Smith, {Staci A.} and Cromey, {Clara E.} and Lowenthal, {David K} and Jens Domke and Nikhil Jain and Thiagarajan, {Jayaraman J.} and Abhinav Bhatele",
year = "2019",
month = "3",
day = "11",
doi = "10.1109/SC.2018.00030",
language = "English (US)",
series = "Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "346--360",
booktitle = "Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018",

}

TY - GEN

T1 - Mitigating inter-job interference using adaptive flow-aware routing

AU - Smith, Staci A.

AU - Cromey, Clara E.

AU - Lowenthal, David K

AU - Domke, Jens

AU - Jain, Nikhil

AU - Thiagarajan, Jayaraman J.

AU - Bhatele, Abhinav

PY - 2019/3/11

Y1 - 2019/3/11

N2 - On most high performance computing platforms, concurrently executing jobs share network resources. This sharing can lead to inter-job network interference, which can have a significant impact on the performance of communication-intensive applications. No satisfactory solutions yet exist for mitigating such performance degradation on systems that allow jobs to share the network for the sake of higher utilization. In this paper, we analyze network congestion caused by multi-job workloads on two production systems that use popular network topologies - fat-tree and dragonfly. For each system, we establish a regression model to relate network hotspots to application performance degradation. The models show that current routing strategies are ineffective at balancing network traffic and mitigating interference on production systems. We propose an alternative routing strategy, which we call adaptive flow-aware routing. We implement our strategy on a fat-tree system, and tests on the system show up to a 46% improvement in job run time when compared to the default routing.

AB - On most high performance computing platforms, concurrently executing jobs share network resources. This sharing can lead to inter-job network interference, which can have a significant impact on the performance of communication-intensive applications. No satisfactory solutions yet exist for mitigating such performance degradation on systems that allow jobs to share the network for the sake of higher utilization. In this paper, we analyze network congestion caused by multi-job workloads on two production systems that use popular network topologies - fat-tree and dragonfly. For each system, we establish a regression model to relate network hotspots to application performance degradation. The models show that current routing strategies are ineffective at balancing network traffic and mitigating interference on production systems. We propose an alternative routing strategy, which we call adaptive flow-aware routing. We implement our strategy on a fat-tree system, and tests on the system show up to a 46% improvement in job run time when compared to the default routing.

KW - Adaptive routing

KW - Congestion

KW - Fat-tree topology

KW - High-speed networks

KW - Performance degradation

KW - Routing protocols

UR - http://www.scopus.com/inward/record.url?scp=85064140506&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064140506&partnerID=8YFLogxK

U2 - 10.1109/SC.2018.00030

DO - 10.1109/SC.2018.00030

M3 - Conference contribution

AN - SCOPUS:85064140506

T3 - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018

SP - 346

EP - 360

BT - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -