CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

Daren Lee, Ivo Dinov, Bin Dong, Boris Gutman, Igor Yanovsky, Arthur W. Toga

Research output: Contribution to journalArticle

29 Citations (Scopus)

Abstract

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm.

Original languageEnglish (US)
Pages (from-to)175-187
Number of pages13
JournalComputer Methods and Programs in Biomedicine
Volume106
Issue number3
DOIs
StatePublished - Jun 2012
Externally publishedYes

Fingerprint

Neuroimaging
Data storage equipment
Processing
Program processors
Parallel architectures
Image registration
Parallel processing systems
Image resolution
Workload
Throughput
Technology

Keywords

  • Compute-bound
  • CUDA
  • Graphics Processing Unit (GPU)
  • Memory-bound
  • Neuroimaging
  • Performance Optimization

ASJC Scopus subject areas

  • Computer Science Applications
  • Software
  • Health Informatics

Cite this

CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms. / Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.

In: Computer Methods and Programs in Biomedicine, Vol. 106, No. 3, 06.2012, p. 175-187.

Research output: Contribution to journalArticle

Lee, Daren ; Dinov, Ivo ; Dong, Bin ; Gutman, Boris ; Yanovsky, Igor ; Toga, Arthur W. / CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms. In: Computer Methods and Programs in Biomedicine. 2012 ; Vol. 106, No. 3. pp. 175-187.
@article{d457d63862854e8fba2a167e4a60deba,
title = "CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms",
abstract = "As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm.",
keywords = "Compute-bound, CUDA, Graphics Processing Unit (GPU), Memory-bound, Neuroimaging, Performance Optimization",
author = "Daren Lee and Ivo Dinov and Bin Dong and Boris Gutman and Igor Yanovsky and Toga, {Arthur W.}",
year = "2012",
month = "6",
doi = "10.1016/j.cmpb.2010.10.013",
language = "English (US)",
volume = "106",
pages = "175--187",
journal = "Computer Methods and Programs in Biomedicine",
issn = "0169-2607",
publisher = "Elsevier Ireland Ltd",
number = "3",

}

TY - JOUR

T1 - CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

AU - Lee, Daren

AU - Dinov, Ivo

AU - Dong, Bin

AU - Gutman, Boris

AU - Yanovsky, Igor

AU - Toga, Arthur W.

PY - 2012/6

Y1 - 2012/6

N2 - As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm.

AB - As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm.

KW - Compute-bound

KW - CUDA

KW - Graphics Processing Unit (GPU)

KW - Memory-bound

KW - Neuroimaging

KW - Performance Optimization

UR - http://www.scopus.com/inward/record.url?scp=84860250228&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84860250228&partnerID=8YFLogxK

U2 - 10.1016/j.cmpb.2010.10.013

DO - 10.1016/j.cmpb.2010.10.013

M3 - Article

C2 - 21159404

AN - SCOPUS:84860250228

VL - 106

SP - 175

EP - 187

JO - Computer Methods and Programs in Biomedicine

JF - Computer Methods and Programs in Biomedicine

SN - 0169-2607

IS - 3

ER -