Academia.eduAcademia.edu

Scout: High-Performance Heterogeneous Computing Made Simple

2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum

https://doi.org/10.1109/IPDPS.2011.385

Abstract

Researchers must often write their own simulation and analysis software. During this process they simultaneously confront both computational and scientific problems. Current strategies for aiding the generation of performance-oriented programs do not abstract the software development from the science. Furthermore, the problem is becoming increasingly complex and pressing with the continued development of many-core and heterogeneous (CPU-GPU) architectures. To acbieve high performance, scientists must expertly navigate both software and hardware. Co-design between computer scientists and research scientists can alleviate but not solve this problem. The science community requires better tools for developing, optimizing, and future-proofing codes, allowing scientists to focus on their research while still achieving high computational performance. Scout is a parallel programming language and extensible compiler framework targeting heterogeneous architectures. It provides tbe abstraction required to buffer scientists from the constantly-shifting details of hardware while still realizing higb-performance by encapsulating software and hardware optimization within a compiler framework.

Key takeaways
sparkles

AI

  1. Scout abstracts hardware details while enabling high-performance execution on heterogeneous architectures.
  2. The language offers a simpler programming model compared to C/C++ and CUDA, facilitating easier debugging and maintenance.
  3. Scout's toolchain employs LLVM for JIT compilation, optimizing both CPU and GPU performance efficiently.
  4. Future developments aim to enhance portability and integrate data visualization directives in Scout.
  5. Collaborations with Los Alamos National Laboratory demonstrate Scout's application in large-scale data analysis and visualization.
LA-U R- J) -C70 S 2:13 Approved for public release; distribution is unlimited. Title: Scout: High-Performance Heterogeneous Computing Made Simple Author(s): James Jablin, Patrick McCormick, Maurice Herlihy Intended for: IPDPS 2011 PhD Forum ~Alamos NATIONAL LABORATORY - - - - EST. 1943 - - - - Los Alamos Nalional Laboratory, an affirmative action/equal opportunity employer, is operated by the Los Alamos National Security, LLC for the National Nuclear Security Administration of the U.S. Department of Energy under contract DE-AC52-06NA25396. By acceptance of this article, the publisher recognizes that the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or to allow others to do so, for U.S. Government purposes. Los Alamos National Laboratory requests that the publisher identify this article as work performed under the auspices of the U.S. Department of Energy. Los Alamos National Laboratory strongly supports academic freedom and a researcher's right to publish; as an institution, however, the Laboratory does not endorse the viewpoint of a publication or guarantee its technical correctness. Form 836 (7106) Scout: High-Performance Heterogeneous Computing Made Simple Grad: James A. Jablin, 3rd year Advisor: Patrick McCormick Advisor: Maurice Herlihy Brown University Los Alamos National Laboratory Brown University [email protected] [email protected] [email protected] Abstract-Researchers must often write their own simulation require better programming abstractions for developing, op- and analysis software. During this process they simultaneously timizing, and future-proofing codes. confront both computational and scientific problems. Current The most common programming languages for scientific strategies for aiding the generation of performance-oriented programs do not abstract the software development from the workloads are C/C++ and Fortran. Invented in a time when science. Furthermore, the problem is becoming increasingly single-core CPUs dominated, these languages contain con- complex and pressing with the continued development of structs to improve single-core performance but foil autopar- many-core and heterogeneous (CPU-GPU) architectures. To allelization techniques. For example in C/C++, pointer ma- acbieve high performance, scientists must expertly navigate nipulation and subversive typecasting thwart alias analysis both software and hardware. Co-design between computer scientists and research scientists can alleviate but not solve creating missed opportunities for optimization. Inadequate this problem. The science community requires better tools programmer education about parallel processing compounds for developing, optimizing, and future-proofing codes, allowing the problem, yielding programs that defy static optimization. scientists to focus on their research while still achieving high A programming language should reflect the ease of use computational performance. and intuitiveness of sequential programming and still admit Scout is a parallel programming language and extensible compiler framework targeting heterogeneous architectures. It directed compiler optimization and parallelization. Domain- provides tbe abstraction required to buffer scientists from specific languages have limited use cases. Parallel pro- the constantly-shifting details of hardware while still realizing gramming languages deliver a highly parallel and scalable higb-performance by encapsulating software and hardware programming model but not one has gained widespread optimization within a compiler framework. appeal. Sequential languages with parallel extensions better Keywords-GPU; high-performance; parallel programming exploit concurrency but still contain sequential language language; compilers pitfalls. For developing GPU codes, programmers use CUDA 1. INTRODUCTION C [10] or OpenCL [4]. Both require explicit management Advances in hardware trends of multicore and heteroge- of data between CPU and GPU, a tedious and error-prone neous architectures have outpaced software development. process. Management of linked-data structures poses an even Previously, programmers could expect increasing perfor- greater challenge than arrays, further limiting utility. CUDA mance without intervention based on regular increases C and OpenCL are extensions of C. As such the programmer in clock frequency. Clock frequency increases eventually must balance between avoiding C's problems and exploiting stopped due to concern about heat dissipation while Moore's the performance potential of the GPU. Law continued unabated. Consequently, chip manufacturers This paper presents Scout, a parallel language and exten- used the extra transistors to create multiple cores with- sible compiler framework. It shares similar concepts and the out increasing overall clock speed rather than a single same name with its predecessor [8]. New contributions focus faster core. The advent of multicore architectures drastically on a refactored programming language and completely re- changed the architectural landscape. The number of cores designed compiler framework and optimization techniques. has increased while clock frequency has either remained Scout combines the effective strategy of explicitly identify- constant or decreased. ing concurrency with program directives with an extensible These radical architectural changes are supported by im- compiler framework for profiling, statically optimizing, and proved CPU performance. Results achieved on the GPU are tuning. These new modifications enable Scout to provide equally impressive [l]. Unfortunately, high performance on a high-level programming abstraction and still generate multicore and heterogeneous architectures often demands efficient executable code for heterogeneous architectures. expert manual optimization. Programs written in current programming languages are a challenge for automatic paral- II. SCOUT PROGRAMMING LANGUAGE lelization and provide no hardware abstraction. Static com- The Scout parallel programming language facilitates anal- piler optimizations produce conservative results. Without ysis and simulation of computationally intensive data sets hardware abstraction the programmer must acquire target by providing constructs for describing concurrency while architecture knowledge to aid optimization. Programmers abstracting architectural details and data management. It is Listing 1: CUDA 2D Heat Simulation Listing 2: Scout 2D Heat Simulation • ---&lobal_ void heat2d(unsigned N, float heatln[N][N], void main(unsigned COUNT) ( • float heatOut[N][N]) ( /* Declare 2D array of data structure of 1,048,576 elements*/ • int row = blockldx.x * blockDim.x + threadldx.x; uniform grid Grid[J024,1024]; • int col = blockldx.y • blockDim.y + threadldx.y; 1* Declare a 2D variables of type float */ • if(row < N && col < N) ( float@ceU Grid:Grid heatln, heatOut; • float delta = 0.0; /* Initialize heatln and heatOut*/ • if(row > 0) delta += heatln(row ][col+l]; • if(col < N-I) delta += heatln[row ][col- I] ; • if(row < N- I) delta += heatln(row+ I][col ]; for (unsigned i = 0; i < COUNT; ++i) { • if(col > 0) delta += heatln[row- I][col ]; /* Perform 2D heat simulation */ · • • .} } heatOut[row][col] = heatln[row][col] + 0.25 * (delta - 4 * heatln[row][col]); _host_ void main(unsigned COUNT) ( 1* Declare 2D arra), of data structure of 1,048,576 elements */ • • • • • for aU ceUs in Grid heatOut = heatln + 0.25 * (north(heatln, 0.0) + heat In = heatOut; south(heatln, 0.0) + east(heatln, 0.0) + west(heatln, 0.0) -4 * heatln); } float heatln[J024][1024], heatOut[1024][1024]; } 1* lnitialize healln and heatOut*/ • GPU kemel ~ CPU-GPU communication IZI Memory (de)allocation /* Malloc memory on the GPU */ float **d_heatln, **d_heatOut; the results are copied from GPU to CPU and GPU memory int const N_BYTES = 1024 * 1024 * sizeof(lIoat): is freed . IZI cudaMalloc(d_heatln, 1024); IZI cudaMalloc(d_heatOut, 1024); The Scout version in Listing 2 is noticeably more concise. Scout abstracts the tasks related to GPU initialization and 1* Cop), 2D arrays from the CPU to the GPU */ ~ cudaMemcpy(heatln, d_heatln, N_B YTES, CPU-GPU communication. Initially, a Scout grid type is ~ cudaMemcpyHostToDevice); defined . Scout grid types define a multidimensional stati- ~ cudaMemcpy(heatOut, d_heatOut, NJWTES , cally allocated data structure. The rank of each d.imension ~ cudaMemcpyHos(foDevice); need not be the same. Next, the type of each grid element is for(unsigned i = 0; i < COUNT; ++i) ( declared. The forall keyword identifies an explicitly concur- /* Define grid and block size to generate 1,048,576 threads */ dim3 grid(64, 64 , 0), block( 16, 16, 0); rent block of code. The Grid type beside forall indicates that /* Perform 2D heat simulation */ lines nested within forall execute on all 1,048,576 elements. • heat2d«< grid, block » > (1024 , d_heatln, d_heatOut); In addition, the Scout GPU kernel is slimmer and more d_heatln = djJeatOut: abstract compared to the CUDA GPU kernel. The GPU } kernel of each listing is marked with black squares. Notice 1* Copy the 2D array back from GPU 10 CPU */ the declaration of GPU threads in the CUDA GPU kernel ~ cudaMemcpy(d_heatOut, heatOut, N_BYTES , and the absence of thread declarations in the Scout example. ~ cudaMemcpyDeviceToHost); In the CUDA example, perimeter elements must be explicitly 1* Free the 2D arrays */ IZI cudaFree(d_heatln); handled. By contrast, Scout's GPU kernel implicitly han- IZI cudaFree(djJeatOut): dles GPU threads, and the comer cases of the perimeter } elements are defined with the intrinsic stencil operations: • GPU kernel ~ CPU-GPU communication IZI Memory (de)allocation north, south, east, and west. Stencil operations perform a query of a neighbor element's value without explicit reference to data structure. The second parameter represents an imperative language providing strong type information, the value returned if an element beyond the dimension's an abstract data layout, and explicit concurrency constructs. size is accessed. Scout also includes a circular version of Listing I is an example program written in CUDA and handling out-of-bounds access. Abstracting data reference Listing 2 is an example of the same program written in enables optimization of data structure to target architecture. Scout. The code in the listings simulates two-dimensional Scout leverages an efficient and powerful programming heat transfer of a point source. When compiled, both codes abstraction to allow the programmer to easily write pro- execute on the GPU with equivalent performance. Compar- grams for heterogeneous architectures. The following section ing listings highlights Scout's ability to hide architectural details the work performed in the compiler to transform a detail behind an abstract yet descriptive language. Scout Scout program into a high-performing executable. programs are easier to program, debug, and maintain without loss of performance. III. SCOUT TOOLCHAIN In Listing I, memory for GPU variables must be rnal- Scout bootstraps from the LLVM [5] compiler infrastruc- loc'ed. Next, there is code to copy the variables from CPU ture. LLVM provides common compiler parts as modules to GPU. The number of GPU threads per grid and per block assembled based upon user discretion. A schematic diagram is declared. Then GPU kernel heat2d is executed. Finally, of Scout's toolchain appears in Figure I. The current status Inputs often restricted to copying data and starting GPU kernels. Accordingly, minimizing the number of CPU-GPU commu- nications, rather than the volume of data communicated, will Optimization achieve best performance. Consequently, improvements to performance for heterogeneous architectures reduce the CPU , workload to CPU-GPU communication and synchronization. B. GPU Optimization The important factors for maximizing GPU kernel perfor- mance are coalescing memory operations, removing control flow, and distributing work to all GPU cores. All these factors accentuate the GPU's single-instruction, multiple-thread (SIMT) architecture and are more fully discussed in "CUDA C Best Practices" [9]. The compiler supports the different kinds of GPU memory. The data layout of the GPU kernel's arguments is first optimized for ------. coalesced memory access for the GPU. Coalesced mem- JIT Profiler -----..: :. . ory operations achieve maximum throughput for the same latency as a single memory access. This optimization is possible by transforming the variable's data footprint in memory. In C/C++, changing the data layout of variables and enforcing pointer arithmetic semantics is impractical. Outputs The GPU's SIMT architecture forces every GPU thread to execute in lock step. Control flow within a GPU kernel will degrade performance by introducing branch divergence . • Current status Only threads that succeed the conditional will execute while Figure I. Scout toolchain the rest idle until the successful threads finish the branch. Removing control flow from within a kernel prevents branch of the toolchain is marked in solid black, and future work is divergence. marked in dashed grey. Implementing Scout in LLVM facil- Distributing work between GPU cores hides latency by itates portability. LLVM maintains many CPU architecture overlapping processes stalled on memory accesses with backends. Future work on Scout will expand the number of processes able to compute. Uncoalesced memory operations GPU backends. and under-utilization of GPU cores have severe performance The Scout compiler takes a Scout source program, like implications for a memory-bound kernel. Listing 2, as input and outputs executable CPU and GPU Finally, variables with high dimensionality present a diffi- code. The frontend parses and translates the Scout program cult problem for efficient memory access. Scout has a special into LLVM assembly language, the input and output of aU optimization for this case. The variable is tiled into two- LLVM modules. Sections of concurrent code in a Scout dimensional blocks. This process has the added benefit of program amenable to GPU execution are identified within enabling better cache coherence. the frontend. Kernel identification partitions LLVM assembly into C. CUDA Declarations GPU-bound and CPU-bound codes. Each code receives The CUDA declarations facilitate GPU kernel manage- target-specific optimization. For GPU kernels CUDA dec- ment. As shown in Listing 1, CUDA declarations allocate larations are automatically inserted based on each GPU and deallocate GPU memory and provide routines for CPU- kernel's characteristics. After GPU optimization, LLVM GPU communication. assembly is lowered to PTX by Scout's PTX backend, a Inefficient CPU-GPU communication patterns result in branch of Rhodin's backend [11]. PTX is the intermediate poor overall performance. Where possible the compiler op- assembly language for NVIDIA GPUs. Likewise, after CPU timizes communication to minimize redundant copies. Also, optimization, LLVM assembly is lowered to the target CPU the compiler transforms cyclic CPU-GPU communication architecture. patterns into acyclic ones. CUDA declarations set the number of threads per grid A. CPU Optimization and per block. Poor choices of grid and block size decrease Common compiler optimizations are applied to the CPU GPU utilization and consequently decrease performance. code. Because the bus connecting CPU and GPU has char- The compiler sets the number of threads based upon kernel acteristics of high bandwidth and high latency, GPU kernels arguments. Programmers have difficulty choosing an appro- tend to dominate runtime, and the work of the CPU is priate thread count because the choice depends on code and GPU architecture. Automatically defining thread count No prior work abstracts details about concurrency or mem- relieves this burden from the programmer. As shown in ory management. For programs written in CUDA, CUDA- Listing I, branch statements must be manually calculated lite [13] improves GPU kernel performance by optimizing and inserted in the kernel to prevent out-of-bound errors. GPU memory access. Using the polyhedral model "C-to- The Scout compiler automates this process. CUDA for Affine Programs" [2] and "A mapping path for GPGPU" [7} optimize C codes into efficient CUDA C. For a IV. RESULTS less portable solution, "OpenMP to GPGPU" [6] translates Collaborations are underway with several groups from Los programs annotated with OpenMP pragmas to CUDA C. Alamos National Laboratory to use Scout to aid in analysis The PGI Fortran and C compiler [12] advertises semi- and visualization of large-scale data sets. With the Physics automatic GPU parallelization. Users must mark loops and and Chemistry of Materials group, short-range molecular the optimization is not tolerant to general pointer arithmetic. dynamics codes are being adapted to run on a GPU cluster. Additionally, with the cooperation of the Space Science and VII . CONCLUSION Applications, Nuclear and Particle Physics, and Astrophysics Current programming languages and tools divert atten- and Cosmology groups work continues on cosmology codes . tion away from science to software development. Scout The Rodinia Benchmark Suite [3] is composed of pro- showcases how a mix of language abstraction and compiler grams with parallelized versions using OpenMP and CUDA automation result in high-performance without the burden of for multicore CPUs and GPUs respectively. Work continues manual optimization. Using an exphcitly parallel language manually porting the CUDA versions to Scout and compar- with few universal primitives results in a kss computa- ing performance. Results are encouraging. tionally intensive analysis and broader application of paral- lelization techniques. Future work focuses on optimization V. FUTURE WORK opportunities not targeted by the current framework and Scout eases the transition for programmers to leverage improvements to portability. GPUs for high-performance. Future work will continue to REFERENCES improve performance and portability. Additions to Scout's [II CUDA Community Showcase. toolchain include an AMD IL backend, OpenCL declara- hup:/Iwww.nvidia.com/objecllcuda_showcase_html.html. tions, and anT profiler. An AMD IL backend and OpenCL [2) M. M. Baskaran, J. Ramanujam. and P. Sadayappan. Automatic C-to- CUDA code generation for affine programs. In R. Gupta, editor, CC, declarations would enable Scout to target AMD GPUs. volume 6011 of Lecture Notes in Computer Science. Springer, 2010. AMD IL is analogous to PTX but for AMD GPUs. It is [3] S. Che, M. Boyer, J. Meng, D. Tarjao, J. W Sheaffer, S.-H. Lee, and the intermediate assembly language for AMD GPUs. K. Skadron. Rodinia: A benchmark suite for heterogeneous comput- ing. In IEEE International Symposium on Workload Characterization, LLVM has a module for nT compilation of LLVM 2009. IfSWC 2009., page.~ 44-54, October 2009. assembly. As depicted in the Scout toolchain diagram in [4) Khronos OpenCL Working Group. The OpenCL Specification, Figure I, the JIT could be used to profile and tune GPU September 2010. [5] c. Lattner and V. Adve. LLVM: A compilation framework for lifelong kernels for performance based on the number of threads program analysis & transformation. In CGO '04: Proceedings of allocated per grid and per block . Profiling could also identify the International Symposium on Code Generation and Optimization, CPU codes sandwiched between GPU codes causing a long page 75, Washington, DC, USA, 2004. IEEE Computer Society. [6) S. Lee, S.-1. Min, and R. Eigenmann. OpenMP to GPGPU : a compiler latency chain. Lowering these sandwiched CPU codes to framework for automatic translalion and optimization. In PPoPP '09: the GPU will improve performance by removing communi- Proceedings of the 14th ACM SIGPLAN symposium on Principles and cation. To further enhance performance, kernel fusion and practice of parallel programming, New York, NY, USA, 2009. [7) A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, fission optimizations would improve cache coherence via C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU acceler- locality of reference. ated computers from a portable -high level programming abstraction. [n GPGPU '10: Proceedings of the 3rd Workshop 011 General-Purpose Currently, the Scout compiler does not statically analyze Computation on Graphics Processing Units, pages 51-61, New York, programs for parallelizable loops, relying solely on Scout's NY, USA, 2010. concurrent annotations . Prior work to automatically detect (8) P. McCormick, J. Inman, J. Ahrens, J. Mohd-Yusof, G. Roth, and parallelizable loops in programs complements Scout's an- S. Cummins. Scout: a data-parallel programming language for graphics processors. Parallel Comput., 33:648-662, November 2007. notations and could further improve performance. [9) NVlDIA Corporation. CUDA C Best Practices Guide 3.2, 2010. Besides high computational performance, data visualjza- [10) NVIDIA Corporation. NVIDIA CUDA C Programming Guide 3.2, tion is also important. The Scout programming language will Nov. 2010. [11) H. Rhodin. LLVM PTX Backend. integrate performance and visualization directives, providing http://sourceforge.neliprojectslllvmptxbackend. facilities for analysis and simulation of computationally [12) The Portland Group. PGI Fortran & C Accelator Programming Model. White Paper, 2010. intensive data sets. [13] S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W-M . W Hwu. Languages and compilers for parallel computing. chapter CUDA- VI. RELATED WORK Lite: Reducing GPU Programming Complexity, pages 1-15. Springer- Verlag, Berlin. Heidelberg. 2008. Prior work on automatic parallelization for GPUs focuses on improving codes written in a mixture of C and CUDA.

References (12)

  1. M. M. Baskaran, J. Ramanujam. and P. Sadayappan. Automatic C-to- CUDA code generation for affine programs. In R. Gupta, editor, CC, volume 6011 of Lecture Notes in Computer Science. Springer, 2010.
  2. S. Che, M. Boyer, J. Meng, D. Tarjao, J. W Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous comput- ing. In IEEE International Symposium on Workload Characterization, 2009. IfSWC 2009., page.~ 44-54, October 2009.
  3. Khronos OpenCL Working Group. The OpenCL Specification, September 2010.
  4. c. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization, page 75, Washington, DC, USA, 2004. IEEE Computer Society.
  5. S. Lee, S.-1. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translalion and optimization. In PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, New York, NY, USA, 2009.
  6. A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU acceler- ated computers from a portable -high level programming abstraction. [n GPGPU '10: Proceedings of the 3rd Workshop 011 General-Purpose Computation on Graphics Processing Units, pages 51-61, New York, NY, USA, 2010.
  7. P. McCormick, J. Inman, J. Ahrens, J. Mohd-Yusof, G. Roth, and S. Cummins. Scout: a data-parallel programming language for graphics processors. Parallel Comput., 33:648-662, November 2007.
  8. NVlDIA Corporation. CUDA C Best Practices Guide 3.2, 2010.
  9. NVIDIA Corporation. NVIDIA CUDA C Programming Guide 3.2, Nov. 2010.
  10. H. Rhodin. LLVM PTX Backend. http://sourceforge.neliprojectslllvmptxbackend.
  11. The Portland Group. PGI Fortran & C Accelator Programming Model. White Paper, 2010.
  12. S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W-M. W Hwu. Languages and compilers for parallel computing. chapter CUDA- Lite: Reducing GPU Programming Complexity, pages 1-15. Springer- Verlag, Berlin. Heidelberg. 2008.

FAQs

sparkles

AI

What are Scout's key advancements over existing parallel programming languages?add

Scout abstracts hardware details and simplifies concurrency management, enabling easier program creation while maintaining performance. It contrasts traditional languages by automating tasks like GPU memory management.

How does Scout optimize CPU and GPU communication?add

The Scout compiler minimizes CPU-GPU communications by transforming cyclic patterns into acyclic ones and streamlining data transfer. This reduces latency and allows better workload distribution between CPU and GPU.

What role do CUDA declarations play in Scout's functionality?add

CUDA declarations automate GPU memory allocation and deallocate routines while optimizing thread count for maximum GPU utilization. This automation reduces the programmer's burden of manual optimizations related to GPU kernels.

How does Scout achieve performance similar to CUDA despite abstraction?add

The paper shows that Scout maintains performance by efficiently optimizing data layouts and GPU kernel characteristics during compilation. This comparison demonstrates that Scout can execute tasks with equivalent performance as CUDA.

What future enhancements are planned for the Scout toolchain?add

Future work includes integrating an AMD IL backend and OpenCL declarations to support more GPU architectures. Additionally, profiling capabilities for tuning GPU kernels will improve performance optimization further.