Scout: High-Performance Heterogeneous Computing Made Simple
2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum
https://doi.org/10.1109/IPDPS.2011.385…
5 pages
Sign up for access to the world's latest research
Abstract
Researchers must often write their own simulation and analysis software. During this process they simultaneously confront both computational and scientific problems. Current strategies for aiding the generation of performance-oriented programs do not abstract the software development from the science. Furthermore, the problem is becoming increasingly complex and pressing with the continued development of many-core and heterogeneous (CPU-GPU) architectures. To acbieve high performance, scientists must expertly navigate both software and hardware. Co-design between computer scientists and research scientists can alleviate but not solve this problem. The science community requires better tools for developing, optimizing, and future-proofing codes, allowing scientists to focus on their research while still achieving high computational performance. Scout is a parallel programming language and extensible compiler framework targeting heterogeneous architectures. It provides tbe abstraction required to buffer scientists from the constantly-shifting details of hardware while still realizing higb-performance by encapsulating software and hardware optimization within a compiler framework.
Key takeaways
AI
AI
- Scout abstracts hardware details while enabling high-performance execution on heterogeneous architectures.
- The language offers a simpler programming model compared to C/C++ and CUDA, facilitating easier debugging and maintenance.
- Scout's toolchain employs LLVM for JIT compilation, optimizing both CPU and GPU performance efficiently.
- Future developments aim to enhance portability and integrate data visualization directives in Scout.
- Collaborations with Los Alamos National Laboratory demonstrate Scout's application in large-scale data analysis and visualization.
Related papers
2005
Modern large-scale scientific computation problems must execute in a parallel computational environment to achieve acceptable performance. Target parallel environments range from the largest tightly-coupled supercomputers to heterogeneous clusters of workstations. Grid technologies make Internet execution more likely. Hierarchical and heterogeneous systems are increasingly common. Processing and communication capabilities can be nonuniform, non-dedicated, transient or unreliable. Even when targeting homogeneous computing environments, each environment may differ in the number of processors per node, the relative costs of computation, communication, and memory access, and the availability of programming paradigms and software tools. Architecture-aware computation requires knowledge of the computing environment and software performance characteristics, and tools to make use of this knowledge. These challenges may be addressed by compilers, low-level tools, dynamic load balancing or solution procedures, middleware layers, high-level software development techniques, and choice of programming languages and paradigms. Computation and communication may be reordered. Data or computation may be replicated or a load imbalance may be tolerated to avoid costly communication. This paper samples a variety of approaches to architecture-aware parallel computation.
2009
There are a number of challenges facing the High Performance Computing (HPC) community, including increasing levels of concurrency (threads, cores, nodes), deeper and more complex memory hierarchies (register, cache, disk, network), mixed hardware sets (CPUs and GPUs) and increasing scale (tens or hundreds of thousands of processing elements). Assessing the performance of complex scientific applications on specialised high-performance computing architectures is difficult. In many cases, traditional computer benchmarking is insufficient as it typically requires access to physical machines of equivalent (or similar) specification and rarely relates to the potential capability of an application. A technique known as application performance modelling addresses many of these additional requirements. Modelling allows future architectures and/or applications to be explored in a mathematical or simulated setting, thus enabling hypothetical questions relating to the configuration of a potential future architecture to be assessed in terms of its impact on key scientific codes.
The Journal of Supercomputing, 2014
Optimization of data-parallel applications for modern HPC platforms requires partitioning the computations between the heterogeneous computing devices in proportion to their speed. Heterogeneous data partitioning algorithms are based on computation performance models of the executing platforms. Their implementation is not trivial as it requires: accurate and efficient benchmarking of computing devices, which may share resources and/or execute different codes; appropriate interpolation methods to predict performance; and advanced mathematical methods to solve the data partitioning problem. In this paper, we present FuPerMod, a software tool that addresses these implementation issues and automates the development of data partitioning code in data-parallel applications for heterogeneous HPC platforms.
GPUs are becoming pervasive in scientific computing. Originally served as peripheral accelerators, now they are gradually turning into central computing nodes. However, most current directive-based approaches for parallelizing sequential legacy code such as OpenACC and HMPP simply off-load "hot" CPU code onto GPUs, entailing a lot of limitations such as unsupported external calls and coarse-grained data dependence analysis. This paper introduces KernelGen, which is a parallelization framework with a robust parallelism detection mechanism and a novel GPU-centric execution model. Ker-nelGen supports the major scientific programming languages including C and Fortran, and has multiple backends that can generate target code for both X86 CPUs and NVIDIA GPUs. The efficiency of KernelGen has been demonstrated by the performance improvement up to 5.4× compared with three major commercial OpenACC compilers over a benchmark suite of numerical kernels.
arXiv (Cornell University), 2002
Scientific codes are increasingly being used in compositional settings, especially problem solving environments (PSEs). Typical compositional modeling frameworks require significant buy-in, in the form of commitment to a particular style of programming (e.g., distributed object components). While this solution is feasible for newer generations of component-based scientific codes, large legacy code bases present a veritable software engineering nightmare. We introduce Weaves-a novel framework that enables modeling, composition, direct code execution, performance characterization, adaptation, and control of unmodified high performance scientific codes. Weaves is an efficient generalized framework for parallel compositional modeling that is a proper superset of the threads and processes models of programming. In this paper, our focus is on the transparent code execution interface enabled by Weaves. We identify design constraints, their impact on implementation alternatives, configuration scenarios, and present results from a prototype implementation on Intel x86 architectures.
1994
There is a need for compiler technology that, given the source program, will generate efficient parallel codes for different architectures with minimal user involvement. Parallel computation is becoming indispensable in solving large-scale problems in science and engineering. Yet, the use of parallel computation is limited by the high costs of developing the needed software.
The mapping process of high performance embedded applications to today's multiprocessor system-onchip devices suffers from a complex toolchain and programming process. The problem is the expression of parallelism with a pure imperative programming language, which is commonly C. This traditional approach limits the mapping, partitioning and the generation of optimized parallel code, and consequently the achievable performance and power consumption of applications from different domains. The Architecture oriented paraLlelization for high performance embedded Multicore systems using scilAb (ALMA) European project aims to bridge these hurdles through the introduction and exploitation of a Scilab-based toolchain which enables the efficient mapping of applications on multiprocessor platforms from a high level of abstraction. The holistic solution of the ALMA toolchain allows the complexity of both the application and the architecture to be hidden, which leads to better acceptance, reduced development cost, and shorter time-to-market. Driven by the technology restrictions in chip design, the end of exponential growth of clock speeds and an unavoidable increasing request of computing performance, ALMA is a fundamental step forward in the necessary introduction of novel computing paradigms and methodologies.
2003
Lie-Quan Lee Generic programming is an important paradigm for software development, with an emphasis on reusability and performance, qualities that would seemingly make this paradigm especially suited for application to scientific computing. We apply generic programming to the development of a message passing framework (the Generic Message Passing library) for parallel computing in hybrid execution architectures (i.e., those having both shared and distributed memory). Although GMP supports both shared-memory and distributed-memory execution, it explicitly separates its programming and execution models, presenting a uniform message-based programming interface to enable source-2.2 The way to get preconditioner objects from a preconditioner implementation. .
ArXiv, 2019
High-Performance Computing (HPC) platforms enable scientific software to achieve breakthroughs in many research fields such as physics, biology, and chemistry, by employing Research Software Engineering (RSE) techniques. These include 1) novel parallelism paradigms such as Shared Memory Parallelism (with e.g. OpenMP 4.5); Distributed Memory Parallelism (with e.g. MPI 4); Hybrid Parallelism which combines them; and Heterogeneous Parallelism (for CPUs, co-processors and accelerators), 2) introducing advanced Software Engineering concepts such as Object Oriented Parallel Programming (OOPP); Parallel Unit testing; Parallel I/O Formats; Hybrid Parallel Visualization; and 3) Selecting the Best Practices in other necessary areas such as User Interface; Automatic Documentation; Version Control and Project Management. In this work we present BACKUS: Comprehensive High-Performance Research Software Engineering Approach for Simulations in Supercomputing Systems, which we found to fit best for ...
2008
In this work, we examine the computational efficiency of scientific applications on three high-performancecomputing systems based on processors of varying degrees of specialization: an x86 server processor, the AMD Opteron; a more specialized System-on-Chip solution, the BlueGene/L and BlueGene/P; and a configurable embedded core, the Tensilica Xtensa. We use the atmospheric component of the global Community Atmospheric Model to motivate our study by defining a problem that requires exascale-class computing performance currently beyond the capabilities of existing systems. Significant advances in power-efficiency are necessary to make such a system practical to field.
References (12)
- M. M. Baskaran, J. Ramanujam. and P. Sadayappan. Automatic C-to- CUDA code generation for affine programs. In R. Gupta, editor, CC, volume 6011 of Lecture Notes in Computer Science. Springer, 2010.
- S. Che, M. Boyer, J. Meng, D. Tarjao, J. W Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous comput- ing. In IEEE International Symposium on Workload Characterization, 2009. IfSWC 2009., page.~ 44-54, October 2009.
- Khronos OpenCL Working Group. The OpenCL Specification, September 2010.
- c. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization, page 75, Washington, DC, USA, 2004. IEEE Computer Society.
- S. Lee, S.-1. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translalion and optimization. In PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, New York, NY, USA, 2009.
- A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU acceler- ated computers from a portable -high level programming abstraction. [n GPGPU '10: Proceedings of the 3rd Workshop 011 General-Purpose Computation on Graphics Processing Units, pages 51-61, New York, NY, USA, 2010.
- P. McCormick, J. Inman, J. Ahrens, J. Mohd-Yusof, G. Roth, and S. Cummins. Scout: a data-parallel programming language for graphics processors. Parallel Comput., 33:648-662, November 2007.
- NVlDIA Corporation. CUDA C Best Practices Guide 3.2, 2010.
- NVIDIA Corporation. NVIDIA CUDA C Programming Guide 3.2, Nov. 2010.
- H. Rhodin. LLVM PTX Backend. http://sourceforge.neliprojectslllvmptxbackend.
- The Portland Group. PGI Fortran & C Accelator Programming Model. White Paper, 2010.
- S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W-M. W Hwu. Languages and compilers for parallel computing. chapter CUDA- Lite: Reducing GPU Programming Complexity, pages 1-15. Springer- Verlag, Berlin. Heidelberg. 2008.
FAQs
AI
What are Scout's key advancements over existing parallel programming languages?add
Scout abstracts hardware details and simplifies concurrency management, enabling easier program creation while maintaining performance. It contrasts traditional languages by automating tasks like GPU memory management.
How does Scout optimize CPU and GPU communication?add
The Scout compiler minimizes CPU-GPU communications by transforming cyclic patterns into acyclic ones and streamlining data transfer. This reduces latency and allows better workload distribution between CPU and GPU.
What role do CUDA declarations play in Scout's functionality?add
CUDA declarations automate GPU memory allocation and deallocate routines while optimizing thread count for maximum GPU utilization. This automation reduces the programmer's burden of manual optimizations related to GPU kernels.
How does Scout achieve performance similar to CUDA despite abstraction?add
The paper shows that Scout maintains performance by efficiently optimizing data layouts and GPU kernel characteristics during compilation. This comparison demonstrates that Scout can execute tasks with equivalent performance as CUDA.
What future enhancements are planned for the Scout toolchain?add
Future work includes integrating an AMD IL backend and OpenCL declarations to support more GPU architectures. Additionally, profiling capabilities for tuning GPU kernels will improve performance optimization further.
patrick mccormick