A Programming Language Interface to Describe Transformations and Code Generation
2011, Languages and Compilers for Parallel Computing
https://doi.org/10.1007/978-3-642-19595-2_10…
2 pages
Sign up for access to the world's latest research
Abstract
This paper presents a programming language interface, a complete scripting language, to describe composable compiler transformations. These transformation programs can be written, shared and reused by non-expert application and library developers. From a compiler writer's perspective, a scripting language interface permits rapid prototyping of compiler algorithms that can mix levels and compose different sequences of transformations, producing readable code as output. From a library or application developer's perspective, the use of transformation programs permits expression of clean high-level code, and a separate description of how to map that code to architectural features, easing maintenance and porting to new architectures. We illustrate this interface in the context of CUDA-CHiLL, a sourceto-source compiler transformation and code generation framework that transforms sequential loop nests to high-performance GPU code. We show how this high-level transformation and code generation language can be used to express: (1) complex transformation sequences, exemplified by a single loop restructuring construct used to generate a series of tiling and permute commands; and, (2) complex code generation sequences to produce CUDA code from a high-level specification. We demonstrate that the automatically-generated code either performs closely or outperforms two hand-tuned GPU library kernels from Nvidia's CUBLAS 2.2 and 3.2 libraries.
Related papers
Proceedings of the 2011 ACM SIGPLAN X10 Workshop, 2011
GPU architectures have emerged as a viable way of considerably improving performance for appropriate applications. Program fragments (kernels) appropriate for GPU execution can be implemented in CUDA or OpenCL and glued into an application via an API. While there is plenty of evidence of performance improvements using this approach, there are many issues with productivity. Programmers must understand an additional programming model and API to program the accelerator; concurrency and synchronization in this programming model is typically expressed differently from the programming model for the host. On top of this, the languages used to write kernels are very low level and thus prone to the kinds of errors that one does not encounter in higher level languages. Programmers must explicitly deal with moving data back-and-forth between the host and the accelerator. These problems are compounded when the user code must be run across a cluster of accelerated nodes. Now the host programming model must further be extended with constructs to deal with scale-out and remote accelerators. We believe there is a critical need for a single source programming model that can be used to write clean, efficient code for heterogeneous, multi-core and scale-out architectures. The APGAS programming model has been developed for such architectures over the past six years. APGAS is based on four fundamental (and architecture-independent) notions: locality, asynchrony, conditional atomicity and order. X10 is an instantiation of the APGAS programming model on top of a base sequential language with Java-style productivity. Earlier work has shown that X10 can be used to write clean and efficient code for homogeneous multi-cores, SMPs, Cell-accelerated nodes, and clusters of such nodes. In this paper we show how X10 programmers can write code that can be compiled and run on GPUs. GPU programming idioms such as threads, blocks, barriers, constant memory, local registers, shared memory variables, etc. can be directly expressed in X10, and do not require new language extensions. We present the design of an extension of the X10-to-C++ compiler which recognizes such idioms and produces CUDA kernel code. We show several benchmarks written in this style. The performance of these kernels is within 80% of handwritten CUDA kernels. [Copyright notice will appear here once 'preprint' option is removed.] We believe these results establish X10 as a single-source programming language in which clean, efficient programs can be written for GPU-accelerated clusters.
Lecture Notes in Computer Science, 2010
Graphics Processing Units (GPUs) offer tremendous computational power. CUDA (Compute Unified Device Architecture) provides a multi-threaded parallel programming model, facilitating high performance implementations of general-purpose computations. However, the explicitly managed memory hierarchy and multi-level parallel view make manual development of high-performance CUDA code rather complicated. Hence the automatic transformation of sequential input programs into efficient parallel CUDA programs is of considerable interest.
Lecture Notes in Computer Science, 2011
The evolution of high performance computers is progressing toward increasingly heterogeneous systems. These new architectures pose new challenges, particularly in the field of programming languages. New tools and languages are needed if we want to make a full use of the advantages offered by these new architectures. llc is a language with a C-like syntax where parallelism is expressed using compiler directives. In this work we focus our attention on the new backend of our prototype compiler for llc which generates CUDA code. We evaluate the performance of the target code using three different applications. The preliminary results that we present make us believe that our approach is worth to be explored more deeply.
Concurrency and Computation: Practice and Experience, 2015
Manycore accelerators have the potential to significantly improve performance of scientific applications when offloading computationally intensive program portions to accelerators. Directive-based high-level programming models, such as OpenACC and OpenMP, are used to create applications for accelerators through annotating regions of code meant for offloading. OpenACC is an emerging directive-based programming model for programming accelerators that typically enable inexperienced programmers to achieve portable and productive performance within applications. In this paper, we present our research in developing challenges and solutions when creating an open-source OpenACC compiler in an industrial framework (OpenUH as a branch of Open64). We then discuss in detail techniques we developed for loop scheduling reduction operations on general purpose GPUs. The compiler is evaluated with benchmarks from the NAS Parallel Benchmarks suite and self-written micro-benchmarks for reduction operations. This implementation has been designed to serve as a compiler infrastructure for researchers to explore advanced compiler techniques, extend OpenACC to other programming models, and build performance tools used in conjunction with OpenACC programs.
ACM Transactions on Architecture and Code Optimization, 2013
This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static control loop nest, generating multiple CUDA kernels when necessary. We introduce a multilevel tiling strategy and a code generation scheme for the parallelization and locality optimization of imperfectly nested loops, managing memory and exposing concurrency according to the constraints of modern GPUs. We evaluate our algorithms and tool on the entire PolyBench suite.
2010
General-purpose computing on GPUs (graphics processing units) has received much attention lately due to the benefits of stream processing to exploit limitations of parallel processing. However, programming GPUs has several challenges with respect to the amount of effort spent in combining the kernel functional code of an application with the parallel concerns offered by APIs from various GPUs. This paper introduces our approach for raising the level of abstaction for programming GPUs. We have implemented an abstract API that can be used with the Compute Unified Device Architecture (CUDA) and the Open Compute Language (OpenCL) frameworks, so that the mechanical steps involved in writing the GPU code are abstracted in separate modules. The approach involves static code analysis and generative programming techniques for automatically generating the host code required for CUDA and OpenCL frameworks from minimal specifications provided by the programmers. The generated code resembles the...
2009
High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique.
International Journal of Parallel Programming, 2016
Recent development in Graphic Processing Units (GPUs) has opened a new challenge in harnessing their computing power as a new general-purpose computing paradigm with its CUDA parallel programming. However, porting applications to CUDA remains a challenge to average programmers. In this thesis work we have developed a restructuring software compiler (RT-CUDA) with best possible kernel optimizations to bridge the gap between high-level languages and the machine dependent CUDA environment. RT-CUDA is based upon a set of compiler optimizations. RT-CUDA takes a C-like program and convert it into an optimized CUDA kernel with user directives in a configuration file for guiding the compiler. While the invocation of external libraries is not possible with OpenACC commercial compiler, RT-CUDA allows transparent invocation of the most optimized xvi external math libraries like cuSparse and cuBLAS. For this, RT-CUDA uses interfacing APIs, error handling interpretation, and user transparent programming. This enables efficient design of linear algebra solvers (LAS). Evaluation of RT-CUDA has been performed on Tesla K20c GPU with a variety of basic linear algebra operators (M+, MM, MV, VV, etc.) as well as the programming of solvers of systems of linear equations like Jacobi and Conjugate Gradient. We obtained significant speedup over other compilers like OpenACC and GPGPU compilers. RT-CUDA facilitates the design of efficient parallel software for developing parallel simulators (reservoir simulators, molecular dynamics, etc.) which are critical for Oil & Gas industry in KSA. We expect RT-CUDA to be needed by many KSA industries dealing with science and engineering simulation on massively parallel computers like NVIDIA GPUs.
2010
Abstract Graphical Processing Unit (GPU) programming languages are used extensively for general-purpose computations. However, GPU programming languages are at a level of abstraction suitable only for use by expert parallel programmers. This paper presents a new approach through whichC'or Java programmers can access these languages without having to focus on the technical or language-specific details.
Haiwad Khattak