Academia.eduAcademia.edu

Data Parallelism

description674 papers
group2,788 followers
lightbulbAbout this topic
Data parallelism is a computational paradigm that involves distributing data across multiple processing units, allowing simultaneous execution of operations on the data. This approach enhances performance and efficiency in processing large datasets by leveraging parallel computing architectures, such as multi-core processors or distributed systems.
lightbulbAbout this topic
Data parallelism is a computational paradigm that involves distributing data across multiple processing units, allowing simultaneous execution of operations on the data. This approach enhances performance and efficiency in processing large datasets by leveraging parallel computing architectures, such as multi-core processors or distributed systems.

Key research themes

1. How can nested and irregular data-parallelism be implemented efficiently for algorithms on complex data structures?

This research theme addresses the challenge of supporting efficient data-parallel computations on irregular and nested data structures such as graphs, sparse matrices, trees, and complex objects within parallel computing frameworks and languages. Efficient execution models and programming abstractions are needed that can represent irregular data and enable parallel traversals or computations without excessive overhead, while preserving portability and performance across different hardware architectures.

Key finding: The paper presents NESL, the first portable nested data-parallel language supporting nested data structures and nested data-parallel functions, enabling concise expression of parallel algorithms over irregular data like... Read more
by Wolfram Schulte and 
1 more
Key finding: The study proposes an intermediate language and runtime scheduling method to map independent traversals of multiple irregular pointer-based data structures (like forests of decision trees and regular expressions) onto SIMD... Read more
Key finding: This work investigates intra- and inter-object parallelism for query processing on complex, nested objects such as 3D models or molecules in non-standard databases. It proposes a layered architecture using nested transactions... Read more

2. What strategies enable efficient hybrid parallelism for training large-scale deep neural networks beyond conventional data or model parallelism?

As deep learning models grow in size and complexity, single-model data parallelism or model parallelism alone often become insufficient for memory capacity or communication efficiency. This research theme explores techniques combining or extending data, model, pipeline, and operator-level parallelisms to efficiently train huge networks on distributed systems. Key questions include how to split models and data, partition operators, optimize communication, and schedule tasks for maximal scalability and throughput while maintaining training accuracy.

Key finding: RaNNC middleware automatically partitions PyTorch models for hybrid parallelism by identifying atomic subcomponents and grouping them into blocks, then using dynamic programming to find balanced pipeline partitions fitting... Read more
Key finding: Alpa framework unifies data, operator (intra-operator), and pipeline (inter-operator) parallelisms into a hierarchical execution plan space and automatically derives efficient distributed training plans. By mapping... Read more
Key finding: SplitBrain introduces hybrid parallelism for distributed CNN training by combining data parallelism and model parallelism with layer-specific partitioning. Compute-intensive convolution layers are co-located while... Read more
Key finding: This work extends data parallelism by incorporating spatial parallelism to partition single samples in large 3D CNNs, enabling strong scaling beyond mini-batch size limitations. Implemented within LBANN for CosmoFlow and 3D... Read more

3. How can task-parallel pipeline programming models and asynchronous execution improve performance and composability in parallel algorithms?

Pipeline parallelism is a fundamental pattern capturing sequences of task stages with dependencies, common in streaming and hierarchical computations. Current pipeline frameworks focus on data-centric abstractions which are convenient but can be inefficient and inflexible for purely task-parallel pipeline algorithms. This theme investigates programming models, scheduling algorithms, and runtime techniques that separate data abstraction from pipeline task scheduling, enhance composability with other parallel paradigms, and enable efficient dynamic load balancing and resource utilization.

Key finding: Pipeflow is a novel C++ task-parallel pipeline programming framework built atop the Taskflow system that decouples pipeline scheduling from data abstractions. It provides a composable interface enabling users to explicitly... Read more
Key finding: The Encore programming language integrates active object parallelism with unshared local heaps and capabilities to guarantee race-free concurrency without complex synchronization. Combining message-based concurrency with... Read more

All papers in Data Parallelism

Enormous improvements in efficiency can be achieved through exploiting parallelism and realizing implementation in hardware. On the other hand, conventional methods for achieving these improvements are traditionally costly, complex and... more
Emerging computing architectures such as near-memory computing (NMC) promise improved performance for applications by reducing the data movement between CPU and memory. However, detecting such applications is not a trivial task. In this... more
Current developments in computing have shown the advantage of using one or more Graphic Processing Units (GPU) to boost the performance of many computationally intensive applications but there are still limits to these GPU-enhanced... more
In this paper, we present our joint efforts to design and develop parallel implementations of the GNU Scientific Library for a wide variety of parallel platforms.
In this paper, we explore the newly introduced array notion syntax extension in recent release of Intel Compiler with a few representative quantitative finance workloads. We will explore the array syntax both as an abstraction tool to... more
In this paper, we explore the newly introduced array notion syntax extension in recent release of Intel Compiler with a few representative quantitative finance workloads. We will explore the array syntax both as an abstraction tool to... more
The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. To satisfy the requirement of computation and memory of DNN training, distributed deep learning based on model... more
An Image Mining System (IMS) requires real time processing often using special purpose hardware. The work herein presented refers to the application of cluster computing for on line image processing inside an IMS, where the end user... more
This paper presents the concept of adaptive programs, whose computation and communication structures can morph to adapt to environmental and demand changes to save energy and computing resources. In this approach, programmers write one... more
The introduction of General Purpose computation on GPUs (GPGPU) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively-multithreaded, data-parallel architectures possessing impressive... more
Semantik merupakan salah satu bidang kajian bahasa yang mengkaji makna dalam kepelbagaian penggunaan medium bahasa. Perbincangan makna ini merangkumi keseluruhan makna yang ingin disampaikan oleh penulis sama ada dihubung kait dengan... more
The implementation of a parallel functional language is discussed. 2DTprograms are composed of local SPMD-computations and global transformations of 2-dimensional data structures leading to a coarse grain computecommunicate scheme. The... more
The Network of Tasks (NOT) model allows adaptive node programs written in a variety of parallel languages to be connected together in an almost acyclic task graph. The main difference between NOT and other task graphs is that it is... more
The use of graphics hardware for non-graphics applications has become popular among many scientific programmers and researchers as we have observed a higher rate of theoretical performance increase than the CPUs in recent years. However,... more
The main objective of this paper is to provide a state-of-the-art survey of advanced optimization methods used in machine learning. It starts with a short introduction to machine learning followed by the formulation of optimization... more
Data Parallelism (DP) and Model Parallelism (MP) are two common paradigms to enable large-scale distributed training of neural networks. Recent trends, such as the improved model performance of deeper and wider neural networks when... more
In this paper, we consider hybrid parallelism-a paradigm that employs both Data Parallelism (DP) and Model Parallelism (MP)-to scale distributed training of large recommendation models. We propose a compression framework called Dynamic... more
We explore two different threading approaches on a graphics processing unit (GPU) exploiting two different characteristics of the current GPU architecture. The fat thread approach tries to minimize data access time by relying on shared... more
Image processing is widely used in many applications, including medical imaging, industrial manufacturing and security systems. In these applications, the size of the image is often very large, the processing time should be very small and... more
An Image Mining System (IMS) requires real time processing often using special purpose hardware. The work herein presented refers to the application of cluster computing for on line image processing inside an IMS, where the end user... more
The Cell Broadband Engine (BE) processor provides the potential to achieve an impressive level of performance for scientific applications. This level of performance can be reached by exploiting several dimensions of parallelism, such as... more
We explore the link between dependence abstractions and maximal parallelism extraction in nested loops. Our goal is to find, for each dependence abstraction, the minima] transformations needed for maximal parallelism extraction. The... more
We explore the link between dependence abstractions and maximal parallelism extraction in nested loops. Our goal is to find, for each dependence abstraction, the minima] transformations needed for maximal parallelism extraction. The... more
This paper describes and evaluates three architectural methods for accomplishing data parallel computation in a programmable embedded system. Comparisons are made between the well-studied Very Long Instruction Word (VLIW) and Single... more
This paper describes and evaluates three architectural methods for accomplishing data parallel computation in a programmable embedded system. Comparisons are made between the well-studied Very Long Instruction Word (VLIW) and Single... more
One well-liked method for condensing massive language models (LLMs) into smaller, faster, more effective versions without sacrificing performance is knowledge distillation (KD). However, it is no longer feasible to run distillation on a... more
Task-based libraries such as Intel's Threading Building Blocks (TBB) provide higher levels of abstraction than threads for parallel programming. Work remains, however, to determine how straightforward it is to use these libraries to... more
This paper describes a numerical method for the parallel solution of the differential measure inclusion problem posed by mechanical multibody systems containing bilateral and unilateral frictional constraints. The method proposed has been... more
With current systems, some important complex queries may take days to complete because of: (1) the volume of data to be processed, (2) limited aggregate resources. Introducing parallelism addresses the first problem. Cheaper, but powerful... more
Emerging computing architectures such as near-memory computing (NMC) promise improved performance for applications by reducing the data movement between CPU and memory. However, detecting such applications is not a trivial task. In this... more
Automatic optimization of application-specific instruction-set processor (ASIP) architectures mostly focuses on the internal memory hierarchy design, or the extension of reduced instruction-set architectures with complex custom... more
SIMD (single instruction multiple data)-type processors have been found very efficient in image processing applications, because their repetitive structure is able to exploit the huge amount of data-level parallelism in pixel-type... more
Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages... more
Codes written in a naive way seldom effectively exploit the computing resources, while writing optimized codes is usually a complex task that requires certain levels of expertise. This problem is further increased in the presence of... more
Heterogeneous devices require much more work from programmers than traditional CPUs, particularly when there are several of them, as each one has its own memory space. Multidevice applications require to distribute kernel executions and,... more
Multicore machines are becoming common. There are many languages, language extensions and libraries devoted to improve the programmability and performance of these machines. In this paper we compare two libraries, that face the problem of... more
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions and having been widely used in real-world applications, such as text mining, visual classification, and recommender systems. However,... more
Parallel Machine (APM) model separates the definitions of parallel operations from the application algorithm, which defines the sequence of parallel operations to be executed. An APM contains a set of parallel operation definitions, which... more
SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce, where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The declarative specification of ML algorithms enables---in contrast... more
Feautrier's scheduling algorithm is the most powerful existing algorithm for parallelism detection and extraction. But it has always been known to be suboptimal. However, the question whether it may miss some parallelism because of its... more
Download research papers for free!