Data Parallelism

description674 papers

group2,788 followers

lightbulbAbout this topic

Data parallelism is a computational paradigm that involves distributing data across multiple processing units, allowing simultaneous execution of operations on the data. This approach enhances performance and efficiency in processing large datasets by leveraging parallel computing architectures, such as multi-core processors or distributed systems.

lightbulbAbout this topic

Key research themes

1. How can nested and irregular data-parallelism be implemented efficiently for algorithms on complex data structures?

This research theme addresses the challenge of supporting efficient data-parallel computations on irregular and nested data structures such as graphs, sparse matrices, trees, and complex objects within parallel computing frameworks and languages. Efficient execution models and programming abstractions are needed that can represent irregular data and enable parallel traversals or computations without excessive overhead, while preserving portability and performance across different hardware architectures.

Implementation of a Portable Nested Data-Parallel Language

by Jonathan C Hardwick

2022

Key finding: The paper presents NESL, the first portable nested data-parallel language supporting nested data structures and nested data-parallel functions, enabling concise expression of parallel algorithms over irregular data like... Read more

articleView Paper downloadDownload

SIMD Parallelization of Applications that Traverse Irregular Data Structures

by Wolfram Schulte and

2015

Key finding: The study proposes an intermediate language and runtime scheduling method to map independent traversals of multiple irregular pointer-based data structures (like forests of decision trees and regular expressions) onto SIMD... Read more

articleView Paper downloadDownload

Parallelism in Processing Queries on Complex Objects

by Harald Schöning

2016

Key finding: This work investigates intra- and inter-object parallelism for query processing on complex, nested objects such as 3D models or molecules in non-standard databases. It proposes a layered architecture using nested transactions... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What strategies enable efficient hybrid parallelism for training large-scale deep neural networks beyond conventional data or model parallelism?

As deep learning models grow in size and complexity, single-model data parallelism or model parallelism alone often become insufficient for memory capacity or communication efficiency. This research theme explores techniques combining or extending data, model, pipeline, and operator-level parallelisms to efficiently train huge networks on distributed systems. Key questions include how to split models and data, partition operators, optimize communication, and schedule tasks for maximal scalability and throughput while maintaining training accuracy.

Automatic Graph Partitioning for Very Large-scale Deep Learning

by Toshihiro Hanawa

2024, arXiv (Cornell University)

Key finding: RaNNC middleware automatically partitions PyTorch models for hybrid parallelism by identifying atomic subcomponents and grouping them into blocks, then using dynamic programming to find balanced pipeline partitions fitting... Read more

articleView Paper downloadDownload

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

by Yonghao Zhuang

2022

Key finding: Alpa framework unifies data, operator (intra-operator), and pipeline (inter-operator) parallelisms into a hierarchical execution plan space and automatically derives efficient distributed training plans. By mapping... Read more

articleView Paper downloadDownload

SplitBrain: Hybrid Data and Model Parallel Deep Learning

by Erik Kruus

2023, arXiv (Cornell University)

Key finding: SplitBrain introduces hybrid parallelism for distributed CNN training by combining data parallelism and model parallelism with layer-specific partitioning. Compute-intensive convolution layers are co-located while... Read more

articleView Paper downloadDownload

Implementing a neural network interatomic model with performance portability for emerging exascale architectures

by James Belak

2022, Computer Physics Communications

Key finding: This work extends data parallelism by incorporating spatial parallelism to partition single samples in large 3D CNNs, enabling strong scaling beyond mini-batch size limitations. Implemented within LBANN for CosmoFlow and 3D... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can task-parallel pipeline programming models and asynchronous execution improve performance and composability in parallel algorithms?

Pipeline parallelism is a fundamental pattern capturing sequences of task stages with dependencies, common in streaming and hierarchical computations. Current pipeline frameworks focus on data-centric abstractions which are convenient but can be inefficient and inflexible for purely task-parallel pipeline algorithms. This theme investigates programming models, scheduling algorithms, and runtime techniques that separate data abstraction from pipeline task scheduling, enhance composability with other parallel paradigms, and enable efficient dynamic load balancing and resource utilization.

Pipeflow: An Efficient Task-Parallel Pipeline Programming Framework using Modern C++

by Zhishan Guo

2025, arXiv (Cornell University)

Key finding: Pipeflow is a novel C++ task-parallel pipeline programming framework built atop the Taskflow system that decouples pipeline scheduling from data abstractions. It provides a composable interface enabling users to explicitly... Read more

articleView Paper downloadDownload

Parallel Objects for Multicores: A Glimpse at the Parallel Language Encore

by Silvia Lizeth Tapia Tarifa

2022, Lecture Notes in Computer Science

Key finding: The Encore programming language integrates active object parallelism with unshared local heaps and capabilities to guarantee race-free concurrency without complex synchronization. Combining message-based concurrency with... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Data Parallelism

Formal behavioural synthesis of Handel-C parallel hardware implementations from functional specifications

by Ali E. Abdallah

2026

Enormous improvements in efficiency can be achieved through exploiting parallelism and realizing implementation in hardware. On the other hand, conventional methods for achieving these improvements are traditionally costly, complex and... more

descriptionView Paper arrow_downwardDownload

Memory and Parallelism Analysis Using a Platform-Independent Approach

by Ahsan Awan

2026, Proceedings of the 22nd International Workshop on Software and Compilers for Embedded Systems

Emerging computing architectures such as near-memory computing (NMC) promise improved performance for applications by reducing the data movement between CPU and memory. However, detecting such applications is not a trivial task. In this... more

descriptionView Paper arrow_downwardDownload

Architecture--Performance Interrelationship Analysis in Single/Multiple CPU/GPU Computing Systems: Application to Composite Process Flow Modeling

by Richard Haney

2026

Current developments in computing have shown the advantage of using one or more Graphic Processing Units (GPU) to boost the performance of many computationally intensive applications but there are still limits to these GPU-enhanced... more

descriptionView Paper arrow_downwardDownload

Toward the parallelization of GSL

by José Manuel Badia Contelles

2026, The Journal of Supercomputing

In this paper, we present our joint efforts to design and develop parallel implementations of the GNU Scientific Library for a wide variety of parallel platforms.

descriptionView Paper arrow_downwardDownload

Extract and Extend Parallelism using C/C++ Extension for Array Notation on Multicore and Many-core Platforms

by Robert Geva

2026

In this paper, we explore the newly introduced array notion syntax extension in recent release of Intel Compiler with a few representative quantitative finance workloads. We will explore the array syntax both as an abstraction tool to... more

descriptionView Paper arrow_downwardDownload

Extract and Extend Parallelism using C/C++ Extension for Array Notation on Multicore and Many-core Platforms

by Robert Geva

2026, Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

descriptionView Paper arrow_downwardDownload

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

by Tianqi Wang

2026, arXiv (Cornell University)

The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. To satisfy the requirement of computation and memory of DNN training, distributed deep learning based on model... more

descriptionView Paper arrow_downwardDownload

A Distributed Image Processing Function Set for an Image Mining System

by Roberto Guerrero

2026

An Image Mining System (IMS) requires real time processing often using special purpose hardware. The work herein presented refers to the application of cluster computing for on line image processing inside an IMS, where the end user... more

descriptionView Paper arrow_downwardDownload

StreaMorph: A case for synthesizing energy-efficient adaptive programs using high-level abstractions

by Dai Bui

2026, 2013 Proceedings of the International Conference on Embedded Software (EMSOFT)

This paper presents the concept of adaptive programs, whose computation and communication structures can morph to adapt to environmental and demand changes to save energy and computing resources. In this approach, programmers write one... more

descriptionView Paper arrow_downwardDownload

From algorithm parallelism to instruction-level parallelism

by Uzi Vishkin

2026

descriptionView Paper arrow_downwardDownload

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

by David Kaeli

2026, IEEE Transactions on Parallel and Distributed Systems

The introduction of General Purpose computation on GPUs (GPGPU) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively-multithreaded, data-parallel architectures possessing impressive... more

descriptionView Paper arrow_downwardDownload

Makna dalam puisi terpilih Usman Awang

by Nur Amirah Che Soh

2026

Semantik merupakan salah satu bidang kajian bahasa yang mengkaji makna dalam kepelbagaian penggunaan medium bahasa. Perbincangan makna ini merangkumi keseluruhan makna yang ingin disampaikan oleh penulis sama ada dihubung kait dengan... more

descriptionView Paper arrow_downwardDownload

Implementing 2DT on a multiprocessor

by Yosi Asher

2026, Lecture Notes in Computer Science

The implementation of a parallel functional language is discussed. 2DTprograms are composed of local SPMD-computations and global transformations of 2-dimensional data structures leading to a coarse grain computecommunicate scheme. The... more

descriptionView Paper arrow_downwardDownload

Building programs in the network of tasks model

by susanna pelagatti

2025, Proceedings of the 2000 ACM symposium on Applied computing - SAC '00

The Network of Tasks (NOT) model allows adaptive node programs written in a variety of parallel languages to be connected together in an almost acyclic task graph. The main difference between NOT and other task graphs is that it is... more

descriptionView Paper arrow_downwardDownload

Skeletons for data parallelism in p31

by susanna pelagatti

2025, Lecture Notes in Computer Science

descriptionView Paper arrow_downwardDownload

Benchmarking Data and Compute Intensive Applications on Modern CPU and GPU Architectures

by Pawel Gepner

2025, Procedia Computer Science

The use of graphics hardware for non-graphics applications has become popular among many scientific programmers and researchers as we have observed a higher rate of theoretical performance increase than the CPUs in recent years. However,... more

descriptionView Paper arrow_downwardDownload

A State-of-the-art Survey of Advanced Optimization Methods in Machine Learning

by Marenglen Biba

2025

The main objective of this paper is to provide a state-of-the-art survey of advanced optimization methods used in machine learning. It starts with a short introduction to machine learning followed by the formulation of optimization... more

descriptionView Paper arrow_downwardDownload

Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism

by Ping Tak Peter Tang

2025, ArXiv

Data Parallelism (DP) and Model Parallelism (MP) are two common paradigms to enable large-scale distributed training of neural networks. Recent trends, such as the improved model performance of deeper and wider neural networks when... more

descriptionView Paper arrow_downwardDownload

Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

by Ping Tak Peter Tang

2025, Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

In this paper, we consider hybrid parallelism-a paradigm that employs both Data Parallelism (DP) and Model Parallelism (MP)-to scale distributed training of large recommendation models. We propose a compression framework called Dynamic... more

descriptionView Paper arrow_downwardDownload

Fat versus Thin Threading Approach on GPUs: Application to Stochastic Simulation of Chemical Reactions

by m giles

2025, IEEE Transactions on Parallel and Distributed Systems

We explore two different threading approaches on a graphics processing unit (GPU) exploiting two different characteristics of the current GPU architecture. The fat thread approach tries to minimize data access time by relying on shared... more

descriptionView Paper arrow_downwardDownload

Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures

by Henk Corporaal

2025, Lecture Notes in Computer Science

Image processing is widely used in many applications, including medical imaging, industrial manufacturing and security systems. In these applications, the size of the image is often very large, the processing time should be very small and... more

descriptionView Paper arrow_downwardDownload

A Distributed Image Processing Function Set for an Image Mining System

by Juan Fernandez

2025

descriptionView Paper arrow_downwardDownload

Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine

by Juan Fernandez

2025, 2007 IEEE International Parallel and Distributed Processing Symposium

The Cell Broadband Engine (BE) processor provides the potential to achieve an impressive level of performance for scientific applications. This level of performance can be reached by exploiting several dimensions of parallelism, such as... more

descriptionView Paper arrow_downwardDownload

Teaching distributed memory programming from mental models

by Victor Eijkhout

2025, Journal of Parallel and Distributed Computing

descriptionView Paper arrow_downwardDownload

On the optimality of Allen and Kennedy's algorithm for parallelism extraction in nested loops

by Frédéric Vivien

2025, Springer eBooks

We explore the link between dependence abstractions and maximal parallelism extraction in nested loops. Our goal is to find, for each dependence abstraction, the minima] transformations needed for maximal parallelism extraction. The... more

descriptionView Paper arrow_downwardDownload

On the Optimality of Allen and Kennedy's Algorithm for Parallelism Extraction in Nested Loops

by Frédéric Vivien

2025, Parallel Algorithms and Applications

descriptionView Paper arrow_downwardDownload

A new look at exploiting data parallelism in embedded systems

by Jaime Moreno

2025

This paper describes and evaluates three architectural methods for accomplishing data parallel computation in a programmable embedded system. Comparisons are made between the well-studied Very Long Instruction Word (VLIW) and Single... more

descriptionView Paper arrow_downwardDownload

A new look at exploiting data parallelism in embedded systems

by Jaime Moreno

2025, Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems

descriptionView Paper arrow_downwardDownload

Scalable knowledge distillation for large language models on multi-GPU systems

by Wary Hossain Rabby

2025, International Journal of Science and Research archive

One well-liked method for condensing massive language models (LLMs) into smaller, faster, more effective versions without sacrificing performance is knowledge distillation (KD). However, it is no longer feasible to run distillation on a... more

descriptionView Paper arrow_downwardDownload

Expressing pipeline parallelism using TBB constructs

by Eric Reed

2025, Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11 - SPLASH '11 Workshops

Task-based libraries such as Intel's Threading Building Blocks (TBB) provide higher levels of abstraction than threads for parallel programming. Work remains, however, to determine how straightforward it is to use these libraries to... more

descriptionView Paper arrow_downwardDownload

A Parallel Algorithm for Solving Complex Multibody Problems With Stream Processors

by Mihai Anitescu

2025, Volume 4: 7th International Conference on Multibody Systems, Nonlinear Dynamics, and Control, Parts A, B and C

This paper describes a numerical method for the parallel solution of the differential measure inclusion problem posed by mechanical multibody systems containing bilateral and unilateral frictional constraints. The method proposed has been... more

descriptionView Paper arrow_downwardDownload

Parallelism in relational data base systems

by C. Mohan

2025, Proceedings of the second international symposium on Databases in parallel and distributed systems - DPDS '90

With current systems, some important complex queries may take days to complete because of: (1) the volume of data to be processed, (2) limited aggregate resources. Introducing parallelism addresses the first problem. Cheaper, but powerful... more

descriptionView Paper arrow_downwardDownload

Memory and Parallelism Analysis Using a Platform-Independent Approach

by Henk Corporaal

2025, Proceedings of the 22nd International Workshop on Software and Compilers for Embedded Systems

descriptionView Paper arrow_downwardDownload

Exploring processor parallelism: Estimation methods and optimization strategies

by Henk Corporaal

2025, 2013 IEEE 16th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS)

Automatic optimization of application-specific instruction-set processor (ASIP) architectures mostly focuses on the internal memory hierarchy design, or the extension of reduced instruction-set architectures with complex custom... more

descriptionView Paper arrow_downwardDownload

DC-SIMD : Dynamic communication for SIMD processors

by Henk Corporaal

2025, 2008 IEEE International Symposium on Parallel and Distributed Processing

SIMD (single instruction multiple data)-type processors have been found very efficient in image processing applications, because their repetitive structure is able to exploit the huge amount of data-level parallelism in pixel-type... more

descriptionView Paper arrow_downwardDownload

The OpenMP Cluster Programming Model

by Marcio Machado Pereira

2025, Workshop Proceedings of the 51st International Conference on Parallel Processing

Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages... more

descriptionView Paper arrow_downwardDownload

An automatic optimizer for heterogeneous devices

by Diego Andrade

2025, Future Generation Computer Systems

Codes written in a naive way seldom effectively exploit the computing resources, while writing optimized codes is usually a complex task that requires certain levels of expertise. This problem is further increased in the presence of... more

descriptionView Paper arrow_downwardDownload

High productivity multi-device exploitation with the Heterogeneous Programming Library

by Diego Andrade

2025, Journal of Parallel and Distributed Computing

Heterogeneous devices require much more work from programmers than traditional CPUs, particularly when there are several of them, as each one has its own memory space. Multidevice applications require to distribute kernel executions and,... more

descriptionView Paper arrow_downwardDownload

Task-Parallel versus Data-Parallel Library-Based Programming in Multicore Systems

by Diego Andrade

2025, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing

Multicore machines are becoming common. There are many languages, language extensions and libraries devoted to improve the programmability and performance of these machines. In this paper we compare two libraries, that face the problem of... more

descriptionView Paper arrow_downwardDownload

An Introduction to the Pthales Domain of Ptolemy II

by R. Barrere

2025

descriptionView Paper arrow_downwardDownload

A Survey on Large-scale Machine Learning

by Weijie Fu

2025, arXiv (Cornell University)

Machine learning can provide deep insights into data, allowing machines to make high-quality predictions and having been widely used in real-world applications, such as text mining, visual classification, and recommender systems. However,... more

descriptionView Paper arrow_downwardDownload

Cost Hierarchies for Abstract Parallel Machines

by John O'Donnell

2025, Springer eBooks

Parallel Machine (APM) model separates the definitions of parallel operations from the application algorithm, which defines the sequence of parallel operations to be executed. An APM contains a set of parallel operation definitions, which... more

descriptionView Paper arrow_downwardDownload

Hybrid parallelization strategies for large-scale machine learning in SystemML

by Berthold Reinwald

2025, Proceedings of the VLDB Endowment

SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce, where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The declarative specification of ML algorithms enables---in contrast... more

descriptionView Paper arrow_downwardDownload

On the optimality of Feautrier's scheduling algorithm

by Frédéric Vivien

2025, Concurrency and Computation: Practice and Experience

Feautrier's scheduling algorithm is the most powerful existing algorithm for parallelism detection and extraction. But it has always been known to be suboptimal. However, the question whether it may miss some parallelism because of its... more

descriptionView Paper arrow_downwardDownload

Data Parallelism

Key research themes

1. How can nested and irregular data-parallelism be implemented efficiently for algorithms on complex data structures?

2. What strategies enable efficient hybrid parallelism for training large-scale deep neural networks beyond conventional data or model parallelism?

3. How can task-parallel pipeline programming models and asynchronous execution improve performance and composability in parallel algorithms?

Related Topics

All papers in Data Parallelism