Academia.eduAcademia.edu

Modulo Scheduling

description180 papers
group24 followers
lightbulbAbout this topic
Modulo scheduling is an advanced compiler optimization technique used in instruction scheduling for superscalar and VLIW architectures. It aims to minimize pipeline stalls by overlapping instruction execution across multiple cycles, effectively utilizing available resources while adhering to data dependencies and resource constraints.
lightbulbAbout this topic
Modulo scheduling is an advanced compiler optimization technique used in instruction scheduling for superscalar and VLIW architectures. It aims to minimize pipeline stalls by overlapping instruction execution across multiple cycles, effectively utilizing available resources while adhering to data dependencies and resource constraints.

Key research themes

1. How can modulo scheduling algorithms be scaled and optimized for efficient loop pipelining in high-level synthesis and VLIW architectures?

This theme focuses on improving the scalability, efficiency, and quality of modulo scheduling algorithms, particularly for loop pipelining in High-Level Synthesis (HLS) targeting hardware like VLIW processors and heterogeneous systems. It addresses key challenges including minimizing initiation intervals (II), balancing computation time with schedule quality, integrating resource constraints, and reducing leakage power through scheduling optimizations. Advances aim to reduce compilation time while maintaining high throughput, enabling practical acceleration of large, complex loops.

Key finding: Proposes a new modulo scheduling algorithm that reformulates the classical problem, separating scheduling and allocation, resulting in linear scalability with loop size compared to previous quadratic methods, and enabling a... Read more
Key finding: Introduces a hybrid method combining decomposed software pipelining to obtain a valid retiming and an integer linear programming (ILP) formulation with reduced size to solve the resource-constrained modulo scheduling problem... Read more
Key finding: Develops a leakage-aware modulo scheduling algorithm tailored for VLIW architectures with dual-threshold domino logic, maximizing idle time of functional units and reducing transitions between active and sleep modes.... Read more
Key finding: Introduces a novel framework that reformulates modulo scheduling as an acyclic scheduling problem on a 'regular unwinded' problem with a regularity constraint enforcing fixed spacing between operation instances. This reduces... Read more

2. How can modulo scheduling be effectively applied and optimized on Coarse-Grained Reconfigurable Architectures (CGRAs) for loop-level parallelism?

This research area investigates modulo scheduling and mapping techniques to exploit loop-level parallelism on CGRAs. CGRAs offer a balance between programmability and power efficiency, but effective compiler support is crucial, particularly in scheduling, placement, routing, and managing resource constraints. The challenges include integrating scheduling with mapping and routing, handling limited registers and memory ports, and producing mappings that maximize throughput and resource utilization. Advances include heuristic and metaheuristic algorithms, routing-aware scheduling frameworks, and exploiting recomputation to overcome resource limitations.

Key finding: Presents RAMP, a CGRA mapping approach that integrates various routing options explicitly and intelligently before scheduling to improve ability and quality of data routing. By considering routing through PEs, registers,... Read more
Key finding: Proposes MCHPSO, a Modulo-Constrained Hybrid Particle Swarm Optimization algorithm for software pipelining that schedules, places, and routes loops onto CGRAs simultaneously. Experiments on DSP benchmarks and ADRES... Read more
Key finding: Introduces a general problem formulation for application mapping on CGRAs that includes re-computation along with routing to alleviate resource limitations. EPIMap transforms input dependency graphs into epimorphic... Read more

3. How can resource constraints, such as memory ports and fairness, be incorporated and optimized within modulo scheduling frameworks?

This theme explores integrating resource constraints—like memory bandwidth limits, fairness across multiple scheduling days, and resource sharing—into modulo scheduling models. The goal is to maintain high throughput while minimizing resource usage or ensuring equitable service. Research addresses trade-offs between ideal execution times and resource usage, multi-day scheduling fairness, and the application of Boolean and pseudo-Boolean optimization techniques to reduce model sizes and improve solution scalability.

Key finding: Focuses on reducing the required number of memory ports in modulo scheduling for FPGA synthesis while preserving the minimal initiation interval (ideal parallelism). By targeting 'gradual' solutions that optimize resources... Read more
Key finding: Introduces the equitable scheduling problem that generalizes single-machine scheduling by considering multiple days and guaranteeing each client meets deadlines in at least k out of m days. This model addresses fairness in... Read more
Key finding: Demonstrates that reformulating scheduling problems, including round-robin and job-shop scheduling, from Boolean satisfiability (SAT) to linear pseudo-Boolean (PB) constraints can significantly reduce model sizes without... Read more

All papers in Modulo Scheduling

Current trends in many-core architectures show a switch from a small number of architecturally sophisticated cores (e.g. Intel Core2, IBM PowerPC) to many simple cores (e.g SiCortex and Tilera multiprocessor). These simple cores lack many... more
Application-specific optimization of embedded systems becomes inevitable to satisfy the market demand for designers to meet tighter constraints on cost, performance and power. On the other hand, the flexibility of a system is also... more
In current embedded systems processors, multi-ported register files are one of the most power hungry parts of the processor, even when they are clustered. This paper presents a novel register file architecture, which has single ported... more
Users expect future handheld devices to provide extended multimedia functionality and have long battery life. This type of application imposes heavy constraints on both (realtime) performance and energy consumption and forces designers to... more
Given a spatial crime data warehouse, that is updated infrequently and a set of operations O as well as constraints of storage and update overheads, the index type selection problem is to find a set of index types that can reduce the I/O... more
Digital Signal Processing (DSP) architectures are specialized for high performance numerical algorithms such as those found in communication and multimedia applications. The development of efficient compilers for DSP processors is a... more
Se ha valorado poco el tratamiento farmacológico de la enfermedad venosa crónica (EVc) en etapas clínicas CEAP C3 y C4. Objetivo. Valorar si el tratamiento con sulodexida es eficaz en la EVc etapas clínicas CEAP C3 y C4. Material y... more
This paper describes complementary software-and hardwarebased approaches for handling overlapping register lifetimes that occur in modulo scheduled loops. Modulo scheduling takes the Ninstructions in a loop body and constructs an M-stage... more
This paper presents AGAMOS, a technique to modulo schedule loops on clustered micro-architectures. The proposed scheme uses a multi-level graph partitioning strategy to distribute the workload among clusters and reduces the number of... more
This paper presents effective metrics to evaluate the power dissipation of scheduled data flow graphs (DFGs). This enables early evaluation of schedules without performing the computationally expensive resource-binding step. Our metrics... more
Scheduling for speculative parallelization is a problem that remained unsolved despite its importance. Simple methods such as Fixed-Size Chunking (FSC) need several 'dry-runs' before an acceptable chunk size is found. Other traditional... more
High-Level Synthesis tools have been increasingly used within the hardware design community to bridge the gap between productivity and the need to design large and complex systems. When targeting heterogeneous systems, where the CPU and... more
Ahsfrucf-SPAID is a design tool that maps digital signal processing (DSP) algorithms into a multibus VLSI architecture. Algorithm structure, design style of functional units (FU's), and parallelism of the architecture are all explored in... more
Se ha valorado poco el tratamiento farmacológico de la enfermedad venosa crónica (EVc) en etapas clínicas CEAP C3 y C4. Objetivo. Valorar si el tratamiento con sulodexida es eficaz en la EVc etapas clínicas CEAP C3 y C4. Material y... more
Nano-sized molecular motors, which consume chemicals and do mechanical work are ubiquitous in nature. One of the most powerful such motors is the viral packaging motor, which consumes ATP and packages the viral DNA into the procapsid (the... more
In this paper, we develop a method for automatically selecting groups of loops to fuse in an image processing data flow graph, here referred to as a "fusing configuration". The method is designed for use on Digital Signal Processors... more
Polychlorinated dioxins (PCDD) and dibenzofurans (PCDF) have been identified in technical products and pesticides, most of which are not very widely used today. Other sources are incinerators of various types like MSW incinerators,... more
Coarse-grained reconfigurable architectures (CGRA) are designed to deliver high-performance computing while drastically reducing the latency of the computing system. Although they are often highly domain-specifically optimized, they keep... more
Soft cores are used as flexible software programmable components in FPGA designs. Transport-Triggered Architecture (TTA) is interesting for this use due to its scalability, modularity, simplified register files (RF) and fine-grained... more
The testability of an MCM can be enhanced significantly for very little cost whenever a reprogrammable FPGA component that is already embedded in the MCM for functionality is utilized for diagnostics. This approach can have some of the... more
Ayala, Artigues, Gacias (LAAS) Lagrangian relaxation for the RCMSP ISCO 2010 1 / 24 Ayala, Artigues, Gacias (LAAS) Lagrangian relaxation for the RCMSP ISCO 2010 2 / 24 3 Lagrangian relaxation 4 Experimental results. 5 Conclusion and... more
We revisit the optimal code generation or evaluation order determination problem-the problem of generating an instruction sequence from a data dependence graph (DDG). In particular, we are interested in generating an instruction sequence... more
Exploiting instruction-level parallelism (ILP) is extremely important for achieving high performance in application specific instruction set processors (ASIPs) and embedded processors. Existing techniques deal with either scheduling... more
In this paper w e propose CO-Scheduling, a framework f o r simultaneous design of hardware pipelines structures and software-pipelined schedules. T w o important components of t h e Co-Scheduling framework are: (1) An extension t o t h e... more
Instruction scheduling methods which use the concepts developed by the classical pipeline theory have been proposed for architectures involving deeply pipelined function units. These methods rely on the construction of state diagrams (or... more
Usual periodic scheduling problems deal with precedence
Scheduling for speculative parallelization is a problem that remained unsolved despite its importance. Simple methods such as Fixed-Size Chunking (FSC) need several 'dry-runs' before an acceptable chunk size is found. Other traditional... more
One of the characteristics of Madurese variety used in Situbondo Regency is the lexical differences. Focusing on the Madurese variety used by people to communicate in their daily life, this study is aimed to describe the lexical... more
Coarse-Grained Reconfigurable Architecture (CGRA) is a highperformance computing architecture. However, existing CGRA silicon utilization is low due to the lack of fine-grained parallelism inside Processing Element (PE) and general... more
Instruction Scheduling is the task of deciding what instruction will be executed at which unit of time. The objective is to extract maximum instruction level parallelism for the code. Compilers designed for VLIW and EPIC architectures do... more
Instruction Scheduling is the task of deciding what instruction will be executed at which unit of time. The objective is to extract maximum instruction level parallelism for the code. Compilers designed for VLIW and EPIC architectures do... more
Packet filters play an essential role in traffic management and security management on the Internet. In order to create software-based packet filters that are fast enough to work even under a DOS attack, it is vital to effectively combine... more
Increasing performance, while at the same time reducing power consumption, is a major design tradeoff in current microprocessors. In this paper, we investigate the potential of using a heterogeneous clustered VLIW microarchitecture. In... more
Energy has emerged as a critical constraint for a large number of portable, wireless devices. For data intensive applications, a significant amount of energy is dissipated in the memory. Advanced memory architectures support multiple... more
Coarse-grained reconfigurable architectures (CGRA) are designed to deliver high-performance computing while drastically reducing the latency of the computing system. Although they are often highly domain-specifically optimized, they keep... more
Software Pipelining is a fine-grain loop optimization technique for architectures that support synchronous parallel execution. We compare Lam's software pipelining algorithm with Ebcio~lu and Nakatani's technique. This research seems to... more
Under timing constraints, local compaction may fail because of poor scheduling decisions. Su [SDWX87] uses foresight to avoid some of the poor scheduling decisions. However, the foresight takes a considerable amount of time. In this paper... more
Historically, instruction schedulers have been developed in an ad hoc manner. This paper explores using one scheduler for a number of different architectures and the ramifications of this. In order to achieve this generality, a machine... more
Coarse-grained reconfigurable architectures (CGRA) require many processing elements (PEs) and a configuration memory unit (configuration cache) for reconfiguration of its PE array. Although this structure is meant for high performance and... more
Stream languages explicitly describe fork-join parallelism and pipelines, offering a powerful programming model for many-core Multi-Processor Systems on Chip (MPSoC). In an embedded resource-constrained system, adapting stream programs to... more
In this paper, we focus on the resource-constrained modulo scheduling problem, a general periodic scheduling problem, abstracted from the problem solved by compilers when optimizing inner loops at instruction level for VLIW parallel... more
This paper describes a generalisation of modulo scheduling to parallelise loops for SpMT processors that exploits simultaneously both instruction-level parallelism and thread-level parallelism while preserving the simplicity and... more
Download research papers for free!