Skip to main content

Sarma Vrudhula

Followers

32

Following

1

Public Views

Interests

Uploads

Papers by Sarma Vrudhula

Heterogeneous FPGA Architecture Using Threshold Logic Gates for Improved Area, Power, and Performance

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2021

Multi-level logic optimization for low power using local logic transformations

Proceedings of International Conference on Computer Aided Design

In this paper we present an ecient technique to reduce the switching activity in a CMOS combinati... more In this paper we present an ecient technique to reduce the switching activity in a CMOS combinational logic network based o n l o c al logic transformations. These transformations consist of adding redundant connections or gates so as to reduce the switching activity. Simple and ecient procedures, based o n logic implication, for identifying the sources and targets of the redundant connections are p r esented. Additionally, procedures that permit the designer to tradeo power and delay after the transformations are described. Results of experiments on the MCNC benchmark circuits are given. The results indicate that signicant reduction of the switching activities of a CMOS combinational circuit can be achieved with a very low area overhead and low computational cost.

A single-chip, asynchronous echo canceller for high-speed data communication

Proceedings of Eighth International Application Specific Integrated Circuits Conference

A single-chip, 128-coefficient, asynchronous echo canceller has been developed. Cancellation is p... more A single-chip, 128-coefficient, asynchronous echo canceller has been developed. Cancellation is performed by an FIR filter whose coefficients are adapted using the power-of-two modified LMS algorithm. The pipelined circuit updates all coefficients and generates the filtered output every cycle while allowing a sampling rate greater than 205 kHz

Rapid prototyping of networks of asynchronous multiple functional units

Proceedings 8th IEEE International Workshop on Rapid System Prototyping Shortening the Path from Specification to Prototype

The design cycle of the proposed asynchronous multiple functional unit networks, from CAD tool co... more The design cycle of the proposed asynchronous multiple functional unit networks, from CAD tool coding to post-layout scalability, adheres to the attributes of rapid prototyping. These attributes come in five flavors: an OOP style in CAD tools; a short design, modify, evaluate and profile cycle at the dataflow graph level; the reuse of predesigned components; effective event realization of the asynchronous behavior; and rapid VLSI realization. At the modeling level, a dataflow graph modeling tool specifies and profiles the asynchronous systems rapidly and accurately. At the architectural level, several multiple functional unit networks illustrate two rarely addressed issues in asynchronous design: modularity and scalability, which are the keys to rapid prototyping. Networks in a distributor approach and a tournament protocol are presented, where fixed and greedy operand assignments are used respectively. The tournament protocol also leads to a short physical design time and a compact VLSI layout for its regular structure

Algorithm-hardware co-design of single shot detector for fast object detection on FPGAs

Proceedings of the International Conference on Computer-Aided Design, 2018

The rapid improvement in computation capability has made convolutional neural networks (CNNs) a g... more The rapid improvement in computation capability has made convolutional neural networks (CNNs) a great success in recent years on image classification tasks, which has also prospered the development of objection detection algorithms with significantly improved accuracy. However, during the deployment phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the inference algorithm. Therefore, this work proposes to customize the detection algorithm, e.g. SSD, to benefit its hardware implementation with low data precision at the cost of marginal accuracy degradation. The proposed FPGA-based deep learning inference accelerator is demonstrated on two Intel FPGAs for SSD algorithm achieving up to 2.18 TOPS throughput and up to 3.3× superior energy-efficiency compared to GPU.

THRESHOLD LOGIC GENE REGULATORY MODEL - Prediction of Dorsal-ventral Patterning and Hardware-based Simulation of Drosophila

Proceedings of the First International Conference on Biomedical Electronics and Devices, 2008

Precise characterization of gene regulatory mechanisms is a fundamental problem in developmental ... more Precise characterization of gene regulatory mechanisms is a fundamental problem in developmental biology. In this paper we present a new gene regulatory network (GRN) model which is based on threshold logic (TL). Two different set of genes are responsible for the cell patterning of the Drosophila embryo. By using the proposed threshold logic gene regulatory model (TLGRM), we derive the different gene regulatory rules for the gene products involved. We use these rules to model and explain the interaction between the genes. Very large or complex gene regulatory networks are difficult to simulate using a general purpose CPU. Specialized programmable hardware provides additional concurrency and is an alternative to a large and expensive cluster of machines. The steady state gene expression predicted by the model clearly mimics the actual wild-type gene expression along the dorsal-ventral axis in the Drosophila embryo. We thus demonstrate that for a well characterized gene regulatory system, the nature and topology of interaction is enough to model gene regulation. We also demonstrate through proof of concept that using hardware-based simulation, it is possible to achieve orders of magnitude of performance improvement over conventional CPU-based simulation.

The Stochastic Loss of Spikes in Spiking Neural P Systems: Design and Implementation of Reliable Arithmetic Circuits

Fundamenta Informaticae, 2014

Spiking neural P systems (in short, SN P systems) have been introduced as computing devices inspi... more Spiking neural P systems (in short, SN P systems) have been introduced as computing devices inspired by the structure and functioning of neural cells. The presence of unreliable components in SN P systems can be considered in many different aspects. In this paper we focus on two types of unreliability: the stochastic delays of the spiking rules and the stochastic loss of spikes. We propose the implementation of elementary SN P systems with DRAM-based CMOS circuits that are able to cope with these two forms of unreliability in an efficient way. The constructed bio-inspired circuits can be used to encode basic arithmetic modules.

format_quoteProposed design improves single neuron reliability more efficiently than traditional Triple Modular Redundancy methods.format_quote

Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018

As convolution contributes most operations in convolutional neural network (CNN), the convolution... more As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing endto-end CNNs including NiN, VGG-16 and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.

Mitigating effects of non-ideal synaptic device characteristics for on-chip learning

2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2015

The cross-point array architecture with resistive synaptic devices has been proposed for on-chip ... more The cross-point array architecture with resistive synaptic devices has been proposed for on-chip implementation of weighted sum and weight update in the training process of learning algorithms. However, the non-ideal properties of the synaptic devices available today, such as the nonlinearity in weight update, limited ON/OFF range and device variations, can potentially hamper the learning accuracy. This paper focuses on the impact of these realistic properties on the learning accuracy and proposes the mitigation strategies. Unsupervised sparse coding is selected as a case study algorithm. With the calibration of the realistic synaptic behavior from the measured experimental data, our study shows that the recognition accuracy of MNIST handwriting digits degrades from ~97 % to ~65 %. To mitigate this accuracy loss, the proposed strategies include 1) the smart programming schemes for achieving linear weight update; 2) a dummy column to eliminate the off-state current; 3) the use of multiple cells for each weight element to alleviate the impact of device variations. With the improved synaptic behavior by these strategies, the accuracy increases back to ~95 %, enabling the reliable integration of realistic synaptic devices in the neuromorphic systems.

Throughput of multi-core processors under thermal constraints

Proceedings of the 2007 international symposium on Low power electronics and design - ISLPED '07, 2007

We analyze the effect of thermal constraints on the performance and power of multi-core processor... more We analyze the effect of thermal constraints on the performance and power of multi-core processors. We propose system-level power and thermal models, and derive expressions for (a) the maximum number of cores that can be activated, with and without throttling, (b) the speedup (multi-core over single core), and the total power consumption, both as functions of the number of active cores. These expressions involve parameters like power per core, thermal resistance of hottest die block and package, and leakage dependence on temperature. We also computed the above metrics (a) and (b) numerically by solving the detailed Hotspot circuit of an multicore processor driven by a block-level exponential temperaturedependent leakage model. When compared to these numerical results, we found that the above expressions for (a) were at most 8% underpredicted, while those for (b) were accurately predicted. The proposed analytical approach is the first of its kind to relate metrics of interest in multi-core processors to high-level design parameters. Compared to numerical approaches, it provides much faster computation time, and valuable insight for processor designers.

An optimal analytical solution for processor speed control with thermal constraints

Proceedings of the 2006 international symposium on Low power electronics and design - ISLPED '06, 2006

As semiconductor manufacturing technology scales to smaller device sizes, the power consumption o... more As semiconductor manufacturing technology scales to smaller device sizes, the power consumption of clocked digital ICs begins to increase. Dynamic voltage and frequency scaling (DVFS) is a wellknown technique for conserving energy. Recently, it has also been used to control the CPU temperature as part of Dynamic Thermal Management (DTM) techniques. Most works in these areas assume that the optimum speed profile (for either minimizing energy or maximizing performance) is a constant profile. However, in the presence of thermal constraints, we show that the optimal profile is in general, a time-varying function. We formulate the problem of maximizing the average throughput of a processor over a given time period, subject to thermal and speed constraints, as a problem in the calculus of variations. The variational approach provides a powerful framework for precisely specifying and solving the speed control problem, and allows us to obtain an exact analytical solution. The solution methodology is very general, and works for any convex power model, and simple lumped RC thermal models. The resulting speed profiles were found to consist of up to three segments, of which one of them is a decreasing function of time, and the others are constant. We anialyze the effect of different parameters like the initial temperature, thermal capacitance and the maximum rated speed on the nature and the cost of the optimum solution. We also propose a two-speed solution that approximates the optimal speed curve. This solution was found to achieve a performance close to that of the optimum, and is also easier to implement in real processors. Categories and Subject Descriptors C.4 [Performance of systems]: modeling techniques, performance attributes; G. 1.6 [Optimization]: constrained optimization *We gratefully acknowledge the support for this work by the Consortium for Embedded Systems at the Arizona State University and by a grant from the National Science Foundation, grant number CNS-0509540. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Battery optimization vs energy optimization: which to choose and when?

ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005.

Batteries are non-ideal energy sources-minimizing the energy consumption of a battery-powered sys... more Batteries are non-ideal energy sources-minimizing the energy consumption of a battery-powered system is not equivalent to maximizing its battery life. We propose an alternative interpretation of a previously proposed battery model, which indicates that the deviation from ideal behavior is due to the buildup of "unavailable charge" during the discharge process. Previously, battery-aware task scheduling algorithms and power management policies have been developed, which try to reduce the unavailable charge at the end of a given workload. However, they do not account for the occurrence of rest periods (user enforced, naturally occurring, or due to finite load horizon), which are present in a variety of workloads. We first obtain an analytical bound on the recovery time of a battery as a function of the extent of recovery. Then, we shown that the effect of the rest periods is to reduce the improvement of battery-charge optimizing techniques over traditional energy-optimizing techniques. Under certain conditions, the policy that only minimizes energy consumption can actually achieve a longer battery lifetime than a battery-aware policy. A formal criterion based on the recovery time is proposed to choose between a candidate battery-aware policy and a candidate energy-aware policy. We also model the battery discharge process as a Linear Time Invariant system and obtain the frequency response of a battery. This is then used to study the effect of task granularity on the improvement achieved by battery-aware task scheduling. It was observed that the response time of typical batteries are of the order of seconds to several minutes. This, along with the charge recovery effect, was seen to cause battery-aware task scheduling methods to become ineffective for both very fine-grained (less than 10 ms) and very coarse-grained (greater than 30 min) task granularities.

format_quoteRest periods longer than 30 minutes lead to less than 5% improvement of battery-aware policies over energy-aware policies, suggesting limited utility of such strategies.format_quote

Energy optimal speed control of devices with discrete speed sets

Proceedings of the 42nd annual conference on Design automation - DAC '05, 2005

We obtain analytically, the energy optimal speed profile of a generic multi-speed device with a d... more We obtain analytically, the energy optimal speed profile of a generic multi-speed device with a discrete set of speeds, to execute a given task within a given time. Current implementations of energy efficient speed control policies (including DVFS) almost exclusively use the minimum feasible speed pair, which has been shown before to be suboptimal. Unlike previous works, ours does not require an explicit functional relationship between the device's power and speed (e.g. the CMOS power model), but only assumes that the power-speed relationship is a W-convex (a discrete equivalent of a convex) function. This assumption allowed us to show that the optimal speed profile uses at most two speeds, and that all the essential characteristics of the power-speed relationship can be encapsulated within a single speed, ω u. The latter speed is intrinsic to the device (i.e. task independent) and can be readily computed from its power-speed values (without any curve fit). Further, ω u is also the speed at which the the device consumes the least energy per unit work done. The problem formulation reduces to a linear program in the number of supported speeds, which in general, is difficult to solve analytically. However, the optimum solution has a very simple form-it is either ω u , or the minimum feasible speed pair for the given task. We verified that a number of commercial DVFS processors, and other devices like disk drives satisfied our model of the W-convex power-speed relationship.

A framework for statistical timing analysis using non-linear delay and slew models

Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design - ICCAD '06, 2006

In this paper 1 we propose a framework for Statistical Static Timing Analysis (SSTA) considering ... more In this paper 1 we propose a framework for Statistical Static Timing Analysis (SSTA) considering intra-die process variations. Given a cell library, we propose an accurate method to characterize the gate and interconnect delay as well as slew as a function of underlying parameter variations. Using these accurate delay models, we propose a method to perform SSTA based on a quadratic delay and slew model. The method is based on efficient dimensionality reduction technique used for accurate computation of the max of two delay expansions. Our results indicate less than 4% error in the variance of the delay models compared to SPICE Monte Carlo and less than 1% error in the variance of the circuit delay compared to Monte Carlo simulations.

Analytical Results for Design Space Exploration of Many-Core Processors for Sound Synthesis of Guitar

2011 International Conference on Information Science and Applications, 2011

Abstract In this paper, we present a design space exploration of optimal many-core processors for... more

Energy profiler for hardware/software co-design

17th International Conference on VLSI Design. Proceedings.

The increasing popularity of low power computing drives the need for increasing battery life-time... more The increasing popularity of low power computing drives the need for increasing battery life-time by power-optimizing application programs. This requires a tool capable of providing a system-level energy model. We present a methodology for simulating and profiling energy consumed by software applications running on computing systems. The uniqueness of our framework lies in its ability to capture at a fine granularity, power behavior of the software executing on the system as well as of each of the hardware components in the system and its applicability to a wide variety of computing systems. We demonstrate our work on a real world platform, the Itsy, a handheld computer developed by Compaq&amp;#x27;s Western Research Labs.

Variational delay metrics for interconnect timing analysis

Proceedings of the 41st annual conference on Design automation - DAC '04, 2004

In this paper we develop an approach to model interconnect delay under process variability for ti... more In this paper we develop an approach to model interconnect delay under process variability for timing analysis and physical design optimization. The technique allows for closed-form computation of interconnect delay probability density functions (PDFs) given variations in relevant process parameters such as linewidth, metal thickness, and dielectric thickness. We express the resistance and capacitance of a line as a linear function of random variables and then use these to compute circuit moments. Finally, these variability-aware moments are used in known closedform delay metrics to compute interconnect delay PDFs. We compare the approach to SPICE based Monte Carlo simulations and report an error in mean and standard deviation of delay of 1% and 4% on average, respectively.

Hardware-software bipartitioning for dynamically reconfigurable systems

Proceedings of the tenth international symposium on Hardware/software codesign - CODES '02, 2002

The main unique feature of dynamically reconfigurable systems is the ability to time-share the sa... more The main unique feature of dynamically reconfigurable systems is the ability to time-share the same reconfigurable hardware resources. However, the energy-delay cost associated with reconfiguration must be accounted for during hardware-software partitioning. We propose a method for mapping nodes of an application control flow graph either to software or reconfigurable hardware, explicitly targeting minimization of the energy-delay cost due to both computation and configuration. The addressed problems are energy-delay product minimization, delay-constrained energy minimization, and energy-constrained delay minimization. We show how these problems can be tackled by using network flow techniques, after transforming the original control flow graph into an equivalent network. If there are no constraints, as in the case of the energy-delay product minimization, we are able to generate an optimal solution in polynomial time.

Performance optimal speed control of multi-core processors under thermal constraints

2009 Design, Automation & Test in Europe Conference & Exhibition, 2009

Advances in chip-multiprocessor processing capabilities has led to an increased power consumption... more Advances in chip-multiprocessor processing capabilities has led to an increased power consumption and temperature hotspots. Maintaining the on-chip temperature is important from the power reduction and reliability considerations. Achieving highest performance while maintaining the temperature constraint is a challenge. We develop analytical solutions for the optimal control of frequencies for each core in a chipmultiprocessor. The objective is to reduce the makespan or the latest task completion time of all tasks. We show that the optimal frequency policy is bang-bang when the temperature constraint is not active and is exponential when the temperature constraint is active. We show that there is a significant improvement in overall throughput with our proposed solution and yet all cores operate under the thermal maximum.

Implicit pseudo boolean enumeration algorithms for input vector control

Proceedings of the 41st annual conference on Design automation - DAC '04, 2004

In a CMOS combinational logic circuit, the subthreshold leakage current in the standby state depe... more In a CMOS combinational logic circuit, the subthreshold leakage current in the standby state depends on the state of the inputs. In this paper we present a new approach to identify the minimum leakage set of input vectors (MLS). Applying a vector in the MLS is known as Input Vector Control (IVC), and has proven to be very useful in reducing gate oxide leakage and sub-threshold leakage in standby mode of operation. The approach presented here is based on Implicit Enumeration of integer-valued decision diagrams. Since the search space for minimum leakage vector increases exponentially with the number of primary inputs, the enumeration is done with respect to the minimum balanced cut of the digraph representation of the circuit. To reduce the switching power dissipated when the inputs are driven to a given state (during entry into and exit from the standby state), we extend the MLS algorithm to compute a bounded leakage set (BLS). Given a bound of standby leakage, we present an algorithm for computing minimal switching cost partial input vectors such that the leakage of the circuit is always less than the upper bound.