Academia.eduAcademia.edu

Recovery Technique For Implementing DSP Algorithms,”

1999

Abstract

In this IC&D report, we provide two research works entitled:

LPGD A Low-Power Design Methodology/Flow and its Application to the Implementation of a DCS1800-GSM/DECT Modulator/Demodulator (ESPRIT 25256) “Proceedings of second Workshop” D. Soudris1, C. Dre2, and C. Goutis3 1 Democritus University of Thrace (DUTH) 2 Intracom 3 University of Patras (UP) Editor: DUTH+UP+INTRACOM Type: Report LPGD Id: LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 CEC Identifier: EP25256/ DUTH+UP+INTRACOM/D1.2R1 Document Version: 1 Status: Deliverable Confidentiality: Public Actual Date: December 31, 1999 Contractual Date: November 30, 1999 WorkPackage: WP1 Keywords: logic-level estimation, real-delay gate model, RTL optimization, energy recovery, array processor Abstract: In this IC&D report, we provide two research works entitled: i) “Accurate And Fast Power Estimation Of Large Combinational Circuits” DQG LL  “A Modified Energy Recovery Technique For Implementing DSP Algorithms,” ZKLFK ZHUH LQFOXGHG LQ WKH 3URFHHGLQJV RI WK ,QW :RUNVKRS 3RZHU DQG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ DQG 6LPXODWLRQ 3DWPRV C9  Kos, Greece, 2FWREHU 8, 1999. Title: Author(s): Copyright © 1999 INTRACOM UP DUTH Proceedings of second Workshop Date December 31, 1999 LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 History Version 1 Comments Deliverable Purpose /Scope The purpose of the present Information Capturing and Dissemination report is to provide two papers entitled: i) “Accurate And Fast Power Estimation Of Large Combinational Circuits” >@ DQG LL “A Modified Energy Recovery Technique For Implementing DSP Algorithms,” [2], ZKLFK ZHUH LQFOXGHG LQ WKH 3URFHHGLQJV RI WK ,QW :RUNVKRS 3RZHU DQG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ DQG 6LPXODWLRQ 3DWPRV C9  Kos, Greece, 2FWREHU 8, 1999, pp. 199-208. The published material includes two low power techniques, the first of which aims at the power estimation of large gate level circuits under a real delay gate model. The second approach is an RTL power optimization technique for implementing DSP applications based on an area-efficient energy recovery technique. Here, it should be stressed that the General Chair of PATMOS 99 is also the responsible person for the Information Capturing and Dissemination of LPGD. In the context of PATMOS 99 there was close co-operation between PATMOS 99 organizing committee and DIMES (Prof. R. Nouta and Prof. R. Van Lecken ). Except that report, we attached the Proceedings and the Technical Program of PATMOS 99. Publications [1] S. Theoharis, G. Theodoridis, N. Zervas, and C. Goutis, “Accurate And Fast Power Estimation Of Large Combinational Circuits,” in Proc. of 9WK ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ $QG 6LPXODWLRQ 3DWPRV C9  Kos, Greece, 2FWREHU 8, 1999, pp. 199-208. [2] D. Soudris, C. Z. Lolas, And A. Thanailakis, “A Modified Energy Recovery Technique For Implementing DSP Algorithms,” In Proc. of 9th ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ $QG 6LPXODWLRQ 3DWPRV C9  Kos, Greece, 2FWREHU 8, 1999, pp. 113-122. Public Page 2 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 TABLE OF CONTENTS ACCURATE AND FAST POWER ESTIMATION OF LARGE COMBINATIONAL CIRCUITS......................5 ABSTRACT ....................................................................................................................... .....................................5 1. INTRODUCTION................................................................................................................ ...............................5 2. PROBLEM FORMULATION ............................................................................................................................6 3. SWITCHING ACTIVITY COMPUTATION .............................................................................................. .......8 4. EXPERIMENTAL RESULTS ........................................................................................................ ..................11 6. CONCLUSIONS ...............................................................................................................................................14 REFERENCES ..................................................................................................................... .................................14 A MODIFIED ENERGY RECOVERY TECHNIQUE FOR IMPLEMENTING DSP ALGORITHMS.............. 17 ABSTRACT .......................................................................................................................................................... 17 1. INTRODUCTION ...........................................................................................................................................17 2. THE BASIC DESIGN CONCEPT.................................................................................................................... 18 3. THE PROPOSED DESIGN METHODOLOGY...............................................................................................19 3.1 THE MAPPING PROCEDURE ............................................................................................................................. 19 3.2 METHOD FOR SELECTING OPTIMAL CLOCKING SCHEME................................................................................... 20 4. HARDWARE COMPLEXITY/REDUCTION..................................................................................................22 5. APPLICATION ................................................................................................................................................. 23 6. CONCLUSIONS ...............................................................................................................................................25 Public Page 3 Reprinted from Proceedings of 9WK ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ$QG6LPXODWLRQ 3DWPRVC9 Kos, Greece, 2FWREHU8, 1999. S. Theoharis, G. Theodoridis, N. Zervas, and C. Goutis, “Accurate And Fast Power Estimation Of Large Combinational Circuits,” in Proc. of 9WK ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ $QG 6LPXODWLRQ 3DWPRV C9  Kos, Greece, 2FWREHU8, 1999, pp. 199-208. Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 Accurate And Fast Power Estimation Of Large Combinational Circuits1 S. Theoharis, G. Theodoridis, N. Zervas, and C. Goutis. VLSI Design Lab, Dept. of Elect. and Comp. Eng. University of Patras, Rio 26 110, Greece Abstract A novel probabilistic method to estimate the switching activity of a logic circuit under a real delay gate model, is introduced. Based on Markov stochastic processes and generalizing the basic concepts of zero delay-based methods, a novel probabilistic model to estimate accurately the power consumption, is developed. More specifically, a set of new formulas, which describe first-order temporal correlation, under real delay model, are derived. The chosen gate model allows accurate estimation of the functional and spurious (glitches) transitions, leading to accurate power estimation. The proposed approach manipulates efficiently large circuits employing a new partitioning heuristic, which estimates the switching activity with reduced computational complexity. Comparative study and analysis of benchmark circuits demonstrates the accuracy and efficiency of the proposed method. 1. Introduction The development of competitive market sectors such as wireless applications, laptops and portable medical devices make the power dissipation the most important parameter in the modern VLSI design field[1]. The average number of transitions of the circuit nodes can express the average power consumption of a circuit. A variety of probabilistic power estimation methods have been developed to estimate the power consumption of the combinational circuits [2]-[9]. In this methods probabilistic properties, such as the average steady state and the average transition probability of the circuit nodes, are used in order to evaluate the transitions of any circuit node. The above methods can be classified according to assumed gate delay model in the following categories i) the zero delay methods [3]-[6], where only the functional transitions are considered and the ii) real gate delay methods [7]-[9], where both the functional and spurious transitions are taken into account. 1 This work was partially supported by the LPGD ESPRIT IV 25256 project of EU. Public Page 5 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 Hence, real gate delay model is needed for accurate power estimation. In addition temporal correlation, input pattern dependencies and spatial (structural) dependencies have to be considered to estimate accurate the power dissipation. In [7] the concept of the probability waveforms to estimate the average power dissipation of a combinational circuit was introduced. Given the probability waveforms of the primary inputs, a propagation algorithm of these waveforms through the circuit to evaluate the corresponding waveforms of the internal circuit nodes was developed. However, both the structural and input dependencies were not taken into consideration. Additionally, the efficiency of the algorithm for large circuits is not proved by the presented experimental results. In [8], a symbolic simulation algorithm has been proposed. Having the switching activities of the primary inputs and using global Ordered Binary Decisions Diagrams (OBDDs), where each node is expressed by the primary input signals, the transition probability of a node at time t resulted from the probability of a new boolean function. The new boolean function is derived by XOR’ing the boolean functions that correspond to two successive switching time instances, i.e. t and t+1. This method handled the structural and the first-order temporal correlations, but the input pattern dependency was not included. Moreover, this method is inefficient for large circuits as global OBDDs are used. A new method for calculating the transition probabilities was proposed in [9]. The structural and the first order temporal correlations of the primary inputs were captured. In order to capture large circuits the method is parameterized in terms of the depth of the circuit levels that are used for reconvergent structural analysis. In this paper, a novel probabilistic method for accurate power estimation of large combinational circuits is introduced. Taking into account simultaneous multiple-input transitions, temporal and structural correlation of the circuit and assuming that primary inputs are independent (i.e. the input data correlations do not taken into account), the proposed method calculates the switching activity of any logic circuit node under a real delay gate model. The main objective is the development of a new mathematical model to estimate accurately the power consumption of a logic level circuit under real delay model generalizing fundamental principles of zero delay-based methods. Since we assume a real delay gate model, the time parameter influences the logic behavior of a circuit node. We propose a new heuristic for handling efficiently large combinational circuits. The proposed method is compared with the method reported in [9]. Employing a set of benchmark circuits, the comparison results prove the efficiency of the proposed method. 2. Problem Formulation It is assumed that the combinational circuit is a part of a synchronous sequential circuit, which means that its inputs can switch synchronously with the clock, performing at most one transition at time t=0 during the clock period [0,T). Public Page 6 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 The power estimation problem of a combina-tional logic circuit, under real gate delay model can be stated as follows: “Given the gate level description of a combinational circuit with n inputs and m outputs and the inertial delays of its gates, and, assuming that the time between two successive applied input vectors is greater or equal to the settling time of the circuit, estimate the average power consumption of the circuit for an input vector stream through the calculation of its average switching activity.” The output of a gate generates a glitch, if two conditions are satisfied: i) the necessary condition, which requires the difference of the transition arrival times of the input signals to be equal or greater than the inertial delay of the gate and ii) the sufficient condition, which requires the appropriate transitions of the input signal(s) to switch the gate output. Consequently, the time parameter plays a critical role in the power consumption estimation. That is, the exact description of a logic circuit should include not only its logic behavior, but also the time dependence among the logic signals. Since the glitch generation is strongly dependent on the time, a “modified” boolean function, from now on called timed boolean function, which can describe the logic and timing behavior, is needed. [ 1 n1 x2 1 f Figure 1 : The logic circuit with Unit Delay AND gates. Example: We assume a logic circuit with gate delay equals to one delay unit, d, as it shown in Figure 1. The logic behavior of the node f can be described in time domain as follows: f = F ( x1 , x 2 , t ) = x1 ( t − 2 d ) x 2 ( t − 2 d ) x 2 ( t − d ) (1) The signal f may switch at two time instances, i.e. t 1 = d and t 2 = 2d . More f f specifically, the transition of the signal f at t1, t 1 = d , depends on the transitions of the f primary inputs x1 and x2 at time points t1x1 = −d , t1x2 = −d , and t1x2 = 0 , while the transition of f at t 2 = 2d depends on the transitions of the signals x1 and x2 at t2x1 = 0 , f t 2x2 = 0 , and t2x2 = d . Having as starting point eq. (1), a novel mathematical model, which describes the behavior of a logic signal in terms of time, should be introduced. We aim at the development of a new method, which reduces the power estimation problem considering real delay model to a zero delay problem at certain switching time points. For that purpose, we introduce new concepts and formulas, which express parameters of real delay modeled power estimation problem in terms of zero delay parameters. Public Page 7 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 3. Switching Activity Computation In this section, we provide all the necessary definitions and the appropriate lemmas for modeling the temporal correlation in time domain. More specifically, we begin with the fact that a signal can be modeled as Markov stochastic process. Then, we provide the appropriate material for temporal correlation modeling, extending the concept of transition probability for certain time intervals. The behavior of a binary signal, x, at a time point, t, i.e. x(t), is modeled as a random variable of a time homogeneous, Strict Sense Stationary, lag-one Markov stochastic process having two states, s, with s ∈S = {0,1} . Also, it is assumed that the Markov process is finite, aperiodic, and irreducible with all states recurrent non-null [6]. The transition probability, p klx (t ) , expresses the probability of a signal x to perform a transition from the state k to the state l within two successive time points t-1 and t. That is: p klx (t ) = p ( x(t − 1) = k ∧ x(t ) = l ) ∀ k , l ∈ S (2) The switching activity, E ( x, t ) , of a signal x at time instance t is given by: x x E ( x, t ) = p 01 (t ) + p10 (t ) (3) where p01x (t ) (or p10x (t ) ) is the transition probability of the signal x to perform the transition from state 0 to 1 (or 1 to 0) at time instance t. For the rest of the paper, the term "input signal" stands for primary input signal and the term "signal" stands for either a primary input signal or an internal signal of the logic circuit. x Definition 1. A Signal Transition Probability Vector, P (t ) , of a signal x at a time x instance t, is defined as the vector of all transition probabilities p kl (t ) , with k , l ∈ S : [ P x ( t ) = p00x (t ), p01x ( t ), p10x (t ), p11x (t ) ] (4) We introduce the transition probability of a signal x in time intervals (ti-1, ti) and (ti, ti+1) as P x (ti− ) and P x (ti+ ) , respectively. Their corresponding values are computed by the next lemma. Lemma 1. The transition probability of a signal, x, at a time point t ′∈ {t − , t + } is expressed in terms of the transition probability at time point t as follows: P x (t ′) = f (P x (t ) ) (5) and are computed by the following new formulas: pkkx (t − ) = pkkx (t )+ pkx(1−k ) (t ) ∀ k ∈ S Public (5.1) Page 8 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 pllx (t + ) = pllx (t )+ p(x1−l ) l (t ) ∀ l ∈ S (5.2) pklx (t − ) = pklx (t ) = 0 (5.3) ∀ k, l ∈ S ∧ k ≠ l Proof: Due to lack of space the proof is omitted here. Definition 2: We define as Valid Time Points set, T = {t1 ,... , tr }, the transition time x points for a signal x, i.e. the time points that a transition may occur in the signal x. The switching activity estimation problem is reduced to the estimation of P x (t ) ∀ t ∈T x . For example the T f = {1,2} for the signal f of the circuit shown in Fig. 1. For an input signal x the T x = {0 − ,0,0 + } Generally, a signal x of the logic circuit is a timed boolean function of a subset of v signals, i.e. x (t ) = f (x1 (t ) , ... , xv (t ) ) . For instance, the nodes n1 and f shown in Fig. 1 have as timed boolean function n1 (t ) = x1 (t − 1) x2 (t − 1) and f (t)=n1 (t −1) x2 (t −1) =x1 (t −2) x2 (t −2) x2 (t −1) respectively (one unit time delay is assumed). For any timed boolean function, x (t ) = f (x1 (t ) , ... , xv (t ) ) , we can efficiently construct the OBDD using the BDD package reported in [11]. The v variables that the signal x depends on, at any time point, are called the support set of signal x(t). In order to find out the support sets of all circuit nodes, a limited reconvergent path analysis is performed. This analysis results into a new heuristic, which is based on the limited depth spatial correlation of [9]. More specifically, the notion of active nodes introduced in [9] is used in order to perform an initial partition for any circuit signal according to a user defined parameter, called Depth. This parameter determines the depth in terms of logic levels from each signal x that will be searched in order to determine if two paths starting at x will reconverge. Spatial correlation corresponding to two paths starting at x that reconvergence within Depth logic levels will be accurately taken into account. If reconvergent paths meet after Depth logic levels, then they are assumed to be independent. Even if the two paths meet within Depth logic levels, another user defined parameter , called N_vars, limits the number of variables v , i.e v ≤ N_vars for any circuit signal at any valid time point. The algorithm that limits the number of variables consist a modified heuristic similar to that reported in [9]. A circuit node, f, switches at time t = i , if the derivative with respect to time of its timed boolean function is equal to 1. Thus, the average switching activity, E ( f ) , in probabilistic domain can be described by:  ∂f (t) E( f ) = p  ∂t t =tif ) (  = 1 = p lim{f (tif − ε ) ⊕ f (tif + ε )}= 1 ε →0  Public (6) Page 9 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 Instead of performing the XORing between the boolean functions corresponding to the time intervals (ti-1, ti) and (ti, ti+1) we can manipulate efficiently the OBDD, that corresponds to each time point, in order to evaluate the probability of switching at t = i . Taking into consideration the representation of the ordinary boolean function on the OBDD the evaluation of the transition k → l with k , l ∈ {0, 1 } at time instance ti can f be done as described by the following steps (using the same procedure as in [6]): i) Find the set of paths to state k, Π k (t = i ) , of the OBDD of the boolean function corresponding to the timed boolean function on time point i, ii) Find the set of paths to state l, Π l (t = i ) , of the OBDD of the boolean function corresponding to the timed boolean function on time point i, iii) Combine each path of Π k (t = i ) with all paths of Π l (t = i ) and extract the switching behavior taking into account the temporal compatibility of each signal v using the following equation: p x kl (t i ) = ∑ ∑ ∏ π ∈ Π k π ′∈ Π l j =1 p kxiil i (t j ) (7) The pseudo-code of the proposed switching activity estimation is shown in Fig. 2. Switching_Activity_Estimation (Network, Vector , Depth, N_vars){ Gates = Topological_Sort ( Network ) ; -- (1) : Perform a limited depth reconrvergent analysis Find_Active_Nodes ( Gates , Depth ) ; -- (2) : Find out the valid time points of each circuit signal. For each primary input signal xi { see [9] T xi = {0} ; P xi (0) = Analyze_Input_Vector( Vector , i ) ; } for each gate g in Gates { m = delay of g ; T g = NIL (LIST) ; for each input gi of g { for each time point k ∈ ∪ T gi { T g = T g ∪ {k + ∆ } if k + m satisfies the necessary condition ; } } } -- (3) : Calculate the switching activity of any circuit node at any valid time point. For each gate g in Gates { for each time point k of T g { Sup(g(k)) = Find_Support_Set ( g , k , Depth , N_vars ) ; OBDD(g(k)) = Construct_OBDD ( Sup (g(k)) ) ; Public see [11] Page 10 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 P g (k ) = Combine_Paths( OBDD(g(k) ) ; see [6] } } } Find_Support_Set ( g , k , Depth , N_vars ) { Sup(g(k)) = Construct_Support_Set ( g, k , g.active_nodes ) ; Depth’ = Depth ; While | Sup(g(k)) | > N_vars { Depth’ = Depth’ – 1 ; Find_Active_Nodes ( g , Depth’ ) ; Sup(g(k)) = Construct_Support_Set ( g , k, g.active_nodes’ ) ; } return(Sup(g(k)) ; } Figure 2 : Pseudo-code of the proposed method. 4. Experimental Results The proposed power estimation method is implemented by ANSI C language, while its efficiency is proved by a number of ISCAS'91 benchmark multilevel circuits. For technology mapping, a library of primitive gates (i.e. NOT, BUFF, AND, NAND, OR, NOR, XOR, XNOR) of up to 4 primary inputs, is used. All power estimations are measured in ìW with 20 MHz clock frequency and 5 V power supply. The presented results are obtained using the proposed method with Depth = 2 and 3 and N_vars = 8 and the method presented in [9] with the same partitioning depths. For a circuit node ni (either primary input or internal node), we define as Real Node Error the quantity Err (ni ) = E (ni ) − E ′(ni ) where E (ni ) is the real switching activity of E (ni ) node ni and E ′(ni ) is the estimated switching activity of this node. For a combinational circuit with N gates and a specific input vector set Vj, we define as Total Power Consumption the quantity Total Power (V j ) = N 1 2 ⋅ Vdd ⋅ f ⋅ ∑ E Eff (ni ) . The effective 2 i =1 switched capacitance, E Eff (ni ), is defined as the product of the switching activity E (ni ) and the total capacitance load of node ni, C ni = Fni C g where Fni is the fanout of this node and Cg = 0.05 pF is a typical input gate capacitance. We define as Real Total Error the quantity Total Error (V j ) = ′ Total Power (V j )− Total Power (V j ) Total Power (V j ) Public , as Real Mean Error the Page 11 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 N quantity Mean Error (V j ) = ∑ Err (n ) i i =1 N and finally as Real Maximum Error the quantity Max Error (V j ) = max{ Err (n1 ), Err (n2 ), ... , Err (n N ) }. For comparison reasons, a random vector file of 100,000 vectors with temporal correlations is generated for each benchmark circuit. More specifically any input bit xi has a random static, p(xi )∈[0,1] , and switching probability E (xi )∈[0,1] and E(xi )≠ 2 p(xi )(1 − p(xi ) ) . We compare the proposed method and the method reported in [9] with Mentor's Graphics QUICKSIM II switch level simulator. The power consumption differences between each method and switch level simulator are depicted in Table 2-5 (i.e. proposed method and [9]). Circuit C1355 C1908 C2670 C3540 C499 C6288 C7552 C880 alu4 cu des f51m rca16 vda # inputs 41 33 33 50 41 32 207 60 14 14 256 8 33 17 # outputs 32 25 64 22 32 32 107 26 8 11 245 8 17 39 # gates 180 205 422 817 176 1491 1154 221 543 104 2437 83 112 391 # levels 15 21 20 34 17 73 21 22 35 14 17 10 49 16 Table 1: Circuits characteristics. C1355 C1908 C2670 C3540 C499 C6288 C7552 C880 alu4 cu des f51m rca16 vda Average Power Prop. Power' 3149 3061 4322 4205 4827 4792 8864 8010 2710 2661 75583 94825 12457 11733 1722 1691 3858 3683 510 502 16987 16497 532 519 934 874 2465 2310 9923 11097 TOTAL MEAN 2.803 4.724 2.703 12.239 0.703 6.886 9.628 12.059 1.820 4.367 25.461 19.721 5.799 10.791 1.745 3.695 4.452 16.115 1.506 3.283 2.862 3.721 2.492 8.357 6.454 7.000 6.213 10.064 5.332 8.787 MAX 34.081 321.325 231.785 112.329 28.167 136.845 129.737 39.836 114.390 21.651 60.855 173.159 26.902 97.258 109.166 Time 0.890 1.030 5.030 3.350 0.910 10.340 7.320 1.290 2.220 0.330 15.270 0.260 0.650 1.290 3.584 Table 2: Real power estimation errors of the proposed method for Depth=2, N_vars=8. Public Page 12 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 For some large circuits (even for Depth=2) the method presented in [9] runs out of memory and this is indicated with a “N/A” in the corresponding table row. The slightly differences in power consumptions (column Power is the real power consumption while column Power’ is the estimated one) are present due to the fact that the [9] method estimates the switching activities using polynomial simulation (without OBDDs) while the Proposed method estimate the switching activities using OBDDs. Finally in column Time, the corresponding run times for each method are given. The reported run times are in seconds and were obtained on an HP-UX C180 with 512 Mbytes of main memory. As it shown the run times of the proposed method are about 6 times faster than those of [9] (wherever [9] is applicable) and the average run time is 3.5 sec. Circuit C1355 C1908 C2670 C3540 C499 C6288 C7552 C880 alu4 cu des F51m Rca16 vda Power [9] Power' TOTAL MEAN 3149 3128 0.657 5.204 4322 4314 0.181 11.516 4827 N/A N/A N/A 8864 7912 10.732 12.273 2710 2661 1.822 4.369 75583 95590 26.473 19.726 12457 N/A N/A N/A 1722 1693 1.580 3.526 3858 3710 3.756 14.606 510 501 1.518 3.251 16987 N/A N/A N/A 532 514 3.297 5.744 934 874 6.449 6.997 2465 2426 1.501 6.109 MAX 33.160 382.930 N/A 111.899 28.162 135.901 N/A 22.534 113.917 21.652 N/A 59.363 26.880 97.256 Time 3.920 6.870 N/A 21.550 3.880 140.580 N/A 5.830 16.790 0.910 N/A 1.090 2.060 8.900 Table 3: Real power estimation errors of method [9] for Depth=2. Choosing Depth=3 and N_vars=8, Table 4 shows the errors in power estimation (%) of the proposed method, while Table 5 shows the corresponding errors of [9]. As it is depicted, the proposed method reduces the power estimation errors, while method [9] is not applicable for the majority of the large circuits. More specifically the average TOTAL error is about 4.5 %, while the average MEAN error and the average MAX error are 8.6 % and 106.8 % respectively. The average run time is slightly increased (i.e. 4.4 sec) compared to the average run time of Table 2. Circuit C1355 C1908 C2670 C3540 C499 C6288 C7552 C880 alu4 cu des f51m Prop.Power' TOTAL Power MEAN 3149 3075 2.357 3.863 4322 4224 2.256 12.377 4827 4787 0.807 7.039 8864 7945 10.368 12.135 2710 2667 1.595 4.997 75583 61981 17.995 20.270 12457 11815 5.146 9.857 1722 1721 0.020 3.142 3858 3665 4.919 16.457 510 502 1.506 3.283 16987 16504 2.823 3.754 532 517 2.694 9.118 Public MAX 34.081 371.130 200.166 98.021 35.530 99.297 158.658 40.119 114.218 21.651 60.757 173.146 Time 0.930 1.240 5.380 4.760 0.940 14.440 8.960 1.550 3.190 0.330 18.180 0.340 Page 13 Proceedings of second Workshop rca16 vda Average 934 2465 9923 LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 872 2362 8760 6.586 4.124 4.514 7.188 7.296 8.627 26.902 62.578 106.875 0.670 1.610 4.466 Table 4: Real power estimation errors of the proposed method for Depth=3, N_vars=8. 6. Conclusions We have proposed a novel method for the estimation of the power dissipation of a logic circuit that is described at the gate level, assuming a real gate delay. The proposed method constitutes an extension of the zero delay probabilistic method that presented in [9] and takes into account the first-order temporal correlations and the spatial correlations for the logic circuit structural dependencies. Since the proposed method does not take into account the spatial correlations due to input data dependencies, our future work is to implement a method that propagates the primary input statistics and correlation coefficients through the logic network in order to estimate efficiently the switching activity of any node of the logic circuit at any valid time point and for any input vector stream. Circuit Power [9] Power' TOTAL MEAN MAX Time 3149 3121 0.904 4.095 24.054 4.920 C1355 4322 N/A N/A N/A N/A N/A C1908 4827 N/A N/A N/A N/A N/A C2670 8864 N/A N/A N/A N/A N/A C3540 2710 2690 0.738 4.481 30.072 5.250 C499 75583 N/A N/A N/A N/A N/A C6288 N/A N/A N/A N/A N/A C7552 12457 1722 N/A N/A N/A N/A N/A C880 3858 N/A N/A N/A N/A N/A alu4 510 501 1.518 3.251 21.652 0.910 cu N/A N/A N/A N/A N/A des 16987 532 519 2.463 4.730 60.613 3.560 F51m 934 937 0.314 0.866 7.030 2.980 Rca16 2465 2442 0.857 4.150 28.290 51.320 vda Table 5: Real power estimation errors of method [9] for Depth=3. References [1] J. Rabaey and M. Pedram, “Low Power Design Methodologies,” Kluwer Academic Publishers, 1996. [2] F. Najm, “A Survey of Power Estimation Techniques in VLSI circuits” in IEEE Trans. on VLSI, vol 2, no 4, pp. 446-455, December 1995. [3] F. Najm, “Transition Density: A new measure of activity in digital circuits,” in IEEE Trans. on CAD, Vol. 12, No. 2, pp. 310-323, February 1995. [4] B. Kapoor, “Improving the accuracy of circuit activity measurement”, in Proc. of Int. Workshop on Low Power Design, pp. 111-116, April 1994. Public Page 14 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 [5] P. Schneider and U. Schlichmann, “Decomposition of Boolean functions for low power based on a new power estimation technique,” in Proc. of IWLPD, pp. 123-128, April 1994. [6] R. Marculescu, D. Marculescu, and M. Pedram “Efficient Power estimation for highly correlated input streams,” in Proc. of DAC, pp.628-634, 1995. [7] F. Najm, R. Burch, P.Yang and I. Hajj “Probabilistic Simulation for Reliability Analysis of CMOS VLSI circuits,” in IEEE Trans. on CAD, 9 (4) pp. 439-450, Apr. 1990. [8] J. Monteiro, A. Ghosh, S. Devadas, K. Keutzer, and J. White, “Estimation of average switching activity in combinatorial and sequential circuits”, in IEEE Trans. on CAD, Vol. 16, No.1, pp. 121-127, January 1997. [9] J.C. Costa, J.C. Monteiro, and S. Devadas, “Switching Activity Estimation using Limited Depth Reconvergent Path analysis”, In Proc. of ISLPD, pp. 184-189, 1997. [10] R. Bryant, “Graph-based algorithms for Boolean function manipulation”, in IEEE Trans. on Computers, C-35:677-691, August 1986. [11] J. Lind-Nielsen, “BuDDy – A Binary Decision Diagram Package”, IT-TR : 1999-028, Technical University Of Denmark, 1999. Public Page 15 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 Reprinted from Proceedings of 9WK ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ$QG6LPXODWLRQ 3DWPRVC9 Kos, Greece, 2FWREHU8, 1999. D. Soudris, C. Z. Lolas, And A. Thanailakis, “A Modified Energy Recovery Technique For Implementing DSP Algorithms,” in Proc. of 9WK,QW:RUNVKRS3RZHU$QG7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ $QG 6LPXODWLRQ 3DWPRV C9  Kos, Greece, 2FWREHU 8, 1999, pp. 113-122. Public Page 16 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 A modified Energy recovery Technique FOR implementing DSP algorithms2 D. Soudris, C.Z. Lolas*, and A. Thanailakis VLSI Design and Testing Center, Dept. of Electrical and Computer Eng. Democritus University of Thrace, 67 100 Xanthi, Greece, [email protected] *Department of Computer Science, University of Crete, P.O.Box 1470, Heraklion, Creece Abstract A new method for implementing low power architectures of DSP applications based on the energy recovery techniques is introduced. Exploiting the regular and iterative structure of DSP algorithms, the proposed architectures embody the principles of the reversible pipeline approach. Since a reversible pipeline system uses the “reverse” logic blocks in different time instances, we can reduce the hardware complexity significantly. An efficient multiplexing and clocking scheme preserve the reversible operation with small hardware cost. A series of proven lemmas specify the optimal characteristics of the chosen clocks, which ensure adiabatic operation and high speed. Furthermore, we specify the necessary conditions which ensure optimally-adiabatic algorithm mappings. The proposed approach is demonstrated by the low power architecture design of the convolution algorithm. 1. INTRODUCTION Low power dissipation has become a significant parameter for implementing efficient digital systems. The market demands for portable devices with high performance forced the designers to invent methods for reducing power dissipation in VLSI systems. Plethora low power design methods have been published, which can be divided into two main categories: i) the conventional low power design approaches [1, 2] and ii) nonconventional [3]. More specifically, the energy recovery is a non-conventional low power design approach, which is based on adiabatic switching principles [3,4,5]. Using adiabatic techniques, the signal transfer between circuit capacitances should be sufficiently slow, which implies that the energy dissipated as heat may be asymptotically low, during transfer. Moreover, the energy recovery systems can recycle back to a power source, the remaining energy stored on circuit capacitances. Two approaches for 2 This work was partially supported by the LPGD ESPRIT IV 25256 project of EU Public Page 17 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 designing multi-block energy recovery systems have been reported: i) the Retractile Cascade [3] and ii) the Reversible Pipeline (RP) [3,5,6]. In particular, since the throughput of a Retractile Cascade system is inversely proportional to the number of used blocks, composite systems with large number of blocks are become impractical. Using partially overlapped clocks and the RP approach, the throughput of the system can be improved. However, the large amount of the required hardware (we need a reverse block for each forward logic block) derived inefficient implementations, mostly for systemand/or architecture-level design. In [7], the whole design of a FIR filter starting from the algorithm level and ending to layout level was presented. There were comparisons with conventional FIR designs and the resulting architectures had very positive features. However, it cannot be seen as a design methodology for DSP algorithms. In this paper, a new methodology for designing low power architectures based on the energy recovery technique, is introduced. The proposed methodology consists of two fundamental steps: i) Determination of the appropriate direction and schedule vectors and ii) Determination of the optimal clocking scheme. More specifically, the first step specify the array architecture. The latter step allows the redesign of clocking strategy resulting into hardware efficient architectures with energy recovery features . The proposed architectures implement a certain class of DSP algorithms called as Weak Single Assignment Codes (WSACs) [8, 9], such as Discrete Fourier/Cosine Transform, matrix-matrix multiplication, and convolution. These algorithms can be implemented, applying conventional design methodologies, by regular, modular, and iterative architectures [9, 10], which consists of many identical blocks (or processing elements (PE)). The derived architectures are characterized by reversible pipeline operation and small hardware complexity. A typical reversible pipeline system consists of a series of “forward” logic blocks and a series of the corresponding “reverse” pipeline logic blocks. Therefore, it is possible to merge the identical parallel blocks to one, employing appropriate time multiplexing of the common block. A set of lemmas specify the optimal characteristics of the chosen clocks, which ensure adiabatic operation and high speed. The effectiveness proposed method is illustrated by the architecture level design of the convolution algorithm. 2. The Basic Design Concept The Reversible Pipeline principle was described in detail manner by Athas et al [4]. A typical Reversible Pipeline system comprises: i) N forward logic blocks, F1, F2,..., FN, ii) N reverse logic blocks and iii) N power-clocks Φ1, Φ2,..., ΦN, as is it depicted in Fig. 1. Concerning the Reversible Pipeline operation, the role of the reverse blocks is to recover circuit energy back to power supply and do not contribute to calculation process itself. In other words, their only usage is to establish a path to recover the energy back to Public Page 18 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 the power source. Thus, if a forward logic block is removed from the pipeline, the system logic operation will be changed. In contrary, if a reversible logic block is removed, the calculation sequence will not be influenced. If some of the reverse blocks are identical, alternative energy paths are available in a reversible pipeline system. By definition, a reversible pipeline system uses the reverse logic blocks in different time instances as the calculation and pipeline proceeds from a logic block, Fm to the next Fk one. Therefore, the hardware utilization of the reverse blocks is very small, i.e. 1/N %. To improve the hardware utilization, which coincides with the hardware reduction, we can merge appropriate groups of reverse logic blocks to one block. Since the hardware-reduced system should preserve the reversible pipeline principles (i.e. energy recovery), an component, which multiplexes in-time the paths of the reverse blocks and manipulates the associated clocks of the merged blocks in effective fashion, should be used. Practically, the implementation of the time multiplexed energy recovery is performed by appropriately-designed multiplexers (MUXs). The derived system can function as a RP system with less hardware complexity. A systematic methodology for designing the structure of the multiplexer and the characteristics of the multiplexed clocks will be described in the following sections. "forward" logic blocks ƒ ƒ ƒ ) ) )  ,QSXW   ƒP    ƒ1 ƒN ... )P ... )N ... )1 ... )P ... )N ... )1 2XWSXW D )   ƒ  )   ƒ  )   ƒ   ƒP   ƒN   ƒ1  "reverse" logic blocks Figure 1. The general structure of a reversible pipeline architecture [4] 3. The Proposed Design Methodology 3.1 THE Mapping Procedure The proposed methodology aims at the architecture level design of DSP algorithms, deriving implementations which exhibit energy-recovery features. Although the reversible pipeline exhibits several advantages a major obstacle regarding the overall hardware complexity of an implemented circuit restricts the applicability of this technique to a limited number of applications. Additional limitations are: i) the lack of generalized method to reverse a function and ii) many useful functions are not reversible, e.g. AND, OR, etc. The latter characteristic implies that the logic level expressions are not always reversible. To overcome the reversible pipeline limitations, we are working in Public Page 19 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 the architecture level because the basic operators of the algorithm expressions are commutative and associative. In order to map an iterative algorithm onto a regular array architecture there exists many alternatives, i.e. many sets of projection and schedule vectors [9,10]. However, only a subset is suitable for designing power efficient array architectures. Studying the reversible pipeline concept, we infer that each algorithm should be input and output of the same processing element. This requirement is satisfied by the fact a DSP algorithm can be expressed in terms of Uniform Recurrent Equations [9,10]. The mapping procedure of an algorithm results into a power efficient architecture & & & when the number of the products d × ei ≠ 0 , i = 1,2,..., N is maximum, where d is the & direction vector and ei is the i-th edge vector of the Dependence Graph. 3.2 Method for Selecting Optimal Clocking Scheme In the beginning, we define some variables, which will help the formal description of the proposed clocking scheme. Referring to Fig. 2, we define: i) T = raising/falling time of a clock, ii) SET = time interval with length s, iii) HOLD = time interval with length h, iv) RESET = time interval with length r, v) IDLE = time interval with length i, and vi) DIFFERENCE = time difference interval between two clocks with length d. +2/' 5(6(7 6(7 7 V 7 K U ,'/( L Figure 2. The clock characteristics of a a reversible pipeline system. Since the raising/falling edges occur during the SET/RESET intervals, we deduce that: s = r = T . Without loss of generality, it is assumed that the lengths of HOLD, IDLE, and DIFFERENCE intervals can be expressed in terms of T. That is: i) h = x T , ii) i = y T , iii) d = z T , where x, y ∈ R + − {0}, and (R+ is the positive of real numbers). In general, any two clocks used in RP systems are identical and have equal phase difference. These features arise from the fact that the clock implementation is easier . Due to the fact that the difference between two successive blocks is assumed to be constant, d, the time shift between the two clocks Φm and Φk is: (m - k) z T (1) where m > k with m, k ∈ Z + − {0,1} (Z is the set of positive integer numbers). During the energy recovery operation of a stage Fm (Fig.1), the rising clock edge of the next stage Fm+1, Φm+1, drives the inputs of the block Fm−1 and open the path it constitutes. Then, the falling edge of the clock Φm-1 attached on the reverse block Fm−1 Public Page 20 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 discharges the stage. To merge the reverse blocks Fm−1 and Fk−1 , a designer should ensure that the energy recovery operation of each block occurs in different time intervals. This implies that the falling edges of the clocks Φm-1 and Φk-1 attached to the reverse blocks Fm-1 and Fk-1 and the rising edges of the clocks Φm+1 and Φk+1 of the next blocks Fm+1-1 and Fk+1-1, occur in different time instances. In Section 2.2, it is mentioned that there is a critical point concerning the adiabatic operation of the multiplexers. All the input signals of MUXes are clocks and change state periodically passing a clock signal to their output, one input at a time. If a clock signal swings from its IDLE value, V0, to its HOLD value, V, and does not return back to V0 before the multiplexer change state, is it possible to occur a non-adiabatic energy dissipation during the successive clock period. This happens because the voltage drop (V-V0) across the multiplexer results into higher power dissipation. To avoid this problem the signal clocks, which pass through the MUXes should satisfy the condition: if the first clock is in “high” logic level, the other clock remains in its IDLE value and vice versa. In other words, the pulses of the clock-signals should not be overlapped. The clocks (denoted as Ti), which control MUXes have to keep their logic values `0` and `1` for the whole time interval of each pulse. The following lemmas facilitate the extraction of fast and secure solutions about the appropriate clocking scheme for the optimization technique. Lemma 1: Let s, h, r and i be a set of the time intervals of a clock pulse. The relationship between the first three intervals and last one is given: 2+x ≤ y (2). Lemma 2: Two successive clock signals are partially overlapped if it holds: 1≤ z < x (3). Lemma 3: Let Fm and Fk, be two identical reverse blocks and Öm and Φk be the associated clocks, which are non-overlapped. The phase difference between Φm and Φk should meets the formulas: (m - k) z ≥ 2 + x (4) Lemma 4: Let Fm and Fk, be two identical reverse blocks and Φm and Φk be associated clocks, which are non-overlapped. The phase difference between Φm and Φk should meets the formula: (m - k) z ≤ y (5) Given the values m and k of the identical logic blocks and (2), (4) and (5), we can determine a clocking scheme choosing one of the possible combinations of x, y, and z, which satisfy the inequalities: 2 + x ≤ (m - k) z ≤ y (6) 1≤ z<x (7) Public Page 21 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 Lemma 5. Given the inqualities (6) and (7), it holds that: i) y ≥ 2 (8) ii) (m - k) > 1 (9) The inequality (9) means that the merge of identical reverse blocks, which belong to successive stages cannot be performed. The clocks deriving from the above inequalities are suitable for merging every pair of identical blocks, which has the same relative position in the pipeline stream, i.e. constant difference ( m − k ) between any Fm and Fk. This important result has impact on those systems, which are characterized by iterative (repeated) identical structures. Such typical systems are the array processors [9,10], whose role are critical on several DSP applications. The proposed procedure can be easily generalized for RP systems, which contain groups with more than two identical blocks. The permissible numbers of identical blocks in each group are powers of 2, due to multiplexer features. In order to find the appropriate clocking scheme, we apply the inequalities (6) and (7) to all possible combinations (by pairs) of identical blocks. The clocks resulting from (6) and (7) do not lead necessarily to maximum speed or at least equal to the speed of a conventional RP system. Thus, the designer should examine how can maximize the speed of the whole system. It has been already mentioned that the factors, which influence the features of a clock pulse are T, x, y, and z. More specifically, in circuit level, the speed depends on the rising/falling time T. In system level, the speed depends on the time interval, t1, needed to obtain the first output (i.e. latency) and the time interval, t2, in which a logic block can load new inputs after the execution of a computation (i.e. throughput). The time interval t1 and t2 can be expressed in terms of T, x, y, and z as follows: t1 = a z T (10) t2 = (2+x+y)T (11) where a is the number of the forward logic blocks. Therefore, from all the possible solutions of (6) and (7), the set of x, y, and z, which minimizes eq. (10) and (11) result into maximum speed. 4. HARDWARE COMPLEXITY/REDUCTION In order to estimate the total hardware complexity of the proposed pipeline implementation, we have to explore the hardware complexity of the additional components, i.e. the multiplexers. A fully-adiabatic multiplexer, which is based on a fully adiabatic tree decoder designed by T-gate logic [3] is selected. In order to estimate the hardware savings of the proposed system, we define the total Relative Hardware Saving (RHST), which can be expressed by a more specific Public Page 22 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 formula, if all the blocks of the RP system are identical, for instance, the array processor architectures (systolic architectures) [9,10]. It is proved that: RHST =1 − 1 2( M + N + 1) − g NT (12) where: A = the number of reverse blocks of the whole system, NT = the number of transistors of each block, g = the number of blocks of each group, and C = the total number of groups. A significant conclusion comes from eq. (12): if all the blocks are identical, RHST is expressed in terms of a block characteristics and the number of the identical blocks of each group, and it is independent from of the total number of the reverse blocks. The RHST is becoming better as g is increasing. However, if g increases, (m-k) becomes larger as well as the values x, y, and z of (6) and (7). This fact implies increased latency, t1, and thus lower speed. Depending on the application requirements, the designer should make the appropriate trade-offs between RHST and speed, choosing the suitable value of g. 5. APPLICATION The features and the advantages of the proposed modified RP approach are illustrated by the well-known DSP algorithm of convolution. It has been proven that the convolution can be implemented by regular architectures [9,10]. Since an array implementation includes many identical PEs (e.g. multiplier-accumulation units), the proposed method can produce low power systems, which embody the properties of the reversible pipeline with affordable hardware complexity. The Dependence Graph and the Signal Flow Graph of the convolution are shown in Fig. 3(a) and (b), respectively. The convolution array processor can operate in a RP fashion because: i) the whole system exhibits the pipeline property [10] and ii) the function of the identical blocks (or processing elements) is reversible. The structure of the “forward” and “reverse” PE is shown in Fig. 3(c) and (d). Having the forward and the reverse logic blocks (processing elements) it is possible to design a convolution array processor operating as a RP system. This system is partially adiabatic and it is designed to recover the external signal energy, that is the energy of the input/output signals of PEs. For comparison reasons, we choose a specific array processor, which implement the convolution of two sequences of eight (A=8) points, with certain 32×32-bit multiplieraccumulator unit (i.e. PE) of 28,500 transistors [11]. Thus, the MUX has M=N=96 input/output bit lines. Hence, from eq. (12), it is obtained: RHTT = 0,986 − 1 g (13) Public Page 23 LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 \ \  X X \  \  \  \  \ \ \    '  '  \ '  '  \ X '  \  X  y (i+1 ) (c) '  ' '  Z L X L \ L '  \ ; y (i ) '  \ u (i ) w (i+1 ) '  X w (i ) '  \ X \   X X \  ;  Z L \ L G '  '  ' Z  Z  Z  Z  Z  Z Z  Z  Z Z  (a) Z   ... \  ... Proceedings of second Workshop  (b) Figure 3. (a) Index Transformed Dependence Graph for convolution and (b) Signal Flow Graph obtained by projection along [0 1], (c) The architecture of the “forward” PE, and (d) the “reverse” PE. There are two different ways for merging identical the blocks of the Signal Flow Graph, that is: g=2 and g=4. For arbitrary length of the convolved sequences, Figure 4 depicts the relationship between RHST and the number of blocks of each group, g. 5+77                    J Figure 4. The relationship between RHST and the number of blocks of each group, g. Groups of two There are two ways for merging the reverse blocks of this system: i) to merge blocks by pairs having m-k = 2 (i.e. blocks 1-3, 2-4, 5-7, 6-8) and ii) to merge blocks with m-k =4 (i.e. blocks 1-5, 2-6, 3-7, 4-8). From the proposed method, choosing m-k = 2, we result into an architecture (Fig. 5) with higher speed. The optimal clocking scheme, Public Page 24 Proceedings of second Workshop LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 which results from (8) and (9), has x = 2.2, y = 4.2 and z = 2.1. The associated values of RHST, throughput and latency are in Table1. ƒ ƒ 1 ƒ ƒ 3 ƒ 4 ƒ 5 ƒ 6 ƒ 7 8    08;V      08; 0 1 0 0 08;V 1 1 0 0 08;V 1 1 0 08;V 1 0 08;V 1 1 08; Œ 08; ƒ ƒ Œ 08; ƒ ƒ Œ 08; ƒ 71 0 ƒ 08; ƒ 72 Œ ƒ Figure 5. The reversible-pipeline array processor architecture of convolution with g=2. Groups of four The merging of the eight reverse blocks with groups of four (g=4) has only one combination of identical blocks, that is, 1-3-5-7 and 2-4-6-8. The deriving system is shown in Fig. 6. Similarly, from (8) and (9) we infer that the minimum solution is satisfied by x = 2.2, y = 12.6 and z = 2.1. The values of RHST, throughput and latency are depicted in Table1. In conclusion, the merge as many as possible identical blocks results into larger hardware reduction at the expense of the architecture latency. G RHST Throughput (t1) Latency (t2) 2 48.6% 16.8 T 8.4 T 4 73.6% 16.8 T 14.8 T Table 1. The hardware reduction and timing requirements of two linear array processors. 6. CONCLUSIONS A new systematic methodology for implementing low power architectures, which are based on adiabatic techniques, was presented. The derived architectures exploit the properties and features of the reversible pipeline and exhibit significantly-reduced hardware complexity. A series of proven lemmas determine the exact and optimal characteristics of the used clocks. The efficacy of the proposed architectures was illustrated by a certain DSP algorithm (convolution). Public Page 25 Proceedings of second Workshop ƒ1 ƒ ƒ3 LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1 ƒ4 ƒ5 ƒ6 ƒ7 ƒ8        08;V       08;V         08;V        08;     08; ƒ1 ƒ3 ƒ5 ƒ7 71 72 ƒ2 ƒ4 ƒ6 ƒ8 Figure 6. The reversible-pipeline array processor architecture of convolution with g=4. REFERENCES [1] J. Rabaey and M. Pedram, “Low Power Design Methodologies,” Kluwer Academic Publishers, 1996. [2] A.P. Chandrakasan and R.W. Brodersen, “Low-power Digital CMOS Design” Kluwer Academic Publishers, 1995. [3] W. Athas, “Energy-Recovery CMOS”, Book Chapter in “Low Power Design Methodologies” editors J.M. Rabaey and M. Pedram, Kluwer Academic Publishers, pp. 65-100, 1996. [4] W.C. Athas, L. Svensson, J.G. Koller, N. Tzartzanis, and E.Y.C. Chou, “Low-Power Digital Systems Based on Adiabatic-Switching Principles,” in IEEE Trans. on VLSI Systems, vol. 2, no.4, Dec. 1994. [5] S.G. Younis and T.F. Knight, Jr., “A asymptotically zero energy split-level charge recovery logic,” in Proc. of IEEE Int. Workshop on Low Power Design, 1994, NapaValley, USA. [6] K. Jung and W. Kim, “A Logic gate for Reversible Pipelining,” in Proc. of IEEE Int. Workshop on low Power Design, 1997, pp. 1928-1931. [7] W.C. Athas, W-C Liu, and L. Svensson, “Energy-recovery CMOS for Highly Pipelined DSP Designs,” in Proc. of Int. Symp. on Low Power Electronics and Design (ISLPED), pp. 101-104, Monterey, CA, August 12 - 14 1996. >@ 95R\FKRZGKXU\6.5DR/7KLHOHDQG7.DLODWK2QWKH/RFDOL]DWLRQRI DOJRULWKPVIRU9/6,SURFHVVRUDUUD\VLQ9/6,6LJQDO3URFHVVLQJ,,,HGLWHGE\ 5%URDGHUVHQDQG+0RVFZRYLW],(((3UHVV1HZ<RUNSS [9] '6RXGULVet al2QWKH'HVLJQRI7ZR/HYHO3LSHOLQHG3URFHVVRU$UUD\VLQ $SSOLFDWLRQ'ULYHQ$UFKLWHFWXUH6\QWKHVLV)&DWWKRRUDQG/6YHQVVRQHGLWRUV .OXZHU$FDGHPLF3XEOLVKHUV-XQH [10] S.Y. Kung, “VLSI Array Processors,” Prentice Hall, Eaglewood Cliffs, 1988. [11] H. Murakami, K. Yano, Y. Ootaguro, Y. Sugeno, M. Ueno, Y. Muroya, and T. Aramaki, “A multiplier-accumulator macro for a 45 MIPS embedded RISC processor,” in IEEE J. Solid State Circuits, July 1996, 31, pp. 1067-1071. Public Page 26

References (24)

  1. S. Theoharis, G. Theodoridis, N. Zervas, and C. Goutis, "Accurate And Fast Power Estimation Of Large Combinational Circuits," in Proc. of 9WK ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ $QG 6LPXODWLRQ 3DWPRV C9 Kos, Greece, 2FWREHU 8, 1999, pp. 199-208.
  2. D. Soudris, C. Z. Lolas, And A. Thanailakis, "A Modified Energy Recovery Technique For Implementing DSP Algorithms," In Proc. of 9 th ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ $QG 6LPXODWLRQ 3DWPRV C9 Kos, Greece, 2FWREHU 8, 1999, pp. 113-122. References
  3. J. Rabaey and M. Pedram, "Low Power Design Methodologies," Kluwer Academic Publishers, 1996.
  4. F. Najm, "A Survey of Power Estimation Techniques in VLSI circuits" in IEEE Trans. on VLSI, vol 2, no 4, pp. 446-455, December 1995.
  5. F. Najm, "Transition Density: A new measure of activity in digital circuits," in IEEE Trans. on CAD, Vol. 12, No. 2, pp. 310-323, February 1995.
  6. B. Kapoor, "Improving the accuracy of circuit activity measurement", in Proc. of Int. Workshop on Low Power Design, pp. 111-116, April 1994.
  7. P. Schneider and U. Schlichmann, "Decomposition of Boolean functions for low power based on a new power estimation technique," in Proc. of IWLPD, pp. 123-128, April 1994.
  8. R. Marculescu, D. Marculescu, and M. Pedram "Efficient Power estimation for highly correlated input streams," in Proc. of DAC, pp.628-634, 1995.
  9. F. Najm, R. Burch, P.Yang and I. Hajj "Probabilistic Simulation for Reliability Analysis of CMOS VLSI circuits," in IEEE Trans. on CAD, 9 (4) pp. 439-450, Apr. 1990.
  10. J. Monteiro, A. Ghosh, S. Devadas, K. Keutzer, and J. White, "Estimation of average switching activity in combinatorial and sequential circuits", in IEEE Trans. on CAD, Vol. 16, No.1, pp. 121-127, January 1997.
  11. J.C. Costa, J.C. Monteiro, and S. Devadas, "Switching Activity Estimation using Limited Depth Reconvergent Path analysis", In Proc. of ISLPD, pp. 184-189, 1997.
  12. R. Bryant, "Graph-based algorithms for Boolean function manipulation", in IEEE Trans. on Computers, C-35:677-691, August 1986.
  13. J. Lind-Nielsen, "BuDDy -A Binary Decision Diagram Package", IT-TR : 1999-028, Technical University Of Denmark, 1999. Reprinted from Proceedings of 9WK ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ$QG6LPXODWLRQ3DWPRVC9Kos, Greece, 2FWREHU8, 1999.
  14. D. Soudris, C. Z. Lolas, And A. Thanailakis, "A Modified Energy Recovery Technique For Implementing DSP Algorithms," in Proc. of 9WK,QW:RUNVKRS3RZHU$QG7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ $QG 6LPXODWLRQ 3DWPRV C9 Kos, Greece, 2FWREHU 8, 1999, pp. 113-122. REFERENCES
  15. J. Rabaey and M. Pedram, "Low Power Design Methodologies," Kluwer Academic Publishers, 1996.
  16. A.P. Chandrakasan and R.W. Brodersen, "Low-power Digital CMOS Design" Kluwer Academic Publishers, 1995.
  17. W. Athas, "Energy-Recovery CMOS", Book Chapter in "Low Power Design Methodologies" editors J.M. Rabaey and M. Pedram, Kluwer Academic Publishers, pp. 65-100, 1996.
  18. W.C. Athas, L. Svensson, J.G. Koller, N. Tzartzanis, and E.Y.C. Chou, "Low-Power Digital Systems Based on Adiabatic-Switching Principles," in IEEE Trans. on VLSI Systems, vol. 2, no.4, Dec. 1994.
  19. S.G. Younis and T.F. Knight, Jr., "A asymptotically zero energy split-level charge recovery logic," in Proc. of IEEE Int. Workshop on Low Power Design, 1994, Napa- Valley, USA.
  20. K. Jung and W. Kim, "A Logic gate for Reversible Pipelining," in Proc. of IEEE Int. Workshop on low Power Design, 1997, pp. 1928-1931.
  21. W.C. Athas, W-C Liu, and L. Svensson, "Energy-recovery CMOS for Highly Pipelined DSP Designs," in Proc. of Int. Symp. on Low Power Electronics and Design (ISLPED), pp. 101-104, Monterey, CA, August 12 -14 1996.
  22. '6RXGULVet al2QWKH'HVLJQRI7ZR/HYHO3LSHOLQHG3URFHVVRU$UUD\VLQ $SSOLFDWLRQ'ULYHQ$UFKLWHFWXUH6\QWKHVLV)&DWWKRRUDQG/6YHQVVRQHGLWRUV .OXZHU$FDGHPLF3XEOLVKHUV-XQH
  23. S.Y. Kung, "VLSI Array Processors," Prentice Hall, Eaglewood Cliffs, 1988.
  24. H. Murakami, K. Yano, Y. Ootaguro, Y. Sugeno, M. Ueno, Y. Muroya, and T. Aramaki, "A multiplier-accumulator macro for a 45 MIPS embedded RISC processor," in IEEE J. Solid State Circuits, July 1996, 31, pp. 1067-1071.
About the author
Papers
17
Followers
1
View all papers from Chrissavgi Drearrow_forward