LPGD
A Low-Power Design Methodology/Flow and its Application to the
Implementation of a DCS1800-GSM/DECT
Modulator/Demodulator
(ESPRIT 25256)
“Proceedings of second Workshop”
D. Soudris1, C. Dre2, and C. Goutis3
1
Democritus University of Thrace (DUTH)
2
Intracom
3
University of Patras (UP)
Editor:
DUTH+UP+INTRACOM
Type:
Report
LPGD Id:
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
CEC Identifier:
EP25256/ DUTH+UP+INTRACOM/D1.2R1
Document Version: 1
Status:
Deliverable
Confidentiality:
Public
Actual Date:
December 31, 1999
Contractual Date: November 30, 1999
WorkPackage:
WP1
Keywords:
logic-level estimation, real-delay gate model, RTL
optimization, energy recovery, array processor
Abstract:
In this IC&D report, we provide two research works entitled:
i) “Accurate And Fast Power Estimation Of Large
Combinational Circuits” DQG LL “A Modified Energy
Recovery Technique For Implementing DSP Algorithms,”
ZKLFK ZHUH LQFOXGHG LQ WKH 3URFHHGLQJV RI WK ,QW
:RUNVKRS 3RZHU DQG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ
DQG 6LPXODWLRQ 3DWPRV C9 Kos, Greece, 2FWREHU 8,
1999.
Title:
Author(s):
Copyright © 1999
INTRACOM
UP
DUTH
Proceedings of second Workshop
Date
December 31, 1999
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
History
Version
1
Comments
Deliverable
Purpose /Scope
The purpose of the present Information Capturing and Dissemination report is to provide two papers entitled: i)
“Accurate And Fast Power Estimation Of Large Combinational Circuits” >@ DQG LL “A Modified Energy
Recovery Technique For Implementing DSP Algorithms,” [2], ZKLFK ZHUH LQFOXGHG LQ WKH 3URFHHGLQJV RI WK
,QW :RUNVKRS 3RZHU DQG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ DQG 6LPXODWLRQ 3DWPRV C9 Kos, Greece,
2FWREHU 8, 1999, pp. 199-208.
The published material includes two low power techniques, the first of which aims at the power estimation of
large gate level circuits under a real delay gate model. The second approach is an RTL power optimization
technique for implementing DSP applications based on an area-efficient energy recovery technique.
Here, it should be stressed that the General Chair of PATMOS 99 is also the responsible person for the
Information Capturing and Dissemination of LPGD. In the context of PATMOS 99 there was close co-operation
between PATMOS 99 organizing committee and DIMES (Prof. R. Nouta and Prof. R. Van Lecken ). Except that
report, we attached the Proceedings and the Technical Program of PATMOS 99.
Publications
[1] S. Theoharis, G. Theodoridis, N. Zervas, and C. Goutis, “Accurate And Fast Power Estimation Of Large
Combinational Circuits,” in Proc. of 9WK ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ $QG
6LPXODWLRQ 3DWPRV C9 Kos, Greece, 2FWREHU 8, 1999, pp. 199-208.
[2] D. Soudris, C. Z. Lolas, And A. Thanailakis, “A Modified Energy Recovery Technique For Implementing
DSP Algorithms,” In Proc. of 9th ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ $QG
6LPXODWLRQ 3DWPRV C9 Kos, Greece, 2FWREHU 8, 1999, pp. 113-122.
Public
Page 2
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
TABLE OF CONTENTS
ACCURATE AND FAST POWER ESTIMATION OF LARGE COMBINATIONAL CIRCUITS......................5
ABSTRACT ....................................................................................................................... .....................................5
1. INTRODUCTION................................................................................................................ ...............................5
2. PROBLEM FORMULATION ............................................................................................................................6
3. SWITCHING ACTIVITY COMPUTATION .............................................................................................. .......8
4. EXPERIMENTAL RESULTS ........................................................................................................ ..................11
6. CONCLUSIONS ...............................................................................................................................................14
REFERENCES ..................................................................................................................... .................................14
A MODIFIED ENERGY RECOVERY TECHNIQUE FOR IMPLEMENTING DSP ALGORITHMS.............. 17
ABSTRACT .......................................................................................................................................................... 17
1. INTRODUCTION ...........................................................................................................................................17
2. THE BASIC DESIGN CONCEPT.................................................................................................................... 18
3. THE PROPOSED DESIGN METHODOLOGY...............................................................................................19
3.1 THE MAPPING PROCEDURE ............................................................................................................................. 19
3.2 METHOD FOR SELECTING OPTIMAL CLOCKING SCHEME................................................................................... 20
4. HARDWARE COMPLEXITY/REDUCTION..................................................................................................22
5. APPLICATION ................................................................................................................................................. 23
6. CONCLUSIONS ...............................................................................................................................................25
Public
Page 3
Reprinted from Proceedings of 9WK ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ
2SWLPL]DWLRQ$QG6LPXODWLRQ 3DWPRVC9 Kos, Greece, 2FWREHU8, 1999.
S. Theoharis, G. Theodoridis, N. Zervas, and C. Goutis, “Accurate And Fast Power
Estimation Of Large Combinational Circuits,” in Proc. of 9WK ,QW :RUNVKRS 3RZHU
$QG 7LPLQJ 0RGHOLQJ 2SWLPL]DWLRQ $QG 6LPXODWLRQ 3DWPRV C9 Kos, Greece,
2FWREHU8, 1999, pp. 199-208.
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
Accurate And Fast Power Estimation Of Large Combinational Circuits1
S. Theoharis, G. Theodoridis, N. Zervas, and C. Goutis.
VLSI Design Lab, Dept. of Elect. and Comp. Eng.
University of Patras, Rio 26 110, Greece
Abstract
A novel probabilistic method to estimate the switching activity of a logic circuit under
a real delay gate model, is introduced. Based on Markov stochastic processes and
generalizing the basic concepts of zero delay-based methods, a novel probabilistic model
to estimate accurately the power consumption, is developed. More specifically, a set of
new formulas, which describe first-order temporal correlation, under real delay model,
are derived. The chosen gate model allows accurate estimation of the functional and
spurious (glitches) transitions, leading to accurate power estimation. The proposed
approach manipulates efficiently large circuits employing a new partitioning heuristic,
which estimates the switching activity with reduced computational complexity.
Comparative study and analysis of benchmark circuits demonstrates the accuracy and
efficiency of the proposed method.
1. Introduction
The development of competitive market sectors such as wireless applications, laptops
and portable medical devices make the power dissipation the most important parameter in
the modern VLSI design field[1]. The average number of transitions of the circuit nodes
can express the average power consumption of a circuit. A variety of probabilistic power
estimation methods have been developed to estimate the power consumption of the
combinational circuits [2]-[9]. In this methods probabilistic properties, such as the
average steady state and the average transition probability of the circuit nodes, are used in
order to evaluate the transitions of any circuit node. The above methods can be classified
according to assumed gate delay model in the following categories i) the zero delay
methods [3]-[6], where only the functional transitions are considered and the ii) real gate
delay methods [7]-[9], where both the functional and spurious transitions are taken into
account.
1
This work was partially supported by the LPGD ESPRIT IV 25256 project of EU.
Public
Page 5
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
Hence, real gate delay model is needed for accurate power estimation. In addition
temporal correlation, input pattern dependencies and spatial (structural) dependencies
have to be considered to estimate accurate the power dissipation. In [7] the concept of the
probability waveforms to estimate the average power dissipation of a combinational
circuit was introduced. Given the probability waveforms of the primary inputs, a
propagation algorithm of these waveforms through the circuit to evaluate the
corresponding waveforms of the internal circuit nodes was developed. However, both the
structural and input dependencies were not taken into consideration. Additionally, the
efficiency of the algorithm for large circuits is not proved by the presented experimental
results. In [8], a symbolic simulation algorithm has been proposed. Having the switching
activities of the primary inputs and using global Ordered Binary Decisions Diagrams
(OBDDs), where each node is expressed by the primary input signals, the transition
probability of a node at time t resulted from the probability of a new boolean function.
The new boolean function is derived by XOR’ing the boolean functions that correspond
to two successive switching time instances, i.e. t and t+1. This method handled the
structural and the first-order temporal correlations, but the input pattern dependency was
not included. Moreover, this method is inefficient for large circuits as global OBDDs are
used. A new method for calculating the transition probabilities was proposed in [9]. The
structural and the first order temporal correlations of the primary inputs were captured. In
order to capture large circuits the method is parameterized in terms of the depth of the
circuit levels that are used for reconvergent structural analysis.
In this paper, a novel probabilistic method for accurate power estimation of large
combinational circuits is introduced. Taking into account simultaneous multiple-input
transitions, temporal and structural correlation of the circuit and assuming that primary
inputs are independent (i.e. the input data correlations do not taken into account), the
proposed method calculates the switching activity of any logic circuit node under a real
delay gate model. The main objective is the development of a new mathematical model to
estimate accurately the power consumption of a logic level circuit under real delay model
generalizing fundamental principles of zero delay-based methods. Since we assume a real
delay gate model, the time parameter influences the logic behavior of a circuit node. We
propose a new heuristic for handling efficiently large combinational circuits. The
proposed method is compared with the method reported in [9]. Employing a set of
benchmark circuits, the comparison results prove the efficiency of the proposed method.
2. Problem Formulation
It is assumed that the combinational circuit is a part of a synchronous sequential
circuit, which means that its inputs can switch synchronously with the clock, performing
at most one transition at time t=0 during the clock period [0,T).
Public
Page 6
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
The power estimation problem of a combina-tional logic circuit, under real gate delay
model can be stated as follows: “Given the gate level description of a combinational
circuit with n inputs and m outputs and the inertial delays of its gates, and, assuming that
the time between two successive applied input vectors is greater or equal to the settling
time of the circuit, estimate the average power consumption of the circuit for an input
vector stream through the calculation of its average switching activity.”
The output of a gate generates a glitch, if two conditions are satisfied: i) the necessary
condition, which requires the difference of the transition arrival times of the input signals
to be equal or greater than the inertial delay of the gate and ii) the sufficient condition,
which requires the appropriate transitions of the input signal(s) to switch the gate output.
Consequently, the time parameter plays a critical role in the power consumption
estimation. That is, the exact description of a logic circuit should include not only its
logic behavior, but also the time dependence among the logic signals.
Since the glitch generation is strongly dependent on the time, a “modified” boolean
function, from now on called timed boolean function, which can describe the logic and
timing behavior, is needed.
[
1
n1
x2
1
f
Figure 1 : The logic circuit with Unit Delay AND gates.
Example: We assume a logic circuit with gate delay equals to one delay unit, d, as it
shown in Figure 1. The logic behavior of the node f can be described in time domain as
follows:
f = F ( x1 , x 2 , t ) = x1 ( t − 2 d ) x 2 ( t − 2 d ) x 2 ( t − d )
(1)
The signal f may switch at two time instances, i.e. t 1 = d and t 2 = 2d . More
f
f
specifically, the transition of the signal f at t1, t 1 = d , depends on the transitions of the
f
primary inputs x1 and x2 at time points t1x1 = −d , t1x2 = −d , and t1x2 = 0 , while the
transition of f at t 2 = 2d depends on the transitions of the signals x1 and x2 at t2x1 = 0 ,
f
t 2x2 = 0 , and t2x2 = d .
Having as starting point eq. (1), a novel mathematical model, which describes the
behavior of a logic signal in terms of time, should be introduced. We aim at the
development of a new method, which reduces the power estimation problem considering
real delay model to a zero delay problem at certain switching time points. For that
purpose, we introduce new concepts and formulas, which express parameters of real
delay modeled power estimation problem in terms of zero delay parameters.
Public
Page 7
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
3. Switching Activity Computation
In this section, we provide all the necessary definitions and the appropriate lemmas for
modeling the temporal correlation in time domain. More specifically, we begin with the
fact that a signal can be modeled as Markov stochastic process. Then, we provide the
appropriate material for temporal correlation modeling, extending the concept of
transition probability for certain time intervals.
The behavior of a binary signal, x, at a time point, t, i.e. x(t), is modeled as a random
variable of a time homogeneous, Strict Sense Stationary, lag-one Markov stochastic
process having two states, s, with s ∈S = {0,1} . Also, it is assumed that the Markov
process is finite, aperiodic, and irreducible with all states recurrent non-null [6].
The transition probability, p klx (t ) , expresses the probability of a signal x to perform a
transition from the state k to the state l within two successive time points t-1 and t. That
is:
p klx (t ) = p ( x(t − 1) = k ∧ x(t ) = l ) ∀ k , l ∈ S
(2)
The switching activity, E ( x, t ) , of a signal x at time instance t is given by:
x
x
E ( x, t ) = p 01
(t ) + p10
(t )
(3)
where p01x (t ) (or p10x (t ) ) is the transition probability of the signal x to perform the
transition from state 0 to 1 (or 1 to 0) at time instance t.
For the rest of the paper, the term "input signal" stands for primary input signal and the
term "signal" stands for either a primary input signal or an internal signal of the logic
circuit.
x
Definition 1. A Signal Transition Probability Vector, P (t ) , of a signal x at a time
x
instance t, is defined as the vector of all transition probabilities p kl (t ) , with k , l ∈ S :
[
P x ( t ) = p00x (t ), p01x ( t ), p10x (t ), p11x (t )
]
(4)
We introduce the transition probability of a signal x in time intervals (ti-1, ti) and (ti,
ti+1) as P x (ti− ) and P x (ti+ ) , respectively. Their corresponding values are computed by
the next lemma.
Lemma 1. The transition probability of a signal, x, at a time point t ′∈ {t − , t + } is
expressed in terms of the transition probability at time point t as follows:
P x (t ′) = f (P x (t ) )
(5)
and are computed by the following new formulas:
pkkx (t − ) = pkkx (t )+ pkx(1−k ) (t ) ∀ k ∈ S
Public
(5.1)
Page 8
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
pllx (t + ) = pllx (t )+ p(x1−l ) l (t ) ∀ l ∈ S
(5.2)
pklx (t − ) = pklx (t ) = 0
(5.3)
∀ k, l ∈ S ∧ k ≠ l
Proof: Due to lack of space the proof is omitted here.
Definition 2: We define as Valid Time Points set, T = {t1 ,... , tr }, the transition time
x
points for a signal x, i.e. the time points that a transition may occur in the signal x.
The switching activity estimation problem is reduced to the estimation of
P x (t ) ∀ t ∈T x . For example the T f = {1,2} for the signal f of the circuit shown in Fig.
1. For an input signal x the T x = {0 − ,0,0 + }
Generally, a signal x of the logic circuit is a timed boolean function of a subset of v
signals, i.e. x (t ) = f (x1 (t ) , ... , xv (t ) ) . For instance, the nodes n1 and f shown in Fig. 1
have
as
timed
boolean
function
n1 (t ) = x1 (t − 1) x2 (t − 1)
and
f (t)=n1 (t −1) x2 (t −1) =x1 (t −2) x2 (t −2) x2 (t −1) respectively (one unit time delay is assumed).
For any timed boolean function, x (t ) = f (x1 (t ) , ... , xv (t ) ) , we can efficiently
construct the OBDD using the BDD package reported in [11]. The v variables that the
signal x depends on, at any time point, are called the support set of signal x(t). In order to
find out the support sets of all circuit nodes, a limited reconvergent path analysis is
performed. This analysis results into a new heuristic, which is based on the limited depth
spatial correlation of [9]. More specifically, the notion of active nodes introduced in [9] is
used in order to perform an initial partition for any circuit signal according to a user
defined parameter, called Depth. This parameter determines the depth in terms of logic
levels from each signal x that will be searched in order to determine if two paths starting
at x will reconverge. Spatial correlation corresponding to two paths starting at x that
reconvergence within Depth logic levels will be accurately taken into account. If
reconvergent paths meet after Depth logic levels, then they are assumed to be
independent. Even if the two paths meet within Depth logic levels, another user defined
parameter , called N_vars, limits the number of variables v , i.e v ≤ N_vars for any
circuit signal at any valid time point. The algorithm that limits the number of variables
consist a modified heuristic similar to that reported in [9].
A circuit node, f, switches at time t = i , if the derivative with respect to time of its
timed boolean function is equal to 1. Thus, the average switching activity, E ( f ) , in
probabilistic domain can be described by:
∂f (t)
E( f ) = p
∂t
t =tif
)
(
= 1 = p lim{f (tif − ε ) ⊕ f (tif + ε )}= 1
ε →0
Public
(6)
Page 9
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
Instead of performing the XORing between the boolean functions corresponding to the
time intervals (ti-1, ti) and (ti, ti+1) we can manipulate efficiently the OBDD, that
corresponds to each time point, in order to evaluate the probability of switching at t = i .
Taking into consideration the representation of the ordinary boolean function on the
OBDD the evaluation of the transition k → l with k , l ∈ {0, 1 } at time instance ti can
f
be done as described by the following steps (using the same procedure as in [6]):
i)
Find the set of paths to state k, Π k (t = i ) , of the OBDD of the boolean function
corresponding to the timed boolean function on time point i,
ii)
Find the set of paths to state l, Π l (t = i ) , of the OBDD of the boolean function
corresponding to the timed boolean function on time point i,
iii)
Combine each path of Π k (t = i ) with all paths of Π l (t = i ) and extract the
switching behavior taking into account the temporal compatibility of each signal
v
using the following equation: p
x
kl
(t i ) = ∑ ∑ ∏
π ∈ Π k π ′∈ Π l
j =1
p kxiil i (t j )
(7)
The pseudo-code of the proposed switching activity estimation is shown in Fig. 2.
Switching_Activity_Estimation (Network, Vector , Depth, N_vars){
Gates = Topological_Sort ( Network ) ;
-- (1) : Perform a limited depth reconrvergent analysis
Find_Active_Nodes ( Gates , Depth ) ;
-- (2) : Find out the valid time points of each circuit signal.
For each primary input signal xi {
see [9]
T xi = {0} ;
P xi (0) = Analyze_Input_Vector( Vector , i ) ;
}
for each gate g in Gates {
m = delay of g ;
T g = NIL (LIST) ;
for each input gi of g {
for each time point k ∈ ∪ T gi {
T g = T g ∪ {k + ∆ } if k + m satisfies the necessary condition ;
}
}
}
-- (3) : Calculate the switching activity of any circuit node at any valid time point.
For each gate g in Gates {
for each time point k of T g {
Sup(g(k)) = Find_Support_Set ( g , k , Depth , N_vars ) ;
OBDD(g(k)) = Construct_OBDD ( Sup (g(k)) ) ;
Public
see [11]
Page 10
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
P g (k ) = Combine_Paths( OBDD(g(k) ) ;
see [6]
}
}
}
Find_Support_Set ( g , k , Depth , N_vars ) {
Sup(g(k)) = Construct_Support_Set ( g, k , g.active_nodes ) ;
Depth’ = Depth ;
While | Sup(g(k)) | > N_vars {
Depth’ = Depth’ – 1 ;
Find_Active_Nodes ( g , Depth’ ) ;
Sup(g(k)) = Construct_Support_Set ( g , k, g.active_nodes’ ) ;
}
return(Sup(g(k)) ;
}
Figure 2 : Pseudo-code of the proposed method.
4. Experimental Results
The proposed power estimation method is implemented by ANSI C language, while its
efficiency is proved by a number of ISCAS'91 benchmark multilevel circuits. For
technology mapping, a library of primitive gates (i.e. NOT, BUFF, AND, NAND, OR,
NOR, XOR, XNOR) of up to 4 primary inputs, is used. All power estimations are
measured in ìW with 20 MHz clock frequency and 5 V power supply. The presented
results are obtained using the proposed method with Depth = 2 and 3 and N_vars = 8
and the method presented in [9] with the same partitioning depths.
For a circuit node ni (either primary input or internal node), we define as Real Node
Error the quantity Err (ni ) =
E (ni ) − E ′(ni )
where E (ni ) is the real switching activity of
E (ni )
node ni and E ′(ni ) is the estimated switching activity of this node. For a combinational
circuit with N gates and a specific input vector set Vj, we define as Total Power
Consumption the quantity Total Power (V j ) =
N
1 2
⋅ Vdd ⋅ f ⋅ ∑ E Eff (ni ) . The effective
2
i =1
switched capacitance, E Eff (ni ), is defined as the product of the switching activity E (ni )
and the total capacitance load of node ni, C ni = Fni C g where Fni is the fanout of this node
and Cg = 0.05 pF is a typical input gate capacitance. We define as Real Total Error the
quantity
Total Error (V j ) =
′
Total Power (V j )− Total Power (V j )
Total Power (V j )
Public
, as Real Mean Error the
Page 11
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
N
quantity Mean Error (V j ) =
∑ Err (n )
i
i =1
N
and finally as Real Maximum Error the quantity
Max Error (V j ) = max{ Err (n1 ), Err (n2 ), ... , Err (n N ) }.
For comparison reasons, a random vector file of 100,000 vectors with temporal
correlations is generated for each benchmark circuit. More specifically any input bit xi
has a random static,
p(xi )∈[0,1] , and switching probability
E (xi )∈[0,1] and
E(xi )≠ 2 p(xi )(1 − p(xi ) ) . We compare the proposed method and the method reported in [9]
with Mentor's Graphics QUICKSIM II switch level simulator. The power consumption
differences between each method and switch level simulator are depicted in Table 2-5
(i.e. proposed method and [9]).
Circuit
C1355
C1908
C2670
C3540
C499
C6288
C7552
C880
alu4
cu
des
f51m
rca16
vda
# inputs
41
33
33
50
41
32
207
60
14
14
256
8
33
17
# outputs
32
25
64
22
32
32
107
26
8
11
245
8
17
39
# gates
180
205
422
817
176
1491
1154
221
543
104
2437
83
112
391
# levels
15
21
20
34
17
73
21
22
35
14
17
10
49
16
Table 1: Circuits characteristics.
C1355
C1908
C2670
C3540
C499
C6288
C7552
C880
alu4
cu
des
f51m
rca16
vda
Average
Power Prop. Power'
3149
3061
4322
4205
4827
4792
8864
8010
2710
2661
75583
94825
12457
11733
1722
1691
3858
3683
510
502
16987
16497
532
519
934
874
2465
2310
9923
11097
TOTAL
MEAN
2.803
4.724
2.703
12.239
0.703
6.886
9.628
12.059
1.820
4.367
25.461
19.721
5.799
10.791
1.745
3.695
4.452
16.115
1.506
3.283
2.862
3.721
2.492
8.357
6.454
7.000
6.213
10.064
5.332
8.787
MAX
34.081
321.325
231.785
112.329
28.167
136.845
129.737
39.836
114.390
21.651
60.855
173.159
26.902
97.258
109.166
Time
0.890
1.030
5.030
3.350
0.910
10.340
7.320
1.290
2.220
0.330
15.270
0.260
0.650
1.290
3.584
Table 2: Real power estimation errors of the proposed method for Depth=2, N_vars=8.
Public
Page 12
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
For some large circuits (even for Depth=2) the method presented in [9] runs out of
memory and this is indicated with a “N/A” in the corresponding table row. The slightly
differences in power consumptions (column Power is the real power consumption while
column Power’ is the estimated one) are present due to the fact that the [9] method
estimates the switching activities using polynomial simulation (without OBDDs) while
the Proposed method estimate the switching activities using OBDDs. Finally in column
Time, the corresponding run times for each method are given. The reported run times are
in seconds and were obtained on an HP-UX C180 with 512 Mbytes of main memory. As
it shown the run times of the proposed method are about 6 times faster than those of [9]
(wherever [9] is applicable) and the average run time is 3.5 sec.
Circuit
C1355
C1908
C2670
C3540
C499
C6288
C7552
C880
alu4
cu
des
F51m
Rca16
vda
Power
[9] Power' TOTAL
MEAN
3149
3128
0.657
5.204
4322
4314
0.181
11.516
4827
N/A
N/A
N/A
8864
7912
10.732
12.273
2710
2661
1.822
4.369
75583
95590
26.473
19.726
12457
N/A
N/A
N/A
1722
1693
1.580
3.526
3858
3710
3.756
14.606
510
501
1.518
3.251
16987
N/A
N/A
N/A
532
514
3.297
5.744
934
874
6.449
6.997
2465
2426
1.501
6.109
MAX
33.160
382.930
N/A
111.899
28.162
135.901
N/A
22.534
113.917
21.652
N/A
59.363
26.880
97.256
Time
3.920
6.870
N/A
21.550
3.880
140.580
N/A
5.830
16.790
0.910
N/A
1.090
2.060
8.900
Table 3: Real power estimation errors of method [9] for Depth=2.
Choosing Depth=3 and N_vars=8, Table 4 shows the errors in power estimation (%)
of the proposed method, while Table 5 shows the corresponding errors of [9]. As it is
depicted, the proposed method reduces the power estimation errors, while method [9] is
not applicable for the majority of the large circuits. More specifically the average TOTAL
error is about 4.5 %, while the average MEAN error and the average MAX error are 8.6
% and 106.8 % respectively. The average run time is slightly increased (i.e. 4.4 sec)
compared to the average run time of Table 2.
Circuit
C1355
C1908
C2670
C3540
C499
C6288
C7552
C880
alu4
cu
des
f51m
Prop.Power' TOTAL
Power
MEAN
3149
3075
2.357
3.863
4322
4224
2.256
12.377
4827
4787
0.807
7.039
8864
7945
10.368
12.135
2710
2667
1.595
4.997
75583
61981
17.995
20.270
12457
11815
5.146
9.857
1722
1721
0.020
3.142
3858
3665
4.919
16.457
510
502
1.506
3.283
16987
16504
2.823
3.754
532
517
2.694
9.118
Public
MAX
34.081
371.130
200.166
98.021
35.530
99.297
158.658
40.119
114.218
21.651
60.757
173.146
Time
0.930
1.240
5.380
4.760
0.940
14.440
8.960
1.550
3.190
0.330
18.180
0.340
Page 13
Proceedings of second Workshop
rca16
vda
Average
934
2465
9923
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
872
2362
8760
6.586
4.124
4.514
7.188
7.296
8.627
26.902
62.578
106.875
0.670
1.610
4.466
Table 4: Real power estimation errors of the proposed method for Depth=3, N_vars=8.
6. Conclusions
We have proposed a novel method for the estimation of the power dissipation of a
logic circuit that is described at the gate level, assuming a real gate delay. The proposed
method constitutes an extension of the zero delay probabilistic method that presented in
[9] and takes into account the first-order temporal correlations and the spatial correlations
for the logic circuit structural dependencies. Since the proposed method does not take
into account the spatial correlations due to input data dependencies, our future work is to
implement a method that propagates the primary input statistics and correlation
coefficients through the logic network in order to estimate efficiently the switching
activity of any node of the logic circuit at any valid time point and for any input vector
stream.
Circuit
Power
[9] Power'
TOTAL
MEAN
MAX
Time
3149
3121
0.904
4.095
24.054
4.920
C1355
4322
N/A
N/A
N/A
N/A
N/A
C1908
4827
N/A
N/A
N/A
N/A
N/A
C2670
8864
N/A
N/A
N/A
N/A
N/A
C3540
2710
2690
0.738
4.481
30.072
5.250
C499
75583
N/A
N/A
N/A
N/A
N/A
C6288
N/A
N/A
N/A
N/A
N/A
C7552 12457
1722
N/A
N/A
N/A
N/A
N/A
C880
3858
N/A
N/A
N/A
N/A
N/A
alu4
510
501
1.518
3.251
21.652
0.910
cu
N/A
N/A
N/A
N/A
N/A
des 16987
532
519
2.463
4.730
60.613
3.560
F51m
934
937
0.314
0.866
7.030
2.980
Rca16
2465
2442
0.857
4.150
28.290
51.320
vda
Table 5: Real power estimation errors of method [9] for Depth=3.
References
[1] J. Rabaey and M. Pedram, “Low Power Design Methodologies,” Kluwer Academic
Publishers, 1996.
[2] F. Najm, “A Survey of Power Estimation Techniques in VLSI circuits” in IEEE Trans. on
VLSI, vol 2, no 4, pp. 446-455, December 1995.
[3] F. Najm, “Transition Density: A new measure of activity in digital circuits,” in IEEE Trans.
on CAD, Vol. 12, No. 2, pp. 310-323, February 1995.
[4] B. Kapoor, “Improving the accuracy of circuit activity measurement”, in Proc. of Int.
Workshop on Low Power Design, pp. 111-116, April 1994.
Public
Page 14
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
[5] P. Schneider and U. Schlichmann, “Decomposition of Boolean functions for low power based
on a new power estimation technique,” in Proc. of IWLPD, pp. 123-128, April 1994.
[6] R. Marculescu, D. Marculescu, and M. Pedram “Efficient Power estimation for highly
correlated input streams,” in Proc. of DAC, pp.628-634, 1995.
[7] F. Najm, R. Burch, P.Yang and I. Hajj “Probabilistic Simulation for Reliability Analysis of
CMOS VLSI circuits,” in IEEE Trans. on CAD, 9 (4) pp. 439-450, Apr. 1990.
[8] J. Monteiro, A. Ghosh, S. Devadas, K. Keutzer, and J. White, “Estimation of average
switching activity in combinatorial and sequential circuits”, in IEEE Trans. on CAD, Vol. 16,
No.1, pp. 121-127, January 1997.
[9] J.C. Costa, J.C. Monteiro, and S. Devadas, “Switching Activity Estimation using Limited
Depth Reconvergent Path analysis”, In Proc. of ISLPD, pp. 184-189, 1997.
[10] R. Bryant, “Graph-based algorithms for Boolean function manipulation”, in IEEE Trans. on
Computers, C-35:677-691, August 1986.
[11] J. Lind-Nielsen, “BuDDy – A Binary Decision Diagram Package”, IT-TR : 1999-028,
Technical University Of Denmark, 1999.
Public
Page 15
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
Reprinted from Proceedings of 9WK ,QW :RUNVKRS 3RZHU $QG 7LPLQJ 0RGHOLQJ
2SWLPL]DWLRQ$QG6LPXODWLRQ 3DWPRVC9 Kos, Greece, 2FWREHU8, 1999.
D. Soudris, C. Z. Lolas, And A. Thanailakis, “A Modified Energy Recovery Technique
For Implementing DSP Algorithms,” in Proc. of 9WK,QW:RUNVKRS3RZHU$QG7LPLQJ
0RGHOLQJ 2SWLPL]DWLRQ $QG 6LPXODWLRQ 3DWPRV C9 Kos, Greece, 2FWREHU 8,
1999, pp. 113-122.
Public
Page 16
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
A modified Energy recovery Technique FOR implementing DSP algorithms2
D. Soudris, C.Z. Lolas*, and A. Thanailakis
VLSI Design and Testing Center, Dept. of Electrical and Computer Eng.
Democritus University of Thrace, 67 100 Xanthi, Greece,
[email protected]
*Department of Computer Science, University of Crete,
P.O.Box 1470, Heraklion, Creece
Abstract
A new method for implementing low power architectures of DSP applications based on
the energy recovery techniques is introduced. Exploiting the regular and iterative
structure of DSP algorithms, the proposed architectures embody the principles of the
reversible pipeline approach. Since a reversible pipeline system uses the “reverse” logic
blocks in different time instances, we can reduce the hardware complexity significantly.
An efficient multiplexing and clocking scheme preserve the reversible operation with
small hardware cost. A series of proven lemmas specify the optimal characteristics of the
chosen clocks, which ensure adiabatic operation and high speed. Furthermore, we specify
the necessary conditions which ensure optimally-adiabatic algorithm mappings. The
proposed approach is demonstrated by the low power architecture design
of the
convolution algorithm.
1. INTRODUCTION
Low power dissipation has become a significant parameter for implementing
efficient digital systems. The market demands for portable devices with high performance
forced the designers to invent methods for reducing power dissipation in VLSI systems.
Plethora low power design methods have been published, which can be divided into two
main categories: i) the conventional low power design approaches [1, 2] and ii) nonconventional [3]. More specifically, the energy recovery is a non-conventional low power
design approach, which is based on adiabatic switching principles [3,4,5]. Using
adiabatic techniques, the signal transfer between circuit capacitances should be
sufficiently slow, which implies that the energy dissipated as heat may be asymptotically
low, during transfer. Moreover, the energy recovery systems can recycle back to a power
source, the remaining energy stored on circuit capacitances. Two approaches for
2
This work was partially supported by the LPGD ESPRIT IV 25256 project of EU
Public
Page 17
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
designing multi-block energy recovery systems have been reported: i) the Retractile
Cascade [3] and ii) the Reversible Pipeline (RP) [3,5,6]. In particular, since the
throughput of a Retractile Cascade system is inversely proportional to the number of used
blocks, composite systems with large number of blocks are become impractical. Using
partially overlapped clocks and the RP approach, the throughput of the system can be
improved. However, the large amount of the required hardware (we need a reverse block
for each forward logic block) derived inefficient implementations, mostly for systemand/or architecture-level design. In [7], the whole design of a FIR filter starting from the
algorithm level and ending to layout level was presented. There were comparisons with
conventional FIR designs and the resulting architectures had very positive features.
However, it cannot be seen as a design methodology for DSP algorithms.
In this paper, a new methodology for designing low power architectures based on
the energy recovery technique, is introduced. The proposed methodology consists of two
fundamental steps: i) Determination of the appropriate direction and schedule vectors and
ii) Determination of the optimal clocking scheme. More specifically, the first step specify
the array architecture. The latter step allows the redesign of clocking strategy resulting
into hardware efficient architectures with energy recovery features .
The proposed architectures implement a certain class of DSP algorithms called as
Weak Single Assignment Codes (WSACs) [8, 9], such as Discrete Fourier/Cosine
Transform, matrix-matrix multiplication, and convolution. These algorithms can be
implemented, applying conventional design methodologies, by regular, modular, and
iterative architectures [9, 10], which consists of many identical blocks (or processing
elements (PE)). The derived architectures are characterized by reversible pipeline
operation and small hardware complexity.
A typical reversible pipeline system consists of a series of “forward” logic blocks
and a series of the corresponding “reverse” pipeline logic blocks. Therefore, it is possible
to merge the identical parallel blocks to one, employing appropriate time multiplexing of
the common block. A set of lemmas specify the optimal characteristics of the chosen
clocks, which ensure adiabatic operation and high speed. The effectiveness proposed
method is illustrated by the architecture level design of the convolution algorithm.
2. The Basic Design Concept
The Reversible Pipeline principle was described in detail manner by Athas et al
[4]. A typical Reversible Pipeline system comprises: i) N forward logic blocks, F1, F2,...,
FN, ii) N reverse logic blocks and iii) N power-clocks Φ1, Φ2,..., ΦN, as is it depicted in
Fig. 1. Concerning the Reversible Pipeline operation, the role of the reverse blocks is to
recover circuit energy back to power supply and do not contribute to calculation process
itself. In other words, their only usage is to establish a path to recover the energy back to
Public
Page 18
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
the power source. Thus, if a forward logic block is removed from the pipeline, the system
logic operation will be changed. In contrary, if a reversible logic block is removed, the
calculation sequence will not be influenced. If some of the reverse blocks are identical,
alternative energy paths are available in a reversible pipeline system. By definition, a
reversible pipeline system uses the reverse logic blocks in different time instances as the
calculation and pipeline proceeds from a logic block, Fm to the next Fk one. Therefore,
the hardware utilization of the reverse blocks is very small, i.e. 1/N %. To improve the
hardware utilization, which coincides with the hardware reduction, we can merge
appropriate groups of reverse logic blocks to one block. Since the hardware-reduced
system should preserve the reversible pipeline principles (i.e. energy recovery), an
component, which multiplexes in-time the paths of the reverse blocks and manipulates
the associated clocks of the merged blocks in effective fashion, should be used.
Practically, the implementation of the time multiplexed energy recovery is performed by
appropriately-designed multiplexers (MUXs). The derived system can function as a RP
system with less hardware complexity. A systematic methodology for designing the
structure of the multiplexer and the characteristics of the multiplexed clocks will be
described in the following sections.
"forward" logic blocks
)
)
)
,QSXW
P
1
N
...
)P
...
)N
...
)1
...
)P
...
)N
...
)1
2XWSXW
D
)
)
)
P
N
1
"reverse" logic blocks
Figure 1. The general structure of a reversible pipeline architecture [4]
3. The Proposed Design Methodology
3.1 THE Mapping Procedure
The proposed methodology aims at the architecture level design of DSP
algorithms, deriving implementations which exhibit energy-recovery features. Although
the reversible pipeline exhibits several advantages a major obstacle regarding the overall
hardware complexity of an implemented circuit restricts the applicability of this
technique to a limited number of applications. Additional limitations are: i) the lack of
generalized method to reverse a function and ii) many useful functions are not reversible,
e.g. AND, OR, etc. The latter characteristic implies that the logic level expressions are
not always reversible. To overcome the reversible pipeline limitations, we are working in
Public
Page 19
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
the architecture level because the basic operators of the algorithm expressions are
commutative and associative. In order to map an iterative algorithm onto a regular array
architecture there exists many alternatives, i.e. many sets of projection and schedule
vectors [9,10]. However, only a subset is suitable for designing power efficient array
architectures. Studying the reversible pipeline concept, we infer that each algorithm
should be input and output of the same processing element. This requirement is satisfied
by the fact a DSP algorithm can be expressed in terms of Uniform Recurrent Equations
[9,10]. The mapping procedure of an algorithm results into a power efficient architecture
& &
&
when the number of the products d × ei ≠ 0 , i = 1,2,..., N is maximum, where d is the
&
direction vector and ei is the i-th edge vector of the Dependence Graph.
3.2 Method for Selecting Optimal Clocking Scheme
In the beginning, we define some variables, which will help the formal
description of the proposed clocking scheme. Referring to Fig. 2, we define: i) T =
raising/falling time of a clock, ii) SET = time interval with length s, iii) HOLD = time
interval with length h, iv) RESET = time interval with length r, v) IDLE = time interval
with length i, and vi) DIFFERENCE = time difference interval between two clocks with
length d.
+2/'
5(6(7
6(7
7
V
7
K
U
,'/(
L
Figure 2. The clock characteristics of a a reversible pipeline system.
Since the raising/falling edges occur during the SET/RESET intervals, we deduce that: s
= r = T . Without loss of generality, it is assumed that the lengths of HOLD, IDLE, and
DIFFERENCE intervals can be expressed in terms of T. That is: i) h = x T , ii) i = y T ,
iii) d = z T , where x, y ∈ R + − {0}, and (R+ is the positive of real numbers). In general,
any two clocks used in RP systems are identical and have equal phase difference. These
features arise from the fact that the clock implementation is easier . Due to the fact that
the difference between two successive blocks is assumed to be constant, d, the time shift
between the two clocks Φm and Φk is:
(m - k) z T
(1)
where m > k with m, k ∈ Z + − {0,1} (Z is the set of positive integer numbers).
During the energy recovery operation of a stage Fm (Fig.1), the rising clock edge
of the next stage Fm+1, Φm+1, drives the inputs of the block Fm−1 and open the path it
constitutes. Then, the falling edge of the clock Φm-1 attached on the reverse block Fm−1
Public
Page 20
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
discharges the stage. To merge the reverse blocks Fm−1 and Fk−1 , a designer should
ensure that the energy recovery operation of each block occurs in different time intervals.
This implies that the falling edges of the clocks Φm-1 and Φk-1 attached to the reverse
blocks Fm-1 and Fk-1 and the rising edges of the clocks Φm+1 and Φk+1 of the next blocks
Fm+1-1 and Fk+1-1, occur in different time instances.
In Section 2.2, it is mentioned that there is a critical point concerning the
adiabatic operation of the multiplexers. All the input signals of MUXes are clocks and
change state periodically passing a clock signal to their output, one input at a time. If a
clock signal swings from its IDLE value, V0, to its HOLD value, V, and does not return
back to V0 before the multiplexer change state, is it possible to occur a non-adiabatic
energy dissipation during the successive clock period. This happens because the voltage
drop (V-V0) across the multiplexer results into higher power dissipation.
To avoid this problem the signal clocks, which pass through the MUXes should
satisfy the condition: if the first clock is in “high” logic level, the other clock remains in
its IDLE value and vice versa. In other words, the pulses of the clock-signals should not
be overlapped. The clocks (denoted as Ti), which control MUXes have to keep their logic
values `0` and `1` for the whole time interval of each pulse.
The following lemmas facilitate the extraction of fast and secure solutions about
the appropriate clocking scheme for the optimization technique.
Lemma 1: Let s, h, r and i be a set of the time intervals of a clock pulse. The relationship
between the first three intervals and last one is given:
2+x ≤ y
(2).
Lemma 2: Two successive clock signals are partially overlapped if it holds:
1≤ z < x
(3).
Lemma 3: Let Fm and Fk, be two identical reverse blocks and Öm and Φk be the associated
clocks, which are non-overlapped. The phase difference between Φm and Φk
should meets the formulas:
(m - k) z ≥ 2 + x
(4)
Lemma 4: Let Fm and Fk, be two identical reverse blocks and Φm and Φk be associated
clocks, which are non-overlapped. The phase difference between Φm and Φk should
meets the formula:
(m - k) z ≤ y
(5)
Given the values m and k of the identical logic blocks and (2), (4) and (5), we can
determine a clocking scheme choosing one of the possible combinations of x, y, and z,
which satisfy the inequalities:
2 + x ≤ (m - k) z ≤ y
(6)
1≤ z<x
(7)
Public
Page 21
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
Lemma 5. Given the inqualities (6) and (7), it holds that:
i) y ≥ 2
(8)
ii) (m - k) > 1
(9)
The inequality (9) means that the merge of identical reverse blocks, which belong to
successive stages cannot be performed.
The clocks deriving from the above inequalities are suitable for merging every
pair of identical blocks, which has the same relative position in the pipeline stream, i.e.
constant difference ( m − k ) between any Fm and Fk. This important result has impact on
those systems, which are characterized by iterative (repeated) identical structures. Such
typical systems are the array processors [9,10], whose role are critical on several DSP
applications.
The proposed procedure can be easily generalized for RP systems, which contain
groups with more than two identical blocks. The permissible numbers of identical blocks
in each group are powers of 2, due to multiplexer features. In order to find the appropriate
clocking scheme, we apply the inequalities (6) and (7) to all possible combinations (by
pairs) of identical blocks.
The clocks resulting from (6) and (7) do not lead necessarily to maximum speed
or at least equal to the speed of a conventional RP system. Thus, the designer should
examine how can maximize the speed of the whole system. It has been already mentioned
that the factors, which influence the features of a clock pulse are T, x, y, and z. More
specifically, in circuit level, the speed depends on the rising/falling time T. In system
level, the speed depends on the time interval, t1, needed to obtain the first output (i.e.
latency) and the time interval, t2, in which a logic block can load new inputs after the
execution of a computation (i.e. throughput). The time interval t1 and t2 can be expressed
in terms of T, x, y, and z as follows:
t1 = a z T
(10)
t2 = (2+x+y)T
(11)
where a is the number of the forward logic blocks. Therefore, from all the possible
solutions of (6) and (7), the set of x, y, and z, which minimizes eq. (10) and (11) result
into maximum speed.
4. HARDWARE COMPLEXITY/REDUCTION
In order to estimate the total hardware complexity of the proposed pipeline
implementation, we have to explore the hardware complexity of the additional
components, i.e. the multiplexers. A fully-adiabatic multiplexer, which is based on a fully
adiabatic tree decoder designed by T-gate logic [3] is selected.
In order to estimate the hardware savings of the proposed system, we define the
total Relative Hardware Saving (RHST), which can be expressed by a more specific
Public
Page 22
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
formula, if all the blocks of the RP system are identical, for instance, the array processor
architectures (systolic architectures) [9,10]. It is proved that:
RHST =1 −
1 2( M + N + 1)
−
g
NT
(12)
where: A = the number of reverse blocks of the whole system, NT = the number of
transistors of each block, g = the number of blocks of each group, and C = the total
number of groups.
A significant conclusion comes from eq. (12): if all the blocks are identical, RHST
is expressed in terms of a block characteristics and the number of the identical blocks of
each group, and it is independent from of the total number of the reverse blocks. The
RHST is becoming better as g is increasing. However, if g increases, (m-k) becomes larger
as well as the values x, y, and z of (6) and (7). This fact implies increased latency, t1, and
thus lower speed. Depending on the application requirements, the designer should make
the appropriate trade-offs between RHST and speed, choosing the suitable value of g.
5. APPLICATION
The features and the advantages of the proposed modified RP approach are illustrated
by the well-known DSP algorithm of convolution. It has been proven that the convolution
can be implemented by regular architectures [9,10]. Since an array implementation
includes many identical PEs (e.g. multiplier-accumulation units), the proposed method
can produce low power systems, which embody the properties of the reversible pipeline
with affordable hardware complexity. The Dependence Graph and the Signal Flow Graph
of the convolution are shown in Fig. 3(a) and (b), respectively. The convolution array
processor can operate in a RP fashion because: i) the whole system exhibits the pipeline
property [10] and ii) the function of the identical blocks (or processing elements) is
reversible. The structure of the “forward” and “reverse” PE is shown in Fig. 3(c) and (d).
Having the forward and the reverse logic blocks (processing elements) it is possible to
design a convolution array processor operating as a RP system. This system is partially
adiabatic and it is designed to recover the external signal energy, that is the energy of the
input/output signals of PEs.
For comparison reasons, we choose a specific array processor, which implement the
convolution of two sequences of eight (A=8) points, with certain 32×32-bit multiplieraccumulator unit (i.e. PE) of 28,500 transistors [11]. Thus, the MUX has M=N=96
input/output bit lines. Hence, from eq. (12), it is obtained:
RHTT = 0,986 −
1
g
(13)
Public
Page 23
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
\
\
X
X
\
\
\
\
\
\
\
'
'
\
'
'
\
X
'
\
X
y (i+1 )
(c)
'
'
'
Z L
X L
\ L
'
\
;
y (i )
'
\
u (i )
w (i+1 )
'
X
w (i )
'
\
X
\
X
X
\
;
Z L
\ L
G
'
'
'
Z
Z
Z
Z
Z
Z
Z
Z
Z
Z
(a)
Z
...
\
...
Proceedings of second Workshop
(b)
Figure 3. (a) Index Transformed Dependence Graph for convolution and (b) Signal Flow
Graph obtained by projection along [0 1], (c) The architecture of the “forward” PE, and
(d) the “reverse” PE.
There are two different ways for merging identical the blocks of the Signal Flow Graph,
that is: g=2 and g=4. For arbitrary length of the convolved sequences, Figure 4 depicts
the relationship between RHST and the number of blocks of each group, g.
5+77
J
Figure 4. The relationship between RHST and the number of blocks of each group, g.
Groups of two
There are two ways for merging the reverse blocks of this system: i) to merge
blocks by pairs having m-k = 2 (i.e. blocks 1-3, 2-4, 5-7, 6-8) and ii) to merge blocks with
m-k =4 (i.e. blocks 1-5, 2-6, 3-7, 4-8). From the proposed method, choosing m-k = 2, we
result into an architecture (Fig. 5) with higher speed. The optimal clocking scheme,
Public
Page 24
Proceedings of second Workshop
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
which results from (8) and (9), has x = 2.2, y = 4.2 and z = 2.1. The associated values of
RHST, throughput and latency are in Table1.
1
3
4
5
6
7
8
08;V
08;
0
1
0
0 08;V 1
1
0
0 08;V 1 1
0 08;V 1
0 08;V 1 1
08;
08;
08;
08;
71
0
08;
72
Figure 5. The reversible-pipeline array processor architecture of convolution with g=2.
Groups of four
The merging of the eight reverse blocks with groups of four (g=4) has only one
combination of identical blocks, that is, 1-3-5-7 and 2-4-6-8. The deriving system is
shown in Fig. 6. Similarly, from (8) and (9) we infer that the minimum solution is
satisfied by x = 2.2, y = 12.6 and z = 2.1. The values of RHST, throughput and latency
are depicted in Table1.
In conclusion, the merge as many as possible identical blocks results into larger
hardware reduction at the expense of the architecture latency.
G
RHST
Throughput (t1)
Latency (t2)
2
48.6%
16.8 T
8.4 T
4
73.6%
16.8 T
14.8 T
Table 1. The hardware reduction and timing requirements of two linear array processors.
6. CONCLUSIONS
A new systematic methodology for implementing low power architectures, which are
based on adiabatic techniques, was presented. The derived architectures exploit the
properties and features of the reversible pipeline and exhibit significantly-reduced
hardware complexity. A series of proven lemmas determine the exact and optimal
characteristics of the used clocks. The efficacy of the proposed architectures was
illustrated by a certain DSP algorithm (convolution).
Public
Page 25
Proceedings of second Workshop
1
3
LPGD/WP1/DUTH+UP+INTRACOM/D1.2R1
4
5
6
7
8
08;V
08;V
08;V
08;
08;
1 3 5 7
71
72
2 4 6 8
Figure 6. The reversible-pipeline array processor architecture of convolution with g=4.
REFERENCES
[1] J. Rabaey and M. Pedram, “Low Power Design Methodologies,” Kluwer Academic
Publishers, 1996.
[2] A.P. Chandrakasan and R.W. Brodersen, “Low-power Digital CMOS Design”
Kluwer Academic Publishers, 1995.
[3] W. Athas, “Energy-Recovery CMOS”, Book Chapter in “Low Power Design
Methodologies” editors J.M. Rabaey and M. Pedram, Kluwer Academic Publishers,
pp. 65-100, 1996.
[4] W.C. Athas, L. Svensson, J.G. Koller, N. Tzartzanis, and E.Y.C. Chou, “Low-Power
Digital Systems Based on Adiabatic-Switching Principles,” in IEEE Trans. on VLSI
Systems, vol. 2, no.4, Dec. 1994.
[5] S.G. Younis and T.F. Knight, Jr., “A asymptotically zero energy split-level charge
recovery logic,” in Proc. of IEEE Int. Workshop on Low Power Design, 1994, NapaValley, USA.
[6] K. Jung and W. Kim, “A Logic gate for Reversible Pipelining,” in Proc. of IEEE Int.
Workshop on low Power Design, 1997, pp. 1928-1931.
[7] W.C. Athas, W-C Liu, and L. Svensson, “Energy-recovery CMOS for Highly
Pipelined DSP Designs,” in Proc. of Int. Symp. on Low Power Electronics and
Design (ISLPED), pp. 101-104, Monterey, CA, August 12 - 14 1996.
>@ 95R\FKRZGKXU\6.5DR/7KLHOHDQG7.DLODWK2QWKH/RFDOL]DWLRQRI
DOJRULWKPVIRU9/6,SURFHVVRUDUUD\VLQ9/6,6LJQDO3URFHVVLQJ,,,HGLWHGE\
5%URDGHUVHQDQG+0RVFZRYLW],(((3UHVV1HZ<RUNSS
[9] '6RXGULVet al2QWKH'HVLJQRI7ZR/HYHO3LSHOLQHG3URFHVVRU$UUD\VLQ
$SSOLFDWLRQ'ULYHQ$UFKLWHFWXUH6\QWKHVLV)&DWWKRRUDQG/6YHQVVRQHGLWRUV
.OXZHU$FDGHPLF3XEOLVKHUV-XQH
[10] S.Y. Kung, “VLSI Array Processors,” Prentice Hall, Eaglewood Cliffs, 1988.
[11] H. Murakami, K. Yano, Y. Ootaguro, Y. Sugeno, M. Ueno, Y. Muroya, and T.
Aramaki, “A multiplier-accumulator macro for a 45 MIPS embedded RISC
processor,” in IEEE J. Solid State Circuits, July 1996, 31, pp. 1067-1071.
Public
Page 26