INTERNATIONAL JOURNAL OF CIRCUIT THEORY AND APPLICATIONS
Int. J. Circ. Theor. Appl. 2006; 34:429–443
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cta.365
Dense CMOS implementation of a binary-programmable
cellular neural network
Jacek Flak1, ∗, † , Mika Laiho2, 1 , Ari Paasio2 and Kari Halonen1
1 Helsinki University of Technology, Electronic Circuit Design Laboratory, P.O. Box 3000, FIN-02015 TKK, Finland
2 University of Turku, Microelectronics Laboratory, Lemminkäisenkatu 14-18, 20520 Turku, Finland
SUMMARY
An implementation of a cellular neural/non-linear network (CNN) for processing black-and-white (B/W)
images is presented in which the template terms are 1-bit programmable. Such approach leads to a very
compact implementation of the coefficient circuits and fast (digital) programming. In this programming
scheme, the more complex templates are split into subtasks that are run successively. The structure allows
a direct or algorithmic evaluation of the majority of templates proposed for B/W images. The transient
mask is utilized in performing the local logic operations as well as in template operations. The proposed
architecture is suitable for high-density implementations. A test structure of a 4 × 4 network has been
implemented with a standard digital 0.18-m CMOS process. One cell occupies only 155 m2 , making
possible the implementations of very large networks on a single chip. The algorithms used for the logic
function computations and selected template evaluations are described, and the corresponding measurement
results are shown. Copyright q 2006 John Wiley & Sons, Ltd.
Received 8 June 2005; Revised 16 March 2006
KEY WORDS: cellular neural networks; CMOS integrated circuits; image processing; measurement
1. INTRODUCTION
The theory of cellular neural/non-linear networks (CNN) [1] is a powerful tool for many image
processing tasks, and serves as a theoretical base for development of the vision chips such as [2, 3].
However, its practical implementation is a real challenge. Therefore, designers have come up with
implementation-oriented simplifications to the CNN theory. For instance, the full signal range (FSR)
model [4], that truncates the cell state between −1 and 1, has led to the realization of a grey-scale
∗ Correspondence to: Jacek Flak, Helsinki University of Technology, Electronic Circuit Design Laboratory, P.O. Box
3000, FIN-02015 TKK, Finland.
†
E-mail:
[email protected]
Contract/grant sponsor: Academy of Finland; contract/grant numbers: 205443, 106451
Copyright q 2006 John Wiley & Sons, Ltd.
430 J. FLAK ET AL.
CNN with 128 × 128 cells [3]. An alternative approach is to realize the grey-scale and the black-and-
white (B/W) image processing parts separately so that their implementations can be optimized. The
B/W data processing is an extremely important class of operations, being an integral part of most
of the algorithms proposed for CNN. Paasio and Halonen [5] proposed a positive-range high-gain
CNN output non-linearity that enables a significant simplification of the B/W cell structure. Based
on that approach, a 176 × 144 (QCIF resolution) CNN chip has been implemented [6]. Further
simplification of this structure was obtained by reducing the programmability of the weights to
1 bit, i.e. both the multiplier and the multiplicand are 1-bit values as depicted in Reference [7].
Due to limited programmability, more complex templates need to be divided into a set of simple
sub-templates that are run successively. The result of each subtask is either the initial state for the
following one or it is stored in a local memory of a cell to be later combined with others into
the final result. If the bias has 2-bit programmability as recently proposed in Reference [8], all
the templates handling B/W images could be transformed into a set of (typically, one to three)
binary sub-templates by, e.g. separating the positive terms of the conventional templates given in
Reference [9] from the negative ones. The general rules on how to design binary-programmable
templates are given in Reference [8].
This paper describes a hardware realization that is simplified to process B/W data only. The
structure is somewhat similar to the B/W data processing part, called global logic unit (GLU),
of the near-sensor image processor (NSIP) presented in Reference [10]. However, the GLU
implementation given in Reference [10] lacks the threshold logic capability. Here, both the coupling
coefficients and the bias (setting the threshold level) are 1-bit programmable. This enables the bias
programming to either 0.5 or 1.5. The 1.5 bias is needed, e.g. for the Insignificant Line Remover
template [11]. Binary-valued weights of the cells’ interconnections lead to a very compact coef-
ficient circuit implementation with a small die area. Additionally, the write-time of the template
terms is very short due to digital programming. The other very important aspect in array computing,
namely, the power consumption is also addressed here. A dissipation in the range of microwatts
per cell along with small cell-size opens the door for high-frame-rate implementations with very
high spatial resolution. The reduced programmability does not induce any severe limitation to the
processor versatility, due to template division and fast programming [8].
The main objective of this article is to demonstrate the realizability and measured performance
of the proof-of-concept test structure. The chip architecture and circuitry are presented in Section
2, and the silicon implementation issues are covered in Section 3. A selection of templates is
described and visualized by the measurement results in Section 4. Discussion regarding the chip
performance is given in Section 5, and the conclusions are drawn in Section 6.
2. ARCHITECTURE
Figure 1 presents a block diagram of the chip consisting of a 4 × 4 array of processing elements
(PEs). In addition, a shift register used for programming the template terms was placed as a means
to minimize the number of pins used. That is because two other test structures were implemented
along with this one on the same chip. Only 10 bits (including the bias term) are needed for the
template terms, since only one of either A or B template matrix has non-zero elements. The supply
voltage of the PEs, VDD1, is kept at least one NMOS threshold voltage lower than the high level of
the global control signals. This way, the NMOS switches can convey full logic levels. The supply
voltage of shift register, VDD2, is the same as the logic HI level of the global signals. The right
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 431
VDD1
Row 4x4
selection Computing
Computing Array
Coefficient
bits
Circuits
Kernel
Bias
4
Local
Global Memories
control
signals
Data Bus
18
Control Bus
Clk & Template 10
terms VDD2
Ctrl 4
2 Shift Register
24 Serial input
Figure 1. Block diagram of the chip architecture.
VY VY
VDD1 VDD1
~VX = VX
Vth VDD1 VX Vth VDD1 ~VX
Figure 2. Cell output non-linearity.
side of Figure 1 shows three main building blocks that can be distinguished in the PE structure.
Namely, the bias circuit, the cell’s computing kernel with digital local memories, and the coefficient
circuits.
2.1. Cell model
The cells in the presented approach use a modified version of the positive-range high-gain output
non-linearity resulting in an inverting threshold function (see Figure 2). This non-linearity can be
mathematically expressed as
VDD1, 0V X <Vth
VY = f (V X ) = (1)
0, V X Vth
where VY is the cell output, V X stands for the cell state, and Vth represents the comparator threshold.
With the modification shown in Figure 2, the state equation can be written in the form of
d(∼ V X )
CX = ABkl VYkl − I = D − I (2)
dt k, l∈Si j
where C X is the state capacitance, Si j is the first neighbourhood of a cell in the ith row and jth
column, ABkl are the coefficients, i.e. the elements of either A or B template matrix (depending
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
432 J. FLAK ET AL.
on the global signal AorB), I is the threshold set by the positive bias, and D is the number of black
neighbourhood pixels (cells with low logic level at V X ) marked by the template.
2.2. Computing kernel
Figure 3 shows the detailed structure of the cell’s computing kernel. It comprises two inverters
INV1 and INV2, four SRAMs grouped in the local memories block, a transient mask based on the
programmable transmission gate, state capacitance C X (implemented with a capacitor-connected
3 m × 3 m NMOS), and a number of switches. Additionally, four parasitic gate capacitances C S ,
CC , Cm , and Cm utilized as dynamic memories are shown with the small dashed-line symbols.
The computing kernel of a cell has three operating modes:
• a multi-input threshold logic gate (TLG) with programmable inputs
• a multi-input (programmable inputs) TLG with a fixed state map
• a local logic device (transient mask operations).
In the first two modes, the cell can operate either in discrete time or in a way that allows
asynchronous propagation (A-templates). If the computing kernel operates in the multi-input TLG
Local memories
wr4
rd4
wr3
rd3
wr2
rd2
wr1
rd1
and the neighbourhood coeff. circuits
hold2
VS
current contributions from the bias
CS INV1
VSS
Transient mask
to the coeff. circuits
VDD1 set_mask
unit
Cm
VSS invert
start AorB
VX
VY
tr_mask INV2 CC
zero
VSS
CX Cm
VSS
dbus_r/w
VSS VSS data_bus
hold1
Figure 3. Schematic of the computing kernel of a cell.
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 433
mode, the start switch is used for connecting the neighbours’ outputs to the cell input, thus
initializing the processing. The input currents set the voltage at the state capacitance C X . The
voltage V X is thresholded by INV1, which provides the output voltage VY . The cell state can be
latched in the SRAM-like loop formed by the inverters INV1 and INV2, and the switches hold1
and hold2. Control signal AorB is used to determine whether an A- or B-template is run. When it
is kept HI, the cell output contribution is enabled to change during the processing, and therefore
an A-template is evaluated. For a B-template this signal is active only for a short while in order to
write the VY into the parasitic gate capacitance of the coefficient circuits, CC . Transistor dbus−r/w
is used to enable the image data transfer to and from a data bus. The control signals unit and zero
are used to preset the cell state V X to either VDD1 or VSS, respectively.
The transient mask is used for implementing a fixed state map as well as for performing all logic
functions. It can be programmed by writing the desired values via the set− mask switches into the
Cm and Cm . When the mask is inactive (transmission gate is not conducting), the cell state V X
evaluates according to Equation (2). When the mask is active (transmission gate is conducting),
V X is forced to either VS or its inverse VS depending on whether the tr mask or the invert signal
is set HI. Such enforcement is possible because the transistor controlled by the start signal has a
small W/L ratio.
When the computing kernel operates in the local logic device mode, the signals start and AorB
are kept LO, thus there is no interaction between the cells.
2.3. Coefficient circuits and border cells
In the presented approach the coefficient circuits, shown in Figure 4, are implemented with pull-
down circuits. The output of INV1 controls the gates of the analogue transistors (shown using
larger symbols than template term-controlled switches) in each coefficient circuit. Assuming the
corresponding template term (ABkl ) to be ‘1’, the coupling becomes active when the cell output is
HI, i.e. the cell state V X is LO. Such activated coefficient circuit works as an unit current sink
at the input of the corresponding neighbour. Since a low supply voltage is used, the gate of the
NMOS current source can be connected to VDD1, while a reasonable current level is maintained.
In this way, the unit current is determined by the supply voltage of the cell. Therefore, the trade-
off between speed and power consumption can be controlled with the value of VDD1. When a
B-template is run, the parasitic gate capacitances of the analogue transistors maintain the desired
control value (either zero or VDD1 volts). Since the processing is fast, the gate parasitics do not
need to hold the voltage for a long time. Therefore, the transistor leakage is not a critical issue.
The border cells surrounding the computing array have structures identical to the coefficient
circuits. A border cell switch is controlled by a suitable template term (e.g. AB21 in left border),
and the analogue transistor is driven by the global border signal.
2.4. Bias circuit
The pull-up bias circuit, shown in Figure 5, comprises a simple current mirror with a scaled
replication of the current defined by an NMOS current source (that is identical as in the coefficient
circuits). The total bias current is either 0.5 or 1.5 times the unit current depending on the global
signal biasbit. The bias programmability could easily be extended to two bits, by placing in the
circuit another path with a scaled (× 2) current replication, as proposed in Reference [8]. The
analogue power consumption of a cell after settling is bounded above by the bias current.
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
434 J. FLAK ET AL.
X11 X12 X13
AB13
AB12
AB11
from the computing kernel output
to the neighbourhood inputs
VSS VSS VSS
X21 SFB X23
AB21
AB23
AB22
VSS VSS VSS
X31 X32 X33
AB31
AB33
AB32
VSS VSS VSS
Figure 4. Schematic of the coefficient circuits.
VDD1
1x 0.5x 1x
MP1 MP2 MP3
biasbit
MN
VSS
to the kernel input
Figure 5. Schematic of the bias circuit.
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 435
3. SILICON IMPLEMENTATION
3.1. Robustness
When implementing an array processor, mismatch caused by processing variations has to be taken
into account. To estimate the probability of an error due to mismatch we Monte Carlo-simulated the
cell with 1000 iterations. The mismatch parameters and models provided by the foundry were used
within the Eldo simulator. Transistors with following sizes (given in micrometers) were used: NMOS
transistors with W/L = 0.5/0.5 (for both the analogue transistors in the coefficient circuits and
M N in the bias circuit), and PMOS transistors with W/L = 2.0/0.3 (for M P1 ), W/L = 2.17/0.28
(for M P3 ), and W/L = 2.17/0.56 (for M P2 ). The absolute robustness (minimum separation of
the two logic states) of the binary-programmable CNN is min|D − I | = 0.5, and the minimum
relative robustness occurs when D = I + 0.5 with bias I at its maximum value [8]. Therefore,
the bias term set to 1.5 and two black neighbours can be considered as the worst case here. With
the bias programmed to 0.5, failure-free results were obtained at each supply voltage and neigh-
bourhood condition. For the case of the bias programmed to 1.5, failure-free results were obtained
with a supply voltage VDD10.8 V. With scaling the supply voltage down the failure percentage
increases. As can be deducted, the increase of supply voltage (if programmed differently for differ-
ent templates) can be used as a means to increase the robustness of a certain template evaluation.
Table I presents the percentage of incorrect cell states at given bias and neighbourhood vs the
supply voltage.
3.2. Layout
For maximal density the cells are grouped into pairs of partially overlapping cells. It is important
to mention here that the cell array does not require any additional spacing (e.g. for signal wires)
between those pairs. In fact, they also slightly overlap each other, thus reducing the array area.
Figure 6 presents the layout of two connected cells designed for a standard digital 0.18-m CMOS
Table I. Simulated failure state percentage vs supply voltage.
VDD1 (V)
Bias and no. of
black neighbours 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
I = 1.5, D = 1 0.8 0 0 0 0 0 0 0
I = 1.5, D = 2 21.1 11.3 1.1 0 0 0 0 0
Figure 6. Layout of two cells with marked regions occupied by each of the PE’s components.
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
436 J. FLAK ET AL.
process. The process has a single poly layer, while six metal layers are available for routing. In
this design, only four metal layers were used, thus there is a room for an improved distribution of
the global signals and supply voltage or shielding layer, if a large network is to be built. The area
of such a two-cell pair is 310 m2 . The entire 4 × 4 array occupies 62.2 m × 38.5 m without the
borders, and 67m × 46m with the border cells. The layout was designed in a full-custom manner.
4. MEASUREMENT RESULTS
The chip measurements were conducted with a dedicated test board comprising external buffers
for bidirectional data bus. All the control signals as well as image data were provided by a pattern
generator with 8 ns minimum pulse width. The images resulting from processing were read out
from the chip and the cell states were determined by a logic analyser. The presented array has been
provided with separate supply rails. A microampere-meter in series with power supply source was
used for the measurements of the power consumption.
4.1. Logic operations: NOT, XOR, NAND, NOR
The logic operations (Boolean combinations) can be computed by CNN either with the use of
templates or by a local logic unit (LLU) placed within each PE as in Reference [3]. In the presented
approach the logic operations do not use templates, neither is there a dedicated LLU. Instead, a
sequence of global control signals makes use of the programmable transmission gate for computing
these functions.
Figure 7 shows the simplified equivalent cell structures for different logic operations. For clarity,
we omitted the complementary control of the transmission gate’s PMOS (values stored on Cm ) as
obvious. In the XOR case, a computing kernel performs a conditional inversion of one operand
if the other operand is in a logic HI state. Analogically, the NAND function means conditional
inversion of one operand when the other is HI or otherwise the cell state is set HI. NOR is done
by inverting one operand when the other is LO or otherwise the cell state is set LO.
In the presented approach the NOT operation is used in many algorithms. Though it is a triv-
ial function, its evaluation with the proposed cell structure is not so straightforward. Therefore,
the detailed NOT algorithm is given here to show how this logic operation translates into a
sequence of control signals. The inversion is done with the aid of the transient mask programmed
to conduct.
(1) Set the cell state HI: hold1 , unit
(2) Program the transient mask: set mask
(3) Read-out the operand from memory: r d1
(4) Separate C X from C S : hold2
(5) Perform the inversion (force V X to VS ): inver t ...
(6) Charge sharing between C X and C S : hold2
Since the C X /C S ≈ 5, the voltage VS at the input of INV1 will become close to V X . Then hold1
and the result is latched in the cell ready to be read out or used in further calculation. The evaluation
of NOT becomes straightforward with a slightly modified cell structure as proposed in Reference [8],
where the transient mask can be bypassed with a switch.
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 437
NOT XOR
invert invert
INV1 Do not INV1
OP1 "1" care OP2 OP1 OP2
CS Cm CX CS Cm CX
(a) (b)
NAND invert NOR invert
INV1 INV1
OP2 OP1 "1" OP2 OP1 "0"
CS Cm CX CS Cm CX
(c) (d)
Figure 7. The equivalent cell structures for different logic functions:
(a) NOT; (b) XOR; (c) NAND; and (d) NOR.
OPERAND 1 OPERAND 2 XOR NAND NOR
Figure 8. The operands and the measured results of the logic operations.
The signal sequences for the two-operand logic functions are designed in a similar way. In the
XOR case the first operand instead of unity is used to program the mask. A similar sequence is re-
quired for a NAND function. The only difference lies in the need to initialize the cell to HI (unit)
before starting the inversion with invert . That will ensure the HI state of the cells where no
inversion takes place, i.e. the first operand is LO. Only a bit more complicated algorithm is needed
to compute the NOR function. Namely, the first operand needs to be inverted before being written to
the transient mask to ensure the inversion takes place for its LO state (i.e. the NOR algorithm begins
with computing the NOT). Also, the cell state needs to be set LO (zero ) before the signal
invert . Figure 8 shows the operands and the measured results of the two-operand logic functions.
In this implementation, black pixels correspond to logical ‘0’ and are represented by a LO cell state
(voltage at V X ), and white pixels correspond to logical ‘1’ and are represented by a HI cell state.
4.2. Processing of templates
The first examined template is the Shadow into South-West (SW) direction, which in our case takes
the form of ⎡ ⎤
0 0 1
⎢ ⎥
A = ⎣ 0 1 0 ⎦ , I = 0.5 (3)
0 0 0
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
438 J. FLAK ET AL.
Table II. Measured wave propagation speed and 0.5 bias current vs supply voltage.
Supply Measured propagation time Measured 0.5 bias current
voltage (V) per cell (ns) per cell (A)
1.2 4.0 6.58
1.1 4.7 5.19
1.0 5.3 3.92
0.9 6.3 2.78
0.8 9.0 1.07
0.7 16.3 0.68
0.6 41.7 0.35
0.55 78.3 0.34
INIT RESULT
Figure 9. Evaluation of the Shadow template into SW direction.
Each cell checks the state of the neighbour specified by the non-zero template term. If it is LO
(black pixel), the cell also turns black. The Shadow template gives an opportunity to measure the
speed of a wave propagating throughout the network. The measurements were conducted with a
supply voltage VDD1 varying from 0.55 to 1.2 V. At each VDD1 value we measured the minimum
time required for the propagation of a black pixel from one cell through the rest of the array (three
cells). Depending on the supply voltage a wave needs from 4 ns at 1.2 V to 78.3 ns at 0.55 V to
propagate a distance of one cell (see Table II). The table also shows the measured 0.5 bias currents
at the given voltages. The Shadow template evaluation into the SW direction is shown in Figure 9.
The Hole Filler template causes the white areas surrounded by black pixels to turn black. There-
fore, the white holes within black objects are being filled. In this implementation, it has the form of
⎡ ⎤
0 1 0
⎢ ⎥
A = ⎣ 1 0 1 ⎦ , I = 0.5 (4)
0 1 0
The algorithm is a little bit more complicated than for Shadow, therefore it is described in more
detail here.
(1) Load an image: dbus−r/w
(2) Invert the image: NOT operation as depicted in Section 4.1
(3) Set the transient mask condition: set− mask
(4) Initialize all cells to white: hold1 , unit
(5) Separate C X from C S : hold2
(6) Upon the mask condition force V X to VS : tr− mask
(7) Select the A-template evaluation: AorB
(8) Set the border black: border
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 439
INVERTED INTERMEDIATE
ORIGIN ORIGIN (MASK) INIT RESULT RESULT
Figure 10. Evaluation of the Hole Filler template.
ORIGIN MARKER
(MASK) (INIT) RESULT
Figure 11. Evaluation of the Figure Reconstruction template.
(9) Evaluate the template: start ...
(10) Invert the image: NOT operation
After the algorithm evaluation is completed the result can be read out from the cell or used in
further processing. Figure 10 presents the original image and the measured result of Hole Filler
evaluation.
The Figure Reconstruction template (also known as Selected Object Extraction) extracts from
the image the objects marked by the black pixels in the marker image. In other words, the marked
objects are preserved and appear in the result, while all the other objects are being erased. With
original image written to the mask and marker as the initial image we evaluate the template (5).
The measured example is shown in Figure 11.
⎡ ⎤
1 1 1
⎢ ⎥
A = ⎣1 1 1⎦ , I = 0.5 (5)
1 1 1
The Object Increase, like every other discrete time CNN (DT-CNN) template [9], is evaluated
as a B-template. It is expressed as
⎡ ⎤
1 1 1
⎢ ⎥
B = ⎣1 1 1⎦ , I = 0.5 (6)
1 1 1
This operation causes black objects in the image to grow by one pixel into every direction. Figure 12
shows the initial image and the measured result.
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
440 J. FLAK ET AL.
INIT RESULT
Figure 12. Evaluation of the Object Increase template.
Table III. Execution time of basic operations at VDD1 = 1.2 V.
Operation Execution time
Write to memory 8 ns
Read from memory 8 ns
Write the transient mask 8 ns
Load image from data bus 61 ns per row
NOT 60 ns
XOR 80 ns
NAND 96 ns
NOR 160 ns
B-template 11 ns
A-template 4 ns per cell
INIT RESULT
Figure 13. The initial image and the resulting one obtained in power measurements.
4.3. Speed of operation and power consumption
Due to the output frequency limitation of the pattern generator, the maximal speed of some oper-
ations could not be determined. Namely, the writing to and reading from a local memory, writing
the transient mask as well as all the operations containing these tasks could be performed even
faster. Therefore, their execution times are marked in Table III as ‘less or equal to’ the minimum
measured value.
The dynamic power consumption was measured as the current dragged from the power supply
during looped evaluation of the B-template:
⎡ ⎤
1 1 1
⎢ ⎥
B = ⎣ 1 1 1 ⎦ , I = 0.5 (7)
1 1 1
The borders were set black, the template terms were programmed, and the initial image, shown
in Figure 13, was stored in one of the local memories. Then, the loop consisting of the following
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 441
operations was executed at the speed of 25 × 106 cycles per second.
(1) Read out the initial image from memory.
(2) Run the template (7).
This loop comprises five lines of code. One line is spent for reading out the image from memory,
two lines for the template evaluation, and additional two lines to assure the proper sequence of the
control signals. With the initial image as shown in Figure 13, all but one of the cells in the array
are forced to change the state at the same time, resulting in a power dissipation of 9.8 W per cell
at the supply voltage VDD1 = 1.2 V.
5. DISCUSSION
The measurement results presented in the previous section confirm that the structure works properly
in performing both the local logic operation and the template evaluation. If the supply voltage VDD1
Table IV. Chip characteristics.
Technology 6M-1P 0.18-m CMOS
Supply voltage 1.2 V
Control signal voltage 1.8 V
Array size 4×4
No. of transistors per PE 64
PE area 155 m2
State representation 1-bit
State dynamics Inverted Positive-Range High-Gain
I/O Digital
Weight programmability 1-bit
Dynamic power per PE 9.8 W
Table V. Chip comparison.
This design Reference [3] Reference [6] Reference [10] Reference [13]
Technology 0.18 m CMOS 0.35 m CMOS 0.25 m CMOS 0.8 m CMOS 0.5 m CMOS
1P-6M 1P-5M 1P-6M 1P-2M 1P-3M
Array size 4×4 128 × 128 176 × 144 32 × 32 48 × 48
PE density 6451 180 3027 71 295
(PE/mm2 )
Supported B/W Grey B/W Grey B/W
images
Photosensors No Yes No Yes No
No. memories 4 B/W 2 B/W & 8 Grey 6 B/W 8 B/W 4 B/W
per PE
No. transistors 64 198 73 97 N/A
per PE
Weight distribution Digital Analogue Analogue Digital Analogue
Bits per term 1, 1, 1∗ 8, 8, 8 6, 6, 9 1, 1, 0∗, † 6, 6, 6
A, B, I
4 ns 135 ns N/A N/A 50 ns
Power per PE 9.8 W 180 W N/A N/A 81 W
∗ No separate bits for the A and B templates.
† No threshold functionality, only cross connection (no diagonal connection) to the neighbourhood.
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
442 J. FLAK ET AL.
is made variable for template-by-template evaluation, speed and robustness can be effectively traded
for power consumption. The external buffers in the bidirectional data bus stopped working properly
at 0.95 V and this limited the minimum supply voltage to 0.55 V in the measurements. Correct
results with the Shadow template were obtained at VDD1 as low as 0.55 V, while with the logic
operations the limit was 0.6 V. The correct and data-independent results of the more complex
algorithms were achieved at a supply voltage of 1.2 V. Thus, this value is listed in Table IV as the
supply voltage. Also, the given dynamic power consumption was measured with VDD1 = 1.2 V.
Although, the presented cell has the bias term programmable to either 0.5 or 1.5, only operations
requiring 0.5 bias were presented. That is because, in the layout drawing process the bias PMOS
transistors were improperly scaled (M P1 with W/L = 0.5/0.3, M P2 with W/L = 0.5/0.6, and M P3
with W/L = 0.5/0.3) according to an older version of the schematic. Thus, the mirrored currents
were made smaller (about 3 times smaller at 1.2 V supply voltage). For the case of 0.5 bias, having
a lower value of the bias than half of the unit current can actually be seen as a means to improve
the robustness [12]. However, for the case of 1.5 bias, the mirrored current should be close to 1.5 of
the unit current. Nevertheless, the functionality of the 1.5 bias was verified with the measurements
at very low supply voltage (VDD1 = 0.3 V, the mirroring of the bias current works better in the
subthreshold region), while the programming and reading out were conducted at VDD1 = 0.8 V. If
the PMOS devices in the bias circuit were properly scaled as in the simulations of Section 3.1, the
cell size would increase to about 170 m2 .
Table V provides a comparison of the presented design and other chips. However, the reported
test-structure can fairly be compared with designs of References [6, 13] only, as they implement
similar functionality. The chips of References [3, 10] can operate on grey-scale images and have
built-in photosensors, and are placed here to give a broader context.
6. CONCLUSION
A hardware realization of a CNN for processing B/W image data was presented. Both the coefficient
circuits and the bias are 1-bit programmable. Therefore, a very compact implementation of the
couplings was obtained. Due to limited programmability, the more complex templates are evaluated
as a set of simple subtasks. Since the weights are programmed digitally, the write-time of the
template terms is fast, and thus the overall performance stays competitive. The proposed transient
mask structure proved to be useful in the implementation of a fixed state map, required in some
algorithms as well as in the evaluation of logic functions. The small cell dimensions allow the
implementation of a very large array on the same chip.
ACKNOWLEDGEMENTS
This work was funded by the Academy of Finland in the projects 205443 and 106451.
REFERENCES
1. Chua LO, Yang L. Cellular neural networks: theory. IEEE Transactions on Circuits and Systems 1988; 35:
1257–1272.
2. Linan G, Espejo S, Dominguez-Castro R, Rodriguez-Vazquez A. ACE4k: an analogue I/O 64 × 64 visual
microprocessor chip with 7-bit analogue accuracy. International Journal of Circuit Theory and Applications 2002;
30(2/3):89–116.
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta
DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 443
3. Rodriguez-Vazquez A et al. ACE16: the third generation of mixed-signal SIMD-CNN ACE chips toward VSoCs,
IEEE Transactions on Circuits and Systems—Part I 2004; 51(5):851–863.
4. Espejo S, Carmona R, Dominguez-Castro R, Rodriguez-Vazquez A. A VLSI oriented continuous-time CNN
model. International Journal of Circuit Theory and Applications 1996; 24:341–356.
5. Paasio A, Halonen K. A new cell output nonlinearity for dense cellular nonlinear network integration. IEEE
Transactions on Circuits and Systems—Part I 2001; 48(3):272–280.
6. Paasio A, Kananen A, Halonen K, Porra V. A QCIF resolution binary I/O CNN-UM chip. Journal of VLSI Signal
Processing 1999; 23:281–290.
7. Paasio A, Laiho M, Kananen A, Halonen K. An analogue array processor hardware realization with multiple new
features. Proceedings of the 2002 International Joint Conference on Neural Networks, Honolulu, Hawaii, 2002;
1952–1955.
8. Laiho M, Paasio A, Flak J, Halonen K. Template design for binary-programmable cellular nonlinear networks.
IEEE International Symposium on Circuits and Systems, Kobe, Japan, 2005; 3981–3941.
9. Roska T et al. CNN software library version 1.1 [On-line]. Available: http://lab.analogic.sztaki.hu/Candy/csl.html,
2000.
10. Eklund J-E, Svensson C, Astrom A. VLSI implementation of a focal plane image processor—a realization of
the near-sensor image processing concept. IEEE Transactions on Very Large Scale Integration Systems 1996;
4(3):322–335.
11. Stoffels A, Roska T, Chua LO. Object-oriented image analysis for very-low-bitrate video-coding systems using
the CNN universal machine. International Journal of Circuit Theory and Applications 1997; 25:235–258.
12. Brea V, Laiho M, Paasio A. Robustness improvement in binary cellular non-linear network architectures.
Proceedings of the 2005 European Conference on Circuit Theory and Design, Cork, Ireland, 2005; I-149–I-152.
13. Paasio A. Integration of cellular nonlinear network universal machine. Ph.D. Dissertation, Helsinki University of
Technology, Espoo, Finland, 1999.
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443
DOI: 10.1002/cta