Dense CMOS implementation of a binary-programmable cellular neural network

Jacek Flak

doi:10.1002/CTA.365

Outline

Dense CMOS implementation of a binary-programmable cellular neural network

Jacek Flak

2006, International Journal of Circuit Theory and Applications

https://doi.org/10.1002/CTA.365

visibility

…

description

15 pages

Abstract

An implementation of a cellular neural/non-linear network (CNN) for processing black-and-white (B/W) images is presented in which the template terms are 1-bit programmable. Such approach leads to a very compact implementation of the coefficient circuits and fast (digital) programming. In this programming scheme, the more complex templates are split into subtasks that are run successively. The structure allows a direct or algorithmic evaluation of the majority of templates proposed for B/W images. The transient mask is utilized in performing the local logic operations as well as in template operations. The proposed architecture is suitable for high-density implementations. A test structure of a 4 × 4 network has been implemented with a standard digital 0.18-m CMOS process. One cell occupies only 155 m 2 , making possible the implementations of very large networks on a single chip. The algorithms used for the logic function computations and selected template evaluations are described, and the corresponding measurement results are shown.

INTERNATIONAL JOURNAL OF CIRCUIT THEORY AND APPLICATIONS Int. J. Circ. Theor. Appl. 2006; 34:429–443 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cta.365 Dense CMOS implementation of a binary-programmable cellular neural network Jacek Flak1, ∗, † , Mika Laiho2, 1 , Ari Paasio2 and Kari Halonen1 1 Helsinki University of Technology, Electronic Circuit Design Laboratory, P.O. Box 3000, FIN-02015 TKK, Finland 2 University of Turku, Microelectronics Laboratory, Lemminkäisenkatu 14-18, 20520 Turku, Finland SUMMARY An implementation of a cellular neural/non-linear network (CNN) for processing black-and-white (B/W) images is presented in which the template terms are 1-bit programmable. Such approach leads to a very compact implementation of the coefficient circuits and fast (digital) programming. In this programming scheme, the more complex templates are split into subtasks that are run successively. The structure allows a direct or algorithmic evaluation of the majority of templates proposed for B/W images. The transient mask is utilized in performing the local logic operations as well as in template operations. The proposed architecture is suitable for high-density implementations. A test structure of a 4 × 4 network has been implemented with a standard digital 0.18-m CMOS process. One cell occupies only 155 m2 , making possible the implementations of very large networks on a single chip. The algorithms used for the logic function computations and selected template evaluations are described, and the corresponding measurement results are shown. Copyright q 2006 John Wiley & Sons, Ltd. Received 8 June 2005; Revised 16 March 2006 KEY WORDS: cellular neural networks; CMOS integrated circuits; image processing; measurement 1. INTRODUCTION The theory of cellular neural/non-linear networks (CNN) [1] is a powerful tool for many image processing tasks, and serves as a theoretical base for development of the vision chips such as [2, 3]. However, its practical implementation is a real challenge. Therefore, designers have come up with implementation-oriented simplifications to the CNN theory. For instance, the full signal range (FSR) model [4], that truncates the cell state between −1 and 1, has led to the realization of a grey-scale ∗ Correspondence to: Jacek Flak, Helsinki University of Technology, Electronic Circuit Design Laboratory, P.O. Box 3000, FIN-02015 TKK, Finland. † E-mail: [email protected] Contract/grant sponsor: Academy of Finland; contract/grant numbers: 205443, 106451 Copyright q 2006 John Wiley & Sons, Ltd. 430 J. FLAK ET AL. CNN with 128 × 128 cells [3]. An alternative approach is to realize the grey-scale and the black-and- white (B/W) image processing parts separately so that their implementations can be optimized. The B/W data processing is an extremely important class of operations, being an integral part of most of the algorithms proposed for CNN. Paasio and Halonen [5] proposed a positive-range high-gain CNN output non-linearity that enables a significant simplification of the B/W cell structure. Based on that approach, a 176 × 144 (QCIF resolution) CNN chip has been implemented [6]. Further simplification of this structure was obtained by reducing the programmability of the weights to 1 bit, i.e. both the multiplier and the multiplicand are 1-bit values as depicted in Reference [7]. Due to limited programmability, more complex templates need to be divided into a set of simple sub-templates that are run successively. The result of each subtask is either the initial state for the following one or it is stored in a local memory of a cell to be later combined with others into the final result. If the bias has 2-bit programmability as recently proposed in Reference [8], all the templates handling B/W images could be transformed into a set of (typically, one to three) binary sub-templates by, e.g. separating the positive terms of the conventional templates given in Reference [9] from the negative ones. The general rules on how to design binary-programmable templates are given in Reference [8]. This paper describes a hardware realization that is simplified to process B/W data only. The structure is somewhat similar to the B/W data processing part, called global logic unit (GLU), of the near-sensor image processor (NSIP) presented in Reference [10]. However, the GLU implementation given in Reference [10] lacks the threshold logic capability. Here, both the coupling coefficients and the bias (setting the threshold level) are 1-bit programmable. This enables the bias programming to either 0.5 or 1.5. The 1.5 bias is needed, e.g. for the Insignificant Line Remover template [11]. Binary-valued weights of the cells’ interconnections lead to a very compact coef- ficient circuit implementation with a small die area. Additionally, the write-time of the template terms is very short due to digital programming. The other very important aspect in array computing, namely, the power consumption is also addressed here. A dissipation in the range of microwatts per cell along with small cell-size opens the door for high-frame-rate implementations with very high spatial resolution. The reduced programmability does not induce any severe limitation to the processor versatility, due to template division and fast programming [8]. The main objective of this article is to demonstrate the realizability and measured performance of the proof-of-concept test structure. The chip architecture and circuitry are presented in Section 2, and the silicon implementation issues are covered in Section 3. A selection of templates is described and visualized by the measurement results in Section 4. Discussion regarding the chip performance is given in Section 5, and the conclusions are drawn in Section 6. 2. ARCHITECTURE Figure 1 presents a block diagram of the chip consisting of a 4 × 4 array of processing elements (PEs). In addition, a shift register used for programming the template terms was placed as a means to minimize the number of pins used. That is because two other test structures were implemented along with this one on the same chip. Only 10 bits (including the bias term) are needed for the template terms, since only one of either A or B template matrix has non-zero elements. The supply voltage of the PEs, VDD1, is kept at least one NMOS threshold voltage lower than the high level of the global control signals. This way, the NMOS switches can convey full logic levels. The supply voltage of shift register, VDD2, is the same as the logic HI level of the global signals. The right Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 431 VDD1 Row 4x4 selection Computing Computing Array Coefficient bits Circuits Kernel Bias 4 Local Global Memories control signals Data Bus 18 Control Bus Clk & Template 10 terms VDD2 Ctrl 4 2 Shift Register 24 Serial input Figure 1. Block diagram of the chip architecture. VY VY VDD1 VDD1 ~VX = VX Vth VDD1 VX Vth VDD1 ~VX Figure 2. Cell output non-linearity. side of Figure 1 shows three main building blocks that can be distinguished in the PE structure. Namely, the bias circuit, the cell’s computing kernel with digital local memories, and the coefficient circuits. 2.1. Cell model The cells in the presented approach use a modified version of the positive-range high-gain output non-linearity resulting in an inverting threshold function (see Figure 2). This non-linearity can be mathematically expressed as VDD1, 0V X <Vth VY = f (V X ) = (1) 0, V X Vth where VY is the cell output, V X stands for the cell state, and Vth represents the comparator threshold. With the modification shown in Figure 2, the state equation can be written in the form of d(∼ V X ) CX = ABkl VYkl − I = D − I (2) dt k, l∈Si j where C X is the state capacitance, Si j is the first neighbourhood of a cell in the ith row and jth column, ABkl are the coefficients, i.e. the elements of either A or B template matrix (depending Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta 432 J. FLAK ET AL. on the global signal AorB), I is the threshold set by the positive bias, and D is the number of black neighbourhood pixels (cells with low logic level at V X ) marked by the template. 2.2. Computing kernel Figure 3 shows the detailed structure of the cell’s computing kernel. It comprises two inverters INV1 and INV2, four SRAMs grouped in the local memories block, a transient mask based on the programmable transmission gate, state capacitance C X (implemented with a capacitor-connected 3 m × 3 m NMOS), and a number of switches. Additionally, four parasitic gate capacitances C S , CC , Cm , and Cm utilized as dynamic memories are shown with the small dashed-line symbols. The computing kernel of a cell has three operating modes: • a multi-input threshold logic gate (TLG) with programmable inputs • a multi-input (programmable inputs) TLG with a fixed state map • a local logic device (transient mask operations). In the first two modes, the cell can operate either in discrete time or in a way that allows asynchronous propagation (A-templates). If the computing kernel operates in the multi-input TLG Local memories wr4 rd4 wr3 rd3 wr2 rd2 wr1 rd1 and the neighbourhood coeff. circuits hold2 VS current contributions from the bias CS INV1 VSS Transient mask to the coeff. circuits VDD1 set_mask unit Cm VSS invert start AorB VX VY tr_mask INV2 CC zero VSS CX Cm VSS dbus_r/w VSS VSS data_bus hold1 Figure 3. Schematic of the computing kernel of a cell. Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 433 mode, the start switch is used for connecting the neighbours’ outputs to the cell input, thus initializing the processing. The input currents set the voltage at the state capacitance C X . The voltage V X is thresholded by INV1, which provides the output voltage VY . The cell state can be latched in the SRAM-like loop formed by the inverters INV1 and INV2, and the switches hold1 and hold2. Control signal AorB is used to determine whether an A- or B-template is run. When it is kept HI, the cell output contribution is enabled to change during the processing, and therefore an A-template is evaluated. For a B-template this signal is active only for a short while in order to write the VY into the parasitic gate capacitance of the coefficient circuits, CC . Transistor dbus−r/w is used to enable the image data transfer to and from a data bus. The control signals unit and zero are used to preset the cell state V X to either VDD1 or VSS, respectively. The transient mask is used for implementing a fixed state map as well as for performing all logic functions. It can be programmed by writing the desired values via the set− mask switches into the Cm and Cm . When the mask is inactive (transmission gate is not conducting), the cell state V X evaluates according to Equation (2). When the mask is active (transmission gate is conducting), V X is forced to either VS or its inverse VS depending on whether the tr mask or the invert signal is set HI. Such enforcement is possible because the transistor controlled by the start signal has a small W/L ratio. When the computing kernel operates in the local logic device mode, the signals start and AorB are kept LO, thus there is no interaction between the cells. 2.3. Coefficient circuits and border cells In the presented approach the coefficient circuits, shown in Figure 4, are implemented with pull- down circuits. The output of INV1 controls the gates of the analogue transistors (shown using larger symbols than template term-controlled switches) in each coefficient circuit. Assuming the corresponding template term (ABkl ) to be ‘1’, the coupling becomes active when the cell output is HI, i.e. the cell state V X is LO. Such activated coefficient circuit works as an unit current sink at the input of the corresponding neighbour. Since a low supply voltage is used, the gate of the NMOS current source can be connected to VDD1, while a reasonable current level is maintained. In this way, the unit current is determined by the supply voltage of the cell. Therefore, the trade- off between speed and power consumption can be controlled with the value of VDD1. When a B-template is run, the parasitic gate capacitances of the analogue transistors maintain the desired control value (either zero or VDD1 volts). Since the processing is fast, the gate parasitics do not need to hold the voltage for a long time. Therefore, the transistor leakage is not a critical issue. The border cells surrounding the computing array have structures identical to the coefficient circuits. A border cell switch is controlled by a suitable template term (e.g. AB21 in left border), and the analogue transistor is driven by the global border signal. 2.4. Bias circuit The pull-up bias circuit, shown in Figure 5, comprises a simple current mirror with a scaled replication of the current defined by an NMOS current source (that is identical as in the coefficient circuits). The total bias current is either 0.5 or 1.5 times the unit current depending on the global signal biasbit. The bias programmability could easily be extended to two bits, by placing in the circuit another path with a scaled (× 2) current replication, as proposed in Reference [8]. The analogue power consumption of a cell after settling is bounded above by the bias current. Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta 434 J. FLAK ET AL. X11 X12 X13 AB13 AB12 AB11 from the computing kernel output to the neighbourhood inputs VSS VSS VSS X21 SFB X23 AB21 AB23 AB22 VSS VSS VSS X31 X32 X33 AB31 AB33 AB32 VSS VSS VSS Figure 4. Schematic of the coefficient circuits. VDD1 1x 0.5x 1x MP1 MP2 MP3 biasbit MN VSS to the kernel input Figure 5. Schematic of the bias circuit. Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 435 3. SILICON IMPLEMENTATION 3.1. Robustness When implementing an array processor, mismatch caused by processing variations has to be taken into account. To estimate the probability of an error due to mismatch we Monte Carlo-simulated the cell with 1000 iterations. The mismatch parameters and models provided by the foundry were used within the Eldo simulator. Transistors with following sizes (given in micrometers) were used: NMOS transistors with W/L = 0.5/0.5 (for both the analogue transistors in the coefficient circuits and M N in the bias circuit), and PMOS transistors with W/L = 2.0/0.3 (for M P1 ), W/L = 2.17/0.28 (for M P3 ), and W/L = 2.17/0.56 (for M P2 ). The absolute robustness (minimum separation of the two logic states) of the binary-programmable CNN is min|D − I | = 0.5, and the minimum relative robustness occurs when D = I + 0.5 with bias I at its maximum value [8]. Therefore, the bias term set to 1.5 and two black neighbours can be considered as the worst case here. With the bias programmed to 0.5, failure-free results were obtained at each supply voltage and neigh- bourhood condition. For the case of the bias programmed to 1.5, failure-free results were obtained with a supply voltage VDD10.8 V. With scaling the supply voltage down the failure percentage increases. As can be deducted, the increase of supply voltage (if programmed differently for differ- ent templates) can be used as a means to increase the robustness of a certain template evaluation. Table I presents the percentage of incorrect cell states at given bias and neighbourhood vs the supply voltage. 3.2. Layout For maximal density the cells are grouped into pairs of partially overlapping cells. It is important to mention here that the cell array does not require any additional spacing (e.g. for signal wires) between those pairs. In fact, they also slightly overlap each other, thus reducing the array area. Figure 6 presents the layout of two connected cells designed for a standard digital 0.18-m CMOS Table I. Simulated failure state percentage vs supply voltage. VDD1 (V) Bias and no. of black neighbours 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 I = 1.5, D = 1 0.8 0 0 0 0 0 0 0 I = 1.5, D = 2 21.1 11.3 1.1 0 0 0 0 0 Figure 6. Layout of two cells with marked regions occupied by each of the PE’s components. Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta 436 J. FLAK ET AL. process. The process has a single poly layer, while six metal layers are available for routing. In this design, only four metal layers were used, thus there is a room for an improved distribution of the global signals and supply voltage or shielding layer, if a large network is to be built. The area of such a two-cell pair is 310 m2 . The entire 4 × 4 array occupies 62.2 m × 38.5 m without the borders, and 67m × 46m with the border cells. The layout was designed in a full-custom manner. 4. MEASUREMENT RESULTS The chip measurements were conducted with a dedicated test board comprising external buffers for bidirectional data bus. All the control signals as well as image data were provided by a pattern generator with 8 ns minimum pulse width. The images resulting from processing were read out from the chip and the cell states were determined by a logic analyser. The presented array has been provided with separate supply rails. A microampere-meter in series with power supply source was used for the measurements of the power consumption. 4.1. Logic operations: NOT, XOR, NAND, NOR The logic operations (Boolean combinations) can be computed by CNN either with the use of templates or by a local logic unit (LLU) placed within each PE as in Reference [3]. In the presented approach the logic operations do not use templates, neither is there a dedicated LLU. Instead, a sequence of global control signals makes use of the programmable transmission gate for computing these functions. Figure 7 shows the simplified equivalent cell structures for different logic operations. For clarity, we omitted the complementary control of the transmission gate’s PMOS (values stored on Cm ) as obvious. In the XOR case, a computing kernel performs a conditional inversion of one operand if the other operand is in a logic HI state. Analogically, the NAND function means conditional inversion of one operand when the other is HI or otherwise the cell state is set HI. NOR is done by inverting one operand when the other is LO or otherwise the cell state is set LO. In the presented approach the NOT operation is used in many algorithms. Though it is a triv- ial function, its evaluation with the proposed cell structure is not so straightforward. Therefore, the detailed NOT algorithm is given here to show how this logic operation translates into a sequence of control signals. The inversion is done with the aid of the transient mask programmed to conduct. (1) Set the cell state HI: hold1 , unit (2) Program the transient mask: set mask (3) Read-out the operand from memory: r d1 (4) Separate C X from C S : hold2 (5) Perform the inversion (force V X to VS ): inver t ... (6) Charge sharing between C X and C S : hold2 Since the C X /C S ≈ 5, the voltage VS at the input of INV1 will become close to V X . Then hold1 and the result is latched in the cell ready to be read out or used in further calculation. The evaluation of NOT becomes straightforward with a slightly modified cell structure as proposed in Reference [8], where the transient mask can be bypassed with a switch. Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 437 NOT XOR invert invert INV1 Do not INV1 OP1 "1" care OP2 OP1 OP2 CS Cm CX CS Cm CX (a) (b) NAND invert NOR invert INV1 INV1 OP2 OP1 "1" OP2 OP1 "0" CS Cm CX CS Cm CX (c) (d) Figure 7. The equivalent cell structures for different logic functions: (a) NOT; (b) XOR; (c) NAND; and (d) NOR. OPERAND 1 OPERAND 2 XOR NAND NOR Figure 8. The operands and the measured results of the logic operations. The signal sequences for the two-operand logic functions are designed in a similar way. In the XOR case the first operand instead of unity is used to program the mask. A similar sequence is re- quired for a NAND function. The only difference lies in the need to initialize the cell to HI (unit) before starting the inversion with invert . That will ensure the HI state of the cells where no inversion takes place, i.e. the first operand is LO. Only a bit more complicated algorithm is needed to compute the NOR function. Namely, the first operand needs to be inverted before being written to the transient mask to ensure the inversion takes place for its LO state (i.e. the NOR algorithm begins with computing the NOT). Also, the cell state needs to be set LO (zero ) before the signal invert . Figure 8 shows the operands and the measured results of the two-operand logic functions. In this implementation, black pixels correspond to logical ‘0’ and are represented by a LO cell state (voltage at V X ), and white pixels correspond to logical ‘1’ and are represented by a HI cell state. 4.2. Processing of templates The first examined template is the Shadow into South-West (SW) direction, which in our case takes the form of ⎡ ⎤ 0 0 1 ⎢ ⎥ A = ⎣ 0 1 0 ⎦ , I = 0.5 (3) 0 0 0 Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta 438 J. FLAK ET AL. Table II. Measured wave propagation speed and 0.5 bias current vs supply voltage. Supply Measured propagation time Measured 0.5 bias current voltage (V) per cell (ns) per cell (A) 1.2 4.0 6.58 1.1 4.7 5.19 1.0 5.3 3.92 0.9 6.3 2.78 0.8 9.0 1.07 0.7 16.3 0.68 0.6 41.7 0.35 0.55 78.3 0.34 INIT RESULT Figure 9. Evaluation of the Shadow template into SW direction. Each cell checks the state of the neighbour specified by the non-zero template term. If it is LO (black pixel), the cell also turns black. The Shadow template gives an opportunity to measure the speed of a wave propagating throughout the network. The measurements were conducted with a supply voltage VDD1 varying from 0.55 to 1.2 V. At each VDD1 value we measured the minimum time required for the propagation of a black pixel from one cell through the rest of the array (three cells). Depending on the supply voltage a wave needs from 4 ns at 1.2 V to 78.3 ns at 0.55 V to propagate a distance of one cell (see Table II). The table also shows the measured 0.5 bias currents at the given voltages. The Shadow template evaluation into the SW direction is shown in Figure 9. The Hole Filler template causes the white areas surrounded by black pixels to turn black. There- fore, the white holes within black objects are being filled. In this implementation, it has the form of ⎡ ⎤ 0 1 0 ⎢ ⎥ A = ⎣ 1 0 1 ⎦ , I = 0.5 (4) 0 1 0 The algorithm is a little bit more complicated than for Shadow, therefore it is described in more detail here. (1) Load an image: dbus−r/w (2) Invert the image: NOT operation as depicted in Section 4.1 (3) Set the transient mask condition: set− mask (4) Initialize all cells to white: hold1 , unit (5) Separate C X from C S : hold2 (6) Upon the mask condition force V X to VS : tr− mask (7) Select the A-template evaluation: AorB (8) Set the border black: border Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 439 INVERTED INTERMEDIATE ORIGIN ORIGIN (MASK) INIT RESULT RESULT Figure 10. Evaluation of the Hole Filler template. ORIGIN MARKER (MASK) (INIT) RESULT Figure 11. Evaluation of the Figure Reconstruction template. (9) Evaluate the template: start ... (10) Invert the image: NOT operation After the algorithm evaluation is completed the result can be read out from the cell or used in further processing. Figure 10 presents the original image and the measured result of Hole Filler evaluation. The Figure Reconstruction template (also known as Selected Object Extraction) extracts from the image the objects marked by the black pixels in the marker image. In other words, the marked objects are preserved and appear in the result, while all the other objects are being erased. With original image written to the mask and marker as the initial image we evaluate the template (5). The measured example is shown in Figure 11. ⎡ ⎤ 1 1 1 ⎢ ⎥ A = ⎣1 1 1⎦ , I = 0.5 (5) 1 1 1 The Object Increase, like every other discrete time CNN (DT-CNN) template [9], is evaluated as a B-template. It is expressed as ⎡ ⎤ 1 1 1 ⎢ ⎥ B = ⎣1 1 1⎦ , I = 0.5 (6) 1 1 1 This operation causes black objects in the image to grow by one pixel into every direction. Figure 12 shows the initial image and the measured result. Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta 440 J. FLAK ET AL. INIT RESULT Figure 12. Evaluation of the Object Increase template. Table III. Execution time of basic operations at VDD1 = 1.2 V. Operation Execution time Write to memory 8 ns Read from memory 8 ns Write the transient mask 8 ns Load image from data bus 61 ns per row NOT 60 ns XOR 80 ns NAND 96 ns NOR 160 ns B-template 11 ns A-template 4 ns per cell INIT RESULT Figure 13. The initial image and the resulting one obtained in power measurements. 4.3. Speed of operation and power consumption Due to the output frequency limitation of the pattern generator, the maximal speed of some oper- ations could not be determined. Namely, the writing to and reading from a local memory, writing the transient mask as well as all the operations containing these tasks could be performed even faster. Therefore, their execution times are marked in Table III as ‘less or equal to’ the minimum measured value. The dynamic power consumption was measured as the current dragged from the power supply during looped evaluation of the B-template: ⎡ ⎤ 1 1 1 ⎢ ⎥ B = ⎣ 1 1 1 ⎦ , I = 0.5 (7) 1 1 1 The borders were set black, the template terms were programmed, and the initial image, shown in Figure 13, was stored in one of the local memories. Then, the loop consisting of the following Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 441 operations was executed at the speed of 25 × 106 cycles per second. (1) Read out the initial image from memory. (2) Run the template (7). This loop comprises five lines of code. One line is spent for reading out the image from memory, two lines for the template evaluation, and additional two lines to assure the proper sequence of the control signals. With the initial image as shown in Figure 13, all but one of the cells in the array are forced to change the state at the same time, resulting in a power dissipation of 9.8 W per cell at the supply voltage VDD1 = 1.2 V. 5. DISCUSSION The measurement results presented in the previous section confirm that the structure works properly in performing both the local logic operation and the template evaluation. If the supply voltage VDD1 Table IV. Chip characteristics. Technology 6M-1P 0.18-m CMOS Supply voltage 1.2 V Control signal voltage 1.8 V Array size 4×4 No. of transistors per PE 64 PE area 155 m2 State representation 1-bit State dynamics Inverted Positive-Range High-Gain I/O Digital Weight programmability 1-bit Dynamic power per PE 9.8 W Table V. Chip comparison. This design Reference [3] Reference [6] Reference [10] Reference [13] Technology 0.18 m CMOS 0.35 m CMOS 0.25 m CMOS 0.8 m CMOS 0.5 m CMOS 1P-6M 1P-5M 1P-6M 1P-2M 1P-3M Array size 4×4 128 × 128 176 × 144 32 × 32 48 × 48 PE density 6451 180 3027 71 295 (PE/mm2 ) Supported B/W Grey B/W Grey B/W images Photosensors No Yes No Yes No No. memories 4 B/W 2 B/W & 8 Grey 6 B/W 8 B/W 4 B/W per PE No. transistors 64 198 73 97 N/A per PE Weight distribution Digital Analogue Analogue Digital Analogue Bits per term 1, 1, 1∗ 8, 8, 8 6, 6, 9 1, 1, 0∗, † 6, 6, 6 A, B, I 4 ns 135 ns N/A N/A 50 ns Power per PE 9.8 W 180 W N/A N/A 81 W ∗ No separate bits for the A and B templates. † No threshold functionality, only cross connection (no diagonal connection) to the neighbourhood. Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta 442 J. FLAK ET AL. is made variable for template-by-template evaluation, speed and robustness can be effectively traded for power consumption. The external buffers in the bidirectional data bus stopped working properly at 0.95 V and this limited the minimum supply voltage to 0.55 V in the measurements. Correct results with the Shadow template were obtained at VDD1 as low as 0.55 V, while with the logic operations the limit was 0.6 V. The correct and data-independent results of the more complex algorithms were achieved at a supply voltage of 1.2 V. Thus, this value is listed in Table IV as the supply voltage. Also, the given dynamic power consumption was measured with VDD1 = 1.2 V. Although, the presented cell has the bias term programmable to either 0.5 or 1.5, only operations requiring 0.5 bias were presented. That is because, in the layout drawing process the bias PMOS transistors were improperly scaled (M P1 with W/L = 0.5/0.3, M P2 with W/L = 0.5/0.6, and M P3 with W/L = 0.5/0.3) according to an older version of the schematic. Thus, the mirrored currents were made smaller (about 3 times smaller at 1.2 V supply voltage). For the case of 0.5 bias, having a lower value of the bias than half of the unit current can actually be seen as a means to improve the robustness [12]. However, for the case of 1.5 bias, the mirrored current should be close to 1.5 of the unit current. Nevertheless, the functionality of the 1.5 bias was verified with the measurements at very low supply voltage (VDD1 = 0.3 V, the mirroring of the bias current works better in the subthreshold region), while the programming and reading out were conducted at VDD1 = 0.8 V. If the PMOS devices in the bias circuit were properly scaled as in the simulations of Section 3.1, the cell size would increase to about 170 m2 . Table V provides a comparison of the presented design and other chips. However, the reported test-structure can fairly be compared with designs of References [6, 13] only, as they implement similar functionality. The chips of References [3, 10] can operate on grey-scale images and have built-in photosensors, and are placed here to give a broader context. 6. CONCLUSION A hardware realization of a CNN for processing B/W image data was presented. Both the coefficient circuits and the bias are 1-bit programmable. Therefore, a very compact implementation of the couplings was obtained. Due to limited programmability, the more complex templates are evaluated as a set of simple subtasks. Since the weights are programmed digitally, the write-time of the template terms is fast, and thus the overall performance stays competitive. The proposed transient mask structure proved to be useful in the implementation of a fixed state map, required in some algorithms as well as in the evaluation of logic functions. The small cell dimensions allow the implementation of a very large array on the same chip. ACKNOWLEDGEMENTS This work was funded by the Academy of Finland in the projects 205443 and 106451. REFERENCES 1. Chua LO, Yang L. Cellular neural networks: theory. IEEE Transactions on Circuits and Systems 1988; 35: 1257–1272. 2. Linan G, Espejo S, Dominguez-Castro R, Rodriguez-Vazquez A. ACE4k: an analogue I/O 64 × 64 visual microprocessor chip with 7-bit analogue accuracy. International Journal of Circuit Theory and Applications 2002; 30(2/3):89–116. Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 443 3. Rodriguez-Vazquez A et al. ACE16: the third generation of mixed-signal SIMD-CNN ACE chips toward VSoCs, IEEE Transactions on Circuits and Systems—Part I 2004; 51(5):851–863. 4. Espejo S, Carmona R, Dominguez-Castro R, Rodriguez-Vazquez A. A VLSI oriented continuous-time CNN model. International Journal of Circuit Theory and Applications 1996; 24:341–356. 5. Paasio A, Halonen K. A new cell output nonlinearity for dense cellular nonlinear network integration. IEEE Transactions on Circuits and Systems—Part I 2001; 48(3):272–280. 6. Paasio A, Kananen A, Halonen K, Porra V. A QCIF resolution binary I/O CNN-UM chip. Journal of VLSI Signal Processing 1999; 23:281–290. 7. Paasio A, Laiho M, Kananen A, Halonen K. An analogue array processor hardware realization with multiple new features. Proceedings of the 2002 International Joint Conference on Neural Networks, Honolulu, Hawaii, 2002; 1952–1955. 8. Laiho M, Paasio A, Flak J, Halonen K. Template design for binary-programmable cellular nonlinear networks. IEEE International Symposium on Circuits and Systems, Kobe, Japan, 2005; 3981–3941. 9. Roska T et al. CNN software library version 1.1 [On-line]. Available: http://lab.analogic.sztaki.hu/Candy/csl.html, 2000. 10. Eklund J-E, Svensson C, Astrom A. VLSI implementation of a focal plane image processor—a realization of the near-sensor image processing concept. IEEE Transactions on Very Large Scale Integration Systems 1996; 4(3):322–335. 11. Stoffels A, Roska T, Chua LO. Object-oriented image analysis for very-low-bitrate video-coding systems using the CNN universal machine. International Journal of Circuit Theory and Applications 1997; 25:235–258. 12. Brea V, Laiho M, Paasio A. Robustness improvement in binary cellular non-linear network architectures. Proceedings of the 2005 European Conference on Circuit Theory and Design, Cork, Ireland, 2005; I-149–I-152. 13. Paasio A. Integration of cellular nonlinear network universal machine. Ph.D. Dissertation, Helsinki University of Technology, Espoo, Finland, 1999. Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429–443 DOI: 10.1002/cta

References (15)

Chua LO, Yang L. Cellular neural networks: theory. IEEE Transactions on Circuits and Systems 1988; 35: 1257-1272.
Linan G, Espejo S, Dominguez-Castro R, Rodriguez-Vazquez A. ACE4k: an analogue I/O 64× 64 visual microprocessor chip with 7-bit analogue accuracy. International Journal of Circuit Theory and Applications 2002; 30(2/3):89-116.
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429-443 DOI: 10.1002/cta DENSE CMOS IMPLEMENTATION OF A BINARY-PROGRAMMABLE CNN 443
Rodriguez-Vazquez A et al. ACE16: the third generation of mixed-signal SIMD-CNN ACE chips toward VSoCs, IEEE Transactions on Circuits and Systems-Part I 2004; 51(5):851-863.
Espejo S, Carmona R, Dominguez-Castro R, Rodriguez-Vazquez A. A VLSI oriented continuous-time CNN model. International Journal of Circuit Theory and Applications 1996; 24:341-356.
Paasio A, Halonen K. A new cell output nonlinearity for dense cellular nonlinear network integration. IEEE Transactions on Circuits and Systems-Part I 2001; 48(3):272-280.
Paasio A, Kananen A, Halonen K, Porra V. A QCIF resolution binary I/O CNN-UM chip. Journal of VLSI Signal Processing 1999; 23:281-290.
Paasio A, Laiho M, Kananen A, Halonen K. An analogue array processor hardware realization with multiple new features. Proceedings of the 2002 International Joint Conference on Neural Networks, Honolulu, Hawaii, 2002; 1952-1955.
Laiho M, Paasio A, Flak J, Halonen K. Template design for binary-programmable cellular nonlinear networks. IEEE International Symposium on Circuits and Systems, Kobe, Japan, 2005; 3981-3941.
Roska T et al. CNN software library version 1.1 [On-line]. Available: http://lab.analogic.sztaki.hu/Candy/csl.html, 2000.
Eklund J-E, Svensson C, Astrom A. VLSI implementation of a focal plane image processor-a realization of the near-sensor image processing concept. IEEE Transactions on Very Large Scale Integration Systems 1996; 4(3):322-335.
Stoffels A, Roska T, Chua LO. Object-oriented image analysis for very-low-bitrate video-coding systems using the CNN universal machine. International Journal of Circuit Theory and Applications 1997; 25:235-258.
Brea V, Laiho M, Paasio A. Robustness improvement in binary cellular non-linear network architectures. Proceedings of the 2005 European Conference on Circuit Theory and Design, Cork, Ireland, 2005; I-149-I-152.
Paasio A. Integration of cellular nonlinear network universal machine. Ph.D. Dissertation, Helsinki University of Technology, Espoo, Finland, 1999.
Copyright q 2006 John Wiley & Sons, Ltd. Int. J. Circ. Theor. Appl. 2006; 34:429-443 DOI: 10.1002/cta

About the author

Jacek Flak

Papers

Followers

View all papers from Jacek Flakarrow_forward

Dense CMOS implementation of a binary-programmable cellular neural network

Sign up for access to the world's latest research

Abstract

Related papers

References (15)

Related papers

Related topics