`
`K. CHEN, A. ASTROM and P.E. DANIELSSON, “PASIC: A Smart Sensor for
`
`Computer Vision,” Pattern Recognition, 1990. Proceedings,”
`
`10th International Conference on pp. 286-291, 16-21 June 1990
`
`TRW Automotive U.S. LLC: EXHIBIT 1041
`PETITION FOR INTER PARTES REVIEW
`OF U.S. PATENT NUMBER 8,599,001
`IPR2015-00436
`
`
`
`PASIC. A Smart Sensor for Computer Vision
`
`Keping Chen
`Anders Astrom and Per-Erik Danielsson
`LSI Design Center
`Department of Electrical Engineering
`Linkoping University, S-581 83 Linkoping, Sweden
`
`Abstract
`
`The PASIC prototype chip contains 256 x 256 photo sensors,
`a linear array of 256 AD-converters, two 256 8-bit shift
`registers, 256 bit-serial ALU, and a 256 x 128 bit-dynamic
`RAM. It is claimed to be a viable architecture for low-level
`vision processing. The processors operate in SIMD-mode
`at 20 MHz. When executing an edge-detection algorithm
`it is shown that the 256 processors are capable of an output
`rate of 3 Mpixel/second.
`
`1. Introduction
`
`The advancement of VLSI technology will inevitably give
`birth to integration of sensors and processors in systems for
`computer vision. The two overriding issues for such
`architectures are:
`
`i) Efficiency when executing a core subset of low
`level algorithms.
`
`the architecture
`ii) Adaptation of
`constraints of VLSI technology.
`
`to various
`
`An early approach in this direction initiated at our university
`is LAPP [l]. This is now a commercially available chip
`which features a linear array of 128 photo sensors,
`thresholding units and simple processors for binary logic.
`
`(Processor, AD-converter, Sensor Integrated
`PASIC
`Circuit) is a prototype for a more general purpuse smart
`sensor. The general floor plan is shown by Fig. 1. Most of
`the area in this prototype is occupied by the photo sensors.
`Their data are read out, one row at a time, over 256 vertical
`lines which connect to a linear array of comparators.
`Together with some digital circuits these comparators form
`a linear array of Analog-to-Digital converters.
`
`The remaining part of the chip is purely digital and contains
`two 8 bit wide 256 stage shift registers, 256 bit-serial
`processors and a 256 x 128 bit dynamic RAM. These parts
`will be described in the following sections. Throughout the
`
`CH2898-5/90/0000/0286$01 .OO 0 1990 IEEE
`
`286
`
`design both analog and digital circuitry has been
`implemented in double metal 1.6 km CMOS technology.
`
`Two 256 8bit bi-directional shift registers
`kt222
`
`256 bit-serial ALUs
`
`22222]
`
`Figure I PASIC floor plan
`
`2. The parallel A/D-converter
`
`A conventional semiconductor video camera has to provide
`for high speed (10 MHz) transportation of tiny charge
`packages from the sensor array to the analog video output.
`In PASIC this transportation problem is greatly alleviated
`since it takes place at much lower rate, one row at a time
`in parallel. See Figure 2. For each new row of analog data
`the 256 comparators are comparing the individual signal
`levels to a common ramp signal generated in synchronism
`with stepping a counter. The eight bits of the counter are
`continuously copied to each of 256 8-bit registers up to the
`
`Authorized licensed use limited to: Lester Kozlowski. Downloaded on June 29,2010 at 17:33:31 UTC from IEEE Xplore. Restrictions apply.
`
`1041-001
`
`
`
`point where the ramp value exceeds the individual signal
`level. The 256 AD-conversions are completed after a full
`ramp generation. With 20 MHz clock rate such an event
`takes place in 256 . 0.05 = 12.8 p.
`
`the system. Putting 256
`organization for the rest of
`processors on a chip immediately requires the processors
`to be bit-serial devices and that they should work in
`SIMD-mode.
`
`The accuracy in the present design is limited to 8 bit/pixel.
`It should be noted, however, that this precision can be
`traded for speed. The A/D-conversion can be speeded up
`by changing the slope of the ramp signal. For instance the
`conversion time for 256 x 256 pixels is reduced from 3276.8
`JLS for 8 bits to 1638.4, 819.2, 409.6, 204.8 and 102.4 KS
`respectively for 7, 6, 5, 4 or 3 bit accuracy. In many vision
`tasks such a programmability for speed-precision trade-off
`is very valuable.
`
`The possibility to increase the signal-to-noise
`lowering the frame rate is self-evident.
`
`ratio by
`
`processor architectures have been
`Many bit-serial
`presented in literature, for instance CLIP[2], DAP[3],
`MPP[4], GAPP[5] and the Connection Machine [6]. The
`details of the GAF'P
`is well known since it has been
`commercially available as a chip. The GAPP processor is
`very simple, having four one-bit registers per chip and logic
`consisting of a full adderlsubtractor. Flexibility and
`generality of the design stems from a rather extensive
`multiplexor system that connects the ALU outputs, the
`register outputs, the memory output and the four NEWS
`neighbors to the memory and the four register inputs. In
`this network several simultaneous bit transfers can take
`place. For instance, one bit can be received from East,
`another transferred Westbound, the sum output stored in
`a memory location, carry saved in the C-register, input bit
`shifted into the communication register, all in the same
`cycle.
`
`This seems very efficient but a closer examination reveals
`that processing speed is nearly always limited by one single
`factor, the memory bandwidth. In a typical two-input
`one-output operation like addition the execution time is
`therefore 3b + 1 cycles where b is the wordlength. We found
`it necessary to design a processor even simpler than the
`GAF'P. In particular, we found it impossible to squeeze a
`GAPP type processor into the narrow 40 pm vertical slot,
`given by the pitch of the sensor columns.
`
`In our VLSI single chip smart sensor, all units, the registers
`in the A/D converter, the bit-serial ALU array, the shift
`registers and the memory cells are hooked to the single-bit
`bus as shown in Figure 2 . All these units communicate with
`the bus bitwise, that is, at one instance only one bit in each
`unit is input/output to/from the bus. The bit-serial ALU and
`the shift registers can be viewed as logic embedded in a
`memory and they appear as memory ports to the bus. The
`architecture where all computation units are memory-like
`and are centered around buses are referred to as a
`memory-bus organized architecture as shown in Figure 4.
`The control and addressing to each unit can be combined
`in a single address decoder broadcasting to all column of
`units. The instruction word (including both address and
`control signals) is 26 bit wide, which can be divided into
`five fields.
`
`287
`
`I
`
`The registers of Figure 2 are actually designed as cells in
`a 8 x 256 bit memory. See Figure 3. The read out to the
`memory bus is controlled by a 3-to-8 decoder.
`From sensor array
`I
`
`Ramp
`gener.
`
`1
`
`I
`
`8
`Figure 2 The parallel AID-converter
`
`From
`comparator
`
`From
`decoder
`
`From
`counter
`
`Memory
`bus
`Figure 3 A bit position in the AID converter registers
`
`3. The memory bus architecture
`
`We believe that convincing arguments for organizing the
`A/D-conversion of an image sensor a linear array were
`given in
`the previous section. By pure geometrical
`constraints
`it
`is then hard to avoid a
`linear array
`
`Authorized licensed use limited to: Lester Kozlowski. Downloaded on June 29,2010 at 17:33:31 UTC from IEEE Xplore. Restrictions apply.
`
`1041-002
`
`
`
`S
`
`AD:
`
`SRA:
`SRB:
`
`&U:
`
`CMA:
`
`4bit A/D converter addredcontrol
`AD=O-7
`0-7th bit to Bus
`AD=8
`data latch
`AD=9
`count
`Sbit shift register 1 address/control
`Sbit shift register 2 addresskontrol
`SR=O
`shift right
`SR=l
`shift left
`SR=2
`data latch
`SR=3
`circle right
`SR=4
`circle left
`SR=5-12
`Bus to 0-7th bit register
`SR=13-20
`0-7th bit register to Bus
`4bit ALU input/output addredcontrol
`BtC to Bus
`ALu=o
`ALu= 1
`BC to Bus
`carry to Bus
`ALu=2
`Bus to C
`ALu=3
`ALu=4
`ACtBC to Bus
`ALu=s
`ACtBC to Bus
`ALU=6
`Bus to B
`B to Bus
`ALu=7
`ALU=8
`Bus to B
`ALu=9
`B@C to Bus
`ALU=lO
`BUS to A
`ALU=11
`BUS to A
`ALU=12
`Bus to A & carry
`feedback
`sum to Bus
`sum to Bus & carry
`feedback
`no operation
`ALU=lS
`8bit memory addresskontrol
`0 to Bus
`MA=O
`MA=l-127
`1-127 bit to Bus
`no operation
`MA=128
`MA=129-256 Bus to 1-127 bit
`
`ALU=13
`ALU=14
`
`-
`
`0
`
`A
`D
`C
`3 -
`4
`
`S
`R
`A
`
`8
`9
`
`S
`R
`B
`-
`13
`14
`
`A
`L
`U
`-
`17
`18
`
`C
`M
`
`A -
`
`25
`
`Figure 4 The architecture and instruction format of PASIC
`
`The key element is the ALU. It has three one-bit registers:
`A, B and C that feed the logic. The logic is basically a full
`adder with SUM and CARRY output, activated by the
`addresses PA = 13, 14 and 2 respectively. In addition,
`however, the ALU-logic is tapped in six other places to
`make six more Boolean functions available for each given
`input A, B, C. With little extra hardware cost these functions
`greatly enhances the overall performance of the processor
`and in effect replaces bit-shuffling over multiplexors in
`GAPP. The number of Boolean functions is further
`extended by the ability to load the A- and B-registers with
`both inverted and non-inverted data from the bus. For
`instance, the CARRY output becomes a Borrow = KB + XC
`+ BC by inserting A := x.
`
`Not shown in Figure 4 is the Global OR function in PASIC.
`It is implemented as a ivired OR combination of the
`C-register output from all 256 processors.
`
`4. Algorithm implementation
`
`Most low-level algorithms for image processing invol
`neighborhood
`operations. The necessary horizont
`neighbor communication is provided by the bi-directior
`shift registers. For grayscale (8-bit) images the obvio
`sequence of events is to load the shift registers over the b
`from the AD-converter or the memory in eight cyck
`After one shift cycle, these 256 x 8 bits are available to t
`neighbors at left or right. Summing two 8-bit pixels frc
`neighboring columns in this manner takes 8 t 1 + 3.8 t
`from
`= 34 cycles where 9 extra cycles stem
`t
`communication. Note, however, that if the result is to
`used in a subsequent neighborhood operation we shol
`store the result into the shift register rather than 1
`memory. The extra delay for neighborhood communicati
`is then virtually eliminated.
`
`The memory in PASIC (present version) is too small
`allow for more sophisticated algorithms. In fact, since
`cannot harbor a full 256 x 256 gray level image we can o
`
`288
`
`Authorized licensed use limited to: Lester Kozlowski. Downloaded on June 29,2010 at 17:33:31 UTC from IEEE Xplore. Restrictions apply.
`
`1041-003
`
`
`
`row-parallel
`bit-serial,
`in
`algorithms
`implement
`pipe-lined versions. If an algorithm is specified using
`intermediate image buffers we have to reformulate the
`algorithm and execute all the stages of the algorithm row
`by row. See Fig. 5. We can hold a limited number of inputs
`and intermediate rows in the memory (128 bits per column).
`However, we have to produce one complete output row
`before we can accept a new row input from the image
`sensor.
`
`Median, 3 x 1 horizontal neighborhood. The result of the
`previous operation was loaded into the shift registers as well
`as to the memory. The computation continues as for the
`vertical neighborhood except for 1 extra shift cycle to bring
`in the left neighbor and, later on, 2 extra cycles to bring
`in the right neighbor.
`Total: 131 + 3 = 134 cycles.
`8bit data input
`
`Sensor array
`
`3x1 median
`
`I I
`
`Processor army
`
`I
`
`I
`
`
`
`thinning 0
`
`+
`output
`
`Figure 6 The edge detection algorithm
`
`Figure 5 Row parallel pipe-lined processing
`
`5. Application example: Edge detection
`
`A rather traditional edge detection procedure is described
`by Figure 6. A cycle count for the various stages of this
`algorithm implemented in PASIC for a 256 x 256 8-bit
`image reveals the following. (See also Figure 4).
`Input of one new row to memory takes 8 cycles.
`
`Median, 1 x 3 "vertical" neighborhood: Three 8 bit
`comparisons are necessary. Each comparison consists of
`one subtraction, 16 cycles, followed by one multiplexing
`operation using the AC + BC logic of the ALU to save the
`larger (and/or the smaller) of two numbers. The first
`multiplexing is a two-way one and takes 33 cycles, the
`following ones are one-way and takes 25 cycles each. The
`whole operation totals 16 + 33 + 2(16 + 25) = 131 cycles.
`
`Smooth is performed by adding two horizontal neighbors
`followed by adding two vertical results. Input data for the
`smoothing stage is already in shift registers. Total execution
`time is 25 cycles for the first 8 bit add operation, 28 cycles
`for the following one which totals 53 cycles.
`
`Derivative magnitude in the x-direction begins with
`subtraction on 10 bit data that takes 31 cycles + 3 extra
`cycles to bring in the two most significant bits, which do
`not fit into the shift register. Then follows an addition with
`vertical neighbor to yield a 12 bit Sobel result f, which takes
`34 cycles. Finally, the magnitude is obtained by inverting
`the 11 LSB bits if the sign is negative. This can be
`implemented in 24 cycles by loading the sign bit in the
`A-register, a zero in the B-register and then streaming in
`the 11 LSB-bits while saving the SUM output from the
`ALU.
`
`To compute the magnitude of the Sobel response in the
`y-direction
`is identical in complexity. The total Sobel
`
`289
`
`I
`
`Authorized licensed use limited to: Lester Kozlowski. Downloaded on June 29,2010 at 17:33:31 UTC from IEEE Xplore. Restrictions apply.
`
`1041-004
`
`
`
`6. Conclusions
`
`to prove (or
`The main objective of the PASIC design is
`smart image
`disprove) the viability of a general purpose
`to put
`the
`sensor. One obvious integration step
`is
`given this, it
`AID-converter on the sensor chip. However,
`is quite natural to also include ALU and memory. To avoid
`high speed transfer of analog data we have employed an
`AID-converter in the form of a linear array of comparators.
`The architecture of the processing part conforms with the
`row parallel output from the AD-converters. We have
`designed an extremely simple but rather efficient processor
`which is excellently suited to the special VLSI constraints
`of the sensor. The pitch in the present version of PASIC
`is 40 pm and it has been possible to fit the AD-converter
`circuitry, the shift register, the processor and the memory
`into this narrow slot. One key factor is the unified structure
`achieved by extending the memory data bus to all the other
`units within the same column.
`
`The actual circuit design has hardly been touched upon in
`this article. However, as is evident by other reports [lo],
`PASIC employs 1.6 pm double metal CMOS technology for
`sensors as well as processors and memories. Furthermore,
`all the four units, sensor, AD-converter, shift register and
`memory, can utilize the same basic cell design shown by
`Figure 3. At the time of this writing most of these parts have
`been tested.
`
`PASIC should be considered as a prototype for a more
`full-fledged
`smart image sensor with higher resolution,
`more memory and control circuitry on-chip. We believe that
`it should be possible to extend the sensor area to 512 x 512
`and the number of processors to 512 by employing
`sub-micron technology in the next version of PASIC. The
`memory size is critical for the scope of the architecture and
`the algorithms. In the present version PASIC is not utilizing
`the highly specialized layout and techniques used in high
`density DRAM’S. Full exploitation of such designs should
`make it possible to store several gray level images on-chip.
`
`The pixel processing rate grows linearly with the size of the
`array. However, since the sensor area grows quadratically
`the frame rate is subject to a linear decrease for upscaled
`versions of PASIC. This problem indicates that future large
`scale smart sensor systems must be programmable at the
`sensor level. The sensor should be able to deliver an image
`of the scene at a variable resolution and/or deliver just a
`smaller window of the full scene (foveation).
`
`the
`clarified by
`further
`is
`computation
`response
`decomposition scheme of Figure 7 (from [ 6 ] ) . Total
`execution time:
`2[34+34+24] = 184 cycles.
`
`pJ
`
`m
`
`- 1 -2 -1
`
`-1 -2 -1
`
`Figure 7 Decomposition of the Sobel operator pair
`
`The gradient magnitude is approximated as the sum lf,l
`+ lfyl. Yielding a 12 bit result from 11 bit input data takes
`37 cycles.
`
`Thresholding with a fix
`to
`is equivalent
`threshold
`subtraction while saving the sign bit of the result. Execution
`time: 25 cycles.
`
`Thinning is applied on the thresholded image. Ideally we
`would like this execution to be data dependent and continue
`iteratively until no further changes. Here we assume that
`all edges after thresholding have a max width of 5 pixels
`and hence two thinning steps are sufficient. Various
`thinning algorithms exist in the literature [SI, [9]. We have
`adopted a four-phase version which executes in 22 cycles
`per phase. Two thinning steps require 2 . 4 . 22 = 176 cycles.
`
`Output of the thinned binary image may take place every
`8th line. Execution time per row is then 118 (8 t 256) = 33
`cycles.
`
`Summary of execution times per line:
`Input from sensor
`Median 1 x 3
`Median 3 x 1
`Smooth
`Derivatives Ifx],
`lfxl t I f Y l
`Thresholding
`Thinning
`output
`
`IfY[
`
`8
`131
`134
`53
`184
`37
`25
`176
`33
`781 cycles
`
`Since there are 256 rows, 50 ns cycle time implies a
`maximum frame rate of
`
`20 . lo6 / 765 . 256 = 102 frameslsecond
`which is equivalent to a processing rate of
`
`20 . l o 6 . 256 / 765 = 3.7 M output pixel/second
`
`290
`
`Authorized licensed use limited to: Lester Kozlowski. Downloaded on June 29,2010 at 17:33:31 UTC from IEEE Xplore. Restrictions apply.
`
`1041-005
`
`
`
`Acknowledgement
`
`This project has been supported by the grant 712-87-02841
`from Swedish Board of Technical Development and by
`CENIIT (CENter for Industrial Information Technology),
`Linkoping University.
`
`References
`[l] R. Forchheimer, A. Odmark, Single chip linear
`array processor, in Applications of digital image
`processing, SPIE, Vol. 397, (1983).
`
`[2] M.J.B. Duff, Architectures of SIMD cellular logic
`image processing arrays in Nato AS1 series, Vol.
`for
`F18, Computer
`architectures
`spatially
`distributed data, ed. by H. Freeman and G.G.
`Picroni, Springer (1985)
`[3] S.F. Reddaway, DAP - a Distributed Processor
`Array, First annual symposium on computer
`architecture, Florida, pp 61-65 (1973).
`[4] K.E. Batcher, Design of a Massively Parallel
`Processor, IEEE Transactions on Comp., Vol.
`C29, pp 836-840 (1980).
`[5] Geometric Arithmetic Parallel Processor, Data sheet
`NCR45CG72, NCR corporation, (1984).
`
`[6] L.W. Tucker and G.G. Robertson, Architecture
`and applications of the Connection Machine, IEEE
`computer, pp 26-38, August 1988.
`
`[7] P-E Danielsson, 0. Seger, Generalized and
`separable Sobel operators, in Machine vision -
`aquiring and interpreting the 3 0 scene, ed. by
`H. Freeman, Academic press (1989).
`
`[8] E.S. Deutsch, Thinning algorithms on rectangular,
`hexagonal and triangular arrays, Com. ACM, Vol.
`15, pp 827-837 (1972).
`
`[9] Q-Z Ye, P-E Danielsson, Inspection of printed
`circuit boards by connectivity preserving shrinking,
`IEEE Trans. PAMI, Vol. 10, pp 737-742 (1988).
`
`[lo] K. Chen, M. Afghani, P-E Danielsson, C.
`Svensson, PASIC: A processor AID-converter-
`sensor integrated circuit, to appear in proceedings
`of ISCAS90.
`
`Authorized licensed use limited to: Lester Kozlowski. Downloaded on June 29,2010 at 17:33:31 UTC from IEEE Xplore. Restrictions apply.
`
`1041-006