`
`Reference 45
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 1
`
`
`
`Convolution on Splash 2
`
`N alini K. Ratha, Anil K. Jain & Diane T. Rover*
`Department of Computer Science
`* Department of Electrical Engineering
`Michigan State University
`East Lansing, MI 48824
`ratha@cps.msu.edu,
`jain@cps.msu.edu, rover@ee.msu.edu
`
`Abstract
`
`(image) and g (mask) can be expressed as
`
`Convolution is a fundamental operation in many
`signal and image processing applications. Since the
`computation and communication pattern in a convolu(cid:173)
`tion operation is regular, a number of special architec(cid:173)
`tures have been designed and implemented for this op(cid:173)
`erator. The Von Neumann architectures cannot meet
`the real-time requirements of applications that use con(cid:173)
`volution as an intermediate step. We combine the
`advantages of systolic algorithms with the low cost of
`developing application specific designs using field pro(cid:173)
`grammable gate arrays (FPGAs)
`to build a scalable
`convolver for use in computer vision systems. The
`performance of the systolic algorithm of Kung et al.
`{1] is compared theoretically and experimentally with
`many other convolution algorithms reported in the lit(cid:173)
`erature. The implementation of a convolution opera(cid:173)
`tion on Splash 2, an attached processor based on Xilinx
`4010 FPGAs, is reported with impressive performance
`gains.
`
`1
`
`Introduction
`
`Convolution is an important operator in digital sig(cid:173)
`nal and image processing. Many machine vision sys(cid:173)
`tems use 2-dimensional convolution for image filtering,
`edge detection, and template matching. Convolution
`of two signals f and g is often expressed as h = f * g,
`where h is the result of applying convolution mask g
`on input signal f. For 2-dimensional discrete images,
`convolution can be defined as follows:
`
`h(x,y) = L L f(x-i,y-j)g(i,j).
`
`00
`
`00
`
`(1)
`
`i=-ooj=-oo
`
`For images of finite size (say N x N), and masks of
`finite size (say k x k), 2-dimensional convolution off
`
`0-8186-7086-X/95 $04.00 © 1995 IEEE
`
`204
`
`k-lk-1
`
`h(x, y) =LL f(x + i, y + j)g(i, j).
`
`. (2)
`
`i=O j=O
`
`In the second form, it is also known as template match(cid:173)
`ing, or correlation. The image size and mask size need
`not contain square number of elements in general.
`Typical use of convolution in image processing is for
`edge detection using, for example, Sobel, and Prewitt
`masks; computing texture orientation; object location
`using template matching; and image smoothing using
`Gaussian masks. Convolution operation described in
`Eq. (2) can include integer-valued and real-valued in(cid:173)
`put images and masks. Examples ofreal-valued masks
`are image smoothing using Gaussian masks, and com(cid:173)
`puting texture features using Gabor filters. Most com(cid:173)
`monly used masks such as Laplacian, Sobel, and Pre(cid:173)
`witt have not only integer mask values, but the mask
`values can be expressed as powers of two. In a typical
`application
`involving many image processing stages,
`each stage should be capable of processing real-valued
`images. Based on this observation, we can have a tax(cid:173)
`onomy of convolution operation as shown in Figure 1.
`Many approaches to perform convolution using spe(cid:173)
`cial architectures have been reported
`in the litera(cid:173)
`ture [1, 2, 3]. A very simple distributed approach for
`convolution is to split the input image into a set of
`smaller, possibly overlapping, sub-images;
`the num(cid:173)
`ber of subimages is the same as the number of pro(cid:173)
`cessing elements (PEs). Each PE produces the result
`for the sub-image it receives. The spatial support for
`the convolution mask is provided through the overlap
`at the boundaries or data replication. However, this
`simplistic approach may not be suitable for all target
`architectures. For example, if the image is being ac(cid:173)
`quired line by line, this approach may not be efficient
`as we have to wait till we acquire the whole image.
`In order to understand
`the advantages and disadvan-
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 2
`
`
`
`tages of various algorithms reported in the literature,
`we need to look at the communication and computa(cid:173)
`tion pattern
`in 2-dimensional convolution. The basic
`convolution operation
`is shown schematically
`in Fig(cid:173)
`ure 2. Suppose the value of the convolution operation
`is desired at the point (x,y). The center of the mask
`is placed at (x,y). A point-wise inner product of the
`image pixels and mask values is computed, followed by
`a reduction sum operation. This computes the output
`value at (x, y). The reduction operation can also be a
`prefix sum, although the intermediate results are not
`directly useful. This basic set of operations is repeated
`at all possible (x, y) locations.
`The sequential version of convolution algorithm is
`very simple and is shown in Figure 3. There are four
`loops in the algorithm and its overall complexity is
`O(N 2k2). The simple data partitioning approach de(cid:173)
`scribed above can reduce the total time by a factor
`equal to the number of available processors, ignoring
`the extra computations needed in the overlapping ar(cid:173)
`eas by each PE.
`The previous work described in the literature can
`be summarized based on the target architecture on
`which convolution operation is implemented.
`
`• Systolic: One of the most widely used convolu(cid:173)
`tion algorithm is the systolic algorithm by Kung
`et al. [l]. The algorithm is fairly straight forward
`and also scalable to higher dimensions using the
`1-dimensional convolution algorithm as the build(cid:173)
`ing block. In his landmark paper "Why systolic
`[4], Kung described many convo(cid:173)
`architectures?"
`lution algorithms on systolic structures. Based
`on a general inner product computation, Kulka(cid:173)
`rni and Yen [5] proposed a systolic algorithm for
`1-dimensional and 2-dimensional convolution.
`
`Fang et al. [2] have described an
`• Hypercube:
`O(k 2/p 2 + klog(N/p) + logN * logp) algorithm,
`where 1 ~ p ~ k, and using N 2 k2 PEs and
`an O(N 2 M 2 / L 2
`) algorithm using L 2 PEs. Us(cid:173)
`
`ing N 2 PEs, Prasanna Kumar and Krishnan [6]
`proposed an algorithm with the best time com(cid:173)
`plexity of O(N 2 / K 2 + logN). With a fixed num(cid:173)
`ber of PEs,
`their
`time complexity changes to
`O(k 2logk + logN).
`
`• Mesh: Many researchers [7, 3] have proposed
`schemes for convolution on mesh connected ar(cid:173)
`[7] use computation
`chitectures.
`Lee et al.
`along a Hamiltonian path ending at the center
`of the convolution mask, called the convolution
`path. Ranka and Shani [3] do not broadcast
`the
`
`data values, thereby improving the performance
`of their algorithm by an order of magnitude.
`
`Pyramid architectures are useful in
`• Pyramid:
`images. An O(k 2 +
`dealing with multi-resolution
`logN) time complexity algorithm is described by
`Chang et al. [8].
`·
`
`• VLSI/ ASIC: Most of the approaches recommend
`the suitability of their algorithm for a VLSI im(cid:173)
`plementation. Chakrabarti and Jaja use a linear
`array of processors in their algorithm [9]. In [5]
`convolution is viewed as a generalized inner prod(cid:173)
`uct and a VLSI implementation for 2-dimensional
`convolution is described. Ranganathan and Venu(cid:173)
`gopal [10] have described a VLSI architecture for
`template matching using k2 PEs and they achieve
`a time complexity of O(N 2 /2 + K 2 ).
`In this paper, we describe a convolution algorithm
`suitable for implementation on field programmable
`gate arrays (FPGAs).
`FPGAs have been used in
`designing many custom computing structures.
`Re(cid:173)
`cent advances in hardware technology, enabling more
`logic blocks in FPGAs, make them more suitable for
`many complex applications needing more logic. Re(cid:173)
`cently, Barros and Akil [11] have used FPGAs for
`low-level image processing. Athanas et al.
`[12] de(cid:173)
`scribe many image processing functions implemented
`on Splash 2. We use Splash 2, an attached processor
`on Sun SPARCstations based on Xilinx 4010 FPGA
`PEs. Our 2-dimensional convolution algorithm
`im(cid:173)
`plementation has been designed and synthesized for
`Splash 2 architecture.
`·
`The rest of the paper is organized as follows. Splash
`2 architecture and software environment
`is described
`in Section 2. The convolution algorithm is described
`in section 3. The performance of the proposed algo(cid:173)
`rithm is evaluated by comparing
`its execution time
`and its complexity against some other algorithms re(cid:173)
`ported in the literature. The computer aided design
`(CAD) tools provide a performance estimate which is
`also compared with the real speed achieved in section
`4. In Section 5 conclusions and plans for future work
`are given.
`
`2 Splash 2 Architecture
`
`The Splash 2 system consists of an array of Xilinx
`4010 FPGAs, improving on the design of the Splash
`1 hMed on Xilinx 3090s [13]. The Splash 2 system
`is connected
`to the Sun host through an interface
`board that extends the address and data buses. The
`
`205
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 3
`
`
`
`to memories and memory-mapped
`host can read/write
`control registers of Splash 2 via these buses. A de(cid:173)
`tailed description of the system is given in [14]. Each
`Splash 2 processing board has 16 Xilinx 4010s as PEs
`(X 1 - X 16 ) in addition to a seventeenth Xilinx 4010
`(Xo) which controls the data flow into the processor
`board. Each PE has 512 KB of memory. The Sun
`host can read/write
`this memory. The PEs. are con(cid:173)
`nected through a crossbar that is programmed by X 0 .
`There is a 36-bit linear data path (SIMD Bus) running
`through all the PEs. The PEs can read data either
`from their respective memory or from any other PE.
`A broadcast path also exists by suitably programming
`Xo.
`Splash 2 system supports several models of com(cid:173)
`putation,
`including PEs executing the same instruc(cid:173)
`tion on multiple data (SIMD mode) and PEs executing
`multiple instructions on multiple data (MIMD mode).
`It can also execute the same or different instructions
`on single data by receiving data
`through
`the global
`broadcast bus. The most common mode of operation
`is systolic in which the SIMD Bus is used for data
`transfer. Also individual memory available with each
`PE makes it convenient to store temporary results and
`tables.
`To program Splash 2, we need to program each of
`the PEs (X1- X 16 ), the crossbar, and the host inter(cid:173)
`face. The crossbar sets the communication paths be(cid:173)
`tween PEs. In case the crossbar is used, Xo needs to
`be programmed. The host interface takes care of data
`transfer's in and out of the Splash 2 board. A spe(cid:173)
`cial library is available for these facilities for VHDL
`programming as described in [14].
`
`3 Convolution Algorithm on Splash 2
`
`in general, convolu(cid:173)
`Image processing algorithms,
`tion, in particular, demand high I/O bandwidth. Most
`of the algorithms proposed on special architectures as(cid:173)
`sume that data are already available on the PEs. This,
`in a way, avoids the I/O bandwidth problem of the
`convolution operation. We do not make this assump(cid:173)
`tion. Jonker [15] argues that linear arrays are better
`for image processing algorithms. A linear array of PEs
`operating in a systolic mode offers two advantages: (i)
`systolic arrays can balance I/O with computations,
`(ii)
`the nearest neighbor communication can eliminate the
`need for a global communication facility for some class
`of algorithm. One of the preferred modes of computa(cid:173)
`tions on Splash 2 is the systolic mode. In this mode,
`we assume nothing about availability of data in the
`individual PEs. The computations needed in a PE
`
`are also fairly simple. This helps us in balancing the
`I/O bandwidth requirement and computation require(cid:173)
`ment at a PE. Hence, we prefer a systolic algorithm
`for convolution on Splash 2.
`First, we describe the I-dimensional convolution al(cid:173)
`gorithm. A systolic I-dimensional convolution is sim(cid:173)
`ple. Let us assume that we have k PEs, where k is
`the size of the mask. Each PE receives from its left
`neighbor the pixel value and the partial result avail(cid:173)
`able so far. The PE multiplies the pixel value with
`the mask value it handles and adds the partial sum
`to it. This result and the pixel value are passed to its
`right neighbor. At the end of the systolic path, we get
`the convolution result after taking into account initial
`latency. A schematic diagram of I-dimensional convo(cid:173)
`lution on a set of PEs connected linearly is shown in
`Figure 4(a).
`the PEs can
`The above algorithm assumes that
`do multiplication.
`In a FPGA-based PE, this is not
`true. A double precision floating point multiplier
`needs more logic than what is available in a PE. We
`can use the local PE memory to store the multipli(cid:173)
`cation table indexed by the pixel value. This results
`in an additional delay of one cycle necessary to refer(cid:173)
`ence the multiplication result from the memory. This
`scheme is shown in Figure 4(b).
`is an extension of
`Our 2-dimensional convolution
`the I-dimensional convolution described above. The
`basic idea is based on the algorithm proposed by Kung
`et al. [1]. The k x k mask is extended
`to a k x N
`mask with O's placed at locations where no entry was
`present. These kN entries are serialized to get a sin(cid:173)
`gle I-dimensional mask of kN entries. Now, we can
`apply the I-dimensional convolution algorithms out(cid:173)
`there are (N - k) locations
`lined above. Note that
`with O's as their mask value. Hence, we can simply
`have (N - k) stages of shift registers. Secondly, for
`improper positions of this new N k-element mask we
`get values which are not really part of the output.
`These need to be ignored. Finally, the assumption on
`the input is that the pixels are being communicated
`in a raster scan order. Note that we can use either
`of the I-dimensional convolution algorithms described
`above depending on the value of the convolution mask
`coefficients. The scheme is shown in Figure 4(c).
`Implementation
`Issues
`The general convolution algorithm needs to be tuned
`for the special hardware being used. For example, we
`have only 16 PEs. We need to map virtual PEs to
`physical PEs. The second issue is the number of shift
`registers which depends on the row size of the im(cid:173)
`age. The third issue is implementation of multipliers
`
`206
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 4
`
`
`
`needed by the PEs. Following is a summary of our
`solution to these issues.
`
`• Large number of mask entries: If mask size ( k)
`is greater than the number of available PEs, the
`virtual PEs are mapped to available PEs. In car(cid:173)
`rying out the mapping, the timing model between
`PEs must be satisfied.
`
`• Larger image width: Image has to be split into
`smaller sub-images of some predefined size. In or(cid:173)
`der to handle this, we develop the 2-dimensional
`convolution with a fixed width (32 in our case).
`Larger images are split into sub-images of 32 x
`32 pixels and depending on the mask size the re(cid:173)
`quired rows and columns at the border are copied.
`
`• Multiplier implementation:
`
`-
`
`- Masks with integer values: If mask values
`are of the type 2P, then the multiplier in each
`virtual PE can be replaced with simple bit
`shifters.
`Integer masks but not powers of 2: An effi(cid:173)
`cient scheme for this is being developed.
`- Real valued masks: If mask values are nor(cid:173)
`malized (ranges between -1 to 1), then a suit(cid:173)
`able scheme for implementation on FPGAs
`is needed.
`
`The Splash PEs carry out two different activities,
`namely, (i) additions and multiplications, and (ii) shift
`operations. We have built two different elements for
`each of the activities, namely, (i) compute element,
`and (ii) shift element. The number of compute ele(cid:173)
`ments and shift registers is determined by the mask
`size and image width. The schematics of both these
`elements are shown in Figure 5.
`
`shown in Table 2. The basic sequential convolution
`algorithm (Figure 3) running on different Sun host
`In addition, timing on a
`machines has been timed.
`recently developed i-860 based system from Alacron is
`reported. The timing results on CM-5 are taken from
`[16] and are for edge detection using a set of six 5 x
`5 convolution masks.
`the convolution algorithm on
`For implementing
`Splash 2, we have chosen three standard edge detectors
`used in low-level computer vision algorithms. The fil(cid:173)
`ters and their mappings on Splash 2 PEs are shown in
`Figure 6. These operators have two convolution stages
`and a final stage to compute the edge magnitude at
`each pixel by computing the absolute sum of the two
`convolution output images. Each PE accommodates
`the required shift registers for that stage. The outputs
`for the house image using 3 x 3 Sobel edge detector,
`5 x 5 Prewitt edge detector and 7 x 7 Prewitt edge
`detector are shown in Figure 7. The timings for these
`edge detectors on Splash 2 are shown in Table 3.
`Our approach for implementing convolution oper(cid:173)
`ation on Splash 2 is different in many ways from the
`approach taken by Peterson et al. [17]. The main dif(cid:173)
`ferences are: (i) We are not limited by a fixed mask
`size of 8 x 8 as done in [17]. For smaller masks, Pe(cid:173)
`terson et al. [17] have used the same 8 x 8 masks filled
`with zeros. This may be due to hard-coded model of
`computing on Splash 2; (ii) Peterson et al. use all the
`16 PEs for implementing their algorithm. We use less
`number of PEs for smaller mask sizes (k < 8). There(cid:173)
`fore, in our approach, we will have more PEs available
`for implementing many other stages of a vision sys(cid:173)
`tem; and (iii) Our mapping on Splash 2 results in a
`higher performance of nearly 100 frames per second
`compared to 15 frames per second reported by Peter(cid:173)
`son et al. [17].
`
`4 Results
`
`A brief summary of the analysis of several convolu(cid:173)
`tion algorithms is given in Table 1 based on the com(cid:173)
`munication facility available on the respective archi(cid:173)
`tectures. In a systolic algorithm, the communication
`overhead is balanced by the computation phase. More(cid:173)
`over, no complex communication facility is needed. In
`this sense, systolic algorithm is better in terms of total
`work done and communication simplicity.
`We compare our implementation on Splash 2 with
`implementations on different platforms in terms of to(cid:173)
`tal execution time. The timings for 3 x 3 convolution(cid:173)
`based Sobel edge detector on a 512 x 512 image are
`
`5 Conclusions
`
`Convolution with integer mask values that are pow(cid:173)
`ers of two has been efficiently implemented on Splash
`2. The algorithm scales well with mask size. For the
`most general case of real-valued masks could be han(cid:173)
`dled using look-up tables.
`We are currently designing a fixed point multiplier
`on the chip to achieve an efficient implementation with
`real-valued masks. Because of the limit on logic that
`can be accommodated in a PE in Splash 2, mapping of
`virtual PEs to physical PEs will become difficult after
`a certain point. Moreover, if we have a large number
`of lookups, then the performance degrades linearly.
`Manseur [18] has proposed a scheme to decompose
`
`207
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 5
`
`
`
`large sized convolution masks into a set of 3 x 3 masks.
`We will explore this technique further to provide an
`efficient solution to large mask sizes.
`
`Acknowledgments We would like to thank Duncan
`Buell, Jeff Arnold and Brian Schott of Supercomput(cid:173)
`ing Research Center, Bowie, Maryland for their help
`and suggestions. We are thankful to Sea Choi for his
`help in debugging the VHDL code.
`
`References
`
`[1] H. T. Kung, L. M. Ruane, and D. W. L. Yen,
`"A two-level pipelined systolic array for convolu(cid:173)
`tions," in VLSI Systems and Computations (H. T.
`Kung, B. Sproull, and G. Steele, Eds.), pp. 255-
`264, Maryland: Computer Science Press, 1981.
`
`[2] Z. Fang, X. Li, and L. M. Ni, "Parallel algorithms
`for image template matching on hypercube SIMD
`computers,"
`IEEE Trans. on Pattern Analysis
`and Machine Intelligence, vol. 9, pp. 835-841,
`November 1987.
`
`"Convolution on
`[3] S. Ranka and S. Sahni,
`mesh connected multicomputers,"
`IEEE Trans.
`on Pattern Analysis and Machine Intelligence,
`vol. 12, pp. 315-318, March 1990.
`
`[4] H. T. Kung, "Why systolic architectures?," IEEE
`Computer, vol. 15, pp. 37-46, January 1982.
`
`[5] A. V. Kulkarni and D. W. L. Yen, "Systolic pro(cid:173)
`cessing and implementation for signal and image
`processing," IEEE Trans. on Computers, vol. 31,
`pp. 1000-1009, October 1982.
`
`"Efficient
`[6] V. K. P. Kumar and V. Krishnan,
`parallel algorithm for image template matching
`on hypercube SIMD machines,"
`IEEE Trans.
`on Pattern Analysis and Machine Intelligence,
`vol. 11, pp. 665-669, June 1989.
`
`[7] S. Y. Lee and J. K. Aggarwal, "Parallel 2-D con(cid:173)
`volution on a mesh connected array processor,"
`IEEE Trans. on Pattern Analysis and Machine
`Intelligence, vol. 9, pp. 590-594, July 1987.
`
`[8] J. H. Chang, 0. H. Ibarra, T. Pong, and S. M.
`Sohn, "Two-dimensional convolution on a pyra(cid:173)
`mid computer," IEEE Trans. on Pattern Analysis
`and Machine Intelligence, vol. 10, pp. 590-593,
`July 1988.
`
`[9] C. Chakrabarti and J. Jaja, "VLSI architectures
`for template matching and block matching,"
`in
`Parallel architectures and algorithms for image
`understanding (V. K. P. Kumar, Ed.), pp. 3-27,
`San Diego: Academic Press; 1991.
`
`[10] N. Ranganathan and S. Venugopal, "An efficient
`VLSI architecture for template matching based
`on momemt preserving pattern matching,"
`in
`Proc. of 12th Intnl. Association of Pattern Recog(cid:173)
`nition, Jerusalem, pp. 388-390, 1994.
`
`[11] M.A. D. Barros and M. Akil, "Low level image
`processing operators on FPGA: Implementation
`examples and performance evaluation,"
`in Proc.
`of 12th Intnl. Association of Pattern Recognition,
`Jerusalem, pp. 262-278, 1994.
`
`[12] P. M. Athanas and A. L. Abbott, "Real-time im(cid:173)
`age processing on a custom-computing platform,"
`IEEE Computer, vol. 28, pp. 16-24, February
`1995.
`
`[13] J. M. Arnold, D. A. Buell, and E. G. Davis,
`"Splash 2," in Proceedings 4th Annual ACM Sym(cid:173)
`posium on Parallel Algorithms and Architectures,
`pp. 316-322, 1992.
`
`[14] D. A. Buell, J.M. Arnold, and W. J. Kleinfelder,
`Eds., Splash 2: FPGAs for Custom Computing
`Machines. Los Alamitos: IEEE Computer Soci(cid:173)
`ety Press, 1995.
`
`[15] P. P. Jonker, '(Why linear arrays are better image
`processors," in Proc. of 12th Intnl. Association
`of Pattern Recognition, Jerusalem, pp. 334-338,
`1994.
`
`[16) V. K. Prasanna, C.-L. Wang, and A. A. Khokhar,
`"Low level vision processing on connection ma(cid:173)
`chine CM-5," in Proc. of Computer Architecture
`for Machine Perception, Florida, pp. 117-126,
`1993.
`
`[17) J. B. Peterson and P. M. Athanas, ('Addressing
`the computational needs of high-speed image pro(cid:173)
`cessing with a custom computing machine," Jour(cid:173)
`nal of VLSI Signal Processing. Under review.
`
`[18] Z. Z. Manseur and D. C. Wilson, "Decomposi(cid:173)
`tion methods for convolution operators," CVGIP:
`Graphical Models and Image Processing, vol. 53,
`pp. 428-434, September 1991.
`
`[19] Sharplmage Software, New York, The HIPS Im(cid:173)
`age Processing Software, 1993.
`
`208
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 6
`
`
`
`Inpu mage
`
`Real-valued A
`
`Integer-valued masks
`
`Real-valued masks
`
`Values are
`powers of2
`
`General integer values
`
`Figure 1: A t!l,xonomy for convolution.
`
`I (x-k/2, y-k/2)
`
`...
`
`I (x+k/2, y-k/2)
`
`M (k/2, -k/2)
`
`M (k/2, -k/2)
`
`.
`.
`
`I (x-k/2, y+k/2)
`
`.
`.
`
`I (x+k/2, y+k/2
`
`(x, y)
`
`...
`
`M (-k/2, k/2)
`
`M (k/2, k/2)
`
`Figure 2: The convolution operator. I(ij) is the input image value at pixel (ij) and M(u,v) is the mask value at
`(u,v).
`
`for i = 1. to N do
`for j = 1 to N do
`sum= 0
`for u = -k/2 to k/2 do
`for v= -k/2 to k/2 do
`sum = sum+I[i+u}{j+v} x M{u}[v]
`O{i}[j} = sum
`
`end
`end.
`
`I(ij) and O(ij) are the input image value and
`Figure 3: Sequential algorithm for 2-dimensional convolution.
`output value, respectively, at pixel (ij) and M(u,v) is the mask value at (u,v).
`
`209
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 7
`
`
`
`Psu
`
`pixe
`
`'
`
`'
`
`PE-1
`
`Psum(t-1)
`
`Psum(t-1)
`
`Psum(t-1J
`
`Result
`
`PE-2
`
`;
`pixel_in
`
`pixel_in
`
`PE-3
`
`PE-k
`
`pixel_in
`
`Pixel_in
`
`t-:l
`f--'
`0
`
`Lookup
`Table
`
`.
`
`i
`
`PE-1
`
`Lookup
`Table
`
`:
`
`'
`' PE-1
`
`Psu
`
`pixe
`
`Psu
`
`pixe
`
`(a) 1-D Convolution
`
`Lookup
`Table
`
`t
`
`Psum(t-1)
`
`Psum(t-1J
`
`PE-2
`
`pixel_in
`
`pixel_in
`
`Lookup
`Table
`
`t
`
`PE-3
`
`(b) 1-D convolution with Memory lookup
`
`Lookup
`Table
`
`t
`
`Psum(t-1) -
`~ PE-k
`pixel_in
`
`Lookup
`Table
`
`1
`
`PE-2
`
`/
`
`Psum(t-1J
`
`pixeUn
`
`Lookup
`Table
`
`-',
`
`~
`
`PE-k
`
`pixel_in
`
`Result
`
`PixeUn
`
`Lookup
`Table
`
`N-k SRs
`
`Psum(t-1J
`
`Result
`
`PE-k*k
`pixel_in
`~---....J
`
`PixeUn
`
`(c) 2-D Convolution with shift registers and memory lookup
`
`Figure 4: Systolic schemes for convolution on Splash 2. (a) 1-dimensional convolution; (b) 1-dimensional convolution using memory lookup; (c)
`2-dimensional convolution with shift registers and memory lookup.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 8
`
`
`
`Partial_sumjn
`
`Partlal_sum_in
`
`Partial_sum_out
`
`Pixel_in
`
`PixeLout
`
`(a)
`
`PixeUn
`
`•
`
`~--1---
`
`(b)
`
`Partial_sum_out
`
`Pixel_out
`
`Figure 5: Different functional units. (a) Compute elements; (b) Shift registers.
`
`Computational Complexity No. of PEs Architecture
`Convolution algorithm
`(N2)
`Ranka & Sahni [3] O(k 2 x(# of bits in a pixel)
`Mesh
`Prasanna Kumar & Krishnan [6]
`O(logk)
`Hypercube
`O(N 2k2)
`k2 + logN
`O(N 2)
`Pyramid
`Chang et al. [8]
`Systolic
`O(k2)
`Kung [1] algorithm
`O(N 2
`O(N 2 +k2/2)
`O(k--z)
`VLSI
`Ranganathan et al. [10]
`
`)
`
`Table 1: Comparative analysis of convolution algorithms.
`
`Architecture
`Machine
`SPARCstation 20 Von Neumann
`Model 40
`SPARCstation 10 Von Neumann
`Model 30
`SPARCstation 2 Von Neumann
`
`Time
`3.8 s (elapsed),
`1.5 s (user), 0.2 s (system)
`3.9 s (elapsed),
`2.7 s (user), 0.3 s (system)
`5.6 s (elapsed),
`4.0 s (user), 0.5 s (system)
`51.9 ms
`13.89 ms
`
`Remarks
`Using HIPS [19] function
`
`Using HIPS [19] function
`
`Using HIPS [19] function
`
`CM-5
`CM-5
`
`MIMD (512 PEs)
`MIMD (32 PEs)
`
`40 ms
`605.75 ms
`
`Image Size
`Mask Size 128 X 128 256 X 256 512 X 512
`206.6
`813.3
`51.6
`3 x3
`216.6
`864.9
`54.9
`5 X 5
`228.3
`914.9
`7 X 7
`56.6
`
`Table 3: Run times (in miliseconds) of edge detection on Splash 2 for various image sizes and convolution mask
`sizes.
`
`211
`
`i-860
`Splash 2
`
`Splash 2
`
`Pipelined
`FPGA based
`
`FPGA based
`
`65 ms
`
`Timings as reported by vendor
`Mask values being powers of 2,
`{32 x 32 base size)
`Table lookup for multiplication
`{32 x 32 base size)
`Result reported in [16]*
`Result reported in [16]*
`~
`Table 2: Timings for 3 x 3 Sobel edge detection for a 512 x 512 image on different platforms. * Results are for
`an edge detector based on six 5 x 5 convolution masks.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 9
`
`
`
`x4
`Xs
`( a) Sobel Filter (3 X 3)
`
`Xs
`
`(b) Prewitt Filter (5 X 5)
`
`Xg
`
`X11
`X10
`(C) Prewitt Filter (7 X 7)
`
`Figure 6: Implementation of three edge detection filters on Splash 2. (a) 3 x3 Sobel filters; (b) 5 x 5 Prewitt
`filters; (c) 7 x 7 Prewitt filters; X1 - X15 denote the PEs in Splash 2.
`
`212
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 10
`
`
`
`(c)
`
`Figure 7: Results of 2-dimensional convolution. (a) Input image; (b) 3 x 3 Sobel Filter output; (c) 5 x 5 Prewitt
`Filter output; ( d) 7 x 7 Prewitt Filter output.
`
`213
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 11
`
`