throbber
Homayoun
`
`Reference 45
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 1
`
`

`

`Convolution on Splash 2
`
`N alini K. Ratha, Anil K. Jain & Diane T. Rover*
`Department of Computer Science
`* Department of Electrical Engineering
`Michigan State University
`East Lansing, MI 48824
`ratha@cps.msu.edu,
`jain@cps.msu.edu, rover@ee.msu.edu
`
`Abstract
`
`(image) and g (mask) can be expressed as
`
`Convolution is a fundamental operation in many
`signal and image processing applications. Since the
`computation and communication pattern in a convolu(cid:173)
`tion operation is regular, a number of special architec(cid:173)
`tures have been designed and implemented for this op(cid:173)
`erator. The Von Neumann architectures cannot meet
`the real-time requirements of applications that use con(cid:173)
`volution as an intermediate step. We combine the
`advantages of systolic algorithms with the low cost of
`developing application specific designs using field pro(cid:173)
`grammable gate arrays (FPGAs)
`to build a scalable
`convolver for use in computer vision systems. The
`performance of the systolic algorithm of Kung et al.
`{1] is compared theoretically and experimentally with
`many other convolution algorithms reported in the lit(cid:173)
`erature. The implementation of a convolution opera(cid:173)
`tion on Splash 2, an attached processor based on Xilinx
`4010 FPGAs, is reported with impressive performance
`gains.
`
`1
`
`Introduction
`
`Convolution is an important operator in digital sig(cid:173)
`nal and image processing. Many machine vision sys(cid:173)
`tems use 2-dimensional convolution for image filtering,
`edge detection, and template matching. Convolution
`of two signals f and g is often expressed as h = f * g,
`where h is the result of applying convolution mask g
`on input signal f. For 2-dimensional discrete images,
`convolution can be defined as follows:
`
`h(x,y) = L L f(x-i,y-j)g(i,j).
`
`00
`
`00
`
`(1)
`
`i=-ooj=-oo
`
`For images of finite size (say N x N), and masks of
`finite size (say k x k), 2-dimensional convolution off
`
`0-8186-7086-X/95 $04.00 © 1995 IEEE
`
`204
`
`k-lk-1
`
`h(x, y) =LL f(x + i, y + j)g(i, j).
`
`. (2)
`
`i=O j=O
`
`In the second form, it is also known as template match(cid:173)
`ing, or correlation. The image size and mask size need
`not contain square number of elements in general.
`Typical use of convolution in image processing is for
`edge detection using, for example, Sobel, and Prewitt
`masks; computing texture orientation; object location
`using template matching; and image smoothing using
`Gaussian masks. Convolution operation described in
`Eq. (2) can include integer-valued and real-valued in(cid:173)
`put images and masks. Examples ofreal-valued masks
`are image smoothing using Gaussian masks, and com(cid:173)
`puting texture features using Gabor filters. Most com(cid:173)
`monly used masks such as Laplacian, Sobel, and Pre(cid:173)
`witt have not only integer mask values, but the mask
`values can be expressed as powers of two. In a typical
`application
`involving many image processing stages,
`each stage should be capable of processing real-valued
`images. Based on this observation, we can have a tax(cid:173)
`onomy of convolution operation as shown in Figure 1.
`Many approaches to perform convolution using spe(cid:173)
`cial architectures have been reported
`in the litera(cid:173)
`ture [1, 2, 3]. A very simple distributed approach for
`convolution is to split the input image into a set of
`smaller, possibly overlapping, sub-images;
`the num(cid:173)
`ber of subimages is the same as the number of pro(cid:173)
`cessing elements (PEs). Each PE produces the result
`for the sub-image it receives. The spatial support for
`the convolution mask is provided through the overlap
`at the boundaries or data replication. However, this
`simplistic approach may not be suitable for all target
`architectures. For example, if the image is being ac(cid:173)
`quired line by line, this approach may not be efficient
`as we have to wait till we acquire the whole image.
`In order to understand
`the advantages and disadvan-
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 2
`
`

`

`tages of various algorithms reported in the literature,
`we need to look at the communication and computa(cid:173)
`tion pattern
`in 2-dimensional convolution. The basic
`convolution operation
`is shown schematically
`in Fig(cid:173)
`ure 2. Suppose the value of the convolution operation
`is desired at the point (x,y). The center of the mask
`is placed at (x,y). A point-wise inner product of the
`image pixels and mask values is computed, followed by
`a reduction sum operation. This computes the output
`value at (x, y). The reduction operation can also be a
`prefix sum, although the intermediate results are not
`directly useful. This basic set of operations is repeated
`at all possible (x, y) locations.
`The sequential version of convolution algorithm is
`very simple and is shown in Figure 3. There are four
`loops in the algorithm and its overall complexity is
`O(N 2k2). The simple data partitioning approach de(cid:173)
`scribed above can reduce the total time by a factor
`equal to the number of available processors, ignoring
`the extra computations needed in the overlapping ar(cid:173)
`eas by each PE.
`The previous work described in the literature can
`be summarized based on the target architecture on
`which convolution operation is implemented.
`
`• Systolic: One of the most widely used convolu(cid:173)
`tion algorithm is the systolic algorithm by Kung
`et al. [l]. The algorithm is fairly straight forward
`and also scalable to higher dimensions using the
`1-dimensional convolution algorithm as the build(cid:173)
`ing block. In his landmark paper "Why systolic
`[4], Kung described many convo(cid:173)
`architectures?"
`lution algorithms on systolic structures. Based
`on a general inner product computation, Kulka(cid:173)
`rni and Yen [5] proposed a systolic algorithm for
`1-dimensional and 2-dimensional convolution.
`
`Fang et al. [2] have described an
`• Hypercube:
`O(k 2/p 2 + klog(N/p) + logN * logp) algorithm,
`where 1 ~ p ~ k, and using N 2 k2 PEs and
`an O(N 2 M 2 / L 2
`) algorithm using L 2 PEs. Us(cid:173)
`
`ing N 2 PEs, Prasanna Kumar and Krishnan [6]
`proposed an algorithm with the best time com(cid:173)
`plexity of O(N 2 / K 2 + logN). With a fixed num(cid:173)
`ber of PEs,
`their
`time complexity changes to
`O(k 2logk + logN).
`
`• Mesh: Many researchers [7, 3] have proposed
`schemes for convolution on mesh connected ar(cid:173)
`[7] use computation
`chitectures.
`Lee et al.
`along a Hamiltonian path ending at the center
`of the convolution mask, called the convolution
`path. Ranka and Shani [3] do not broadcast
`the
`
`data values, thereby improving the performance
`of their algorithm by an order of magnitude.
`
`Pyramid architectures are useful in
`• Pyramid:
`images. An O(k 2 +
`dealing with multi-resolution
`logN) time complexity algorithm is described by
`Chang et al. [8].

`
`• VLSI/ ASIC: Most of the approaches recommend
`the suitability of their algorithm for a VLSI im(cid:173)
`plementation. Chakrabarti and Jaja use a linear
`array of processors in their algorithm [9]. In [5]
`convolution is viewed as a generalized inner prod(cid:173)
`uct and a VLSI implementation for 2-dimensional
`convolution is described. Ranganathan and Venu(cid:173)
`gopal [10] have described a VLSI architecture for
`template matching using k2 PEs and they achieve
`a time complexity of O(N 2 /2 + K 2 ).
`In this paper, we describe a convolution algorithm
`suitable for implementation on field programmable
`gate arrays (FPGAs).
`FPGAs have been used in
`designing many custom computing structures.
`Re(cid:173)
`cent advances in hardware technology, enabling more
`logic blocks in FPGAs, make them more suitable for
`many complex applications needing more logic. Re(cid:173)
`cently, Barros and Akil [11] have used FPGAs for
`low-level image processing. Athanas et al.
`[12] de(cid:173)
`scribe many image processing functions implemented
`on Splash 2. We use Splash 2, an attached processor
`on Sun SPARCstations based on Xilinx 4010 FPGA
`PEs. Our 2-dimensional convolution algorithm
`im(cid:173)
`plementation has been designed and synthesized for
`Splash 2 architecture.

`The rest of the paper is organized as follows. Splash
`2 architecture and software environment
`is described
`in Section 2. The convolution algorithm is described
`in section 3. The performance of the proposed algo(cid:173)
`rithm is evaluated by comparing
`its execution time
`and its complexity against some other algorithms re(cid:173)
`ported in the literature. The computer aided design
`(CAD) tools provide a performance estimate which is
`also compared with the real speed achieved in section
`4. In Section 5 conclusions and plans for future work
`are given.
`
`2 Splash 2 Architecture
`
`The Splash 2 system consists of an array of Xilinx
`4010 FPGAs, improving on the design of the Splash
`1 hMed on Xilinx 3090s [13]. The Splash 2 system
`is connected
`to the Sun host through an interface
`board that extends the address and data buses. The
`
`205
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 3
`
`

`

`to memories and memory-mapped
`host can read/write
`control registers of Splash 2 via these buses. A de(cid:173)
`tailed description of the system is given in [14]. Each
`Splash 2 processing board has 16 Xilinx 4010s as PEs
`(X 1 - X 16 ) in addition to a seventeenth Xilinx 4010
`(Xo) which controls the data flow into the processor
`board. Each PE has 512 KB of memory. The Sun
`host can read/write
`this memory. The PEs. are con(cid:173)
`nected through a crossbar that is programmed by X 0 .
`There is a 36-bit linear data path (SIMD Bus) running
`through all the PEs. The PEs can read data either
`from their respective memory or from any other PE.
`A broadcast path also exists by suitably programming
`Xo.
`Splash 2 system supports several models of com(cid:173)
`putation,
`including PEs executing the same instruc(cid:173)
`tion on multiple data (SIMD mode) and PEs executing
`multiple instructions on multiple data (MIMD mode).
`It can also execute the same or different instructions
`on single data by receiving data
`through
`the global
`broadcast bus. The most common mode of operation
`is systolic in which the SIMD Bus is used for data
`transfer. Also individual memory available with each
`PE makes it convenient to store temporary results and
`tables.
`To program Splash 2, we need to program each of
`the PEs (X1- X 16 ), the crossbar, and the host inter(cid:173)
`face. The crossbar sets the communication paths be(cid:173)
`tween PEs. In case the crossbar is used, Xo needs to
`be programmed. The host interface takes care of data
`transfer's in and out of the Splash 2 board. A spe(cid:173)
`cial library is available for these facilities for VHDL
`programming as described in [14].
`
`3 Convolution Algorithm on Splash 2
`
`in general, convolu(cid:173)
`Image processing algorithms,
`tion, in particular, demand high I/O bandwidth. Most
`of the algorithms proposed on special architectures as(cid:173)
`sume that data are already available on the PEs. This,
`in a way, avoids the I/O bandwidth problem of the
`convolution operation. We do not make this assump(cid:173)
`tion. Jonker [15] argues that linear arrays are better
`for image processing algorithms. A linear array of PEs
`operating in a systolic mode offers two advantages: (i)
`systolic arrays can balance I/O with computations,
`(ii)
`the nearest neighbor communication can eliminate the
`need for a global communication facility for some class
`of algorithm. One of the preferred modes of computa(cid:173)
`tions on Splash 2 is the systolic mode. In this mode,
`we assume nothing about availability of data in the
`individual PEs. The computations needed in a PE
`
`are also fairly simple. This helps us in balancing the
`I/O bandwidth requirement and computation require(cid:173)
`ment at a PE. Hence, we prefer a systolic algorithm
`for convolution on Splash 2.
`First, we describe the I-dimensional convolution al(cid:173)
`gorithm. A systolic I-dimensional convolution is sim(cid:173)
`ple. Let us assume that we have k PEs, where k is
`the size of the mask. Each PE receives from its left
`neighbor the pixel value and the partial result avail(cid:173)
`able so far. The PE multiplies the pixel value with
`the mask value it handles and adds the partial sum
`to it. This result and the pixel value are passed to its
`right neighbor. At the end of the systolic path, we get
`the convolution result after taking into account initial
`latency. A schematic diagram of I-dimensional convo(cid:173)
`lution on a set of PEs connected linearly is shown in
`Figure 4(a).
`the PEs can
`The above algorithm assumes that
`do multiplication.
`In a FPGA-based PE, this is not
`true. A double precision floating point multiplier
`needs more logic than what is available in a PE. We
`can use the local PE memory to store the multipli(cid:173)
`cation table indexed by the pixel value. This results
`in an additional delay of one cycle necessary to refer(cid:173)
`ence the multiplication result from the memory. This
`scheme is shown in Figure 4(b).
`is an extension of
`Our 2-dimensional convolution
`the I-dimensional convolution described above. The
`basic idea is based on the algorithm proposed by Kung
`et al. [1]. The k x k mask is extended
`to a k x N
`mask with O's placed at locations where no entry was
`present. These kN entries are serialized to get a sin(cid:173)
`gle I-dimensional mask of kN entries. Now, we can
`apply the I-dimensional convolution algorithms out(cid:173)
`there are (N - k) locations
`lined above. Note that
`with O's as their mask value. Hence, we can simply
`have (N - k) stages of shift registers. Secondly, for
`improper positions of this new N k-element mask we
`get values which are not really part of the output.
`These need to be ignored. Finally, the assumption on
`the input is that the pixels are being communicated
`in a raster scan order. Note that we can use either
`of the I-dimensional convolution algorithms described
`above depending on the value of the convolution mask
`coefficients. The scheme is shown in Figure 4(c).
`Implementation
`Issues
`The general convolution algorithm needs to be tuned
`for the special hardware being used. For example, we
`have only 16 PEs. We need to map virtual PEs to
`physical PEs. The second issue is the number of shift
`registers which depends on the row size of the im(cid:173)
`age. The third issue is implementation of multipliers
`
`206
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 4
`
`

`

`needed by the PEs. Following is a summary of our
`solution to these issues.
`
`• Large number of mask entries: If mask size ( k)
`is greater than the number of available PEs, the
`virtual PEs are mapped to available PEs. In car(cid:173)
`rying out the mapping, the timing model between
`PEs must be satisfied.
`
`• Larger image width: Image has to be split into
`smaller sub-images of some predefined size. In or(cid:173)
`der to handle this, we develop the 2-dimensional
`convolution with a fixed width (32 in our case).
`Larger images are split into sub-images of 32 x
`32 pixels and depending on the mask size the re(cid:173)
`quired rows and columns at the border are copied.
`
`• Multiplier implementation:
`
`-
`
`- Masks with integer values: If mask values
`are of the type 2P, then the multiplier in each
`virtual PE can be replaced with simple bit
`shifters.
`Integer masks but not powers of 2: An effi(cid:173)
`cient scheme for this is being developed.
`- Real valued masks: If mask values are nor(cid:173)
`malized (ranges between -1 to 1), then a suit(cid:173)
`able scheme for implementation on FPGAs
`is needed.
`
`The Splash PEs carry out two different activities,
`namely, (i) additions and multiplications, and (ii) shift
`operations. We have built two different elements for
`each of the activities, namely, (i) compute element,
`and (ii) shift element. The number of compute ele(cid:173)
`ments and shift registers is determined by the mask
`size and image width. The schematics of both these
`elements are shown in Figure 5.
`
`shown in Table 2. The basic sequential convolution
`algorithm (Figure 3) running on different Sun host
`In addition, timing on a
`machines has been timed.
`recently developed i-860 based system from Alacron is
`reported. The timing results on CM-5 are taken from
`[16] and are for edge detection using a set of six 5 x
`5 convolution masks.
`the convolution algorithm on
`For implementing
`Splash 2, we have chosen three standard edge detectors
`used in low-level computer vision algorithms. The fil(cid:173)
`ters and their mappings on Splash 2 PEs are shown in
`Figure 6. These operators have two convolution stages
`and a final stage to compute the edge magnitude at
`each pixel by computing the absolute sum of the two
`convolution output images. Each PE accommodates
`the required shift registers for that stage. The outputs
`for the house image using 3 x 3 Sobel edge detector,
`5 x 5 Prewitt edge detector and 7 x 7 Prewitt edge
`detector are shown in Figure 7. The timings for these
`edge detectors on Splash 2 are shown in Table 3.
`Our approach for implementing convolution oper(cid:173)
`ation on Splash 2 is different in many ways from the
`approach taken by Peterson et al. [17]. The main dif(cid:173)
`ferences are: (i) We are not limited by a fixed mask
`size of 8 x 8 as done in [17]. For smaller masks, Pe(cid:173)
`terson et al. [17] have used the same 8 x 8 masks filled
`with zeros. This may be due to hard-coded model of
`computing on Splash 2; (ii) Peterson et al. use all the
`16 PEs for implementing their algorithm. We use less
`number of PEs for smaller mask sizes (k < 8). There(cid:173)
`fore, in our approach, we will have more PEs available
`for implementing many other stages of a vision sys(cid:173)
`tem; and (iii) Our mapping on Splash 2 results in a
`higher performance of nearly 100 frames per second
`compared to 15 frames per second reported by Peter(cid:173)
`son et al. [17].
`
`4 Results
`
`A brief summary of the analysis of several convolu(cid:173)
`tion algorithms is given in Table 1 based on the com(cid:173)
`munication facility available on the respective archi(cid:173)
`tectures. In a systolic algorithm, the communication
`overhead is balanced by the computation phase. More(cid:173)
`over, no complex communication facility is needed. In
`this sense, systolic algorithm is better in terms of total
`work done and communication simplicity.
`We compare our implementation on Splash 2 with
`implementations on different platforms in terms of to(cid:173)
`tal execution time. The timings for 3 x 3 convolution(cid:173)
`based Sobel edge detector on a 512 x 512 image are
`
`5 Conclusions
`
`Convolution with integer mask values that are pow(cid:173)
`ers of two has been efficiently implemented on Splash
`2. The algorithm scales well with mask size. For the
`most general case of real-valued masks could be han(cid:173)
`dled using look-up tables.
`We are currently designing a fixed point multiplier
`on the chip to achieve an efficient implementation with
`real-valued masks. Because of the limit on logic that
`can be accommodated in a PE in Splash 2, mapping of
`virtual PEs to physical PEs will become difficult after
`a certain point. Moreover, if we have a large number
`of lookups, then the performance degrades linearly.
`Manseur [18] has proposed a scheme to decompose
`
`207
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 5
`
`

`

`large sized convolution masks into a set of 3 x 3 masks.
`We will explore this technique further to provide an
`efficient solution to large mask sizes.
`
`Acknowledgments We would like to thank Duncan
`Buell, Jeff Arnold and Brian Schott of Supercomput(cid:173)
`ing Research Center, Bowie, Maryland for their help
`and suggestions. We are thankful to Sea Choi for his
`help in debugging the VHDL code.
`
`References
`
`[1] H. T. Kung, L. M. Ruane, and D. W. L. Yen,
`"A two-level pipelined systolic array for convolu(cid:173)
`tions," in VLSI Systems and Computations (H. T.
`Kung, B. Sproull, and G. Steele, Eds.), pp. 255-
`264, Maryland: Computer Science Press, 1981.
`
`[2] Z. Fang, X. Li, and L. M. Ni, "Parallel algorithms
`for image template matching on hypercube SIMD
`computers,"
`IEEE Trans. on Pattern Analysis
`and Machine Intelligence, vol. 9, pp. 835-841,
`November 1987.
`
`"Convolution on
`[3] S. Ranka and S. Sahni,
`mesh connected multicomputers,"
`IEEE Trans.
`on Pattern Analysis and Machine Intelligence,
`vol. 12, pp. 315-318, March 1990.
`
`[4] H. T. Kung, "Why systolic architectures?," IEEE
`Computer, vol. 15, pp. 37-46, January 1982.
`
`[5] A. V. Kulkarni and D. W. L. Yen, "Systolic pro(cid:173)
`cessing and implementation for signal and image
`processing," IEEE Trans. on Computers, vol. 31,
`pp. 1000-1009, October 1982.
`
`"Efficient
`[6] V. K. P. Kumar and V. Krishnan,
`parallel algorithm for image template matching
`on hypercube SIMD machines,"
`IEEE Trans.
`on Pattern Analysis and Machine Intelligence,
`vol. 11, pp. 665-669, June 1989.
`
`[7] S. Y. Lee and J. K. Aggarwal, "Parallel 2-D con(cid:173)
`volution on a mesh connected array processor,"
`IEEE Trans. on Pattern Analysis and Machine
`Intelligence, vol. 9, pp. 590-594, July 1987.
`
`[8] J. H. Chang, 0. H. Ibarra, T. Pong, and S. M.
`Sohn, "Two-dimensional convolution on a pyra(cid:173)
`mid computer," IEEE Trans. on Pattern Analysis
`and Machine Intelligence, vol. 10, pp. 590-593,
`July 1988.
`
`[9] C. Chakrabarti and J. Jaja, "VLSI architectures
`for template matching and block matching,"
`in
`Parallel architectures and algorithms for image
`understanding (V. K. P. Kumar, Ed.), pp. 3-27,
`San Diego: Academic Press; 1991.
`
`[10] N. Ranganathan and S. Venugopal, "An efficient
`VLSI architecture for template matching based
`on momemt preserving pattern matching,"
`in
`Proc. of 12th Intnl. Association of Pattern Recog(cid:173)
`nition, Jerusalem, pp. 388-390, 1994.
`
`[11] M.A. D. Barros and M. Akil, "Low level image
`processing operators on FPGA: Implementation
`examples and performance evaluation,"
`in Proc.
`of 12th Intnl. Association of Pattern Recognition,
`Jerusalem, pp. 262-278, 1994.
`
`[12] P. M. Athanas and A. L. Abbott, "Real-time im(cid:173)
`age processing on a custom-computing platform,"
`IEEE Computer, vol. 28, pp. 16-24, February
`1995.
`
`[13] J. M. Arnold, D. A. Buell, and E. G. Davis,
`"Splash 2," in Proceedings 4th Annual ACM Sym(cid:173)
`posium on Parallel Algorithms and Architectures,
`pp. 316-322, 1992.
`
`[14] D. A. Buell, J.M. Arnold, and W. J. Kleinfelder,
`Eds., Splash 2: FPGAs for Custom Computing
`Machines. Los Alamitos: IEEE Computer Soci(cid:173)
`ety Press, 1995.
`
`[15] P. P. Jonker, '(Why linear arrays are better image
`processors," in Proc. of 12th Intnl. Association
`of Pattern Recognition, Jerusalem, pp. 334-338,
`1994.
`
`[16) V. K. Prasanna, C.-L. Wang, and A. A. Khokhar,
`"Low level vision processing on connection ma(cid:173)
`chine CM-5," in Proc. of Computer Architecture
`for Machine Perception, Florida, pp. 117-126,
`1993.
`
`[17) J. B. Peterson and P. M. Athanas, ('Addressing
`the computational needs of high-speed image pro(cid:173)
`cessing with a custom computing machine," Jour(cid:173)
`nal of VLSI Signal Processing. Under review.
`
`[18] Z. Z. Manseur and D. C. Wilson, "Decomposi(cid:173)
`tion methods for convolution operators," CVGIP:
`Graphical Models and Image Processing, vol. 53,
`pp. 428-434, September 1991.
`
`[19] Sharplmage Software, New York, The HIPS Im(cid:173)
`age Processing Software, 1993.
`
`208
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 6
`
`

`

`Inpu mage
`
`Real-valued A
`
`Integer-valued masks
`
`Real-valued masks
`
`Values are
`powers of2
`
`General integer values
`
`Figure 1: A t!l,xonomy for convolution.
`
`I (x-k/2, y-k/2)
`
`...
`
`I (x+k/2, y-k/2)
`
`M (k/2, -k/2)
`
`M (k/2, -k/2)
`
`.
`.
`
`I (x-k/2, y+k/2)
`
`.
`.
`
`I (x+k/2, y+k/2
`
`(x, y)
`
`...
`
`M (-k/2, k/2)
`
`M (k/2, k/2)
`
`Figure 2: The convolution operator. I(ij) is the input image value at pixel (ij) and M(u,v) is the mask value at
`(u,v).
`
`for i = 1. to N do
`for j = 1 to N do
`sum= 0
`for u = -k/2 to k/2 do
`for v= -k/2 to k/2 do
`sum = sum+I[i+u}{j+v} x M{u}[v]
`O{i}[j} = sum
`
`end
`end.
`
`I(ij) and O(ij) are the input image value and
`Figure 3: Sequential algorithm for 2-dimensional convolution.
`output value, respectively, at pixel (ij) and M(u,v) is the mask value at (u,v).
`
`209
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 7
`
`

`

`Psu
`
`pixe
`
`'
`
`'
`
`PE-1
`
`Psum(t-1)
`
`Psum(t-1)
`
`Psum(t-1J
`
`Result
`
`PE-2
`
`;
`pixel_in
`
`pixel_in
`
`PE-3
`
`PE-k
`
`pixel_in
`
`Pixel_in
`
`t-:l
`f--'
`0
`
`Lookup
`Table
`
`.
`
`i
`
`PE-1
`
`Lookup
`Table
`
`:
`
`'
`' PE-1
`
`Psu
`
`pixe
`
`Psu
`
`pixe
`
`(a) 1-D Convolution
`
`Lookup
`Table
`
`t
`
`Psum(t-1)
`
`Psum(t-1J
`
`PE-2
`
`pixel_in
`
`pixel_in
`
`Lookup
`Table
`
`t
`
`PE-3
`
`(b) 1-D convolution with Memory lookup
`
`Lookup
`Table
`
`t
`
`Psum(t-1) -
`~ PE-k
`pixel_in
`
`Lookup
`Table
`
`1
`
`PE-2
`
`/
`
`Psum(t-1J
`
`pixeUn
`
`Lookup
`Table
`
`-',
`
`~
`
`PE-k
`
`pixel_in
`
`Result
`
`PixeUn
`
`Lookup
`Table
`
`N-k SRs
`
`Psum(t-1J
`
`Result
`
`PE-k*k
`pixel_in
`~---....J
`
`PixeUn
`
`(c) 2-D Convolution with shift registers and memory lookup
`
`Figure 4: Systolic schemes for convolution on Splash 2. (a) 1-dimensional convolution; (b) 1-dimensional convolution using memory lookup; (c)
`2-dimensional convolution with shift registers and memory lookup.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 8
`
`

`

`Partial_sumjn
`
`Partlal_sum_in
`
`Partial_sum_out
`
`Pixel_in
`
`PixeLout
`
`(a)
`
`PixeUn
`
`•
`
`~--1---
`
`(b)
`
`Partial_sum_out
`
`Pixel_out
`
`Figure 5: Different functional units. (a) Compute elements; (b) Shift registers.
`
`Computational Complexity No. of PEs Architecture
`Convolution algorithm
`(N2)
`Ranka & Sahni [3] O(k 2 x(# of bits in a pixel)
`Mesh
`Prasanna Kumar & Krishnan [6]
`O(logk)
`Hypercube
`O(N 2k2)
`k2 + logN
`O(N 2)
`Pyramid
`Chang et al. [8]
`Systolic
`O(k2)
`Kung [1] algorithm
`O(N 2
`O(N 2 +k2/2)
`O(k--z)
`VLSI
`Ranganathan et al. [10]
`
`)
`
`Table 1: Comparative analysis of convolution algorithms.
`
`Architecture
`Machine
`SPARCstation 20 Von Neumann
`Model 40
`SPARCstation 10 Von Neumann
`Model 30
`SPARCstation 2 Von Neumann
`
`Time
`3.8 s (elapsed),
`1.5 s (user), 0.2 s (system)
`3.9 s (elapsed),
`2.7 s (user), 0.3 s (system)
`5.6 s (elapsed),
`4.0 s (user), 0.5 s (system)
`51.9 ms
`13.89 ms
`
`Remarks
`Using HIPS [19] function
`
`Using HIPS [19] function
`
`Using HIPS [19] function
`
`CM-5
`CM-5
`
`MIMD (512 PEs)
`MIMD (32 PEs)
`
`40 ms
`605.75 ms
`
`Image Size
`Mask Size 128 X 128 256 X 256 512 X 512
`206.6
`813.3
`51.6
`3 x3
`216.6
`864.9
`54.9
`5 X 5
`228.3
`914.9
`7 X 7
`56.6
`
`Table 3: Run times (in miliseconds) of edge detection on Splash 2 for various image sizes and convolution mask
`sizes.
`
`211
`
`i-860
`Splash 2
`
`Splash 2
`
`Pipelined
`FPGA based
`
`FPGA based
`
`65 ms
`
`Timings as reported by vendor
`Mask values being powers of 2,
`{32 x 32 base size)
`Table lookup for multiplication
`{32 x 32 base size)
`Result reported in [16]*
`Result reported in [16]*
`~
`Table 2: Timings for 3 x 3 Sobel edge detection for a 512 x 512 image on different platforms. * Results are for
`an edge detector based on six 5 x 5 convolution masks.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 9
`
`

`

`x4
`Xs
`( a) Sobel Filter (3 X 3)
`
`Xs
`
`(b) Prewitt Filter (5 X 5)
`
`Xg
`
`X11
`X10
`(C) Prewitt Filter (7 X 7)
`
`Figure 6: Implementation of three edge detection filters on Splash 2. (a) 3 x3 Sobel filters; (b) 5 x 5 Prewitt
`filters; (c) 7 x 7 Prewitt filters; X1 - X15 denote the PEs in Splash 2.
`
`212
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 10
`
`

`

`(c)
`
`Figure 7: Results of 2-dimensional convolution. (a) Input image; (b) 3 x 3 Sobel Filter output; (c) 5 x 5 Prewitt
`Filter output; ( d) 7 x 7 Prewitt Filter output.
`
`213
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2157, p. 11
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket