`The System Architecture and Networking group
`
`Accelerating SIFT Feature extraction with a vector DSP-
`A feasibility study
`
`Master Thesis
`
`Arjun Sathyanarayana Prasad
`
`Supervisors:
`Prof. Dr. Ir. C. H. (Kees) Van Berkel
`Ir. D. (David) Van Kampen
`Dr. Ir. W. E. H. (Wim) Kloosterhuis
`Prof. Dr. K. G. W. (Kees) Goossens
`
`This thesis work has been carried out at Ericsson. The content of this report is confidential until August 29, 2015
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 1
`
`
`
`
`
`
`
`
`
`Acknowledgments
`I would like to express my deepest gratitude to Professor Dr. Ir. Kees Van Berkel from TU,
`Eindhoven for providing me an opportunity to conduct this thesis work under him. He has been
`motivating and has guided me to carry out this thesis work in a fruitful manner.
`
`I am extremely grateful to David van Kampen from Ericsson for tutoring me throughout this
`thesis work. He has given me valuable suggestions and feedback in each and every stage of my
`work. His willingness to give time so generously is immensely appreciated.
`
`Special thanks to Dr. Ir. Wim Kloosterhuis from Ericsson for advising me in the critical stages of
`the thesis work and providing me the constructive feedback. He has made time to review my
`work even with a busy schedule which is much appreciated.
`
`Lastly, I would also like to extend my thanks to Ericsson and specially the EVP team for being
`helpful and accommodating for this period. This has been a great learning experience.
`
`Arjun Sathyanarayana Prasad
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 2
`
`
`
`Table of Contents
`1.
`Introduction ......................................................................................................................................... 5
`1.1. Digital Image ..................................................................................................................................... 6
`1.2. Embedded vector Processor ............................................................................................................... 8
`1.2.1. Implementation of Sobel kernel on EVP ........................................................................................ 9
`1.3. Problem Description ........................................................................................................................ 16
`1.4. Related work .................................................................................................................................... 17
`2. Scale invariant Feature Transform – Algorithm description ....................................................... 18
`2.1. Scale space extrema detection .......................................................................................................... 19
`2.1.1. Difference of Gaussian.............................................................................................................. 19
`2.1.2. Local Extrema detection ........................................................................................................... 21
`2.2. Accurate keypoint localization......................................................................................................... 22
`2.2.1. Peak/Contrast Threshold ........................................................................................................... 23
`2.2.2. Edge Threshold ......................................................................................................................... 25
`2.3. Orientation assignment .................................................................................................................... 26
`2.4. Keypoint Descriptor generation ....................................................................................................... 27
`2.5. Object recognition ............................................................................................................................ 27
`3. Architecture ....................................................................................................................................... 29
`3.1. Experimental setup ........................................................................................................................... 29
`3.2. Data flow .......................................................................................................................................... 30
`4. Mapping SIFT on EVP – Problems and Approaches .................................................................... 32
`4.1. Application ....................................................................................................................................... 32
`4.2. Kernels ............................................................................................................................................. 33
`4.2.1. Up-sampling .............................................................................................................................. 33
`4.2.2. Gaussian blurring ...................................................................................................................... 37
`4.2.3. Difference of Gaussian.............................................................................................................. 51
`4.2.4. Down-sampling ......................................................................................................................... 52
`4.2.5. Extrema detection ..................................................................................................................... 53
`4.2.6. Gradient and orientation assignment ......................................................................................... 57
`4.2.7. Keypoint refining ...................................................................................................................... 59
`5. Results and Observations ................................................................................................................. 61
`5.1 Data-type exploration for precision .................................................................................................. 61
`5.2 Performance of gaussian pyramid ..................................................................................................... 68
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 3
`
`
`
`5.3 Performance of difference of gaussian, extrema detection and gradient pyramid ............................ 71
`5.4 Total execution time ......................................................................................................................... 73
`5.5 Benchmarking ................................................................................................................................... 73
`6. Conclusions and Future work .......................................................................................................... 80
`6.1 Conclusions ....................................................................................................................................... 80
`6.2 Future work ....................................................................................................................................... 80
`7. References .......................................................................................................................................... 81
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 4
`
`
`
`
`
`
`
`
`
`
`
`1. Introduction
`Over the past decade we have seen rapid adoption of the mobile smartphones among users worldwide. It is estimated
`that the number of smartphones sold in the year 2013 itself was close to 1 billion [1]. It is not only that these devices
`are affordable but the reason for this evolution of the smartphone business can also be attributed to tremendous
`advancements in the processor, battery and memory technologies. These devices can handle tasks that were
`unimaginable few years ago and hence the name Smartphone. The number of applications being built into these
`mobile devices is ever increasing but so is the computational capability of mobile processors. One of the surveys [2]
`show that the majority of users purchase smartphones based on its ability to support personal applications. The
`report also indicates that users are most likely to use their phone for multimedia applications like playing games,
`photo editing, streaming audio/video content and web surfing. Also more and more intriguing applications like
`augmented reality is just around the corner and will soon find its way into our lives [3]. Hence, the ability to handle
`multimedia content is very crucial and this is not true just for smartphones but also for other technologies that are in
`use today such as tablets, high tech glasses, play stations etc. But, it is well known that processing of multimedia
`content is highly resource intensive and has a direct impact on the devices` performance.
`
`Of particular relevance in this report is the field of Image analysis and computer vision. Image analysis involves a
`series mathematical transformations performed on digital images to extract meaningful information. One of the
`ultimate goals that is sought out to be achieved through Image analysis is to use computers for emulating human
`vision. A computer needs to be capable of learning from its visual inputs, make inferences and take the necessary
`action. Computer vision is a field of study that encompasses wide variety of algorithms that are aimed at mimicking
`functionality of human visual system. Computer vision and Image analysis concepts are applied in numerous areas
`spanning from a simple photo editor to complex augmented reality applications.
`
`Sophisticated computer vision applications require performing transformations on images at very high frame rates.
`An image consists of huge chunk of data (pixels) in the orders of hundreds of KBs (Kilobytes) or few MBs
`(Megabytes) and has direct impact on performance of systems when analyzing them. Running Image analysis
`algorithms on GPPs (General purpose processors) is often not recommended as they are very inefficient. Especially
`in Embedded systems with tight constraints on timing, it is difficult to achieve real time performance on resource
`intensive algorithms by using only GPPs (CPU). Hence different alternatives have been explored in the form of Co-
`processors, DSPs (Digital signal processors), GPUs (Graphic processing units), SIMD/MIMD (Single instruction
`multiple data/ Multiple instructions multiple data) processors and hardware accelerators to work in tandem with
`CPUs to share the workload. GPUs are by far the most preferred choice in the state of the art mobile platforms while
`it is no surprise that custom accelerators are the most computationally efficient alternative. GPUs offer high degree
`of flexibility which can be exploited to improve performance of variety of algorithms, but they are power hungry.
`Accelerators, though very efficient, offer least degree of flexibility and hence are not desirable for varying standards.
`SIMD processors occupy a sweet spot in this hierarchy of processing units as they provide a better degree of
`flexibility when compared to accelerators and also they can be computationally efficient when compared to GPUs if
`the algorithms are vectorizable. In many of the image transformations this is often the case. Since SIMD processors
`have similar architectures when compared to GPUs, considering SIMD processors as a design alternative in mobile
`platforms for Image analysis is highly relevant in this context.
`
`In this report we focus on implementation of a widely popular and robust feature extraction algorithm called Scale
`invariant feature transform (SIFT) [4] that is applied in variety of computer vision applications. Some of the
`transformations used in SIFT exhibit high degree of data level parallelism and these are mapped on a SIMD
`processor developed by Ericsson called the Embedded vector Processor (EVP).
`
`The structure of this chapter is as follows: in section.1.1 we briefly discuss basics of a digital image and its
`characteristics. Section.1.2 gives a conceptual description of the architecture of embedded vector processor (EVP)
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 5
`
`
`
`from Ericsson with an illustration of a simple image analysis algorithm mapped on EVP. In section.1.3 we formulate
`the problem statement for our work. In section.1.4 we briefly discuss the related work on SIFT.
`
`1.1. Digital Image
`A digital image can be considered as numerical representation of a 2 dimensional array. There are two types of
`representation of digital images: Raster images and Vector graphics. Normally digital images are associated with
`raster/bitmap images since most of the imaging devices, displays etc. are raster devices. A raster image consists of
`finite number of elements called pixels arranged in rows (Height) and columns (Width). Each pixel has a value
`(integers) assigned to it and indicates the brightness level of a particular color which can be components of RGB
`(Red, Blue and Green) colors or grayscale.
`Color depth/ bit depth of an image is the number of bits required/used to indicate the brightness level of a single
`pixel in an image. In case of color images a single pixel can be a group of sub pixels for individual color
`components (RGB) and color depth applies to each individual sub pixels represented as bits per channel (BPC) or
`bits per sample (BPS) or bits per pixel(BPP). In case of grayscale images, pixels have a single brightness value and
`the BPP indicates the number of gray levels that can be present in an image varying between and including white
`and black colors. Color depth of an image specifies the number of distinct colors that can be displayed but does not
`indicate which colors. There is another parameter called the color gamut. Color gamut specifies the subset of colors
`that can be represented accurately within a given color space (perceived by human eye) or by certain output device.
`Figure.1.1 called the CIE (international commission on illumination) diagram explains the concept of the color
`gamut. The figure shows a full range of colors perceived by the human eye and a subset of colors that a common
`HDTV can display. Sometimes along with the color components another channel to indicate transparency is also
`encoded in an image which is called the alpha channel. Typically images used are of 8 BPC color depth with or
`without alpha channel since the operations on images are computationally simpler and the number of colors are
`sufficient given the sensitivity of the human eye. Table.1.1 gives different variants of commonly used color systems
`along with their applications.
`
`Figure.1.1. Color gamut of a HDTV (CIE diagram) Source [5]
`
`Bitmap/Raster images are often stored in compressed or uncompressed file formats. Once rasterized, these images
`are transformed into grids of pixels with designated color depth for displaying it on the screen. Compression
`techniques are required to store images since storing images in raster format needs lot of memory space. The size of
`an image is directly correlated to the color depth and the number of pixels present in an image. Compression
`techniques can be either lossy or lossless.
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 6
`
`
`
`Color systems
`
`8 bit true color
`
`16 bit direct color
`
`18 bit color
`
`24 bit true color
`
`Deep color(30/36/48 bit)
`
`Bits per channel
`(Color depth)
`3(R,G) and 2(B)
`
`4(RGB) + 4 alpha
`
`5(RGB) + 1alpha
`
`5(R) + 6 (G) + 5(B)
`
`6 BPC (RGB)
`
`8 BPC (RGB) +
`Alpha(optional)
`10/12/16 BPC
`
`Color gamut
`
`Applications
`
`256
`
`Early computers
`
`4096 colors + 16 level
`transparency
`
`32768 colors + 2 level
`transparency
`
`65536 colors
`
`262,144 colors
`
`Mobile phones
`Small device displays
`Photographic images
`
`LCD displays
`
`16,777,216 colors
`
`Modern desktops
`
`Billions of colors
`
`Graphic cards
`
`Table.1.1. Different coloring systems, their characteristics and applications
`
`Following are the different categories of file formats commonly used to store images.
`
`Uncompressed - Stores images in rasterized format. It is easy to perform Image transformations, as each pixel data
`can be easily parsed. Examples: BMP, PGM, PBM.
`
`Lossless Compression - In lossless compression an exact replica of original raster grid is stored but with reduced file
`size. The file size depends on the actual graphical complexity of an image. It is often advantageous when storing
`graphically simpler images (Continuous regions in an image like cartoons or intermediate edited images) as it can
`achieve better compression. Examples: PNG, GIF, TIFF.
`
`Lossy Compression - In lossy compression the image quality is compromised for file size. Often algorithms are
`designed to exploit the limitation of human eye characteristics and discard unnecessary information. So in actuality
`the image may appear to be perfectly good even though it is of low quality. They are generally better than lossless
`compression in achieving lower file size. They are typically used for images that are stored on the internet.
`Examples: JPEG, TIFF, EXIF.
`
`Table.1.2 summarizes different file formats, their characteristics and applications. For most of the image
`transformations it is required that the images are in uncompressed format. Hence in this report, test images that are
`chosen are of PGM file format and also they are of 8 bit color depth (Gray Scale) as SIFT is applied on gray images.
`But this concept can be extended to images with different color depths and color systems.
`
`File formats
`
`JPEG
`
`PNG
`
`GIF
`
`BMP
`
`
`
`Color depth
`
`Compression
`
`Use
`
`Type
`
`Lossy
`
`Lossless
`
`8 BPC (RGB, Gray)
`
`8/24/48 bit true color
`With/Without alpha
`
`++
`
`+
`
`+
`
`--
`
`Final Distribution,
`Internet
`
`Animation,
`Intermediate images
`
`Logos ,cartoons
`
`Cameras, photos,
`image processing
`
`Page 7
`
`Lossless
`
`8 bit(256 colors)
`
`uncompressed
`
`1,4,8,16,24,32
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`
`
`PGM
`
`uncompressed
`
`8 bit grayscale
`
`--
`
`Image processing
`
`Table.1.2. File formats, their characteristics and applications
`
`1.2. Embedded vector Processor
`Embedded vector processor (EVP) is a vector DSP developed by Ericsson and is being used in Ericsson modems in
`baseband processing for mobile standards [6]. Figure.1.2 gives a conceptual overview of the architecture of EVP.
`
`EVP has VLIW architecture and hence has the ability to pack multiple operations in one instruction and executes
`them in a single clock cycle on parallel Functional Units. It is made up of a scalar data computation unit (SDCU)
`and a vector data computation unit (VDCU). SDCU can be operate in parallel to VDCU and can perform common
`scalar operations. The SIMD (vector) width of EVP is 256 bits and the vector operations can be performed in the
`granularities of 8 bits (limited), 16 bits and 32 bits. Thus the number of elements per vector (P) that can be operated
`on in parallel can be 32, 16 or 8 depending on the granularity. From the implementation point of view the VALU,
`VLSU, VMAC, VBCST and VSHU units of the VCDU are of prime importance and will be discussed in more detail
`below.
`
`Figure.1.2. Architecture of EVP- Source [6]
`
`(cid:120)(cid:3) The VDCU unit consists of 16 general purpose vector registers shared among all functional units and each
`vector is 256 bit wide.
`(cid:120)(cid:3) The vector load and store unit (VLSU) forms the interface between vector units and the data memory. It
`supports aligned and unaligned loads of vectors from the memory. Aligned load operates on 256 bit
`boundaries. Unaligned loads can load vectors at 8 bit boundaries. Aligned vector loads have a latency of 3
`cycles while unaligned loads take 4 cycles. The initiation interval (Cycles between two operations of given
`type) for both kinds of loads is 1 cycle. The load and store unit internally maintains a fully associative
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 8
`
`
`
`cache for unaligned loads and fetches consecutive vectors in a stream. This cache is limited in size and
`hence often restrictive (only a maximum of three data streams are supported efficiently).
`(cid:120)(cid:3) The vector arithmetic and logic unit (VALU) supports basic ALU operations such as addition, subtraction,
`comparison etc. and logical bitwise operations on 8/16/32 bit data types. The operations have a latency and
`initiation interval of 1 cycle.
`(cid:120)(cid:3) The vector multiplication and accumulation unit (VMAC) supports integer/fixed and floating point
`multiplications. EVP is capable of performing multiplications of only 16 bit arguments with accumulations
`supported up to 16/32/40 bits. The latency of integer MAC operations is 2 cycles with initiation interval of
`1 cycle. There is no support for 8 bit granularity. Capabilities such as saturation and rounding are also
`included.
`(cid:120)(cid:3) The vector shuffle unit (VSHU) is used to organize, shift/rotate elements in a vector. It uses shuffle patterns
`(16 patterns supported) that can be used to re-organize values inside a vector. The latency and initiation
`interval is 1 cycle each.
`(cid:120)(cid:3) Vector Broadcast unit (VBCST) can be used to broadcast a scalar element to all vector data paths.
`(cid:120)(cid:3) Vector mask registers are supported which is used in combination with various operations on EVP to
`manipulate only parts of vector data. For example, the shuffle operations in conjunction with mask registers
`can be used to merge two different vectors. EVP also consists of a logical unit called Vector mask
`arithmetic and logic units (VMALU) for performing logical operations on vectors masks.
`(cid:120)(cid:3) Almost all operations can be used as predicated operations. The predicated operations have an additional
`parameter of type bool which determines whether the operation is performed or not.
`(cid:120)(cid:3) Extra information: An IVU (intra vector unit) unit is present on EVP which supports operations on data
`elements within a vector. EVP supports bypass so that results of one functional unit can be directly passed
`to other units without the need for register store in the pipeline stages. An address computation unit (ACU)
`is also present for performing arithmetic operations on addresses. Specialized program control units that
`can run in parallel to VDCU and SDCU are included to support hardware loops, calls and branching.
`(cid:120)(cid:3) Software development is done in a language called EVP-C which is a superset of basic ANSI-C language to
`support specialized functions of EVP.
`
`
`1.2.1. Implementation of Sobel kernel on EVP
`To illustrate mapping of image analysis kernels on EVP let us take an example of a simple edge detection algorithm
`called the Sobel operator. The Sobel filter consists of two kernels Gx and Gy of size 3x3 convolved with an image
`(I) in X and Y direction as shown below.
`
`(cid:1833)(cid:1876)(cid:3404)(cid:3)(cid:3429)(cid:3398)(cid:883) (cid:882) (cid:883)
`
`
`
`(cid:3398)(cid:883) (cid:882) (cid:883)(cid:3433)(cid:1499)(cid:1835)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:1833)(cid:1877)(cid:3404)(cid:3)(cid:3429)(cid:3398)(cid:883) (cid:3398)(cid:884) (cid:3398)(cid:883)(cid:883) (cid:884) (cid:883)(cid:3433)(cid:1499)(cid:1835)
`(cid:3398)(cid:884) (cid:882) (cid:884)
`(cid:882)
`(cid:882)
`(cid:882)
`
`The two kernels compute the gradient variations in their respective directions. These gradients extract sharp
`variations in the brightness levels across the image and thus detect edges [6]. Figure1.3. shows the output of a Sobel
`filter applied on a standard test image of size 512x512.
`
`Convolution is performed by moving each of the kernels across the image, one pixel at a time computed as below.
`Consider the kernels as a kind of an overlay centered on a pixel of interest. Then the gradient is calculated as the
`weighted values of pixel of interest and its corresponding 8 neighbors. Figure.1.4 demonstrates the same.
`
`Consider “I11” as the pixel of interest. Then the resulting gradients at I11 can be calculated as follows.
`
`Gx11/Gy11 = G00*I00 + G01*I01 + G02*I02 + G10*I10 + G11*I11 + G12*I12 + G20*I20 + G21*I21 + G22*I22
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 9
`
`
`
`The coefficients are integer values and are multiplied with pixel values of range [0:255]. After Gx11 and Gy11 are
`calculated the next pixel which is I12 is made as the pixel of interest and the filters are centered on it. The boundary
`pixels are not computed on EVP and are instead set to zero while storing back the output image.
`
`Figure.1.3.Sobel edge detection performed on EVP – “lena” standard test image
`
`Figure.1.4. Illustration of convolution using Sobel kernels
`
`After calculating the gradients in the two directions the gradient magnitude at each pixel is calculated using the
`following equation.
`
`For simplicity an approximate formula for calculating gradient magnitude is used which is as below.
`
`(cid:1833)(cid:3404)(cid:3)(cid:3495)(cid:1833)(cid:3025)(cid:2870)(cid:3397)(cid:1833)(cid:3052)(cid:2870)(cid:3)
`(cid:1833)(cid:3404)(cid:3)(cid:1313)(cid:1833)(cid:3051)(cid:1313)(cid:3397)(cid:3)(cid:3630)(cid:1833)(cid:3052)(cid:3630)
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 10
`
`
`
`The gradient magnitude G is stored as the output image which will contain edges. The pixels at the corners are filled
`with zeroes and are not operated on by the kernels. Although there are 18 coefficients from the two kernels Gx and
`Gy, 6 of them are zeroes and hence in total we need to convolve the image with only 12 coefficients.
`
`Let us see how a sequential pseudo code would look like if we were to apply Sobel filters.
`W -> width of the image
`H -> height of the image
`I[W][H] -> input image
`O[W][H] -> output image
`
`for(i = 1; i < H-1; i++)
`{
`
`
`for(j = 1; j < W-1; j++)
`{
`
`
`
`
`
`
`
`
`
`
`}
`
`}
`
`X = -1*I[i-1][j-1];
`X += -2*I[i][j-1];
`X += -1*I[i+1][j-1]; // Convolution with Gx kernel
`X += 1*I[i-1][j+1];
`X += 2*I[i][j+1];
`X += 1*I[i+1][j+1];
`
`Y = -1*I[i-1][j-1];
`Y += 1*I[i+1][j-1];
`y += -2* I[i-1][j];
`y += 2* I[i+1][j];
`Y += -1*I[i-1][j+1];
`Y += 1*I[i+1][j+1];
`
`// Convolution with Gy kernel
`
` G
`
` = abs(X) + abs(Y);
`G = min(G,255); // Saturate
`O[i][j] = G;
`
`Now let us consider implementation of Sobel kernels on a SIMD machine which in this case is EVP. We re-
`represent the whole image as array of vectors with each vector having 16 (P) elements. This is called vectorization.
`For example if we consider a 512x512 image, then we divide each row into 32 vectors containing 16 pixels each.
`Figure.1.5 demonstrates an example of how convolution is performed on vectors for Gx kernel. In the Y direction
`three consecutive rows are convolved with the corresponding coefficients. In the X direction, shifted versions of
`input stream are obtained and convolved and this is illustrated in the figure for row Ri. The same can be applied to
`other rows Ri+1 and Ri+2. The whole process is repeated in similar fashion for Gy kernel.
`
`The shifted versions of input streams are obtained either by using simple unaligned loads on EVP or by using
`aligned loads in combination with shuffle and rotate operations. Figure.1.6. shows how this can be achieved. These
`operations help us in accessing neighbors of each element within a particular vector. Thus when convolved, the
`output will have the weighted sum of the relevant neighboring pixel values.
`
`There are two possible approaches for performing convolution which is explained in the following sections.
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 11
`
`
`
`Figure.1.5. Illustration of convolution on EVP
`
`
`
`
`
`Figure.1.6. Illustration of aligned and unaligned loads
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 12
`
`
`
`Approach 1
`Let us consider implementation on EVP that is similar to the sequential implementation that we discussed in the
`previous section. The operations are performed on 16 pixels together and the resulting output is for all 16 pixels.
`Figure.1.5 actually illustrates this approach. Below is the pseudo code for the same.
`I [H][W/P] -> input vector of elements
`for(i = 0; i< H-2; i++)
`{
`
`
`j = 0;
`
`I1 = I[i+0][j*P : j*P+P-1];
`I2 = I[i+1][j*P : j*P+P-1]; // Aligned load of first vector in a row
`I3 = I[i+2][j*P : j*P+P-1];
`
`I11 = I[i+0][(j+1)*P : (j+1)*P+P-1];
`I22 = I[i+1][(j+1)*P : (j+1)*P+P-1]; // Load the next vector in the row
`I33 = I[i+2][(j+1)*P : (j+1)*P+P-1];
`for(j = 0; j < W/P; j++)
`{
`
`X[ 0 : P-1] = (-1) .* I1;
`X[ 0 : P-1] += (-2) .* I2 // Mac operations with coefficients
`X[ 0 : P-1] +=(-1) .* I3;
`
`Y[ 0 : P-1] = (-1).* I1;
`Y[ 0 : P-1] += (1).* I3;
`
`T1 = (I1[1:P-1],I11[0]); //Rotate one position
`T3 = (I3[1:P-1],I33[0]); //Merge and rotate 1 position on VSHU
`
`Y[ 0 : P-1] += (-2).* T1;
`Y[ 0 : P-1] += (2).* T3;
`
`T1 = (I1[2:P-1],I11[0],I11[1]);
`T2 = (I2[1:P-1],I22[0],I22[1]);// Merge and Rotate 2 positions
`T3 = (I3[2:P-1],I33[0],I33[1]);
`X[ 0 : P-1] += (2) .* T2;
`X[ 0 : P-1] += (1) .* T1;
`X[ 0 : P-1] += (1) .* T3;
`
`Y[ 0 : P-1] += (-1).* T1
`Y[ 0 : P-1] += (1).* T3;
`
`G[0 : P-1] = abs(X[0 : P-1)) + abs(Y[0 : P-1)); // Final gradient
`
`G[0 : P-1] = min(S[0 : P-1],bcst(255)); // Clipping/Saturation
`O[i+1][j*P :j*P+P-1] = G[0 : P-1]; //Store
`
`I1 = I11;
`
`
`
`
`
`I2 = I22;
` // Use input vector for next iteration
`I3 = I33;
`
`I11 = I[i+0][(j+2)*P : (j+2)*P+P-1];
`I22 = I[i+1][(j+2)*P : (j+2)*P+P-1]; // Load a new vector
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 13
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`I33 = I[i+2][(j+2)*P : (j+2)*P+P-1];
`
`}
`
`}
`
`The implementation generates two bogus values at the end of each row. The schedule generated by compiler for the
`innermost loop on EVP is as shown in the figure1.7. It takes 12 cycles to calculate output for 16 pixels. The number
`of cycles in this case is determined by the load on the MAC unit
`
`Figure.1.7. Schedule of innermost loop for approach 1
`
`Approach 2
`Now, in the schedule for approach 1 we see that there are 11 MAC operations in 12 cycles and thus VMAC becomes
`critical resource as it is used 11 out of 12 cycles to compute the output and thus determines the performance of the
`innermost loop. Since the coefficients of the kernels are simple, many of these MAC operations can be replaced with
`simple addition and subtraction operations. This way we can exploit the VLIW architecture of EVP to reduce the
`load on critical resource by sharing loads on different functional units. Below is the pseudo code for this approach.
`I [H][W/P] -> input vector of elements
`for(i = 0; i< H-2; i++)
`{
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`j = 0;
`I1 = I[i+0][j*P : j*P+P-1]; // aligned load
`I2 = I[i+1][j*P : j*P+P-1];
`
`I3 = I[i+2][j*P : j*P+P-1];
`
`I11 = I[i+0][(j+1)*P : (j+1)*P+P-1];
`I22 = I[i+1][(j+1)*P : (j+1)*P+P-1]; // next vector in the row
`I33 = I[i+2][(j+1)*P : (j+1)*P+P-1];
`Xmul1[ 0 : P-1] = (2) .* I2;
`// Partial values of convolution
`Ymul1[ 0 : P-1] = (2) .* I1;
`
`Ymul3[ 0 : P-1] = (2) .* I3;
`
`
`X1[ 0 : P-1] = I1 + I3 + Xmul1;
`Y1[ 0 : P-1] = I3 + I1;
`Y2[ 0 : P-1] = Ymul1+ Ymul3;
`for(j = 0; j < W/P; j++)
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00448
`
`Page 14
`
`
`
`{
`
`
`
`
`
`
`
`
`
`
`
`
`
`}
`
`}
`
`Xmul10[ 0 : P-1] = (2) .* I2;
`Ymul30[ 0 : P-1] = (2) .* I3;
`Ymul10[ 0 : P-1] = (2) .* I1;
`X10[ 0 : P-1] = I10 + I30 + Xmul10;
`Y10[ 0 : P-1] = I30 – I10;
`Y20[ 0 : P-1] = Ymul30 – Ymul10;
`
`T1 = (X1[2:P-1],X10[0],X10[1]);
`// Merge two vectors, rotate 2 positions
`
`X[ 0 : P-1] = T1 - X1;
`
`T2 = (Y2[1:P-1],Y20[0]); //Merge two vectors, rotate 1 position
`
`Y[ 0 : P-1] = Y1 + T2;
`
`T2 = (Y1[1:P-1],Y10[0], Y10[1]);
`
`Y[ 0 : P-1] = Y + T2;
`G[0 : P-1] = abs(X[0 : P-1)) + abs(Y[0 : P-1));
`G[0 : P-1] = min(G[0 : P-1],bcst(2