throbber
PROCEEDINGS
`
`32nd Annual International
`Symposium on Microarchitecture
`
`MICRO-32
`
`MEDIATEK, Ex. 1022, Page 1
`IPR2018-00102
`
`

`

`MEDIATEK, EX. 1022, Page 2
`
`IPR2018-00102
`
`MEDIATEK, Ex. 1022, Page 2
`IPR2018-00102
`
`

`

`PROCEEDINGS
`
`32nd Annual International
`Symposium on Microarchitecture
`
`Haifa, Israel
`November 16 - 18, 1999
`
`Sponsored by
`IEEE TC-MARCH
`ACM SIGMICRO
`
`With the generous support of
`Intel, Microsoft, Hewlett-Packard, IBM, SGI
`
`SOCIETY
`
`Los Alamitos, California
`Washmgton
`0 Brussels
`
`0
`
`Tokyo
`
`MEDIATEK, Ex. 1022, Page 3
`IPR2018-00102
`
`

`

`Copyright 0 1999 by The Institute of Electrical and Electronics Engineers, Inc.
`All rights reserved
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries
`may photocopy beyond the limits of US copyright law, for private use of patrons, those articles in
`this volume that carry a code at the bottom of the first page, provided that the per-copy fee
`indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive,
`Danvers, MA 01923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights
`Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855- 133 1.
`
`The papers in this book comprise the proceedings of the meeting mentioned on the cover and title
`page. They reflect the authors’ opinions and, in the interests of timely dissemination, are
`published as presented and without change. Their inclusion in this publication does not
`necessarily constitute endorsement by the editors, the IEEE Coniputer Society, or the Institute of
`Electrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Order Number PRO0437
`ISBN 0-7695-0437-X
`ISBN 0-7695-0438-8 (case)
`ISBN 0-7695-0439-6 (microfiche)
`ISSN Number 1072-445 1
`
`Additional copies may be ordered from:
`
`IEEE Computer Society
`Customer Service Center
`10662 Los Vaqueros Circle
`P.O. Box 3014
`Los Alamitos, CA 90720- 13 14
`Tel: + 1-714-821-8380
`Fax: + 1-7 14-82 1-464 1
`E-mail: cs.books@computer.org
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box I33 I
`Piscataway. NJ 08855- I33 I
`Tel: + 1-732-98 1-0060
`Fax: + 1-732-98 1-9667
`mis.custserv@computer.org
`
`IEEE Computer Society
`AsidPacific Office
`Watanabe Bldg.. 1-4-2
`Minami- Aoyama
`Minato-ku, Tokyo 107-0062
`JAPAN
`Tel: + 8 1-3-3408-3 1 I8
`Fax: + 8 1-3-3408-3553
`tokyo.ofc@computer.org
`
`Editorial production by Bob Werner
`Cover art production by Alex Torres
`Printed in the United States of America by The Printing House
`
`SOCIETY
`
`MEDIATEK, Ex. 1022, Page 4
`IPR2018-00102
`
`

`

`A Superscalar 3D Graphics Engine
`
`Andrew Wolfe and Derek B. Noonburg
`
`S3 Incorporated
`284 1 Mission College Blvd.
`Santa Clara, CA 95052
`{ awolfe,dnoonbur} @s3.com
`
`Abstract
`
`than any
`is increasing faster
`30 graphics performance
`other computing application.
`Almost all PC systems now
`include 30 graphics accelerators for games, CAD, or visu-
`alization applications. Many of the microarchitectural
`techniques
`that have been used to enhance the perfor-
`mance of microprocessors can be applied to graphics sys-
`tems as well. We present an architecture for an out-of-
`ordec superscalar rasterizer for 30 graphics. This allows
`the concurrent execution of multiple graphics primitives
`while maintaining exact sequential semantics. Experimen-
`tal results show 1.5-3.6X speedups on real applications
`using a simple model, similar to the results from many inte-
`ger benchmarks on superscalar processors. Enhanced
`techniques specific to 30 graphics for decomposing large
`triangles and breaking false dependence chains increase
`perfarmance to more than 10x a sequential system.
`
`1. Introduction
`
`introduction of graphical user
`Since the widespread
`interfaces in computers more than 1.5 years ago, special-
`purpose graphics accelerators have been an integral com-
`ponent of desktop computer systems. Recently, 3D appli-
`cations such as computer gaming, CAD, and visualization
`applications have been pushing high-performance
`3D
`acceleration from specialty workstations
`into mainstream
`PCs. The demand for increased 3D performance is cur-
`rently insatiable. Single-chip 3D accelerators today are 50
`times faster than those available 3-4 years ago. Each gen-
`eration substantially
`improves
`the quality of the images
`and yet another factor of 100 still would not produce truly
`realistic scenes.
`
`Like all computing systems, in order to achieve such
`rapid increases in performance, it is necessary to improve
`the microarchitecture of each generation
`to increasingly
`take advantage of parallelism.
`In many ways, the evolution
`of graphics microarchitectures has paralleled the develop-
`ment of microprocessor microarchitectures.
`In the past
`few years, graphics accelerators have used wider data
`words, registers, and buses to increase performance. PC
`graphics accelerators have moved in the past few years
`from 32 to 64 to 12%bit memory buses in much the same
`way that microprocessors grew from 4 to 8 to 16 and even-
`tually to 64 bits. This mechanism has diminishing
`returns,
`especially as applications move to a large number of small
`polygons in each scene.
`Current high-performance microprocessors all use
`instruction-level parallelism
`(ILP) to further increase per-
`formance.
`ILP exploits
`information about dependences
`between instructions
`to allow parallel execution of multiple
`instructions while maintaining
`the execution semantics of a
`sequential program [6]. Many different
`ILP mechanisms
`can be effectively employed to improve performance, how-
`ever dynamically
`scheduled, out-of-order,
`superscalar
`microprocessors are the commercially dominant microar-
`chitecture at the present time. This paper presents an
`approach for using dynamically scheduled, out-of-order,
`superscalar principles
`for increasing the available parallel-
`ism in a graphics accelerator while maintaining sequential
`execution semantics.
`In the abstract, it would seem that graphics, and particu-
`larly 3D graphics, is a massively parallel application
`that
`would not need ILP technology
`for high performance.
`In
`fact, in many simple graphics applications, it would be pos-
`sible to render each pixel independently; however, in prac-
`tice, graphics applications have very similar characteristics
`like
`to traditional sequential programs. Standard APIs
`Direct3D or OpenGL are used to create graphics applica-
`tions.
`These are translated via software drivers
`to a
`
`1072-445l/99 $10.00 0 1999 IEEE
`
`50
`
`MEDIATEK, Ex. 1022, Page 5
`IPR2018-00102
`
`

`

`to be rendered by a graph-
`sequence of graphics primitives
`ics engine that consists of some combination of additional
`hardware and software. The programming model assumes
`that these primitives will be executed sequentially and
`atomically, in much the same manner that it is assumed that
`instructions in a traditional sequential ISA are executed. In
`order to implement a parallel system that executes this
`sequence of primitives,
`the semantics of sequential execu-
`tion must be maintained. Several factors cause depen-
`dences between graphics primitives
`that can prevent
`concurrent execution.
`Z-buffering
`- A realistic 3D rendering systems must
`include hidden surface removal. Objects that are behind
`other objects from the perspective of the viewer should not
`be visible in the final image. A Z-buffer is used to imple-
`ment hidden-surface removal. The Z-buffer stores the dis-
`tance from the viewpoint
`to the currently drawn pixel so
`that when drawing a subsequent pixel, it can be determined
`if the new pixel is in front of or behind the currently drawn
`pixel. A well-implemented Z-buffering algorithm pro-
`duces the same result if the triangles are drawn in a differ-
`ent order, however, it requires an atomic read-modify-write
`of Z-buffer data. If two triangles are drawn concurrently
`and they attempt to concurrently
`read the same Z-buffer
`value, modify it, then write it using common read and write
`operations, incorrect results can be produced. A special
`type of dependence thus exists between any two primitives
`that may access the same Z-buffer value. They can execute
`in either order but not concurrently.
`is an operation that
`Alpha-blending
`- Alpha-blending
`uses a transparency value (alpha) to permit some portion of
`an occluded object to be visible
`through a foreground
`object. Unfortunately,
`the alpha-blending operation
`that
`implements
`transparency is order dependent. Program-
`mers will generate groups of primitives in a known order in
`order to overcome this limitation of alpha blending.
`If the
`rendering engine does not maintain the semantics of the
`sequence, the image will be incorrect.
`Dynamic
`textures, procedural
`textures, environment
`mapping - Realistic 3D graphics map an image (called a
`texture) onto each polygon. Often these textures have lim-
`ited lifetimes. Procedural textures are created on-the-fly by
`program code. Dynamic
`textures are loaded into
`the
`graphics system memory space from some backing store
`for a limited time. Environment mapping is a technique for
`reflections where some 3D objects are drawn then copied
`as an image to be mapped onto a reflective surface. In each
`of these cases, there is a dependence between the primi-
`tives that create the texture and the primitives
`that render
`the polygon that uses the texture.
`2D BLITs
`- It is advantageous to be able to mix 2D
`block copy and drawing operations with 3D rendering.
`If
`
`overlapping 2D objects are read or written out of order, the
`resulting image is incorrect.
`Direct
`frame buffer access - The common graphics
`APIs allow blocks of frame buffer memory to be directly
`read or written by the main processor. This requires that
`precise frame buffer state be available whenever a direct
`access executes. This creates a dependence between previ-
`ous primitives and a direct read or subsequent primitives
`and a direct write.
`As in general-purpose computing systems, it is possible
`to build massively parallel systems that provide excellent
`performance on a limited set of applications that have been
`programmed with parallel execution in mind. However, in
`order to be compatible with a large existing base of soft-
`ware using widely accepted programming
`interfaces and
`programming styles, it is necessary to detect the depen-
`dences between graphics primitives, extract independent
`them
`primitives
`from the instruction stream, and execute
`concurrently.
`In this paper we present a method
`to detect
`and represent dependences between graphics primitives, an
`out-of-order engine to execute multiple
`independent primi-
`tives concurrently, and a datapath architecture that provides
`concurrent resources for parallel execution.
`The following
`section of this paper provides a brief
`introduction
`to hardware-accelerated 3D graphics and the
`current mechanisms for implementing hardware. Section 3
`describes a new out-of-order
`ILP architecture for the ren-
`dering stage of the graphics pipeline and relates the ILP
`principles
`to well-known
`concepts
`in microprocessors.
`Section 4 describes simulation experiments that measure
`the increased parallelism provided by this microarchitec-
`ture under a range of implementation parameters. Section
`5 includes conclusions and suggestions for future investi-
`gations.
`
`2. Background
`
`2.1. 3D pipeline
`
`A 3D application creates a series of frames. Each of the
`objects in a frame is broken down into a collection of trian-
`gles, typically covering the surface of the object. From the
`point of view of a 3D engine, then, a frame consists of a
`collection of triangles. The 3D engine is responsible for
`turning this list into a rendered frame.
`in a 3D
`Each triangle
`is specified by three vertices
`space, one or more surface normals, and a description of
`texture, alpha blending
`how to draw the triangle’s surface:
`parameters, etc. The 3D pipeline (see Figure 1) takes the
`triangles, applies lighting
`to them, transforms them from
`the 3D space used by the application
`into 2D screen coor-
`
`51
`
`MEDIATEK, Ex. 1022, Page 6
`IPR2018-00102
`
`

`

`triangles
`
`rendered frame
`
`Figure 1: The 3D pipeline.
`
`dinates, and draws the appropriate pixels into the frame
`buffer [4,5,9].
`The first pipe stage, the world transform, converts the
`vertices and normals from object space, which may be dif-
`ferent for each object in the scene, to world space, shared
`by the entire scene. This transform consists of a matrix-
`vector multiplication
`for each vertex and each normal.
`The lighting stage takes the triangle’s color and surface
`normal(s) and computes the effect of one or more light
`sources. The result is a color at each vertex.
`The view
`transform converts the vertices from world
`space to camera space, with the viewer (or camera) at the
`origin. This uses a matrix-vector multiplication
`for each
`vertex.
`transform maps vertices from camera
`The projection
`space to the viewport. This includes the perspective trans-
`formation. At this point, the vertices are effectively
`two-
`dimensional, i.e., perspective effects (depth foreshortening)
`have been applied, and so the third (z) coordinate is only
`needed to indicate the relative front-to-back ordering of the
`vertices. Like the other two transforms, projection requires
`one matrix-vector multiplication per vertex.
`The clipping stage clips triangles to a view volume. Tri-
`angles which
`lie entirely off the side of the screen or
`behind the viewer are removed. Triangles which are par-
`tially out of bounds are trimmed, which generally requires
`splitting
`the resulting polygon into multiple triangles.
`The rasterization stage converts a triangle
`into pixels
`and computes the color value at each pixel. This includes
`visible-surface determination
`(dropping pixels which are
`obscured by a triangle closer to the viewer), texture map-
`ping, and alpha blending (transparency effects).
`The current generation of PC graphics accelerators per-
`form only the rasterization stage of the pipeline. The ear-
`lier stages are handled in software, and the transformed and
`lit vertices are then handed off to the hardware. Accord-
`ingly, the 3D rendering engine described in this paper per-
`forms the rasterization stage of the pipeline, and so this
`
`52
`
`span 0
`
`span 1
`
`span 2
`
`span 3
`
`span 4
`
`span 5
`
`span 6
`
`span 7
`
`span 6
`
`span 9
`
`to spans and pixels.
`a triangle
`Figure 2: Converting
`Each horizontal
`line of pixels
`is a span. The gray shad-
`ing shows pixels which are considered part of the tri-
`anale.
`
`part will now be described in more detail. Next generation
`PC 3D hardware will encompass the whole pipeline.
`The rasterizer first converts a triangle
`into a series of
`horizontal spans (see Figure 2). Pixels are centered at inte-
`ger x and y values (in Direct3D; OpenGL uses a different
`convention). One span is generated for each integer y
`value that falls inside the triangle. For each span, the ras-
`terizer computes the two endpoints. This includes interpo-
`lated color
`values and perspective-corrected
`texture
`coordinates. Next,
`the rasterizer generates the series of
`pixels along the span, again interpolating color and texture
`coordinates. Several operations are performed at each
`pixel. First the pixel’s z (depth) value is compared to the
`current z value for that pixel in the frame buffer.
`If the
`comparison indicates that this new pixel is behind the old
`one, the new pixel is discarded.
`If the z test succeeds, the
`new pixel color
`is computed. This can include
`texture
`mapping and alpha blending.
`
`2.2. Rasterizer
`
`the rasterizer can be
`
`The operations performed by
`divided into three categories:
`this includes
`l operations performed once per triangle:
`computing
`interpolation parameters for span endpoint
`generation
`includes
`l operations performed once per span: this
`computing the span endpoints and computing interpo-
`lation parameters for pixel generation
`l operations performed once per pixel: this is where the
`real work needed to compute each pixel is done
`
`MEDIATEK, Ex. 1022, Page 7
`IPR2018-00102
`
`

`

`pixel information
`
`2.3. Previous work
`
`There are many different approaches to high-perfor-
`mance 3D rendering hardware; we list a few here.
`Permedia 3 is a single-chip Direct3D/OpenGL
`rasterizer
`for high-end PCs [ 11.
`The Neon chip contains eight pixel pipes that interface
`to eight off-chip SDRAM channels
`[8]. Several clever
`techniques are used to maximize memory efficiency.
`Sony’s second-generation PlayStation chipset uses a
`MIPS processor with specialized vector units (the Emotion
`Engine) attached to a 16-pipe rasterizer with embedded
`DRAM (the Graphics Synthesizer) [2].
`The SGI InfiniteReality
`uses 12 different ASICs on a
`geometry board, multiple
`rasterizer boards, and a display
`generator board [ 101. Pixel computations are performed by
`up to 320 image engines.
`The PixelFlow architecture consists of a series of “flow
`units”, each of which has a geometry processor and a
`SIMD
`rasterizer board with embedded DRAM
`(“logic-
`enhanced memory”)
`[3]. Each flow unit is responsible for
`a fraction of the primitives;
`the outputs of all of the flow
`units are composited using the Z values.
`The Imagine architecture
`is a VLIW media processor
`with support for a unique “stream processing” model [ 111.
`
`3. Out-of-order
`
`superscalar
`
`rendering engine
`
`3.1. Graphics parallelism
`
`the device drivers,
`The application program, through
`provides a stream of graphics primitives
`to the graphics
`subsystem. These primitives can include 2D operations
`like block fills, lines, or block moves as well as 3D opera-
`tions such as triangles and 3D lines. There may be some
`available parallelism within each primitive, but as 3D
`applications become more sophisticated,
`they include a
`larger number of smaller triangles, each of which has mini-
`mal opportunity
`for parallelism.
`In current applications
`(see Section 4.3), triangles are typically hundreds of pixels.
`As this number decreases, the per-triangle setup time will
`become a larger and larger .component of overall run-time.
`The best mechanism for increasing performance is to exe-
`cute multiple graphics primitives concurrently.
`The transform,
`lighting, and clipping operations (see
`Figure 1) are massively parallel, and their latency is rela-
`tively unimportant. Therefore they can be implemented at
`arbitrarily high performance levels through the use of long
`pipelines or multiple parallel pipelines. The interesting
`parallelism problems exist at the rasterization stage where
`dependences exist. An out-of-order dispatch engine that
`detects dependences and maintains sequential semantics is
`
`memory
`
`‘i
`
`I
`
`alpha blend -1
`
`Figure 3: The pixel
`
`rasteritation
`
`pipeline.
`
`The pixel rasterization pipe is shown in Figure 3. The
`first two stages read the Z value from memory and compare
`it to the pixel’s Z value. If the pixel fails the Z test, i.e., if it
`is behind a previously drawn pixel, then it is dropped from
`the pipe. If it passes, the new Z value is written to memory.
`The next two stages read one or more texels from the tex-
`ture cache, filter them to produce a single texel, and blend
`the texel with
`the pixel color. Next, the alpha value is
`tested and completely
`transparent pixels are dropped.
`Finally, the destination pixel is read, alpha blended with the
`blended pixel/texture color, and written back to memory.
`Several of these pipe stages are optional, depending on
`mode settings for the current triangle. For example, if the
`triangle is opaque (no alpha value), the alpha test, destina-
`tion read, and alpha blend stages are unneeded. A real
`implementation of this pipeline might look somewhat dif-
`ferent, e.g., some of the texture computation might be done
`in parallel with the Z read because of memory latency.
`Assuming
`that sufficient memory bandwidth
`is avail-
`able, this pipeline has a peak throughput of one pixel per
`clock.
`
`53
`
`MEDIATEK, Ex. 1022, Page 8
`IPR2018-00102
`
`

`

`fetch queue
`(unexamined)
`
`1 M 3
`
`2
`
`4
`
`5
`
`a
`
`in flight
`(running)
`
`1
`
`fetch queue
`(unexamined)
`
`4d
`
`4
`
`5
`
`in flight
`(running)
`
`-
`1 Q 2
`
`Figure 4: A stream of primitives
`
`awaiting
`
`dispatch.
`
`used in the rasterizer to extract ILP (where each ‘instruc-
`tion’
`is a graphics primitive)
`from the graphics primitive
`stream. Multiple
`rasterization engines are used to provide
`Since
`execution
`resources for
`the graphics primitives.
`primitives can have long execution
`times, from several
`cycles to several hundred thousand cycles, it is not very
`important
`to dispatch more than one primitive per cycle,
`rather it is important
`to have many primitives executing
`concurrently. Also, since the primitives do many memory
`accesses and require an indeterminate number of cycles to
`execute, it is best to have the rasterization engines run inde-
`pendently, MIMD-style.
`
`3.2. In-order vs. out-of-order dispatch
`
`The basic idea is shown in Figures 4,.5, and 6. The host
`computer
`is submitting a sequential stream of graphics
`primitives
`to the accelerator (Figure 4).
`In the simplest
`implementation, primitives are dispatched from the head of
`the input list when they are ready. With in-order dispatch
`(Figure 5), two primitives have been dispatched to execu-
`tion engines. These primitives are either being executed or
`they are in some queue approved for immediate execution
`but waiting
`for resources. We don’t know or care in what
`order these primitives will begin execution or complete
`execution. The third primitive cannot be approved for exe-
`cution since it overlaps with the currently executing primi-
`tives.
`It will be dispatched when primitives 1 and 2 both
`complete.
`In order to extract additional parallelism, it is preferable
`to allow out of-of-order dispatch [6,7].
`In this case (Figure
`6), stalled operations are placed into central reservation
`stations and other operations
`that can proceed indepen-
`dently are allowed to bypass them.
`In the actual imple-
`mentation,
`all
`incoming
`primitives
`are placed
`in
`reservation stations and the oldest eligible candidate is dis-
`patched each cycle.
`
`lis-
`d
`in-order
`of Figure 4 after
`5: The stream
`Figure
`_
`.
`1 and
`Primitive
`3 is stalled because
`it overlaps
`patch.
`they arrived
`4 and 5 are stalled because
`2. Primitives
`after 3.
`
`fetch queue
`(unexamined)
`
`res. stations
`(waiting)
`
`in flight
`(running)
`
`1 /“:n
`A 4
`
`2
`
`A 5
`
`dis-
`Figure 6: The stream of Figure 4 after out-of-order
`patch. Primitive
`4 has been dispatched
`ahead of prim-
`itive 3.
`
`3.3. Dependence checking
`
`In order to implement an out-of-order dispatch engine,
`mechanisms are required
`for detecting dependences, for
`detecting ready operations, and for tracking completion of
`operations.
`In processors, dependence checking is accom-
`plished by comparing
`the source and destination register
`addresses of a candidate operation
`to those of previous
`operations in the sequential instruction stream. A corollary
`exists for graphics primitives. Any given graphics primi-
`tive has a set of destination pixels that it will write and may
`have one or more sets of source pixels that are read for the
`operation.
`In a typical 3D triangle operation, the destina-
`tion pixels are the ones drawn by the rasterization opera-
`tion and the source pixels may include
`the old pixels
`in
`those locations that will be alpha-blended by the triangle
`and the texture map that will be used to draw the triangle.
`Unlike
`in a processor, a single register number cannot
`describe the source or destination operands and it would be
`
`54
`
`MEDIATEK, Ex. 1022, Page 9
`IPR2018-00102
`
`

`

`bounding box
`
`Figure 7: A triangle’s
`
`bounding
`
`box.
`
`impractical to construct lists of source and destination pix-
`els in hardware and compare them for dependences.
`We resolve this problem by using a conservative, but
`simple-to-implement
`algorithm
`for detecting depen-
`dences. Every dispatched primitive reserves a region of the
`frame buffer corresponding to the bounding box surround-
`ing each of its operands. The sets of pixels that each prim-
`itive
`reads in order to complete are called
`its source
`operands and the bounding boxes surrounding each of the
`source operands are called the Source regions. The set of
`pixels that each primitive writes is called its destination
`operand and the bounding box around it is the destination
`region. Bounding boxes can be computed by taking the
`minimum and maximum values of the x and y coordinates
`of each vertex. See Figure 7.
`left and
`Four numbers, the coordinates of the lower
`upper right comers of the bounding box, can describe each
`region. For two regions RI = [(Umi”t6,i,),(a,,,b,,,)1
`and
`R, = [(X,i,,y,i*).(X,,,,y,,,)1,
`we can determine if
`the
`regions overlap by defining an overlap function:
`
`Graphics primitives can have several source regions as
`well as a destination regions, but for simplicity consider
`the case where there is one of each. To determine whether
`or not a candidate primitive P depends on a previously dis-
`patched primitive D, the function depend(P,D) can be com-
`puted.
`.Sp is the source region of P and Dp is the
`destination region of P. So and DD are the source and des-
`tination regions of D. If depend(P,D) is false then P can be
`dispatched concurrently with D:
`
`depend ( P, D)
`= (Sp n Do) + (Dp n Do) + VP n SD)
`
`= Up n DDWp n DDWp n SD)
`
`this means that for any two concurrently dis-
`Basically
`patched primitives,
`their source regions can overlap but
`neither source region can overlap with
`the destination
`region of the other primitive and the destination regions of
`the two primitives cannot overlap. This is a similar restric-
`tion to that which would be used to determine whether two
`in a processor. The
`instructions could issue concurrently
`first test prevents any RAW hazards caused by reading a
`source pixel prior to it being written by the earlier primi-
`
`tive. The second test prevents any WAW hazards caused
`by writing a pixel from the second primitive prior to writ-
`ing the same pixel from the first. The third test prevents
`any WAR hazards caused by writing a pixel from the sec-
`ond primitive prior to reading the same pixel from the first.
`The WAR hazard is somewhat unique in that it would be
`prevented in any well constructed processor pipeline by
`reading the source operands of the first instruction
`from the
`register file early in the pipeline and retaining a copy of the
`correct value until
`it is consumed in the pipeline.
`It is
`impractical to copy the values of all of the source pixels for
`a graphics primitive, so the source region in memory must
`be reserved until an operation completes, and the WAR test
`must be performed against any subsequently issuing primi-
`tives.
`the out-of-order
`The hardware required to implement
`dispatch mechanism is shown in Figures 8 and 9. A fetch
`unit holds a command queue containing graphics primi-
`tives. As reservation stations in the issue unit become
`available, primitives are passed to the issue unit for depen-
`dence testing and scheduling. The issue unit can issue one
`primitive per cycle to any one of several accelerators for
`rasterization. As each primitive
`is completed, the accelera-
`tor notifies the issue unit so that the primitive can be retired
`The accelerators
`and its reservation stations cleared.
`access memory through a global memory interface unit that
`in practice is likely
`to contain a large switch and many
`independent memory banks to provide adequate band-
`width. For our experiments, we have assumed an ideal
`multi-ported memory.
`The reservation mechanism uses three sets of registers
`to store data about pending and executing primitives. The
`destination
`reservation station
`indexes are the primary
`identifiers
`for a new primitive.
`It is assumed that every
`primitive has exactly one destination region and the bound-
`ing box for that region is stored in the destination reserva-
`tion station for that primitive. A primitive may also have
`zero, one, or two source regions. Source reservation sta-
`tions are allocated in a separate register file for each source
`region. The source register file contains a bounding box
`descriptor for each source region and a tag that is the index
`of the destination reservation station corresponding to each
`source. The candidate buffers hold relevant information
`about unissued primitives. This includes the opcode and
`additional operands for the primitive (such as vertex colors,
`alpha values, etc.) and a dependence vector that describes
`the restrictions on issue. A pointer to the opcode and oper-
`and can be included
`rather than the actual data. Once
`again, a tag is used to specify the address of the destination
`reservation station of the primitive. Every register has a
`valid bit that is used to indicate that the register contains
`live data.
`Ideally,
`the number of destination reservation
`
`55
`
`MEDIATEK, Ex. 1022, Page 10
`IPR2018-00102
`
`

`

`fetch unit
`
`Figure 8: Top-level architecture of the superscalar 3D rendering engine.
`
`candidate buffers
`
`destination
`reservation stations
`
`source
`reservation stations
`
`Figure 9: Registers used in the superscalar
`
`rendering engine.
`
`stations should equal the number of candidate buffers plus
`the number of accelerators.
`A primitive
`is moved into the issue unit from the fetch
`unit each cycle provided that there is an empty destination
`reservation station, an empty candidate buffer, and enough
`empty source reservation stations. Hardware in the fetch
`unit has already decoded the primitive
`type and computed
`the bounding box for each source and destination region.
`These bounding boxes are stored in the source and destina-
`tion reservation stations and the index of the destination
`reservation station is stored in the tag field of the source
`reservation stations and the candidate buffer. A depen-
`dence vector is computed using a variation of the depend
`function extended to support two source regions. The
`hardware to compute this dependence vector is shown in
`Figure 10. The bounding box coordinates (four numbers)
`of the new primitive are driven onto vertical buses for each
`operand. The bounding box coordinates for each valid res-
`ervation station are driven on horizontal buses. At the
`intersections
`that correspond to potential hazards, bound-
`ing box overlap comparators implement
`the overlap func-
`tion described earlier.
`A dependence vector
`is thus
`calculated by computing whether or not the new primitive
`
`from earlier
`tested primitive
`overlaps with each previously
`in the command sequence. Bit k in the dependence vector
`is set if the new primitive must wait for the primitive stored
`in destination
`reservation station k to complete.
`This
`dependence vector
`is stored
`in
`the candidate buffer
`reserved for the new primitive.
`At the same time, the issue unit is testing the existing
`issue candidates to see if any are ready to be issued. If any
`of the accelerators are available, then the issue unit tests all
`of the dependence vectors in the candidate buffers.
`If any
`valid candidate buffer contains a dependence vector of all
`zeros, then it can issue this cycle.
`If more than one candi-
`date can issue, one is selected at random. The candidate is
`issued by transferring
`the primitive data and the tag to an
`accelerator and clearing the valid bit to free the candidate
`buffer.
`in a
`Finally, any accelerators that complete a primitive
`given cycle return the tag of the completed primitive
`to the
`issue unit. The issue unit clears the valid bit from the cor-
`responding destination reservation station.
`It also passes
`the tag to all of the source reservation stations, which com-
`pare it to their own tag and clear their valid bit if the tags
`match. Finally,
`the tag is decoded and the corresponding
`
`56
`
`MEDIATEK, Ex. 1022, Page 11
`IPR2018-00102
`
`

`

`destination
`reservation
`
`stations
`
`new primitive
`
`new dependence
`vector
`
`source
`reservation
`
`stations
`
`box overlap
`
`comparator.
`
`is a bounding
`5
`4.5
`
`Each small circle
`hardware.
`Figure 10: Dependence-checking
`bit in the dependence vector of each candidate buffer is
`cleared. This removes the completed primitive
`from the
`reservation stations and clears any dependences in any
`pending primitives. Once all of the dependence bits in a
`candidate’s dependence vector are cleared, that candidate
`can issue on a subsequent cycle.
`The logic required to implement the i

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket