`
`32nd Annual International
`Symposium on Microarchitecture
`
`MICRO-32
`
`MEDIATEK, Ex. 1022, Page 1
`IPR2018-00102
`
`
`
`MEDIATEK,Ex. 1022, Page 2
`IPR2018-00102
`
`MEDIATEK, Ex. 1022, Page 2
`IPR2018-00102
`
`
`
`PROCEEDINGS
`
`32nd Annual International
`Symposium on Microarchitecture
`
`Haifa, Israel
`November 16 - 18, 1999
`
`Sponsored by
`IEEE TC-MARCH
`ACM SIGMICRO
`
`With the generous support of
`Intel, Microsoft, Hewlett-Packard, IBM, SGI
`
`SOCIETY
`
`Los Alamitos, California
`Washmgton
`0 Brussels
`
`0
`
`Tokyo
`
`MEDIATEK, Ex. 1022, Page 3
`IPR2018-00102
`
`
`
`Copyright 0 1999 by The Institute of Electrical and Electronics Engineers, Inc.
`All rights reserved
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries
`may photocopy beyond the limits of US copyright law, for private use of patrons, those articles in
`this volume that carry a code at the bottom of the first page, provided that the per-copy fee
`indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive,
`Danvers, MA 01923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights
`Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855- 133 1.
`
`The papers in this book comprise the proceedings of the meeting mentioned on the cover and title
`page. They reflect the authors’ opinions and, in the interests of timely dissemination, are
`published as presented and without change. Their inclusion in this publication does not
`necessarily constitute endorsement by the editors, the IEEE Coniputer Society, or the Institute of
`Electrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Order Number PRO0437
`ISBN 0-7695-0437-X
`ISBN 0-7695-0438-8 (case)
`ISBN 0-7695-0439-6 (microfiche)
`ISSN Number 1072-445 1
`
`Additional copies may be ordered from:
`
`IEEE Computer Society
`Customer Service Center
`10662 Los Vaqueros Circle
`P.O. Box 3014
`Los Alamitos, CA 90720- 13 14
`Tel: + 1-714-821-8380
`Fax: + 1-7 14-82 1-464 1
`E-mail: cs.books@computer.org
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box I33 I
`Piscataway. NJ 08855- I33 I
`Tel: + 1-732-98 1-0060
`Fax: + 1-732-98 1-9667
`mis.custserv@computer.org
`
`IEEE Computer Society
`AsidPacific Office
`Watanabe Bldg.. 1-4-2
`Minami- Aoyama
`Minato-ku, Tokyo 107-0062
`JAPAN
`Tel: + 8 1-3-3408-3 1 I8
`Fax: + 8 1-3-3408-3553
`tokyo.ofc@computer.org
`
`Editorial production by Bob Werner
`Cover art production by Alex Torres
`Printed in the United States of America by The Printing House
`
`SOCIETY
`
`MEDIATEK, Ex. 1022, Page 4
`IPR2018-00102
`
`
`
`A Superscalar 3D Graphics Engine
`
`Andrew Wolfe and Derek B. Noonburg
`
`S3 Incorporated
`284 1 Mission College Blvd.
`Santa Clara, CA 95052
`{ awolfe,dnoonbur} @s3.com
`
`Abstract
`
`than any
`is increasing faster
`30 graphics performance
`other computing application.
`Almost all PC systems now
`include 30 graphics accelerators for games, CAD, or visu-
`alization applications. Many of the microarchitectural
`techniques
`that have been used to enhance the perfor-
`mance of microprocessors can be applied to graphics sys-
`tems as well. We present an architecture for an out-of-
`ordec superscalar rasterizer for 30 graphics. This allows
`the concurrent execution of multiple graphics primitives
`while maintaining exact sequential semantics. Experimen-
`tal results show 1.5-3.6X speedups on real applications
`using a simple model, similar to the results from many inte-
`ger benchmarks on superscalar processors. Enhanced
`techniques specific to 30 graphics for decomposing large
`triangles and breaking false dependence chains increase
`perfarmance to more than 10x a sequential system.
`
`1. Introduction
`
`introduction of graphical user
`Since the widespread
`interfaces in computers more than 1.5 years ago, special-
`purpose graphics accelerators have been an integral com-
`ponent of desktop computer systems. Recently, 3D appli-
`cations such as computer gaming, CAD, and visualization
`applications have been pushing high-performance
`3D
`acceleration from specialty workstations
`into mainstream
`PCs. The demand for increased 3D performance is cur-
`rently insatiable. Single-chip 3D accelerators today are 50
`times faster than those available 3-4 years ago. Each gen-
`eration substantially
`improves
`the quality of the images
`and yet another factor of 100 still would not produce truly
`realistic scenes.
`
`Like all computing systems, in order to achieve such
`rapid increases in performance, it is necessary to improve
`the microarchitecture of each generation
`to increasingly
`take advantage of parallelism.
`In many ways, the evolution
`of graphics microarchitectures has paralleled the develop-
`ment of microprocessor microarchitectures.
`In the past
`few years, graphics accelerators have used wider data
`words, registers, and buses to increase performance. PC
`graphics accelerators have moved in the past few years
`from 32 to 64 to 12%bit memory buses in much the same
`way that microprocessors grew from 4 to 8 to 16 and even-
`tually to 64 bits. This mechanism has diminishing
`returns,
`especially as applications move to a large number of small
`polygons in each scene.
`Current high-performance microprocessors all use
`instruction-level parallelism
`(ILP) to further increase per-
`formance.
`ILP exploits
`information about dependences
`between instructions
`to allow parallel execution of multiple
`instructions while maintaining
`the execution semantics of a
`sequential program [6]. Many different
`ILP mechanisms
`can be effectively employed to improve performance, how-
`ever dynamically
`scheduled, out-of-order,
`superscalar
`microprocessors are the commercially dominant microar-
`chitecture at the present time. This paper presents an
`approach for using dynamically scheduled, out-of-order,
`superscalar principles
`for increasing the available parallel-
`ism in a graphics accelerator while maintaining sequential
`execution semantics.
`In the abstract, it would seem that graphics, and particu-
`larly 3D graphics, is a massively parallel application
`that
`would not need ILP technology
`for high performance.
`In
`fact, in many simple graphics applications, it would be pos-
`sible to render each pixel independently; however, in prac-
`tice, graphics applications have very similar characteristics
`like
`to traditional sequential programs. Standard APIs
`Direct3D or OpenGL are used to create graphics applica-
`tions.
`These are translated via software drivers
`to a
`
`1072-445l/99 $10.00 0 1999 IEEE
`
`50
`
`MEDIATEK, Ex. 1022, Page 5
`IPR2018-00102
`
`
`
`to be rendered by a graph-
`sequence of graphics primitives
`ics engine that consists of some combination of additional
`hardware and software. The programming model assumes
`that these primitives will be executed sequentially and
`atomically, in much the same manner that it is assumed that
`instructions in a traditional sequential ISA are executed. In
`order to implement a parallel system that executes this
`sequence of primitives,
`the semantics of sequential execu-
`tion must be maintained. Several factors cause depen-
`dences between graphics primitives
`that can prevent
`concurrent execution.
`Z-buffering
`- A realistic 3D rendering systems must
`include hidden surface removal. Objects that are behind
`other objects from the perspective of the viewer should not
`be visible in the final image. A Z-buffer is used to imple-
`ment hidden-surface removal. The Z-buffer stores the dis-
`tance from the viewpoint
`to the currently drawn pixel so
`that when drawing a subsequent pixel, it can be determined
`if the new pixel is in front of or behind the currently drawn
`pixel. A well-implemented Z-buffering algorithm pro-
`duces the same result if the triangles are drawn in a differ-
`ent order, however, it requires an atomic read-modify-write
`of Z-buffer data. If two triangles are drawn concurrently
`and they attempt to concurrently
`read the same Z-buffer
`value, modify it, then write it using common read and write
`operations, incorrect results can be produced. A special
`type of dependence thus exists between any two primitives
`that may access the same Z-buffer value. They can execute
`in either order but not concurrently.
`is an operation that
`Alpha-blending
`- Alpha-blending
`uses a transparency value (alpha) to permit some portion of
`an occluded object to be visible
`through a foreground
`object. Unfortunately,
`the alpha-blending operation
`that
`implements
`transparency is order dependent. Program-
`mers will generate groups of primitives in a known order in
`order to overcome this limitation of alpha blending.
`If the
`rendering engine does not maintain the semantics of the
`sequence, the image will be incorrect.
`Dynamic
`textures, procedural
`textures, environment
`mapping - Realistic 3D graphics map an image (called a
`texture) onto each polygon. Often these textures have lim-
`ited lifetimes. Procedural textures are created on-the-fly by
`program code. Dynamic
`textures are loaded into
`the
`graphics system memory space from some backing store
`for a limited time. Environment mapping is a technique for
`reflections where some 3D objects are drawn then copied
`as an image to be mapped onto a reflective surface. In each
`of these cases, there is a dependence between the primi-
`tives that create the texture and the primitives
`that render
`the polygon that uses the texture.
`2D BLITs
`- It is advantageous to be able to mix 2D
`block copy and drawing operations with 3D rendering.
`If
`
`overlapping 2D objects are read or written out of order, the
`resulting image is incorrect.
`Direct
`frame buffer access - The common graphics
`APIs allow blocks of frame buffer memory to be directly
`read or written by the main processor. This requires that
`precise frame buffer state be available whenever a direct
`access executes. This creates a dependence between previ-
`ous primitives and a direct read or subsequent primitives
`and a direct write.
`As in general-purpose computing systems, it is possible
`to build massively parallel systems that provide excellent
`performance on a limited set of applications that have been
`programmed with parallel execution in mind. However, in
`order to be compatible with a large existing base of soft-
`ware using widely accepted programming
`interfaces and
`programming styles, it is necessary to detect the depen-
`dences between graphics primitives, extract independent
`them
`primitives
`from the instruction stream, and execute
`concurrently.
`In this paper we present a method
`to detect
`and represent dependences between graphics primitives, an
`out-of-order engine to execute multiple
`independent primi-
`tives concurrently, and a datapath architecture that provides
`concurrent resources for parallel execution.
`The following
`section of this paper provides a brief
`introduction
`to hardware-accelerated 3D graphics and the
`current mechanisms for implementing hardware. Section 3
`describes a new out-of-order
`ILP architecture for the ren-
`dering stage of the graphics pipeline and relates the ILP
`principles
`to well-known
`concepts
`in microprocessors.
`Section 4 describes simulation experiments that measure
`the increased parallelism provided by this microarchitec-
`ture under a range of implementation parameters. Section
`5 includes conclusions and suggestions for future investi-
`gations.
`
`2. Background
`
`2.1. 3D pipeline
`
`A 3D application creates a series of frames. Each of the
`objects in a frame is broken down into a collection of trian-
`gles, typically covering the surface of the object. From the
`point of view of a 3D engine, then, a frame consists of a
`collection of triangles. The 3D engine is responsible for
`turning this list into a rendered frame.
`in a 3D
`Each triangle
`is specified by three vertices
`space, one or more surface normals, and a description of
`texture, alpha blending
`how to draw the triangle’s surface:
`parameters, etc. The 3D pipeline (see Figure 1) takes the
`triangles, applies lighting
`to them, transforms them from
`the 3D space used by the application
`into 2D screen coor-
`
`51
`
`MEDIATEK, Ex. 1022, Page 6
`IPR2018-00102
`
`
`
`triangles
`
`rendered frame
`
`Figure 1: The 3D pipeline.
`
`dinates, and draws the appropriate pixels into the frame
`buffer [4,5,9].
`The first pipe stage, the world transform, converts the
`vertices and normals from object space, which may be dif-
`ferent for each object in the scene, to world space, shared
`by the entire scene. This transform consists of a matrix-
`vector multiplication
`for each vertex and each normal.
`The lighting stage takes the triangle’s color and surface
`normal(s) and computes the effect of one or more light
`sources. The result is a color at each vertex.
`The view
`transform converts the vertices from world
`space to camera space, with the viewer (or camera) at the
`origin. This uses a matrix-vector multiplication
`for each
`vertex.
`transform maps vertices from camera
`The projection
`space to the viewport. This includes the perspective trans-
`formation. At this point, the vertices are effectively
`two-
`dimensional, i.e., perspective effects (depth foreshortening)
`have been applied, and so the third (z) coordinate is only
`needed to indicate the relative front-to-back ordering of the
`vertices. Like the other two transforms, projection requires
`one matrix-vector multiplication per vertex.
`The clipping stage clips triangles to a view volume. Tri-
`angles which
`lie entirely off the side of the screen or
`behind the viewer are removed. Triangles which are par-
`tially out of bounds are trimmed, which generally requires
`splitting
`the resulting polygon into multiple triangles.
`The rasterization stage converts a triangle
`into pixels
`and computes the color value at each pixel. This includes
`visible-surface determination
`(dropping pixels which are
`obscured by a triangle closer to the viewer), texture map-
`ping, and alpha blending (transparency effects).
`The current generation of PC graphics accelerators per-
`form only the rasterization stage of the pipeline. The ear-
`lier stages are handled in software, and the transformed and
`lit vertices are then handed off to the hardware. Accord-
`ingly, the 3D rendering engine described in this paper per-
`forms the rasterization stage of the pipeline, and so this
`
`52
`
`span 0
`
`span 1
`
`span 2
`
`span 3
`
`span 4
`
`span 5
`
`span 6
`
`span 7
`
`span 6
`
`span 9
`
`to spans and pixels.
`a triangle
`Figure 2: Converting
`Each horizontal
`line of pixels
`is a span. The gray shad-
`ing shows pixels which are considered part of the tri-
`anale.
`
`part will now be described in more detail. Next generation
`PC 3D hardware will encompass the whole pipeline.
`The rasterizer first converts a triangle
`into a series of
`horizontal spans (see Figure 2). Pixels are centered at inte-
`ger x and y values (in Direct3D; OpenGL uses a different
`convention). One span is generated for each integer y
`value that falls inside the triangle. For each span, the ras-
`terizer computes the two endpoints. This includes interpo-
`lated color
`values and perspective-corrected
`texture
`coordinates. Next,
`the rasterizer generates the series of
`pixels along the span, again interpolating color and texture
`coordinates. Several operations are performed at each
`pixel. First the pixel’s z (depth) value is compared to the
`current z value for that pixel in the frame buffer.
`If the
`comparison indicates that this new pixel is behind the old
`one, the new pixel is discarded.
`If the z test succeeds, the
`new pixel color
`is computed. This can include
`texture
`mapping and alpha blending.
`
`2.2. Rasterizer
`
`the rasterizer can be
`
`The operations performed by
`divided into three categories:
`this includes
`l operations performed once per triangle:
`computing
`interpolation parameters for span endpoint
`generation
`includes
`l operations performed once per span: this
`computing the span endpoints and computing interpo-
`lation parameters for pixel generation
`l operations performed once per pixel: this is where the
`real work needed to compute each pixel is done
`
`MEDIATEK, Ex. 1022, Page 7
`IPR2018-00102
`
`
`
`pixel information
`
`2.3. Previous work
`
`There are many different approaches to high-perfor-
`mance 3D rendering hardware; we list a few here.
`Permedia 3 is a single-chip Direct3D/OpenGL
`rasterizer
`for high-end PCs [ 11.
`The Neon chip contains eight pixel pipes that interface
`to eight off-chip SDRAM channels
`[8]. Several clever
`techniques are used to maximize memory efficiency.
`Sony’s second-generation PlayStation chipset uses a
`MIPS processor with specialized vector units (the Emotion
`Engine) attached to a 16-pipe rasterizer with embedded
`DRAM (the Graphics Synthesizer) [2].
`The SGI InfiniteReality
`uses 12 different ASICs on a
`geometry board, multiple
`rasterizer boards, and a display
`generator board [ 101. Pixel computations are performed by
`up to 320 image engines.
`The PixelFlow architecture consists of a series of “flow
`units”, each of which has a geometry processor and a
`SIMD
`rasterizer board with embedded DRAM
`(“logic-
`enhanced memory”)
`[3]. Each flow unit is responsible for
`a fraction of the primitives;
`the outputs of all of the flow
`units are composited using the Z values.
`The Imagine architecture
`is a VLIW media processor
`with support for a unique “stream processing” model [ 111.
`
`3. Out-of-order
`
`superscalar
`
`rendering engine
`
`3.1. Graphics parallelism
`
`the device drivers,
`The application program, through
`provides a stream of graphics primitives
`to the graphics
`subsystem. These primitives can include 2D operations
`like block fills, lines, or block moves as well as 3D opera-
`tions such as triangles and 3D lines. There may be some
`available parallelism within each primitive, but as 3D
`applications become more sophisticated,
`they include a
`larger number of smaller triangles, each of which has mini-
`mal opportunity
`for parallelism.
`In current applications
`(see Section 4.3), triangles are typically hundreds of pixels.
`As this number decreases, the per-triangle setup time will
`become a larger and larger .component of overall run-time.
`The best mechanism for increasing performance is to exe-
`cute multiple graphics primitives concurrently.
`The transform,
`lighting, and clipping operations (see
`Figure 1) are massively parallel, and their latency is rela-
`tively unimportant. Therefore they can be implemented at
`arbitrarily high performance levels through the use of long
`pipelines or multiple parallel pipelines. The interesting
`parallelism problems exist at the rasterization stage where
`dependences exist. An out-of-order dispatch engine that
`detects dependences and maintains sequential semantics is
`
`memory
`
`‘i
`
`I
`
`alpha blend -1
`
`Figure 3: The pixel
`
`rasteritation
`
`pipeline.
`
`The pixel rasterization pipe is shown in Figure 3. The
`first two stages read the Z value from memory and compare
`it to the pixel’s Z value. If the pixel fails the Z test, i.e., if it
`is behind a previously drawn pixel, then it is dropped from
`the pipe. If it passes, the new Z value is written to memory.
`The next two stages read one or more texels from the tex-
`ture cache, filter them to produce a single texel, and blend
`the texel with
`the pixel color. Next, the alpha value is
`tested and completely
`transparent pixels are dropped.
`Finally, the destination pixel is read, alpha blended with the
`blended pixel/texture color, and written back to memory.
`Several of these pipe stages are optional, depending on
`mode settings for the current triangle. For example, if the
`triangle is opaque (no alpha value), the alpha test, destina-
`tion read, and alpha blend stages are unneeded. A real
`implementation of this pipeline might look somewhat dif-
`ferent, e.g., some of the texture computation might be done
`in parallel with the Z read because of memory latency.
`Assuming
`that sufficient memory bandwidth
`is avail-
`able, this pipeline has a peak throughput of one pixel per
`clock.
`
`53
`
`MEDIATEK, Ex. 1022, Page 8
`IPR2018-00102
`
`
`
`fetch queue
`(unexamined)
`
`1 M 3
`
`2
`
`4
`
`5
`
`a
`
`in flight
`(running)
`
`1
`
`fetch queue
`(unexamined)
`
`4d
`
`4
`
`5
`
`in flight
`(running)
`
`-
`1 Q 2
`
`Figure 4: A stream of primitives
`
`awaiting
`
`dispatch.
`
`used in the rasterizer to extract ILP (where each ‘instruc-
`tion’
`is a graphics primitive)
`from the graphics primitive
`stream. Multiple
`rasterization engines are used to provide
`Since
`execution
`resources for
`the graphics primitives.
`primitives can have long execution
`times, from several
`cycles to several hundred thousand cycles, it is not very
`important
`to dispatch more than one primitive per cycle,
`rather it is important
`to have many primitives executing
`concurrently. Also, since the primitives do many memory
`accesses and require an indeterminate number of cycles to
`execute, it is best to have the rasterization engines run inde-
`pendently, MIMD-style.
`
`3.2. In-order vs. out-of-order dispatch
`
`The basic idea is shown in Figures 4,.5, and 6. The host
`computer
`is submitting a sequential stream of graphics
`primitives
`to the accelerator (Figure 4).
`In the simplest
`implementation, primitives are dispatched from the head of
`the input list when they are ready. With in-order dispatch
`(Figure 5), two primitives have been dispatched to execu-
`tion engines. These primitives are either being executed or
`they are in some queue approved for immediate execution
`but waiting
`for resources. We don’t know or care in what
`order these primitives will begin execution or complete
`execution. The third primitive cannot be approved for exe-
`cution since it overlaps with the currently executing primi-
`tives.
`It will be dispatched when primitives 1 and 2 both
`complete.
`In order to extract additional parallelism, it is preferable
`to allow out of-of-order dispatch [6,7].
`In this case (Figure
`6), stalled operations are placed into central reservation
`stations and other operations
`that can proceed indepen-
`dently are allowed to bypass them.
`In the actual imple-
`mentation,
`all
`incoming
`primitives
`are placed
`in
`reservation stations and the oldest eligible candidate is dis-
`patched each cycle.
`
`lis-
`d
`in-order
`of Figure 4 after
`5: The stream
`Figure
`_
`.
`1 and
`Primitive
`3 is stalled because
`it overlaps
`patch.
`they arrived
`4 and 5 are stalled because
`2. Primitives
`after 3.
`
`fetch queue
`(unexamined)
`
`res. stations
`(waiting)
`
`in flight
`(running)
`
`1 /“:n
`A 4
`
`2
`
`A 5
`
`dis-
`Figure 6: The stream of Figure 4 after out-of-order
`patch. Primitive
`4 has been dispatched
`ahead of prim-
`itive 3.
`
`3.3. Dependence checking
`
`In order to implement an out-of-order dispatch engine,
`mechanisms are required
`for detecting dependences, for
`detecting ready operations, and for tracking completion of
`operations.
`In processors, dependence checking is accom-
`plished by comparing
`the source and destination register
`addresses of a candidate operation
`to those of previous
`operations in the sequential instruction stream. A corollary
`exists for graphics primitives. Any given graphics primi-
`tive has a set of destination pixels that it will write and may
`have one or more sets of source pixels that are read for the
`operation.
`In a typical 3D triangle operation, the destina-
`tion pixels are the ones drawn by the rasterization opera-
`tion and the source pixels may include
`the old pixels
`in
`those locations that will be alpha-blended by the triangle
`and the texture map that will be used to draw the triangle.
`Unlike
`in a processor, a single register number cannot
`describe the source or destination operands and it would be
`
`54
`
`MEDIATEK, Ex. 1022, Page 9
`IPR2018-00102
`
`
`
`bounding box
`
`Figure 7: A triangle’s
`
`bounding
`
`box.
`
`impractical to construct lists of source and destination pix-
`els in hardware and compare them for dependences.
`We resolve this problem by using a conservative, but
`simple-to-implement
`algorithm
`for detecting depen-
`dences. Every dispatched primitive reserves a region of the
`frame buffer corresponding to the bounding box surround-
`ing each of its operands. The sets of pixels that each prim-
`itive
`reads in order to complete are called
`its source
`operands and the bounding boxes surrounding each of the
`source operands are called the Source regions. The set of
`pixels that each primitive writes is called its destination
`operand and the bounding box around it is the destination
`region. Bounding boxes can be computed by taking the
`minimum and maximum values of the x and y coordinates
`of each vertex. See Figure 7.
`left and
`Four numbers, the coordinates of the lower
`upper right comers of the bounding box, can describe each
`region. For two regions RI = [(Umi”t6,i,),(a,,,b,,,)1
`and
`R, = [(X,i,,y,i*).(X,,,,y,,,)1,
`we can determine if
`the
`regions overlap by defining an overlap function:
`
`Graphics primitives can have several source regions as
`well as a destination regions, but for simplicity consider
`the case where there is one of each. To determine whether
`or not a candidate primitive P depends on a previously dis-
`patched primitive D, the function depend(P,D) can be com-
`puted.
`.Sp is the source region of P and Dp is the
`destination region of P. So and DD are the source and des-
`tination regions of D. If depend(P,D) is false then P can be
`dispatched concurrently with D:
`
`depend ( P, D)
`= (Sp n Do) + (Dp n Do) + VP n SD)
`
`= Up n DDWp n DDWp n SD)
`
`this means that for any two concurrently dis-
`Basically
`patched primitives,
`their source regions can overlap but
`neither source region can overlap with
`the destination
`region of the other primitive and the destination regions of
`the two primitives cannot overlap. This is a similar restric-
`tion to that which would be used to determine whether two
`in a processor. The
`instructions could issue concurrently
`first test prevents any RAW hazards caused by reading a
`source pixel prior to it being written by the earlier primi-
`
`tive. The second test prevents any WAW hazards caused
`by writing a pixel from the second primitive prior to writ-
`ing the same pixel from the first. The third test prevents
`any WAR hazards caused by writing a pixel from the sec-
`ond primitive prior to reading the same pixel from the first.
`The WAR hazard is somewhat unique in that it would be
`prevented in any well constructed processor pipeline by
`reading the source operands of the first instruction
`from the
`register file early in the pipeline and retaining a copy of the
`correct value until
`it is consumed in the pipeline.
`It is
`impractical to copy the values of all of the source pixels for
`a graphics primitive, so the source region in memory must
`be reserved until an operation completes, and the WAR test
`must be performed against any subsequently issuing primi-
`tives.
`the out-of-order
`The hardware required to implement
`dispatch mechanism is shown in Figures 8 and 9. A fetch
`unit holds a command queue containing graphics primi-
`tives. As reservation stations in the issue unit become
`available, primitives are passed to the issue unit for depen-
`dence testing and scheduling. The issue unit can issue one
`primitive per cycle to any one of several accelerators for
`rasterization. As each primitive
`is completed, the accelera-
`tor notifies the issue unit so that the primitive can be retired
`The accelerators
`and its reservation stations cleared.
`access memory through a global memory interface unit that
`in practice is likely
`to contain a large switch and many
`independent memory banks to provide adequate band-
`width. For our experiments, we have assumed an ideal
`multi-ported memory.
`The reservation mechanism uses three sets of registers
`to store data about pending and executing primitives. The
`destination
`reservation station
`indexes are the primary
`identifiers
`for a new primitive.
`It is assumed that every
`primitive has exactly one destination region and the bound-
`ing box for that region is stored in the destination reserva-
`tion station for that primitive. A primitive may also have
`zero, one, or two source regions. Source reservation sta-
`tions are allocated in a separate register file for each source
`region. The source register file contains a bounding box
`descriptor for each source region and a tag that is the index
`of the destination reservation station corresponding to each
`source. The candidate buffers hold relevant information
`about unissued primitives. This includes the opcode and
`additional operands for the primitive (such as vertex colors,
`alpha values, etc.) and a dependence vector that describes
`the restrictions on issue. A pointer to the opcode and oper-
`and can be included
`rather than the actual data. Once
`again, a tag is used to specify the address of the destination
`reservation station of the primitive. Every register has a
`valid bit that is used to indicate that the register contains
`live data.
`Ideally,
`the number of destination reservation
`
`55
`
`MEDIATEK, Ex. 1022, Page 10
`IPR2018-00102
`
`
`
`fetch unit
`
`Figure 8: Top-level architecture of the superscalar 3D rendering engine.
`
`candidate buffers
`
`destination
`reservation stations
`
`source
`reservation stations
`
`Figure 9: Registers used in the superscalar
`
`rendering engine.
`
`stations should equal the number of candidate buffers plus
`the number of accelerators.
`A primitive
`is moved into the issue unit from the fetch
`unit each cycle provided that there is an empty destination
`reservation station, an empty candidate buffer, and enough
`empty source reservation stations. Hardware in the fetch
`unit has already decoded the primitive
`type and computed
`the bounding box for each source and destination region.
`These bounding boxes are stored in the source and destina-
`tion reservation stations and the index of the destination
`reservation station is stored in the tag field of the source
`reservation stations and the candidate buffer. A depen-
`dence vector is computed using a variation of the depend
`function extended to support two source regions. The
`hardware to compute this dependence vector is shown in
`Figure 10. The bounding box coordinates (four numbers)
`of the new primitive are driven onto vertical buses for each
`operand. The bounding box coordinates for each valid res-
`ervation station are driven on horizontal buses. At the
`intersections
`that correspond to potential hazards, bound-
`ing box overlap comparators implement
`the overlap func-
`tion described earlier.
`A dependence vector
`is thus
`calculated by computing whether or not the new primitive
`
`from earlier
`tested primitive
`overlaps with each previously
`in the command sequence. Bit k in the dependence vector
`is set if the new primitive must wait for the primitive stored
`in destination
`reservation station k to complete.
`This
`dependence vector
`is stored
`in
`the candidate buffer
`reserved for the new primitive.
`At the same time, the issue unit is testing the existing
`issue candidates to see if any are ready to be issued. If any
`of the accelerators are available, then the issue unit tests all
`of the dependence vectors in the candidate buffers.
`If any
`valid candidate buffer contains a dependence vector of all
`zeros, then it can issue this cycle.
`If more than one candi-
`date can issue, one is selected at random. The candidate is
`issued by transferring
`the primitive data and the tag to an
`accelerator and clearing the valid bit to free the candidate
`buffer.
`in a
`Finally, any accelerators that complete a primitive
`given cycle return the tag of the completed primitive
`to the
`issue unit. The issue unit clears the valid bit from the cor-
`responding destination reservation station.
`It also passes
`the tag to all of the source reservation stations, which com-
`pare it to their own tag and clear their valid bit if the tags
`match. Finally,
`the tag is decoded and the corresponding
`
`56
`
`MEDIATEK, Ex. 1022, Page 11
`IPR2018-00102
`
`
`
`destination
`reservation
`
`stations
`
`new primitive
`
`new dependence
`vector
`
`source
`reservation
`
`stations
`
`box overlap
`
`comparator.
`
`is a bounding
`5
`4.5
`
`Each small circle
`hardware.
`Figure 10: Dependence-checking
`bit in the dependence vector of each candidate buffer is
`cleared. This removes the completed primitive
`from the
`reservation stations and clears any dependences in any
`pending primitives. Once all of the dependence bits in a
`candidate’s dependence vector are cleared, that candidate
`can issue on a subsequent cycle.
`The logic required to implement the iss