`
`United States Patent
`Wolfe
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,079,133 B2
`Jul.18, 2006
`
`US0070791.33B2
`
`(54) SUPERSCALAR 3D GRAPHICS ENGINE
`(75) Inventor: Andrew Wolfe, Los Gatos, CA (US)
`(73) Assignee: S3 Graphics Co., Ltd., Grand Cayman
`(KN)
`
`- r
`(*) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 11/069,821
`
`(22) Filed:
`
`Feb. 28, 2005
`
`(65)
`
`Prior Publication Data
`US 2005/O195197 A1
`Sep. 8, 2005
`Related U.S. Application Data
`63) Continuati
`faroplication No. 09/715,701 filed
`(63) Continuation of application No
`, /UT, Illed On
`Nov. 16, 2000.
`(51) Int. Cl.
`(2006.01)
`G06T IS/00
`(2006.01)
`G06T L/20
`(2006.01)
`G06F 5/16
`(52) U.S. Cl. ...................... 345/419; 34.5/502; 34.5/506;
`345/574
`(58) Field of Classification Search ........ 345/418-428,
`345/501-506, 537 538,520, 522,553,574,
`345/545, 530: 382/103, 190; 711/100, 117
`See application file for complete search history.
`References Cited
`
`(56)
`
`U.S. PATENT DOCUMENTS
`S. A
`s
`g NS et al.
`4.967:39. A
`10, 1990
`less al
`5,010,515 A
`4/1991 Torborg, Jr.
`5,325,485 A
`6, 1994 Hockmuth et al.
`5,347,618 A
`9/1994 Akeley
`
`5,357,599 A 10/1994 Luken
`5,392,385 A
`2/1995 Evangelisti
`5,408,605 A
`4/1995 Deering
`5,422,991 A
`6, 1995 Fowler
`5,440,682 A
`8/1995 Deering
`5,452,412 A
`9/1995 Johnson, Jr. et al.
`5.457,775 A 10/1995 Johnson, Jr. et al.
`5,459,835 A 10, 1995 Trevett
`5.499,326 A
`3/1996 Narayanaswami
`5.502.462 A
`3, 1996 Mical et al.
`5,539,911 A
`7/1996 Nguyen et al.
`5,560,032 A
`9/1996 Nguyen et al.
`5,572,634 A * 11/1996 Duluk, Jr. ................... 345/419
`5,572,657 A 11/1996 Pindeo et al.
`(Continued)
`OTHER PUBLICATIONS
`Wolfe, Andrew et al., “A Superscalar 3D Graphics Engine.”
`s All Symposium on Microarchitecture, Nov. 17,
`, Israel.
`Primary Examiner Richard Hjerpe
`y
`Assistant Examiner—Wesner Sajous
`(74) Attorney, Agent, or Firm—Carr & Ferrell LLP
`
`ABSTRACT
`(57)
`A method, apparatus and computer program product for
`parallel execution of primitives in 3D graphics engines. It
`includes detection and preservation of dependences between
`graphics primitives with the ability to execute multiple
`independent primitives concurrently while preserving their
`ordering because the architecture of the graphics engine for
`the present invention further provides concurrent resources
`for parallel execution. In a first preferred embodiment,
`primitives are executed in parallel using an in-order dispatch
`unit capable of detecting dependencies between primitives.
`In a second preferred embodiment, an out-of-order dispatch
`unit is used such that not only are primitives executed
`concurrently; but, the primitives may be executed in any
`order when dependencies are detected.
`
`10 Claims, 8 Drawing Sheets
`
`St.
`
`
`
`S46a
`
`
`
`issue unit
`
`6.a 0
`
`22.2-(2-r
`Z2A gra-
`candidate 5ters
`eSWalt alon
`suetus - stro
`
`accelerator i
`
`acceleraic 8
`
`
`
`
`
`T-cache
`
`T-cache
`
`merriory interface unit (MU}
`
`SS a
`
`cardigate bufers
`
`destratii
`resertail StarS
`
`
`
`Site
`reservator slators
`
`w
`
`tag
`
`rimitive
`
`ce wector
`
`y
`
`trixx
`
`MEDIATEK, Ex. 1023, Page 1
`IPR2018-00102
`
`
`
`US 7,079,133 B2
`Page 2
`
`U.S. PATENT DOCUMENTS
`5,574,835 A 1 1/1996 Duluk, Jr. et al.
`5,574,836 A 1 1/1996 Broemmelsiek
`5,574,847 A 1 1/1996 Eckart et al.
`5,596,686 A
`1/1997 Duluk, Jr.
`5,602,570 A
`2/1997 Capps et al.
`5,606,657 A
`2f1997 Dennison et al.
`5,632,030 A
`5, 1997 Takano et al.
`5,649,173 A
`7, 1997 Lentz
`6,108,014 A
`8/2000 Dye
`
`4/2001 Rosman et al. ............. 345/419
`6,222,550 B1
`3. R S. sing et al.
`- - -
`a
`6,417,848 B1
`7/2002 Battle
`6,552,723 B1
`4/2003 Duluk et al.
`6,611.264 B1
`8/2003 Regan
`2002/0158865 A1 * 10/2002 Dye et al. ................... 345/419
`2004/O130552 A1* 7/2004 Duluk et al. ................ 345,506
`
`* cited by examiner
`
`MEDIATEK, Ex. 1023, Page 2
`IPR2018-00102
`
`
`
`U.S. Patent
`
`Jul.18, 2006
`
`Sheet 1 of 8
`
`US 7,079,133 B2
`
`triangles
`
`
`
`world
`transform
`
`E:
`
`i
`
`i
`
`projection
`transform
`
`Figure
`
`clipping Hrasterization
`/2
`(-e J.
`
`rendered trame
`
`MEDIATEK, Ex. 1023, Page 3
`IPR2018-00102
`
`
`
`U.S. Patent
`
`Jul.18, 2006
`
`Sheet 2 of 8
`
`US 7,079,133 B2
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`bdel
`
`Figure 2:
`
`Span 0
`
`
`
`><><>KR-><><
`GGGEGGG
`as feas
`Eas
`SEA
`><>3.255.22
`V. NARS Ssss
`asygia:NA Y
`
`Span 3
`
`span 4
`span 5
`Spar 8
`
`Spar 7
`-> span 8
`fire
`>>3><-> --><
`tail it is ti -> span 9
`
`
`
`MEDIATEK, Ex. 1023, Page 4
`IPR2018-00102
`
`
`
`U.S. Patent
`
`Jul.18, 2006
`
`Sheet 3 of 8
`
`US 7,079,133 B2
`
`fatch Queue
`(unexamined)
`
`in flight
`(running)
`
`/XA
`
`FIG. 3a
`
`fetch queue
`(unexamined)
`
`4.
`/X
`
`FIG. 3b
`
`ir flight
`(running)
`
`MEDIATEK, Ex. 1023, Page 5
`IPR2018-00102
`
`
`
`U.S. Patent
`
`Jul. 18, 2006
`
`Sheet 4 of 8
`
`US 7,079,133 B2
`
`fetch queue
`(unexamined)
`
`in flight
`(running)
`
`FIG. 4a
`
`letch queue
`(unexamined)
`
`in flight
`(running)
`
`AN
`
`fetch dueue
`(unexamined)
`
`s
`
`FIG. 4b
`
`res. Stations
`(waiting)
`
`in flight
`(running)
`
`4|AA
`A
`
`FIG. 4c
`
`MEDIATEK, Ex. 1023, Page 6
`IPR2018-00102
`
`
`
`U.S. Patent
`
`Jul. 18, 2006
`
`Sheet 5 of8
`
`US 7,079,133 B2
`
`fetch unit
`
`issue unit
`
`Ssa é
`
`memory interface unit (MIU)
`
`ew,
`Soa
`
`candioate buffers
`
`destination
`reservation) Slalons
`
`source
`reservation slavons
`
`
`
`FIG. 5
`
`C22a-nv
`
`MEDIATEK,Ex. 1023, Page 7
`IPR2018-00102
`
`MEDIATEK, Ex. 1023, Page 7
`IPR2018-00102
`
`
`
`U.S. Patent
`
`Jul.18, 2006
`
`Sheet 6 of 8
`
`US 7,079,133 B2
`
`MOVE PLURALITY OF
`PRIMTVES INTO ISSUE UNIT
`
`601
`
`STORE DESTINATION
`REGION AND SOURCE
`REGION BOUNDING BOX
`COORDINATES IN
`RESERVATION STATIONS
`
`602
`
`BEGIN EXECUT ON IN
`PARALLEL OF MULTIPLE
`PRMITIVES BY FIRST
`COMPUTING DEPENDENCE
`VECTOR FOR EACH
`PRIMTVE
`
`603
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`PASS PRMITVE TO
`AVAILABLE GRAPHICS
`ACCEE RATOR/
`RASTERIZER
`
`
`
`F.G. 6a
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`WAIT UNTIL PRMIT VE
`CAUSING
`DEPENDENCY S
`COMPLETED
`PROCESSING AND
`THEN PROCESS NEXT
`PRMITIVE IN
`CANDIDATE BUFFER
`
`605
`
`
`
`COMPLETE
`PROCESSING OF
`PRIMTIVE AND CLEAR
`ENTRIES IN
`RESERVATION
`STATIONS
`
`606
`
`PASS NEXT PRIMTVE
`FROM FETCH UNIT TO
`ISSUE UNIT AND
`SEARCH CAND DATE
`BUFFER FOR NEXT
`ODEST ELGIBLE
`PRMTVE
`
`
`
`607
`
`MEDIATEK, Ex. 1023, Page 8
`IPR2018-00102
`
`
`
`U.S. Patent
`
`Jul.18, 2006
`
`Sheet 7 of 8
`
`US 7,079,133 B2
`
`MOVE PLURALITY OF
`PRMITIVES INTO ISSUE UNIT
`
`610
`
`STORE DESTINATION
`REGION AND SOURCE
`REGION BOUNDING BOX
`COORDNATES IN
`RESERVATION STATIONS
`
`615
`
`BEGIN EXECUTION IN
`PARALLEL OF MULTIPLE
`PRMTVES BY FIRST
`COMPUTING DEPENDENCE
`VECTOR FOR EACH
`PRMITIVE
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`PASS PRIMTIVE TO
`AVAILABLE
`GRAPHCS
`ACCE LERATOR/
`RASTERIZER
`
`
`
`
`
`
`
`
`
`
`
`
`
`F.G. 6b
`
`
`
`
`
`BYPASS PRIMITIVE
`FROM PROCESSSING
`AND CAL CULATE
`DEPENDENCY FOR
`NEXT PRIMITIVE IN
`CAND DATE BUFFER
`
`630
`
`
`
`COMPLETE
`PROCESSING OF
`PRIMITIVE AND CLEAR
`ENTRIES IN
`RESERVATION
`STATIONS
`
`635
`
`PASS NEW PRIMITIVE
`FROM FETCH UNIT TO
`ISSUE UNIT AND
`SEARCH CAND DATE
`BUFFER FOR NEXT
`OLDEST ELGBLE
`PRMITIVE
`640
`
`MEDIATEK, Ex. 1023, Page 9
`IPR2018-00102
`
`
`
`U.S. Patent
`
`Jul.18, 2006
`
`Sheet 8 of 8
`
`US 7,079,133 B2
`
`destination
`reservation statioS
`
`initi
`w primitive
`
`new dependence
`
`
`
`
`
`
`
`
`
`
`
`wector IP
`|
`|| T
`H || ||
`--
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`sorce
`reservation stations
`
`FIG. 7
`
`MEDIATEK, Ex. 1023, Page 10
`IPR2018-00102
`
`
`
`US 7,079,133 B2
`
`1
`SUPERSCALAR 3D GRAPHICS ENGINE
`
`The present application is a continuation and claims the
`priority benefit of U.S. patent application Ser. No. 09/715,
`701 filed Nov. 16, 2000 and entitled “Superscalar 3D Graph-
`ics Engine.” The disclosure of this commonly owned appli-
`cation is incorporated herein by reference.
`
`FIELD OF THE INVENTION
`
`The invention relates to 3D graphics systems. More
`specifically,
`the invention relates to an architecture for
`parallel processing of graphics primitives which allows the
`concurrent execution of multiple graphics primitives while
`maintaining exact sequential semantics.
`
`BACKGROUND OF THE INVENTION
`
`Since the widespread introduction of graphicaluserinter-
`faces in computers more than 15 years ago, special-purpose
`graphics accelerators have been an integral component of
`desktop computer systems. Recently, 3D applications such
`as computer gaming, CAD, and visualization applications
`have been pushing high-performance 3D acceleration from
`specialty work-stations into mainstream PCs. The demand
`for increased 3D performanceis currently insatiable. Single-
`chip 3D accelerators today are 50 times faster than those
`available 3-4 years ago. Each generation substantially
`improvesthe quality of the images and yet another factor of
`100 still would not producetruly realistic scenes.
`Like all computing systems, in order to achieve such rapid
`increases in performance,
`it is necessary to improve the
`microarchitecture of each generation to increasingly take
`advantage of parallelism. Over the past few years, 3D
`graphics engines have moved from 32 to 64 to 128-bit
`memory buses in much the same way that microprocessors
`grew from 4 to 8 to 16 and eventually to 64 bit buslines.
`However,
`this advancement
`in 3D graphics engines has
`diminishing returns, especially as applications move toward
`a larger numberofprimitives for each scene to be displayed.
`Current high-performance microprocessors
`all use
`instruction-level parallelism (ILP) to further increase per-
`formance.
`ILP exploits information about dependences
`between instructions to allow parallel execution of multiple
`instructions while maintaining the execution semantics of a
`sequential program. Manydifferent ILP mechanisms can be
`effectively employed to improve performance, however
`dynamically scheduled, out-of-order, superscalar micropro-
`cessors are the commercially dominant microarchitecture at
`the present time.
`This invention includes an approach for using dynami-
`cally scheduled, out-of-order, superscalar principles for
`increasing the available parallelism in a 3D graphics engine
`while maintaining sequential execution semantics. In the
`abstract, it would seem that graphics, and particularly 3D
`graphics, is a massively parallel application that would not
`need ILP technology for high performance.In fact, in many
`simple graphics applications, it would be possible to render
`each pixel
`independently; however,
`in practice, graphics
`applications have very similar characteristics to traditional
`sequential programs. Standard APIs
`like Direct3D or
`OpenGLare used to create graphics applications. These are
`translated via software drivers to a sequence of graphics
`primitives to be rendered by a graphics engine that consists
`of some combination of additional hardware and software.
`
`The programming model assumesthat these primitives will
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`be executed sequentially and atomically, in much the same
`mannerthat it is assumed that instructions in a traditional
`
`sequential ISA are executed.
`Typically, 3D graphics applications will allow blocks of
`frame buffer memory to be directly read or written by the
`main processor. This requires that precise frame buffer state
`be available whenevera direct access executes. This creates
`
`a dependence between previous primitives and a direct read
`or subsequent primitives and a direct write. As in general-
`purpose computing systems, it is possible to build massively
`parallel systems that provide excellent performance on a
`limited set of applications that have been programmed with
`parallel execution in mind. However, in order to be com-
`patible with a large existing base of software using widely
`accepted programminginterfaces and programmingstyles, it
`is necessary to detect the dependences between graphics
`primitives, extract independent primitives from the instruc-
`tion stream, and execute them concurrently. Therefore, in
`order to implement a parallel system for executing this
`sequence of primitives, the semantics of sequential execu-
`tion must be maintained. More particularly, several factors
`cause dependences between graphics primitives which that
`can prevent concurrentor parallel execution of primitives in
`a 3D graphics engine.
`Z-buffering
`Realistic 3D graphics usually include hidden surface
`removal. More specifically, objects that are behind other
`objects from the perspective of the viewer should not be
`visible in the final image. Typically, a Z-buffer is used to
`implement hidden-surface removal. The Z-buffer stores the
`distance from the viewpoint to a currently drawn pixel so
`that when drawing a new pixel, it can be determined if the
`new pixel is in front of or behind the currently drawn pixel.
`A well-implemented Z-buffering algorithm should produce
`the sameresult even if the triangles are drawn in a different
`order. However, if the primitives are executed concurrently,
`then two triangles may be drawn concurrently. If each
`primitive attempts to concurrently read the same Z-buffer
`value, modify it, and then write the new value to the Z-buffer
`using commonread and write operations, incorrect results
`can be produced. A special type of dependence thus exists
`between any two primitives that must access the same
`Z-buffer value. Although they can execute in either order,
`there is currently no known process for executing these
`primitives concurrently.
`
`Alpha-blending
`Alpha-blending is an operation that uses a transparency
`value (alpha) to permit some portion of an occluded object
`to be visible through a foreground object. Unfortunately, the
`primitives that execute alpha-blending operations to make
`objects appear transparent must be executed in order. The
`foreground and the background must be executed in order to
`make the transparent effect appear correct on the image.
`Accordingly, ifthe 3D graphics engine does not maintain the
`semantics of the sequence in which primitives are executed,
`the image will be incorrect.
`
`Dynamic Textures, Procedural Textures, Environment Map-
`ping
`In another feature of realistic 3D graphics, an image
`(called a texture) can be mapped onto another image; for
`example,
`it may be part of the image to have objects
`reflected off of water displayed in the image. Often these
`textures have limited life-times. Procedural
`textures are
`
`created on-the-fly by program code. Dynamic textures are
`loaded into the graphics system memory space from some
`backing store for a limited time. Environment mapping is a
`MEDIATEK,Ex. 1023, Page 11
`IPR2018-00102
`
`MEDIATEK, Ex. 1023, Page 11
`IPR2018-00102
`
`
`
`US 7,079,133 B2
`
`5
`
`10
`
`15
`
`3
`technique for reflections where the 3D objects which is to be
`reflected is drawn and then copied as an image to be mapped
`onto a reflective surface. In each of these cases, there is a
`dependence between the primitives that create the texture
`and the primitives that render the reflective surface or
`polygon upon which the texture is projected. Once again, if
`the 3D graphics engine does not maintain the semantics of
`the sequence in which primitives are executed, the image
`will be incorrect.
`2D BLITS
`Often in 3D graphics, it is advantageous to be able to mix
`2D block copy and drawing operations with 3D rendering.
`If overlapping 2D objects are read or written out of order, the
`resulting image is incorrect.
`Direct Frame Buffer Access
`Common graphics APIs allow blocks of frame buffer
`memory to be directly read from or written to at the same
`time by the processor. This requires that the precise state of
`the frame buffer be available and known whenever an access
`to the frame buffer memory executes. This creates a depen
`dence between any previously executed primitives and a
`direct read or any future executed primitives and a direct
`write.
`Generally, a 3D application creates a series of frames. A
`25
`3D graphics engine then identifies each of the objects in a
`frame and breaks the surface of the object down into a
`collection of triangles for processing (typically the process
`ing and drawings of the pixels within these triangles are
`represented by a serious of executable instructions which are
`referred to as primitives which are processed individually).
`Each triangle or primitive is specified by three vertices in a
`3D space, one or more surface normals, and a description of
`how to draw the triangle's Surface, i.e. texturing, alpha
`blending parameters, etc. Accordingly, from the point of
`view of a 3D graphics engine, a frame consists of a collec
`tion of triangles or primitives which are all processed and
`executed separately thereby rendering the entire frame or
`image. The 3D graphics engine is responsible for processing
`each triangle or primitive and converting them each into
`pixels, which when displayed render the entire 3D frame.
`FIG. 1 illustrates a block diagram which shows a prior art
`3D processing pipeline resident in a prior art 3D graphics
`engine. Generally, the graphics engine identifies the trian
`gular coordinates for each primitive within the shared world
`space of the entire image, applies lighting to the triangles or
`primitives, transforms each triangle or primitive from the 3D
`space used by the application into 2D screen coordinates,
`and draws the appropriate pixels into the frame buffer
`(applying any shading, Z-buffering, alpha bending etc.).
`Referring now to FIG. 1, and more specifically, a first
`stage in a pipeline is a world transform stage 105, in which
`the graphics engine converts the vertices and normals of the
`triangle from the real world object space, which may be
`different for each object in the scene, to the shared world
`space, which is space shared by all of the objects to be
`rendered in the entire scene. This transform consists of a
`matrix-vector multiplication for each vertex and each nor
`mal. In a second stage of the pipeline, a lighting stage 110.
`the graphics engine takes the triangle's color and Surface
`normal(s) and computes the effect of one or more light
`Sources. The result is a color at each vertex. At the next stage
`in the pipeline, a view transform stage 115, the graphics
`engine converts the vertices from the world space to a
`camera space, with the viewer (or camera) at the center or
`origin and all vertices then mapped relative from that origin.
`Additionally, in the view transform stage 115, the graphics
`
`35
`
`4
`engine applies a matrix-vector multiplication to each vertex
`calculated for the camera space.
`As further shown in FIG. 1, the next stage in the pipeline
`is a projection transform stage 120. At the projection trans
`form stage 120, the graphics engine maps the vertices for the
`camera space to the actual view space. This includes the
`perspective transformation from 3D to 2D. Accordingly, at
`this point in the pipeline, the vertices are effectively two
`dimensional to which perspective effects (i.e., depth fore
`shortening) have been applied. Accordingly, the third (Z)
`coordinate is only needed to indicate the relative front-to
`back ordering of the vertices when the objects are rendered
`or drawn within the view space. Like the other two trans
`form stages in the pipeline, the projection transform stage
`requires the application of a matrix-vector multiplication per
`each vertex. In a clipping stage 125, the graphics engine
`clips the triangles or primitives to fit within the view space.
`Accordingly, the triangles or primitives which lie entirely off
`the side of the screen or behind the viewer are removed.
`Meanwhile, triangles or primitives which are only partially
`out of bounds are trimmed. This generally requires splitting
`the resulting polygon into additional multiple triangles or
`primitives and processing each one of these additional
`triangles or primitives separately. Finally, in a rasterization
`stage 130, the graphics engine converts those triangles to be
`displayed within the view space into pixels and computes
`the color value to be displayed at each pixel. This includes
`visible-surface determination (dropping pixels which are
`obscured by a triangle closer to the viewer), texture map
`ping, and alpha blending (transparency effects).
`FIG. 2 further illustrates how a prior art rasterizer stage
`130 in a 3D graphics engine operates. First, the rasterizer
`calculates the centers for each pixel in the triangle or
`primitive and assigns X and y values to these centers. The
`rasterizer stage then converts each triangle or primitive into
`a series of horizontal spans, with one span generated for
`each integery value that falls inside the triangle. For each
`horizontal span, the rasterizer computes the two endpoints,
`i.e. the points where the horizontal span crosses the edges or
`boundaries of the triangle or primitive. The rasterizer will
`also interpolate color values and perspective-corrected tex
`ture coordinates for the endpoints. Next, the rasterizer
`generates the series of pixels along the span, again interpo
`lating color and texture coordinates for each pixel between
`the two endpoints of the horizontal span. Several operations
`are then performed at each pixel. First each pixel has its Z
`(depth) value compared to the Z (depth) value for the
`currently displayed pixel in the same location. The currently
`displayed pixel has its Z (depth) value stored in a Z buffer. If
`the comparison indicates that this new pixel is behind the old
`one, the new pixel is discarded. If the comparison indicates
`that this new pixel is in front of the old pixel then the Z test
`Succeeds and the new pixel color is computed. This can
`include texture mapping and alpha blending. Accordingly,
`prior art 3D graphics are computed serially because each Z
`(depth) value must be compared to the previously displayed
`Z (depth) value.
`Generally, a prior art 3D graphics engine will serially
`perform these steps on each triangle or primitive one at a
`time, such that the triangles or primitives are processed in an
`orderly fashion one after the other in series. One reason this
`is done is to avoid any dependencies which may occur
`between the primitives or triangles as they are each
`executed. As explained earlier, as each primitive is executed
`for processing, the Z values for each pixel location in the
`triangle or primitive is compared with the Z value previously
`displayed in that same location on the two dimensional
`
`30
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`MEDIATEK, Ex. 1023, Page 12
`IPR2018-00102
`
`
`
`US 7,079,133 B2
`
`5
`screen in order to determine whether the new pixel should
`overwrite (appear in front of) the old value or be ignored
`(appear behind). If the triangles are not executed in order,
`then the Z value-test results will be faulty.
`However, the present invention is directed toward a
`method, apparatus and computer program product for par
`allel execution of primitives in 3D graphics engines. It
`includes detection and preservation of dependences between
`graphics primitives. Accordingly, the present invention has
`the ability to execute multiple independent primitives con
`currently while preserving their ordering because the archi
`tecture of the graphics engine for the present invention
`further provides concurrent resources for parallel execution.
`In a first preferred embodiment, primitives are executed in
`parallel using an in-order dispatch unit capable of detecting
`dependencies between primitives. In a second preferred
`embodiment, an out-of-order dispatch unit is used such that
`not only are primitives executed concurrently; but, the
`primitives may be executed in any order when dependencies
`are detected.
`
`10
`
`15
`
`6
`multiple instruction/multiple data MIMD fashion, which
`means the graphics engine is capable of processing multiple
`primitives at the same time.
`In Order Dispatch Unit
`FIGS. 3a–b illustrate how a preferred embodiment of a
`graphics engine uses an in-order dispatch system in order to
`dispatch primitives for concurrent execution by two or more
`graphics accelerators in accordance with the present inven
`tion. In this embodiment, each primitive is ranked in order
`in an input list and is dispatched from the head of the input
`list, in order, as an accelerator becomes available. As shown
`in FIG. 3a, a 3D image is broken into a number of triangles
`or primitives, i.e. each triangle is a primitive which must be
`processed for display by the graphics engine. In a typical
`prior art 3D graphics engine, the coordinates for the vertices
`of the primitives are stored in a queue for serial execution/
`processing by the 3D graphics engine. The primitives are
`executed in order serially one at a time from the queue.
`However, in the present invention more than one primitive
`may be executed at a time so the queue has what is referred
`to as an in-order dispatch system.
`Assume in the preferred embodiment illustrated that there
`are two graphics accelerators capable of executing two
`graphics primitives in parallel such that two primitives may
`be dispatched to the two simultaneously executing accelera
`tors/rasterizers in the 3D graphics engine. Accordingly, FIG.
`3b shows primitives 1 and 2 being executed while primitives
`3, 4 and 5 remain in the queue. When using an in-order
`dispatch unit in accordance with a preferred embodiment of
`the present invention, each Subsequent primitive is tested
`against currently executing primitives in order to detect any
`dependencies. Accordingly, in our example, primitive 3 will
`be tested against primitives 1 and 2 for dependencies. Since
`primitive 3 overlaps primitives 1 and 2, primitive 3 cannot
`be executed until primitives 1 and 2 are both complete. This
`is because triangles associated with primitives 1 and 2 are
`both overlapped by the triangle represented by primitive 3.
`Thus, even if primitive 1 completes before primitive 2.
`primitive 3 cannot begin processing/execution until primi
`tive 2 has completed. In this embodiment, the graphics
`engine cannot execute primitive 4 because it includes an
`in-order dispatch system which does not permit out of order
`execution and processing of the primitives. Accordingly, the
`in-order dispatch system prevents the next primitive in the
`list from executing until all previous primitives that overlap
`that primitive have been completed and all primitives which
`are in the queue before that primitive have executed. This
`allows for parallel processing and execuion of primitives
`while preserving the ordering and integrity of the primitives
`thereby insuring that the final graphic is properly drawn.
`Out of Order Dispatch
`FIG. 4a–c illustrate how a preferred embodiment of a
`graphics engine uses an out-of-order dispatch system in
`order to dispatch primitives for concurrent execution by two
`or more graphics accelerators in accordance with the present
`invention. As shown in FIG. 4a, a 3D image is broken into
`a number of triangles, each triangle represented by a primi
`tive. The primitives are stored in a queue for execution/
`processing by the 3D graphics engine.
`Assume in the preferred embodiment illustrated that there
`are two graphics accelerators capable of executing two
`graphics primitives in parallel such that two primitives may
`be dispatched to the two simultaneously executing accelera
`tors/rasterizers in the 3D graphics engine. However, unlike
`the preferred embodiment which utilizes an in-order dis
`patch unit, in the embodiment which utilizes an out of order
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`25
`
`35
`
`FIG. 1 illustrates a block diagram which shows a prior art
`3D processing pipeline resident in a prior art 3D graphics
`engine;
`FIG. 2 illustrates further how a prior art rasterizer stage
`130 in a 3D graphics engine operates;
`FIGS. 3a–b illustrate how a first preferred embodiment of
`30
`a 3D graphics engine uses an in-order dispatch mechanism
`which dispatches each primitive from the head of the input
`list, in order, as an accelerator becomes available;
`FIG. 4a–c illustrate the basic concepts used in the process
`of 3D graphics in accordance with a second preferred
`embodiment of the present invention which uses an out-of
`order dispatch mechanism;
`FIG. 5 illustrates a functional block diagram showing an
`out-of-order dispatch mechanism in accordance with a pre
`ferred embodiment of the present invention;
`40
`FIGS. 6a and 6b illustrate the process executed when
`processing graphics primitives and using either an in-order
`or an out-of-order dispatch mechanism in accordance with a
`preferred embodiment of the present invention; and
`FIG. 7 illustrates a block diagram which shows a pre
`ferred embodiment of the hardware used to compute depen
`dency vectors for each primitive.
`
`45
`
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`
`In the present invention, an out-of-order dispatch 3D
`graphics engine detects dependences and maintains sequen
`tial semantics at the rasterizer stage to extract ILP from the
`graphics primitive stream. Multiple rasterization engines are
`used to provide execution resources for the graphics primi
`tives such that multiple graphics triangles or primitives may
`be executed at the same time. Since primitives can have long
`execution times, from several cycles to several hundred
`thousand cycles, it is not very important to dispatch more
`than one primitive per cycle, although it is possible to do so.
`However, it is important to have many primitives executing
`concurrently. Also, since the primitives do many memory
`accesses and require an indeterminate number of cycles to
`execute, it is best to have the rasterization engines run
`independently. Accordingly, in a preferred embodiment of
`the present invention, the graphics engine operates in a
`
`50
`
`55
`
`60
`
`65
`
`MEDIATEK, Ex. 1023, Page 13
`IPR2018-00102
`
`
`
`US 7,079,133 B2
`
`7
`dispatch unit, the primitives may be executed out-of-order
`since the queue has an out-of-order dispatch system.
`Accordingly, once again assume in the example illustrated
`that two primitives may be dispatched to two simultaneously
`executing accelerators/rasterizers in the 3D graphics engine.
`FIG. 4b shows primitives 1 and 2 being executed while
`primitives 3, 4 and 5 remain in the queue. When using the
`out-of-order dispatch unit in accordance with a preferred
`embodiment of the present invention, each Subsequent
`primitive is tested against currently executing primitives in
`order to detect any dependencies. Accordingly, in our
`example, primitive 3 will be tested against primitives 1 and
`2 for dependencies. Since primitive 3 overlaps primitives 1
`and 2, it cannot be executed until primitives 1 and 2 are both
`complete. Using the out-of-order dispatch unit, if primitive
`1 completes before primitive 2, primitive 3 may be bypassed
`and primitive 4 is then processed/executed. Once primitive
`2 completes processing, then primitive 3 may be executed.
`Accordingly, the out-of-order dispatch system prevents the
`next primitive in the list from executing until all previous
`primitives that overlap that primitive have been completed
`while allowing Subsequent primitives which do not overlap
`to be executed. This allows for parallel processing and
`execution of primitives while preserving the ordering and
`integrity of the primitives thereby insuring that the final
`graphic is properly drawn.
`Therefore, utilizing the preferred embodiment of the
`present invention in which an out-of-order dispatch unit is
`used, although primitive 3 cannot be approved for execution
`since it overlaps with the currently executing primitives, in
`order to extract additional parallelism, it is preferable to
`allow out of-of-order dispatch of other primitives in the
`queue. In this case, stalled operations or primitives which
`cannot be executed are placed into central reservation sta
`tions (i.e. primitive 3) and other operations that can proceed
`independently (i.e. other primitives which can be processed
`immediately since there is no overlap) are allowed to bypass
`them. Accordingly, if either primitive 1 or 2 completes
`before the other, primitive 4 can begin execution, as shown
`in FIG. 4c.
`Dependency Checking
`In both embodiments (i.e. using either an in-order or an
`out-of-order dispatch mechanism), all incoming primitives
`are placed in reservation stations and the oldest eligible
`candidate (i.e., the oldest primitive eligible for execution
`which has no dependencies) is dispatched each cycle as an
`accelerator/rasterizer becomes available. Accordingly, in
`both embodiments, it is necessary to have a mechanism for
`checking the dependency of each incoming p