United States Patent
`Leather et al.
Patent No.:
Date of Patent:
`US 8,933,945 B2
`Jan. 13, 2015
Inventors: Mark M. Leather, Saratoga, CA (US);
`Eric Demers, Palo Alto, CA (US)
Assignee: ATI Technologies ULC, Markham,
`Ontario (CA)
Appl. No.: 10/459,797
`Jun. 12, 2003
`US 2004/0100471 A1
`May 27, 2004
Provisional application No. 60/429,641, filed on Nov.
`27, 2002.
`G06T 1/20
`G06F 13/14
`G06F 12/02
`G06T 11/40
`G06T 15/00
`G09G 5/36
`CPC . G06T11/40 (2013.01); G06T1/20 (2013.01);
`G06T 15/005 (2013.01); G09G 5/363 (2013.01)
`USPC .......................... .. 345/506; 345/519; 345/544
`USPC ....... .. 345/506, 530, 505, 588, 544, 545, 532,
`345/501, 502, 531, 519
`A graphics processing circuit includes at least two pipelines
`operative to process data in a corresponding set of tiles of a
`repeating tile pattern, a respective one of the at least two
`pipelines operative to process data in a dedicated tile, wherein
`the repeating tile pattern includes a horizontally and verti-
`cally repeating pattern of square regions. A graphics process-
`ing method includes receiving vertex data for a primitive to be
`rendered; generating pixel data in response to the vertex data;
`determining the pixels within a set of tiles of a repeating tile
`pattern to be processed by a corresponding one of at least two
`graphics pipelines in response to the pixel data, the repeating
`tile pattern including a horizontally and vertically repeating
`pattern of square regions; and performing pixel operations on
`the pixels within the determined set oftiles by the correspond-
`ing one of the at least two graphics pipelines.
`21 Claims, 5 Drawing Sheets
`/\ 108

`U.S. Patent
`Jan. 13, 2015
`Sheet 1 of5
`US 8,933,945 B2
`FIG. 1
`FIG. 5

`U.S. Patent
`Jan. 13, 2015
`Sheet 2 of5
`US 8,933,945 B2
`FIG. 2

`U.S. Patent
`Jan. 13, 2015
`Sheet 3 of5
`US 8,933,945 B2

`U.S. Patent
`Jan. 13, 2015
`Sheet 4 of5
`US 8,933,945 B2

`U.S. Patent
`Jan. 13, 2015
`Sheet 5 of5
`US 8,933,945 B2
`_____ ding box
`vertex 2
`.\ /wwu-unununm/M

`US 8,933,945 B2
`This application claims the benefit of U.S. Provisional
`Application Ser. No. 60/429,641 filed Nov. 27, 2002, entitled
`“Dividing Work Among Multiple Graphics Pipelines Using a
`Super-Tiling Technique”, having as inventors Mark M.
`Leather and Eric Demers, and owned by instant assignee.
`This is a related application to a co-pending application
`entitled “Parallel Pipeline Graphics System”, having Ser. No.
`10/724,384, having Leather et al. as the inventors, filed on
`Nov. 26, 2003, owned by the same assignee and hereby incor-
`porated by reference in its entirety.
`The present invention generally relates to graphics pro-
`cessing circuitry and, more particularly, to dividing graphics
`processing operations among multiple pipelines.
`Computer graphics systems, set top box systems or other
`graphics processing systems typically include a host proces-
`sor, graphics (including video) processing circuitry, memory
`(e. g. frame buffer), and one or more display devices. The host
`processor may have a graphics application running thereon,
`which provides vertex data for a primitive (e.g. triangle) to be
`rendered on the one or more display devices to the graphics
`processing circuitry. The display device, for example, a CRT
`display includes a plurality of scan lines comprised of a series
`of pixels. When appearance attributes (e.g. color, brightness,
`texture) are applied to the pixels, an object or scene is pre-
`sented on the display device. The graphics processing cir-
`cuitry receives the vertex data and generates pixel data includ-
`ing the appearance attributes which may be presented on the
`display device according to a particular protocol. The pixel
`data is typically stored in the frame buffer in a manner that
`corresponds to the pixels location on the display device.
`FIG. 1 illustrates a conventional display device 10, having
`a screen 12 partitioned into a series of vertical strips 13-18.
`The strips 13-18 are typically 1-4 pixels in width. In like
`manner, the frame buffer of conventional graphics processing
`systems is partitioned into a series ofvertical strips having the
`same screen space width. Alternatively, the frame buffer and
`the display device may be partitioned into a series of horizon-
`tal strips. Graphics calculations, for example, lighting, color,
`texture and user viewing information are performed by the
`graphics processing circuitry on each of the primitives pro-
`vided by the host. Once all calculations have been performed
`on the primitives, the pixel data representing the object to be
`displayed is written into the frame buffer. Once the graphics
`calculations have been repeated for all primitives associated
`with a specific frame, the data stored in the frame buffer is
`rendered to create a video signal that is provided to the display
`The amount of time taken for an entire frame of informa-
`tion to be calculated and provided to the frame buffer
`becomes a bottleneck in graphics systems as the calculations
`associated with the graphics become more complicated. Con-
`tributing to the increased complexity of the graphics calcula-
`tion is the increased need for higher resolution video, as well
`as the need for more complicated video, such as 3-D video.
`The video image observed by the human eye becomes dis-
`torted or choppy when the amount of time taken to render an
`entire frame ofvideo exceeds the amount oftime in which the
`display device must be refreshed with a new graphic or frame
`in order to avoid perception by the human eye. To decrease
`processing time, graphics processing systems typically
`divide primitive processing among several graphics process-
`ing circuits where, for example, one graphics processing cir-
`cuit is responsible for one vertical strip (e.g. 13) of the frame
`while another graphics processing circuit is responsible for
`another vertical strip (e.g. 14) ofthe frame. In this manner, the
`pixel data is provided to the frame buffer within the required
`refresh time.
`Load balancing is a significant drawback associated with
`the partitioning systems as described above. Load balancing
`problems occur, for example, when all ofthe primitives 20-23
`of a particular object or scene are located in one strip (e.g.
`strip 13) as illustrated in FIG. 1. When this occurs, only the
`graphics processing circuit responsible strip 13 is actively
`processing primitives; the remaining graphics processing cir-
`cuits are idle. This results in a significant waste of computing
`resources as at most only half of the graphics processing
`circuits are operating. Consequently, graphics processing
`system performance is decreased as the system is only oper-
`ating at a maximum of fifty percent capacity.
`Changing the width of the strips has been employed to
`counter the system performance problems. However, when
`the width of a strip is increased, the load balancing problem is
`enhanced as more primitives are located within a single strip;
`thereby, increasing the processing required of the graphics
`processing circuit responsible for that strip, while the remain-
`ing graphics processing circuits remain idle. When the width
`ofthe strip is decreased (e.g. four bits to two bits), cache (e.g.
`texture cache) efficiency is decreased as the number of cache
`lines employed in transferring data is reduced in proportion to
`the decreased width of the strip. In either case, graphics
`processing system performance is still decreased due to the
`idle graphics processing circuits.
`Frame based subdivision has been used to overcome the
`performance problems associated with conventional parti-
`tioning systems. In frame based subdivision, each graphics
`processor is responsible for processing an entire frame, not
`strips within the same frame. The graphics processors then
`alternate frames. However, frame subdivision introduces one
`or more frames of latency between the user and the screen,
`which is unacceptable in real-time interactive environments,
`for example, providing graphics for a flight simulator appli-
`The present invention and the related advantages and ben-
`efits provided thereby, will be best appreciated and under-
`stood upon review of the following detailed description of a
`preferred embodiment, taken in conjunction with the follow-
`ing drawings, where like numerals represent like elements, in
`FIG. 1 is a schematic block diagram of a conventional
`display partitioned into several vertical strips:
`FIG. 2 is a schematic block diagram of a graphics process-
`ing system employing an exemplary multi-pipeline graphics
`processing circuit according to one embodiment of the
`present invention;
`FIG. 3 is a schematic block diagram of a memory parti-
`tioned into an exemplary super-tile pattern according to the
`present invention;

`US 8,933,945 B2
`FIG. 4 is a schematic block diagram of a memory parti-
`tioned into a super-tile pattern according to an alternate
`embodiment of the present invention;
`FIG. 5 is a schematic block diagram of an exemplary multi-
`pipeline graphics processing circuit used in a multi processor
`configuration according to an alternate embodiment of the
`present invention;
`FIG. 6 is a flow chart of the operations performed by the
`graphics processing circuit according to the present inven-
`FIG. 7 is a diagram illustrating a polygon bounding box to
`determine which, if a polygon fits in a tile or super tile; and
`FIG. 8 is a schematic block diagram of an exemplary multi-
`pipeline graphics processing circuit used in a multi processor
`configuration according to an alternate embodiment of the
`present invention.
`A multi-pipeline graphics processing circuit includes at
`least two pipelines operative to process data in a correspond-
`ing tile of a repeating tile pattern, a respective one of the at
`least two pipelines is operative to process data in a dedicated
`tile, wherein the repeating tile pattern includes a horizontally
`and vertically repeating pattern of square regions. The multi-
`pipeline graphics processing circuit may be coupled to a
`frame buffer that is subdivided into a replicating pattern of
`square regions (e. g. tiles), where each region is processed by
`a corresponding one of the at least two pipelines such that
`load balancing and texture cache utilization is enhanced.
`A multi-pipeline graphics processing method includes
`receiving vertex data for a primitive to be rendered, generat-
`ing pixel data in response to the vertex data, determining the
`pixels within a set of tiles of a repeating tile pattern to be
`processed by a corresponding one of at least two graphics
`pipelines in response to the pixel data, the repeating tile
`pattern including a horizontally and vertically repeating pat-
`tern of square regions, and performing pixel operations on the
`pixels within the determined set of tiles by the corresponding
`one of the at least two graphics pipelines. An exemplary
`embodiment of the present invention will now be described
`with reference to FIGS. 2-6.
`FIG. 2 is a schematic block diagram of an exemplary
`graphics processing system 30 employing an example of a
`multi-pipeline graphics processing circuit 34 according to
`one embodiment of the present invention. The graphics pro-
`cessing system 30 can be implemented with a single graphics
`processing circuit 34 or with two or more graphics processing
`circuits 34, 54. The components and corresponding function-
`ality of the graphics processing circuits 34, 54 are substan-
`tially the same. Therefore, only the structure and operation of
`graphics processing circuit 34 will be described in detail. An
`alternate embodiment, employing both graphics processing
`circuits 34 and 54 will be discussed in greater detail below
`with reference to FIGS. 4-5.
`Graphics data 31, for example, vertex data of a primitive
`(e. g. triangle) 80 (FIG. 3) is transmitted as a series of strips to
`the graphics processing circuit 34. As used herein, graphics
`data 31 can also include video data or a combination of video
`data and graphics data. The graphics processing circuit 34 is
`preferably a portion of a stand-alone graphics processor chip
`or may also be integrated with a host processor or other
`circuit, if desired, or part of a larger system. The graphics data
`31 is provided by a host (not shown). The host may be a
`system processor (not shown) or a graphics application run-
`ning on the system processor. In an alternate embodiment, an
`Accelerated Graphics Port (AGP) 32 or other suitable port
`receives the graphics data 31 from the host and provides the
`graphics data 31 to the graphics processing circuit 34 for
`further processing.
`The graphics processing circuit 34 includes a first graphics
`pipeline 101 operative to process graphics data in a first set of
`tiles as discussed in greater detail below. The first pipeline
`101 includes front end circuitry 35, a scan converter 37, and
`back end circuitry 39. The graphics processing circuit 34 also
`includes a second graphics pipeline 102, operative to process
`graphics data in a second set of tiles as discussed in greater
`detail below. The first graphics pipeline 101 and the second
`graphics pipeline 102 operate independently of one another.
`The second graphics pipeline 102 includes the front end cir-
`cuitry 35, a scan converter 40, and back end circuitry 42.
`Thus, the graphics processing circuit 34 of the present inven-
`tion is configured as a multi-pipeline circuit, where the back
`end circuitry 39 ofthe first graphics pipeline 101 and the back
`end circuitry 42 of the second graphics pipeline 102 share the
`front end circuitry 35, in that the first and second graphics
`pipelines 101 and 102 receive the same pixel data 36 provided
`by the front end circuitry 35. Alternatively, the back end
`circuitry 39 ofthe first graphics pipeline 101 and the back end
`circuitry 42 of the second pipeline 102 may be coupled to
`separate front end circuits.Additionally, it will be appreciated
`that a single graphics processing circuit can be configured in
`similar fashion to include more than two graphics pipelines.
`The illustrated graphics processing circuit 34 has the first and
`second pipelines 101-102 present on the same chip. However,
`in alternate embodiments, the first and second graphics pipe-
`lines 101-102 may be present on multiple chips intercon-
`nected by suitable communication circuitry or a communica-
`tion path, for example, a synchronization signal or data bus
`interconnecting the respective memory controllers.
`The front end circuitry 35 may include, for example, a
`vertex shader, set up circuitry, rasterizer or other suitable
`circuitry operative to receive the primitive data 31 and gen-
`erate pixel data 36 to be further processed by the back end
`circuitry 39 and 42, respectively. The front end circuitry 35
`generates the pixel data 36 by performing, for example, clip-
`ping, lighting, spatial transformations, matrix operations and
`rasterizing operations on the primitive data 31. The pixel data
`36 is then transmitted to the respective scan converters 37 and
`40 of the two graphics pipelines 101-102.
`The scan converter 37 of the first graphics pipeline 101
`receives the pixel data 36 and sequentially provides the posi-
`tion (e.g. x, y) coordinates 60 in screen space of the pixels to
`be processed by the back end circuitry 39 by determining or
`identifying those pixels of the primitive, for example, the
`pixels within portions 81-82 of the triangle 80 (FIG. 3) that
`intersect the tile or set oftiles that the back end circuitry 39 is
`responsible for processing. The particular tile(s) that the back
`end circuitry 39 is responsible for is determined based on the
`tile identification data present on the pixel identification line
`38 of the scan converter 37. The pixel identification line 38 is
`illustrated as being hard wired to ground. Thus, the tile iden-
`tification data corresponds to a logical zero. This corresponds
`to the back end circuitry 39 being responsible for processing
`the tiles labeled “A” (e. g. 72 and 75) in FIG. 3. Although the
`pixel identification line 38 is illustrated as being hard wired to
`a fixed value, it is to be understood and appreciated that the
`tile identification data can be programmable data,
`example, from a suitable driver and such a configuration is
`contemplated by the present invention and is within the spirit
`and scope of the instant disclosure.
`Back end circuitry 39 may include, for example, pixel
`shaders, blending circuits, z-buffers or any other circuitry for

`US 8,933,945 B2
`performing pixel appearance attribute operations (e.g. color,
`texture blending, z-buffering) on those pixels located, for
`example, in tiles 72, 75 (FIG. 3) corresponding to the position
`coordinates 60 provided by the scan converter 37. The pro-
`cessed pixel data 43 is then transmitted to graphics memory
`48 via memory controller 46 for storage therein at locations
`corresponding to the position coordinates 60.
`The scan converter 40 of the second graphics pipeline 102,
`receives the pixel data 36 and sequentially provides position
`(e.g. x, y) coordinates 61 in screen space of the pixels to be
`processed by the back end circuitry 42 by determining those
`pixels ofthe primitive, for example, the pixels within portions
`83-84 ofthe triangle 80 (FIG. 3) that intersect the tiles that the
`back end circuitry 42 is responsible for processing. Back end
`circuitry 42 tile responsibility is determined based on the tile
`identification data present on the pixel identification line 41 of
`the scan converter 41. The pixel identification line 41 is illus-
`trated as being hard wired to VCC; thus, the tile identification
`data corresponds to a logical one. This corresponds to the
`back end circuitry 42 being responsible for processing the
`tiles labeled “B” (e.g. 73-74) in FIG. 3. Although the pixel
`identification line 41 is illustrated as being hard wired to a
`fixed value, it is to be understood and appreciated that the tile
`identification data can be programmable data, for example,
`from a suitable driver and such configuration is contemplated
`by the present invention and is within the spirit and scope of
`the instant disclosure.
`Back end circuitry 42 may include, for example, pixel
`shaders, blending circuits, z-buffers or any suitable circuitry
`for performing pixel appearance attribute operations on those
`pixels located, for example, in tiles 73 and 74 (FIG. 3) corre-
`sponding to the position coordinates 61 provided by the scan
`converter 40. The processed pixel data 44 is then transmitted
`to the graphics memory 48, via memory controller 46, for
`storage therein at locations corresponding to the position
`coordinates 61.
`The memory controller 46 is operative to transmit and
`receive the processed pixel data 43-44 from the back end
`circuitry 39 and 42; transmit and retrieve pixel data 49 from
`the graphics memory 48; and in a single circuit implementa-
`tion, transmit pixel data 50 for presentation on a suitable
`display 51. The display 51 may be a monitor, a CRT, a high
`definition television (HDTV) or any other device or combi-
`nation thereof.
`Graphics memory 48 may include, for example, a frame
`buffer that also stores one or more texture maps. Referring to
`FIG. 3, the frame buffer portion of the graphics memory 48 is
`partitioned in a repeating tile pattern ofhorizontal and vertical
`square regions or tiles 72-75, where the regions 72-75 provide
`a two dimensional partitioning of the frame buffer portion of
`the memory 48. Each tile is implemented as a 16x16 pixel
`array. The repeating tile pattern of the frame buffer 48 corre-
`sponds to the partitioning of the corresponding display 51
`(FIG. 2). When rendering a primitive (e.g. triangle) 80, the
`first graphics pipeline 101 processes only those pixels in
`portions 81, 82 ofthe primitive 80 that intersects tiles labeled
`“A”, for example, 72 and 75, as the back end circuitry 39 is
`responsible for the processing of tiles corresponding to tile
`identification 0 present on pixel identification line 38 (FIG.
`2). In corresponding fashion, the second graphics pipeline
`102 processes only those pixels in portions 83, 84 of the
`primitive 80 that intersects tiles labeled “B”, for example
`73-74, as the back end circuitry 42 (FIG. 2) is responsible for
`the processing of tiles corresponding to tile identification 1
`present on pixel identification line 41 (FIG. 2).
`By configuring the frame buffer 48 according to the present
`invention, as the primitive data 31 is typically written in
`strips, the tiles (e.g. 72 and 75) being processed by the first
`graphics pipeline 101 and the tiles (e.g. 73 and 74) being
`processed by the second graphics pipeline 102 will be sub-
`stantially equal in size, notwithstanding the primitive 80 ori-
`entation. Thus, the amount of processing performed by the
`first graphics pipeline 101 and the second graphics pipeline
`102, respectively, are substantially equal; thereby, effectively
`eliminating the load balance problems exhibited by conven-
`tional techniques.
`FIG. 4 is a schematic block diagram of a frame buffer 68
`partitioned into a super-tile pattern according to an alternate
`embodiment of the present invention. Such a partitioning
`would be used, for example, in conjunction with a multi-
`processor implementation to be discussed below with refer-
`ence to FIG. 5. As illustrated, the frame buffer 68 is parti-
`tioned into a repeating tile pattem where the tiles, for
`example, 92-99 that form the repeating tile pattern are the
`responsibility of and processed by a corresponding one ofthe
`graphics pipelines provided by the multi-processor imple-
`FIG. 5 is a schematic block diagram of a graphics process-
`ing circuit 54 which may be coupled with the graphics pro-
`cessing circuit 34 (FIG. 2), for example, by the AGP 32 or
`other suitable port, to form one embodiment of a multi-pro-
`cessor implementation. The graphics processing circuit 54 is
`preferably a portion of a stand-alone graphics processor chip
`or may also be integrated with a host processor or other
`circuit, if desired, or port of a larger system. The multi-
`processor implementation exhibits an increased fill rate of,
`for example, 9.6 billion pixels/sec with a triangle rate of 300
`million triangles/sec. This represents a tremendous perfor-
`mance increase as compared to conventional graphics pro-
`cessing systems. The triangle rate is defined as the number of
`triangles the graphics processing circuit can generate per
`second. The fill rate is defined as the number of pixels the
`graphics processing circuit can render per second.
`Referring briefly to FIG. 2, in the multi-processor imple-
`mentation, processedpixel data 52 from the graphics process-
`ing circuit 34 is provided as a first of two inputs to a high
`speed switch 70. The second input to the high speed switch 70
`is the processed pixel data 55 from the graphics processing
`circuit 54. The high speed switch 70 has a switching fre-
`quency (f) sufiicient to provide the pixel information 71 to a
`suitable display device without any detectable latency.
`Returning to FIG. 5, the graphics processing circuit 54
`includes a third graphics pipeline 201 operative to process
`graphics data in a third set oftiles. The third graphics pipeline
`201 includes front end circuitry 135, which may be the front
`end circuitry 35 discussed with reference to FIG. 2, a scan
`converter 137 and back end circuitry 139. The graphics pro-
`cessing circuit 54 also includes a fourth graphics pipeline
`202, operative to process graphics data in a fourth set of tiles.
`The fourth graphics pipeline 202 includes the front end cir-
`cuitry 135, a scan converter 140 and back end circuitry 142.
`The third graphics pipeline 201 and the fourth graphics pipe-
`line 202 also operate independently of one another. Thus, the
`graphics processing circuit 54 is configured as a multi-pipe-
`line circuit, where the back end circuitry 139 of the third
`graphics pipeline 201 and the back end circuitry 142 of the
`fourth graphics pipeline 202 share the front end circuitry 135,
`in that the respective back end circuitry 139 and 142 receives
`the same pixel data from the front end circuitry 135. As
`illustrated, the components of the third and fourth graphics
`pipelines are present on a single chip. Additionally, the back
`end circuitry 139 and the back end circuitry 142 may be
`configured to share the front end circuitry 35 of the graphics
`processing circuit 34. Alternatively,
`the third and fourth

`US 8,933,945 B2
`graphics pipelines may be configured to be on multiple chips
`interconnected by a communication path, for example, a syn-
`chronization signal or data bus.
`The front end circuitry 135 may include, for example, a
`vertex shader, set up circuitry, rasterizer or other suitable
`circuitry operative to receive the primitive data 31 from the
`AGP 32 and generate pixel data 136 to be processed by the
`third graphics pipeline 201 and fourth graphics pipeline 202,
`respectively. The front end circuitry 135 generates the pixel
`data 136 by performing, for example, clipping, lighting, spa-
`tial transformations, matrix operations, rasterization or any
`suitable primitive operations or combination thereof on the
`primitive data 31. The pixel data 136 is then transmitted to the
`respective scan converters 137 and 140 of the two graphics
`pipelines 201-202.
`The scan converter 137 of the third graphics pipeline 201
`receives the pixel data 136 and sequentially provides the
`position (e.g. x, y) coordinates 160 in screen space of the
`pixels to be processed by the back end circuitry 139, based on
`the tile identification data pre

