`
`3 0 05 0
`a -H L
`e 13 xj
`0 05 -H Qj
`U U _J >
`■H 'H
`111 I—< Ll Qj
`lu acn u
`ixi au Q)
`^ 05 D q:
`
`MEDIATEK, Ex. 1021, Page 1
`IPR2018-00101
`
`
`
`CompuhrGrapHic^
`
`AND/applications
`
`f
`
`IEEE Computer Society
`10662 Los Vaqueros Circle
`PO Box 3014
`Los Alamitos, CA 90720-1264
`
`Editor in Chief
`Peter R. Wilson
`
`Submissions: Please submit six copies of all articles and
`proposals for special issues to Peter R. Wilson, editor in chief,
`NIST, Bldg. 220, Rra. A127, Gaithersburg, MD 20899, e-mail
`pwilson@cme.nist.gov. All submissions are subject to editing
`for style, clarity, and space considerations. An author’s guide
`is available from the publications office or by e-mail from
`nhays@computer.org.
`
`Editorial Board
`
`Frank W. Bliss
`Jack E. Bresenham
`Thomas A. DeFanti
`John C. Dill
`Rae A. Earnshaw
`Jose L. Encarnacao
`Nick England
`Andrew Glassner
`Bertram Herzog
`Tosiyasu L. Kunii
`Carl Machover
`Gregory M. Nielson
`Michael J. Potel
`Michael L. Rhodes
`Philip K. Robertson
`Lawrence J. Rosenblum
`
`EDS/GM Corp.
`Consultant
`U. of Illinois at Chicago
`Simon Eraser Univ.
`University of Leeds
`Technical U. of Darmstadt
`U. of North Carolina
`Xerox PARC
`Consultant
`U. of Aizu
`Machover Associates
`Arizona State U.
`Taligent
`Toshiba America MRI
`CSIRO, Australia
`Office of Naval Research
`
`Managing Editor
`Staff Editor
`Assistant Editor
`Publisher
`Assistant Publisher
`Editorial Dir., Magazines
`Advertising Manager
`Advertising Coordinator
`Member/Circ. Promos Mgr.
`Editorial Secretary
`Art Direction
`
`Nancy Hays
`Linda World
`Karen Whitehouse
`True Seaborn
`Matt Loeb
`Marilyn Potes
`Heidi Rex
`Marian Tibayan
`John Gill
`Carole Danner
`Joe Daigle
`
`Publications Board
`
`Barry W. Johnson (chair),
`Jon T. Butler, James J. Farrell III,
`Tadao Ichikawa, Mary Jane Irwin, Raymond Miller,
`Michael C. Mulder, Theo Pavlidis,
`John P. Riganati, Harold Stone, Ronald D. Williams
`
`Editorial: IEEE Computer Graphics and Applications serves
`both users and designers of graphics hardware, software, and
`systems. Its readers work in industry, business, the arts, and
`universities.Unless otherwise stated, bylined articles and
`departments, as well as descriptions of products and services,
`reflect the author’s or firm’s opinion; inclusion in this
`publication does not necessarily constitute endorsement by the
`IEEE or the Computer Society.
`
`Copyright and reprint permission: Abstracting is permitted
`with credit to the source. Libraries are permitted to photocopy
`beyond the limits of US copyright law for private use of
`patrons those articles that carry a code at the bottom of the
`first page, provided the per-copy fee indicated in the code is
`paid through the Copyright Clearance Center, 29 Congress St.,
`Salem, MA 01970. Instructors are permitted to photocopy
`isolated articles for noncommercial classroom use without fee.
`For other copying, reprint, or republication permission, write
`to Permissions Editor, IEEE Computer Graphics and
`Applications, 10662 Los Vaqueros Circle, PO Box 3014, Los
`Alamitos, CA 90720-1264. All rights reserved. Copyright ©
`1994 by the Institute of Electrical and Electronics Engineers,
`Inc.
`
`Circulation: IEEE Computer Graphics and Applications (ISSN
`0272-1716) is published bimonthly by the IEEE Computer
`Society, 10662 Los Vaqueros Circle, PO Box 3014, Los
`Alamitos, CA 90720-1264; (714) 821-8380. IEEE Computer
`Society Headquarters: 1730 Massachusetts Ave., Washington,
`DC 20036-1903. IEEE Headquarters: 345 East 47th St., New
`York, NY 10017-2394. Annual subscription: $29 in addition to
`any IEEE group or society dues. Members of other technical
`organizations: $53. Nonmember subscription rates are
`available on request. Back issues: members, $10; nonmembers,
`$20. This journal is also available in microfiche form.
`
`Postmaster: Send address changes and undelivered copies to
`IEEE Computer Graphics and Applications, IEEE Computer
`Society, 10662 Los Vaqueros Circle, PO Box 3014, Los
`Alamitos, CA 90720-1264. Second class postage is paid at New
`York, NY, and at additional mailing offices. Canadian GST
`#125634188. Printed in USA.
`
`Moving?
`
`PLEASE NOTIFY Mail to
`US 4 WEEKS
`IEEE Computer Society
`IN ADVANCE
`10662 Los Vaqueros Circle
`PO Box 3014
`Los Alamitos, CA 90720-1264
`• This notice of address change will apply to all IEEE publications to
`which you subscribe.
`• List your new address below.
`• If you have a question about your subscription, place your address label
`here and clip this form to your letter.
`ATTACH
`LABEL
`HERE
`
`Magazine Operations Committee
`
`Name (please print)
`
`James J. Farrel III (chair),
`Valdis Berzins, B. Chandrasekaran,
`Carl Chang, Dante Del Corso, Ramesh Jain,
`John A.N. Lee, Ted Lewis, Michael J. Quinn,
`Ahmed Sameh, Ken Wagner, Peter R. Wilson
`
`New address
`
`City
`
`State/Country
`
`^ ZIP
`
`MEDIATEK, Ex. 1021, Page 2
`IPR2018-00101
`
`
`
`SHORT NOTE
`
`Bump Shading for Volume Textures
`Nelson L. Max and Barry G. Becker
`Bump mapping works well for parameterized surfaces. Here we interpret the 3D texture as a
`displacement function to be added to the surface position.
`
`PARALLEL RENDERING
`
`21
`23
`
`33
`
`41
`
`49
`
`59
`
`69
`
`Guest Editors’ Introduction: Recent Developments in Parallel Rendering
`Scott Whitman, Charles D. Hansen, and Thomas W. Crockett
`
`A Sorting Classification of Parallel Rendering
`Steven Molnar, Michael Cox, David Ellsworth, and Henry Fuchs
`This conceptual model defines three broad sorting classes that illuminate the trade-offs
`between different parallel rendering approaches.
`
`A New Algorithm for Interactive Graphics on Multicomputers
`David A. Ellsworth
`This polygon rendering algorithm uses a new load-balancing method that exploits the frame-
`to-frame coherence of interactive applications.
`
`Dynamic Load Balancing for Parallel Polygon Rendering
`Scott Whitman
`A new graphics renderer incorporates novel partitioning methods for efficient execution on
`a parallel computer. It requires little overhead, executes efficiently, and demands minimal
`processor synchronization.
`
`Communication Costs for Parallel Volume-Rendering Algorithms
`Ulrich Neumann
`Analyzing the intrinsic communication costs of three classes of parallel volume-rendering
`algorithms yielded estimates of how varying system and data sizes would affect a system s
`communication time.
`
`Parallel Volume Rendering Using Binary-Swap Compositing
`Kwan-Liu Ma, James S. Painter, Charles D. Hansen, and Michael F. Krogh
`This algorithm distributes data and computations to individual processing nodes for
`rendering subimages, then composites the final image with a method that uses all nodes
`at all times.
`
`Distributed Data and Control for Ray Tracing in Parallel
`Didier Badouel, Kadi Bouatouch, and Thierry Priol
`Distributed-memory parallel computers offer high performance for ray tracing when
`supplied with shared virtual memory. They also avoid complex message-passing techniques.
`
`July 1994
`
`Vol. 14 No. 4 (ISSN 0272-1716)
`
`Published by IEEE Computer Society
`
`MEDIATEK, Ex. 1021, Page 3
`IPR2018-00101
`
`
`
`'1
`
`DEPARTMENTS
`4 Letters to the Editor
`Fanning the Flames for Fortran
`5 About the Cover
`Building Castles Out of Pixels
`8 Displays on Display
`Mark of the Hand—by Mouse
`12 Visualization Blackboard
`The Cosmic Worm
`15 Applications
`Decriminalizing the Fingerprint
`78 Jim Blinn’s Corner
`Quantization Error and Dithering
`83 Graphic News
`Is There Life After Commodore?;
`Campuses Chosen for New Media Centers
`DOD Funds Flat-Panel Display R&D
`86 Conferences
`The Future of VR;
`State of VR Art in Medicine
`89 Calendar
`92 New Products
`96 Advertiser/Product Index
`CIII Final Focus
`
`Change of address form, p. 1
`Computer Society Information, p. 17
`Reader Service Card, p. 96a & b
`
`MEDIATEK, Ex. 1021, Page 4
`IPR2018-00101
`
`
`
`A Sorting Classification of
`Parallel Rendering
`
`Steven Molnar
`University of North Carolina at Chapel Hill
`Michael Cox
`Princeton University
`David Ellsworth and Henry Fuchs
`University of North Carolina at Chapel Hill
`
`This conceptual model
`defines three broad sorting
`classes that illuminate the
`trade-offs between
`different parallel
`rendering approaches.
`
`Graphics rendering is notoriously compute intensive, par
`
`rasterization are performed in parallel. This classification
`ticularly for realistic images and fast updates. Demanding
`scheme supports the analysis of computational and communi
`applications, such as scientific visualization, CAD, vehicle sim
`cation costs, and encompasses the bulk of current and proposed
`ulation, and virtual reality, can require hundreds of Mflops of
`highly parallel renderers—both hardware and software.
`performance and gigabytes per second of memory bandwidth—
`We begin by reviewing the standard feed-forward rendering
`far beyond the capabilities of a single processor. For these rea
`pipeline, showing how different ways of parallelizing it lead to
`sons, parallelism has become a crucial tool for building
`three classes of rendering algorithms. Next, we consider each of
`high-performance graphics systems, whether special-purpose
`these classes in detail, analyzing their aggregate processing and
`hardware systems or software systems for general-purpose mul
`communication costs, possible variations, and constraints they
`ticomputers.
`may impose on rendering applications. Finally, we use these
`We can employ parallelism of various types at many levels.
`analyses to compare the classes and identify when each is likely
`For example, functional parallelism (pipelining) can speed crit
`to be preferable.
`ical calculations, and data parallelism can allow multiple results
`to be computed at once. Common data-parallel approaches are
`by object (object parallelism) and by pixel or portion of the
`screen (pixel or image parallelism).
`Several taxonomies of parallel rendering algorithms have been
`proposed.’ '* These taxonomies help in classifying and under
`standing systems but do not lend themselves easily to compari
`son or analysis. Some rendering systems have been analyzed in
`isolation.”'’ However, these analyses tend to focus on unique sys
`tem attributes and make comparisons between systems difficult.
`In this article, we describe a classification scheme that we be
`lieve provides a more structured framework for reasoning about
`parallel rendering. The scheme is based on where the sort from
`object coordinates to screen coordinates occurs, which we be
`lieve is fundamental whenever both geometry processing and
`
`About the title photograph
`The photograph above shows the simulation results of
`communications traffic between 36 sort-first processors in
`a 2D mesh rendering the National Computer Graphics
`Association's "rotating head" Picture-Level Benchmark
`{GPC Quarterly Report, Vol. 2, No. 4, 1992).
`Arrow color indicates the number of primitives
`transferred between processors for the 4.5 degree
`rotation of the head model between these two successive
`frames. Range is 0 (black) to 800 (white) using a heated-
`object spectrum.
`
`July 1994
`
`0272-17-16/94/$4.00 © 1994 IEEE
`
`23
`
`MEDIATEK, Ex. 1021, Page 5
`IPR2018-00101
`
`
`
`Parallel Rendering
`
`Graphics database traversal
`
`i
`
`G
`
`G
`
`G
`
`G
`
`Geometry
`processing
`
`V
`
`R
`
`R
`
`R
`
`R
`
`Rasterization
`
`Display
`
`Parallel rendering as
`a sorting problem
`Figure 1 shows a simplified version of the standard, feed
`forward rendering pipeline, adapted for parallel rendering. It
`consists of two principal parts: geometry processing (transfor
`mation, clipping, lighting, and so on) and rasterization (scan-
`conversion, shading, and visibility determination). In this article,
`we target rendering rates high enough that both geometry pro
`cessing and rasterization must be performed in parallel. We say
`such systems are fully parallel.
`Geometry processing usually is parallelized by assigning each
`processor a subset of the primitives (objects) in the scene. Ras
`terization usually is parallelized by assigning each processor a
`portion of the pixel calculations.
`The essence of the rendering task is to calculate the effect of
`each primitive on each pixel. Due to the arbitrary nature of the
`modeling and viewing transformations, a primitive can fall any
`where on (or off) the screen. Thus, we can view rendering as a
`problem of sorting primitives to the screen, as noted by Suther
`land, Sproull, and Schumacher in their seminal paper on visible-
`surface algorithms.* For fully parallel renderers, this sort
`involves a redistribution of data between processors, because re
`sponsibility for primitives and pixels is distributed.
`The location of this sort largely determines the structure of
`the resulting parallel rendering system. Understanding the va
`riety of system structures possible within the constraints of this
`distributed sort and realizable with available computational re
`sources is the main challenge for designers of fully parallel ren
`dering systems.
`The sort can, in general, take place anywhere in the render
`ing pipeline: during geometry processing (sort-first), between
`geometry processing and rasterization (sort-middle), or during
`rasterization (sort-last). Sort-first means redistributing “raw”
`primitives—before their screen-space parameters are known.
`Sort-middle means redistributing screen-space primitives. Sort-
`last means redistributing pixels, samples, or pixel fragments.
`Each of these choices leads to a separate class of parallel ren
`dering algorithms with distinct properties. We describe the
`classes briefly now and examine them in more detail later.
`
`Sort-first
`The aim in sort-first is to distribute primitives early in the
`rendering pipeline—during geometry processing—to proces
`sors that can do the remaining rendering calculations (Figure 2).
`We can do this by dividing the screen into disjoint regions and
`making processors (called renderers) responsible for all ren
`dering calculations that affect their respective screen regions.
`Initially, we assign primitives to renderers in some arbitrary
`
`Figure 1. Graphics pipeline in a fully parallel rendering system.
`Processors G perform geometry processing. Processors R perform
`rasterization.
`
`Figure 2. Sort-first redistributes raw primitives
`during geometry processing.
`
`Graphics database
`(arbitrarily partitioned)
`
`(Pre-transform)
`
`Geometry
`processing
`
`Rasterization
`
`fashion. When rendering begins, each renderer does enough
`transformation to determine into which region(s) each primitive
`falls, generally computing the screen-space bounding box of
`the primitive. We call this pre-transformation, and it may or
`may not involve actually transforming the primitive. In some
`cases, primitives will fall into the screen regions of renderers
`other than the one on which they reside. These primitives must
`then be redistributed over an interconnect network to the
`appropriate renderer (or renderers), which then perform the re
`mainder of the geometry-processing and rasterization calcula
`tions for these primitives.
`This redistribution of primitives at the beginning of the ren
`dering process is the distinguishing feature of sort-first. It clearly
`involves overhead, since it performs some geometry processing
`for some primitives on the wrong renderer. The results of these
`calculations must either be sent to or recomputed on the new
`renderer (s).
`Sort-first is the least explored of the three classes. To our
`knowledge, no one has built a sort-first system. Although sort-
`first may seem impractical at first, we will see later that it can re
`quire much less communication bandwidth than the other
`approaches, if primitives are tessellated or if frame-to-frame
`coherence can be exploited. We will discuss its potential ad
`vantages and disadvantages in more detail later.
`
`Sort-middle
`In sort-middle, primitives are redistributed in the middle of
`the rendering pipeline—between geometry processing and ras
`terization. Primitives at this point have been transformed into
`screen coordinates and are ready for rasterization. Since many
`systems perform geometry processing and rasterization on sep
`arate processors, this is a natural place to break the pipeline.
`In a sort-middle system, geometry processors are assigned,
`arbitrary subsets of the primitives to be displayed; rasterizers are
`assigned a portion of the display screen (generally a contiguous
`region of pixels, as in sort-first). The two processor groups can
`be separate sets of processors, or they can time-share the same
`physical processors.
`During each frame, geometry processors transform, light.
`
`24
`
`IEEE Computer Graphics and Applications
`
`MEDIATEK, Ex. 1021, Page 6
`IPR2018-00101
`
`
`
`A Sorting Classification of Parallel Rendering
`
`Figure 3. Sort-middle redistributes screen-space primitives between
`geometry processing and rasterization.
`
`Figure 4. Sort-last redistributes pixels, samples,
`or pixel fragments during rasterization.
`
`Graphics database
`(arbitrarily partitioned)
`
`and perform other geometric operations on their portion of the
`primitives, and classify them with respect to screen region
`boundaries. Then they transmit all these screen-space primi
`tives to the appropriate rasterizer or rasterizers, as shown in
`Figure 3.
`Sort-middle is general and straightforward. It has been the
`most common approach to date for both hardware”' and soft
`ware'’'’'^ parallel rendering systems. We will examine its ad
`vantages and disadvantages in more detail later.
`
`Sort-last
`Sort-last defers sorting until the end of the rendering
`pipeline—after primitives have been rasterized into pixels, sam
`ples, or pixel fragments. Sort-last systems assign each processor
`(renderer) arbitrary subsets of the primitives. Each renderer
`computes pixel values for its subset, no matter where they fall
`in the screen. Renderers then transmit these pixels over an in
`terconnect network to compositing processors, which resolve
`the visibility of pixels from each renderer (Figure 4).
`In sort-last, renderers operate independently until the visi
`bility stage—the very end of the rendering pipeline. The inter
`connect network, however, must handle all the pixel data
`generated on all the renderers. For interactive or real-time ap
`plications rendering high-quality images, this can result in very
`high data rates.
`Sort-last can be done in at least two ways. One approach,
`which we call SL-sparse, minimizes communication by dis
`tributing only those pixels actually produced by rasterization.
`The second approach, called SL-full, stores and transfers a full
`image from each renderer. Both methods have advantages, as
`we will see later.
`Sort-last systems have existed in various forms for more than
`20 years. The 1967 GE NASA II flight simulator used a simple
`version of SL-full that assigned a processor to each primitive.'^
`Since then, several primitive-per-processor’ '"' and multiple-
`primitive-per-processor"’'"’ SL-full systems have been proposed.
`Several recent commercial systems have used the SL-sparse
`approach.
`
`Processing and
`communication model
`We now analyze each of the three rendering methods in more
`detail. The aim is to build a quantitative model of their pro
`cessing and communication costs to use as a basis for compar
`ing them. We first consider the processing required to render on
`a uniprocessor. Then we evaluate the additional processing and
`communication requirements of the parallel methods. We focus
`
`Terms used in the analysis
`
`N
`
`dr. 3d
`
`Or.O,
`
`frJd
`
`T
`S
`c
`
`The resolution of the screen, in pixels.
`The number of processors in the multiprocessor.
`The number of raw and display primitives,
`respectively.
`The average size, in pixels, of a raw and display
`primitive.
`The average number of screen regions that raw
`and display primitives overlap.
`The fraction of pre-tessellation and post
`tessellation geometry processing that is
`duplicated when a raw primitive overlaps
`multiple screen regions.
`The tessellation ratio, njn^.
`The number of samples per pixel.
`The fraction of raw primitives that must be
`redistributed each frame when sort-first takes
`advantage of frame-to-frame coherence.
`
`on the inherent overhead in the methods and do not specifi
`cally model factors such as load balancing, buffering, and trans
`port delay. Such factors affect performance significantly, but
`depend on the detailed system implementation and the graph
`ics scene itself. We will, however, diseuss these factors (partic
`ularly load balancing) where appropriate.
`See the sidebar for definitions of terms used in the analyses.
`
`Uniprocessor pipeline
`For the analysis that follows, we refine the rendering pipeline
`as shown in Figure 5 on the next page. First, some rendering
`systems tessellate primitives to generate higher quality images
`(RenderMan is one widely used example). Tessellation decom
`poses larger primitives into smaller ones, typically into polygons
`or polygonal meshes. Not all rendering packages perform tes
`sellation, but we include it in the pipeline because it can greatly
`expand the number of primitives that need to be displayed.
`
`July 1994
`
`25
`
`MEDIATEK, Ex. 1021, Page 7
`IPR2018-00101
`
`
`
`Parallel Rendering
`
`Pipeline stage
`
`Processing cost
`
`Geometry:
`Pre-tessellation
`Post-tessellation
`Pixel rendering
`
`Visibility
`
`Of «geom-pre-tess» -i-
`Pel «geom-post-tess»
`«rend-setup» +
`«rend»
`«comp»
`
`Second, we break rasterization into two stages called pixel
`rendering (computing pixel values) and visibility (determining
`which pixels are visible), since some algorithms perform these
`on separate processors. Finally, we do not explicitly mention
`shading, which can be a major consumer of processing cycles,
`but can occur almost anywhere in the pipeline. Shading should
`be considered a part of geometry processing or of pixel ren
`dering, as appropriate.
`We assume that we are rendering a data set containing raw
`primitives with average size a,. We will call primitives that re
`sult from tessellation display primitives. If T is the tessellation
`ratio, there are = Tn^ display primitives, with average size
`= aJT. If there is no tessellation, 7 = 1,
`and = a,. We
`assume that we are rendering an image containing A pixels and
`that we are to compute S samples per pixel. For simplicity, we
`assume that all primitives are potentially visible (that is, lie
`within the viewing frustum).
`
`Processing costs
`Figure 5 lists the processing costs for each stage of the unipro
`cessor rendering pipeline. First, the raw primitives are pro
`cessed by the stages of the geometry pipeline up to tessellation.
`The cost for this is «geom-pre-tess» (“«...»” denotes cost,
`most naturally in units of time). The resulting display prim
`itives are then processed by the geometry pipeline stages fol
`lowing tessellation, at a cost of «geom-post-tess».
`Rasterization follows. Pixel rendering is the first stage and con
`sists of two parts: setup and per-sample rendering operations,
`whose costs are «rend-setup» and n^a^S «rend», respectively.
`Finally, the cost of compositing is n^a^S «comp», where «comp»
`is the cost of compositing one sample (generally a z comparison
`and conditional write). There is no data redistribution or “sort”
`in the uniprocessor model.
`Sort-first analysis
`We begin our analysis with sort-first. Figure 6 shows its ren
`dering pipeline and what it costs beyond uniprocessor render
`ing. Our analysis assumes that primitives are redistributed as
`early in the rendering pipeline as possible and that renderers dis
`card transformed data when they send a primitive to a new ren-
`derer. (An alternative is to send the transformed data. Although
`we will not focus on this alternative, its analysis is similar to the
`case we present here.)
`
`Processing and communication costs
`The first steps in sort-first are to “pre-transform” the raw
`primitives so that their screen extents are known and to classify
`them with respect to screen-region buckets. Each bucket be
`longs to a region, and a primitive falls in several buckets when
`
`Figure 5. Rendering pipeline and processing costs in a uniprocessor
`implementation.
`
`Figure 6. Sort-first processing and communication overhead
`(communication costs indicated by box).
`
`Pipeline stage
`
`SF overhead
`
`Pre-transform
`Bucketization
`Redistribution
`Geometry:
`Pre-tessellation
`Post-tessellation
`Pixel rendering
`Visibility
`
`«pre-xform»
`nPr «bucket^>
`enPr «prim;»
`
`nr(Or-1)/r«geom-pre-tess» +
`nJ^Ocj-^) frf«geom-post-tess»
`1) «rend-setup» +
`—
`
`it overlaps several regions. We define an overlap factor O^,
`which is the average number of regions a raw primitive overlaps.
`If the cost to precompute a primitive’s screen coordinates is
`«pre-xform» and the cost to put the primitive in each of its buck
`ets is «bucket,», then the overhead for these stages is n, «pre-
`xform» and «bucket,.».
`Next, primitives on the wrong renderer must be distributed
`to the correct renderer(s). The number of primitives redis
`tributed depends on the application and whether frame-to-
`frame coherence is employed. For example, if we are rendering
`a single frame, almost all the primitives must be sent. However,
`if we render multiple frames in an animation or real-time ap
`plication and the scene does not change much between frames,
`we may need to send only a small fraction of the primitives.
`(Note that this type of coherence is only available in sort-first,
`as it is the only technique that distributes raw primitives. The
`other methods distribute data that has undergone view-depen-
`dent processing.)
`To take this application-dependent behavior into account,
`we define the parameter c, the fraction of primitives that must
`be redistributed. The total network bandwidth required is then
`cn,0, «primr», where «prim,.» is the size of the data structure rep
`resenting a raw primitive. We will not explicitly tally the pro
`cessing cost of communication, but note that it is proportional
`to this term.
`After redistribution, sort-first algorithms may accrue other
`overhead that a uniprocessor pipeline would not. If a raw prim
`itive falls in exactly one bucket, then it undergoes geometry
`processing exactly once, and the costs are the same as they
`would be on a uniprocessor. If the primitive overlaps more than
`one region, however, there will be duplication of effort. Each
`additional processor responsible for a given raw primitive must
`duplicate some fraction of pre-tessellation geometry pra-
`cessing and some fraction/^ of post-tessellation geometry pro
`cessing. These fractions will depend on the algorithms chosen
`for geometry processing. For example, explicit clipping to region
`boundaries requires duplication of effort, while implementa
`tions that avoid clipping avoid this extra work. From these con
`siderations, the additional cost of geometry processing for the
`
`26
`
`IEEE Computer Graphics and Applications
`
`MEDIATEK, Ex. 1021, Page 8
`IPR2018-00101
`
`
`
`A Sorting Classification of Parallel Rendering
`
`Figure 7. Sort-middle processing and communication overhead
`(communication costs indicated by box).
`
`Pipeline stage
`
`SM overhead
`
`Geometry
`Bucketization
`Redistribution
`Pixel rendering
`Visibility
`
`_
`«bucket(j»
`ndOd«prim£^>
`nJ^Ofd-^) «rend-setup»
`—
`
`/I, raw primitives is - 1)/, «geom-pre-tess». For the Tn^
`display primitives, it is n^O^- \ )f,, «geom-post-tess».
`Finally, each display primitive that overlaps several screen re
`gions requires pixel rendering setup for each region. This adds
`processing overhead of
`1) «rend-setup».
`
`Frame-to-frame coherence
`If we are rendering a single frame, most primitives will be on
`the wrong renderer and will require redistribution—unless we
`are very lucky. Hence, c will be close to 1. However, if we ren
`der several related frames in sequence, we may be able to take
`advantage of frame-to-frame coherence to reduce overhead.
`To do this, processors that render a primitive during one frame
`must retain that primitive for the next frame. If there is signifi
`cant spatial coherence between frames, fewer primitives will
`require redistribution during the next frame and c will be close
`to 0. Of course, the application must support retained-mode
`(display-list) rendering, and bookkeeping is needed to track
`primitive ownership.
`To get an estimate of c, we have analyzed traces from two ren
`dering sequences: the rotating head model in the NCGA Graph
`ics Performance Committee’s Picture-Level Benchmark^” (see
`the title-page photo) and an architectural walkthrough of a
`building interior seen through a head-mounted display. In both
`cases, we assumed a display size of 1,280 x 1,024 pixels and a re
`gion size of 64 x 64 pixels. In the Picture-Level Benchmark, the
`head contains 60,000 triangles and rotates 4.5 degrees each
`frame. Here, c varied between 0.13 and 0.16. The architectural
`model contained 64,796 triangles. During a 60-second explo
`ration of a room within the model, c varied between 0.0 and
`0.64 with a mean value of 0.15. Thus, we see that c can be quite
`small in practice, making sort-first appealing for applications
`with substantial coherence.
`
`Tessellation and oversampling
`If a system employs tessellation and oversampling, then each
`raw primitive generates T display primitives, and each of these
`in turn generates samples. This means that for each raw
`primitive redistributed by sort-first, sort-middle must redis
`tribute Tdisplay primitives and sort-last must redistribute Ta^S
`samples. If T and S are large, it may make the most sense to re
`distribute raw primitives, since there are far fewer of them. We
`will see after considering sort-middle and sort-last that sort-
`first has the lowest cost under these circumstances.
`
`Load balancing
`Sort-first is susceptible to several types of load imbalance (as
`are the other two approaches). The most obvious is the initial dis
`tribution of primitives across processors. Even if equal numbers
`of primitives are assigned to each processor, primitives may re
`quire different amounts of work, so load imbalances can result.
`
`A second type of load imbalance arises from the distribution
`of primitives over the screen. In sort-first, some regions of the
`screen will very likely receive many more primitives than oth
`ers. Also, some primitives may take longer to process than oth
`ers. A way to combat this is to make regions smaller and make
`each processor responsible for more than one region. This can
`be done statically, which is simple, but might not achieve an
`optimal load distribution for any given frame. It can also be
`done dynamically, which increases algorithm complexity but
`may achieve better results. When using frame-to-frame coher
`ence, we must constrain dynamic load balancing so that each re
`gion is assigned to the same processor in successive frames.
`Also, dividing the screen more finely to improve the load bal
`ance tends to increase c.
`
`Advantages and disadvantages
`In summary, sort-first has two advantages:
`
`• Communication requirements are low when the tessellation
`ratio and the degree of oversampling are high, or when frame-
`to-frame coherence can be exploited.
`• Processing nodes implement the entire rendering pipeline for
`a portion of the screen.
`
`Its disadvantages, however, are
`
`• Susceptibility to load imbalance. Primitives may clump into re
`gions, concentrating the work on a few renderers.
`• Necessity of retained mode and complex data handling code
`to take advantage of frame-to-frame coherence.
`
`Sort-middle analysis
`Sort-middle algorithms are in some sense the most natural. In
`contrast to sort-first, they redistribute primitives after geome
`try processing, that is, after screen coordinates have already
`been computed. Figure 7 shows the sort-middle rendering
`pipeline and its additional costs.
`
`Processing and communication costs
`In sort-middle, the cost of geometry processing is the same as
`on a uniprocessor. Sort-middle algorithms first accrue overhead
`when they redistribute display primitives. Each of the prim
`itives must be placed into buckets at a total cost of
`«buckeL». The bandwidth required to send these buckets is
`^dOd «prim^,», and the processing cost of communication is pro
`portional. After redistribution, display primitives that overlap
`more than one region require extra pixel rendering setup. The
`cost for this is (O^-1) times the cost on a uniprocessor. Visibility
`calculations cost the same in sort-middle as they would on a
`uniprocessor.
`
`July 1994
`
`V
`
`27
`
`MEDIATEK, Ex. 1021, Page 9
`IPR2018-00101
`
`
`
`Parallel Rendering
`
`Figure 8. Sort-last processing and communication overhead
`(communication costs indicated by box).
`
`Pipeline stage
`
`SL-sparse overhead SL-full overhead
`
`Geometry
`Pixel rendering
`Redistribution
`Visibility
`
`— —
`
`«sample»
`—
`
`NAS «sample»
`NAS «comp»
`
`From Figure 7 we can see that the overhead for sort-middle
`depends on the number of display primitives and on the dis
`play-primitive overlap factor