`
`;$3.9;F-wlujsi C”3'T€C'€}lh‘~'-§ :v' ‘ :vM
`
`
`Si€::<'3P.-\PH 2:361
`F'rr.
`
`1:;
`
`- ‘
`
`PROCEEDINGS
`
`'_,
`
`n
`
`,.._n -_- Fuc‘v
`
`A ?ld£:f:‘lh0:10§ACT‘.ISZGGH.‘-Fh
`
`Sprmsarufi by 31:0ACifi‘i:- Spusat
`
`frvr~--,51C‘r:m;:« :2: Cur-31.x»?
`
`Grant-Hr;
`
`1‘
`
`.
`
`.
`
`.
`
`-
`
`PRESS
`
`_
`
`.siéUS—aeee 3553
`
`233%“
`
`WISE GRAPHICS fiPROCEED 1 NES‘
`
`mam-mag:
`3393371000 m m
`
`w; m W‘
`
`. as“
`
`MEDIATEK, Ex. 1013, Page 1
`IPR2018-00101
`
`
`
`
`
`PROCEEDINGS
`
` Annual Conference Series 2001
`
`
`SEGGHAPH 2001
`Conference Proceedings
`August 12—17, 2001
`Papers Chain Eugene Flume
`
`A Publication of ACM SIGGRAPH
`
`Sponsored by the ACM‘s Special
`Inierest Group on Computer
`Graphics
`
`
`
`20 O 1EXPLDRE iNfEEACTIGN
`AND DlGiTAL IMAGES
`
`Emit-.9.
`'t-"u't a 3;"
`i:
`
`ii r: 39'
`:
`“RE
`
`‘6 Wm
`
`
` __,..——_
`
`iiiilifliillliliilllliiiillillliilifl
`
`REG-22587597
`
`magnum Loan. Return Almall within 4 weeks of date
`Wham ufifi'sfi'recafi‘ed‘exm. f
`Request Ref. No.
`
`9/10~2-VE LOAN S S
`If no other library inéicated please return loan to:-
`The British Library Document Supply Centre, Boston Spa,
`Wetherby, West Yorksnlre, United Kingdom I323 7BQ
`
`MEDIATEK EX. 101 , Page 2
`
`IPR2018-00101
`
`_._
`
`/
`“i
`__s
`
`wvum
`
`/.'
`
`9’
`
`/'.
`
`)
`
`:4
`
`/ .
`
`-'
`
`
`
`MEDIATEK, Ex. 1013, Page 2
`IPR2018-00101
`
`
`
`SIGGFIAPH 2001. Los Angeies. Califoméa. August 12—17. 2001
`
`The Association for Computing Machinery, Inc.
`1515 Broadway
`New York. New York 10036
`
`Copyright © 2001 by the Association for Computing Machinery, Inc (ACM). Permission to make digital or hard copies of
`portions of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed
`for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright for com-
`ponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
`
`To copy otherwise, to republish. to post on servers or to redistribute to lists. requires prior specific permission andlor a fee.
`Request permission to republish from : Publications Department. ACM, Inc. Fax +1-212-869-0481 or e—maji
`permissions @acmorg.
`
`For other copying of articies that carry a code at the bottom of the first or last page, copying is permitted provided that the
`per-copy fee indicated in the code is paid through the Copyright Clearance Center. 222 Rosewood Drive. Danvers, MA 01923.
`
`Notice to Past Authors of ACM-Published Articles
`ACM intends to create a complete electronic archive of all articles and/or other material previously published by ACM. if you
`have written a work that was previously published by ACM in any journal or conference proceedings prior to 1978, or any
`SIG newsletter at any time, and you do NOT want this work to appear in the ACM Digital Library. please inform
`permissions©acm.org. stating the title of the work, the author(s). and where and when published.
`
`ACM ISBN: 1-581 i3-374—X
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`PO. Box 11405
`Church Street Station
`New York. NY 10286~ 1405
`
`Phone: 1-800-342-6626
`(USA and Canada)
`Halli—62640500
`(All other countries)
`Fax: +i-212-944-1318
`
`E-mail: acmheip@acm.org
`
`ACM Order Number: 428010
`
`Printed in the USA
`
`MEDIATEK, EX. 1013, Page 3
`
`IPR2018-00101
`
`MEDIATEK, Ex. 1013, Page 3
`IPR2018-00101
`
`
`
`AND DIGITAL lflJGl-S
`"'2 0 0 TEXPLORE lMTClACTlON
`
`E
`
`En‘k Lindholm
`
`erikl@nvidia.corn
`
`Computer Graphics Proceedings. Annual Conference Series, 2001
`
`A User-Programmable Vortex Engine
`Mark J Kllgard
`
`mjk@nvidia.com
`
`NVIDIA Corporation
`
`Henry Moraton
`
`moreton@nvidia.corn
`
`perfonnance has driven. and been driven by increasingly rich
`graphics APls. The motivation behind the creation of the user-
`programmabie geometry engine described in this paper is two
`fold: first. the increasing configurability required by continually
`evolving graphics APIs requires a programmable device to
`support
`the combinatorial explosion of mode combinations.
`Second, high-performance programmability is an end unto itself.
`Given the right programming model. with a sufficient degree of
`target processor independence.
`the need for rapidly evolving
`graphics APls is reduced, and an opportunity is created for
`inventiveness
`unconstrained
`by
`fixed-firnction, modally
`configured Mia and hardware. Further. compatibility across
`hardware generations and platforms will increase the lifespan and
`utility of programs written for geometry processors.
`
`The programming model and design of the geometry engine in the
`GeForce3 was guided by several
`factors: commodity pricing.
`design
`time.
`area.
`legacy
`performance.
`programmable
`performance. programmabiliry.
`and platform independence.
`Ultimately. all of these influence the commercial viability of the
`design. Design time obviously determines time to market. Area is
`directly linked to product cost. Previously existing applications
`must exhibit higher performance on new products. There can only
`be a slight performance penalty paid for taking advantage of
`progranunability. To gain acceptance. the engine must be easy to
`program. Finally. to promote adoption across vendors. a standard
`interface is required and thus the functionality cannot be too
`tightly coupled to a specific hardware implementation;
`for
`example. CPU implementations must be viable.
`We provide a monornic description of previous programmable
`graphics processors. comparing them to our device. We show how
`the programming model can be effectively supported by a custom
`processor design. We describe how a programmable processing
`element can be incorporated into an existing graphics AP].
`Finally. we illustrate how the programming model and interface
`may be used to efficiently implement complex custom effects.
`2 PREVIOUS WORK
`Geometric calculations have been accelerated for over 30 years.
`starting with early flight simulators. Among the best known is the
`Geometry Engine [5]. A system was built from !2 instances ofthe
`GE, coupled with a raster subsystem built out of AMDZ903s. The
`GE was fabricated using a 31m feature size and housed in a 40-
`pin package. The GeForoe3 CiPU is manufactured using a 0.I8lrm
`process with a ~550—pin package. So while available logic has
`increased by a factor of 300. the relative amount of available
`bandwidth has only increased by a factor of 14. Note that
`increases in clock frequency cancel in this relative measure. We
`provide these numbers simply to illustrate that the problem is
`continually evolving. and that the natural amount of computation
`performed by the GPU todsy'rs far more than ms performedIn
`years past. and probably a fraction of what will be appropriate
`tomorrow.
`
`The various products and technologies applied to performing the
`standardgeomeu‘yprooessingtsskscenbecstegormdbyasmeli
`number
`of
`attributes:
`technology.
`attainment.
`and
`programmability. The technology is one of ASIC, DSP. RISC
`
`MEDIATEK, EX. 1013, Page 4 149
`
`IPR2018-00101
`
`EABSTRACT
`in this paper we describe the design, programing interface. and
`
`nulernentation of a very efficient user-programmable vertex
`graine. The vertex engine of NVIDIA's GeForce3 GPU evolved
`
`fiun a hiyrly tuned fixed-fimction pipeline requiring considerable
`:hiowledge to program. Programs operate only on a stream of
`independent vettices traversing the pipe. Embedded in the broader
`
`ed function pipeline. our approach preserves parallelism
`unificed by previous approaches. The programmer is presented
`
`" '. a straightforward programming model. which is supported by
`t mold-threading and bypassing to preserve parallelism
`. performance
`.
`in the remainder of the paper we discuss the motivation behind
`
`it design and contrast it with previous work. We present the
`.1.
`: ming model. the instruction set selection process. and
`a- '13 of the hardware implementation. Finally. we discuss
`
`m-rtant API design issues encountered when creating an
`I':
`to such a device. We close with thoughts about the
`..
`. of programmable graphics devices.
`
`ords
`
`.
`...
`ics Hardware, Graphics Systems.
`
`
`
`
`
`
`.
`
`-
`
`.
`
`-
`
`X 2
`
`'5
`>
`
`INTRODUCTION
`
`Host interface
`
`Primitive Assembl lSetu-
`u .u
`- E_RasterITextura
`gE
`Fra_mobuffer Interface
`3'
`
`
`Figure 1: Graphics Processing Unit (GPU)
`
`dramatic increases in the computational power of graphics
`-.
`; units (OPUS. Figure l) have been fueled both by
`I:-I
`innovation
`and
`the
`continuing
`improvement
`in
`
`W ..
`-
`.
`process tectmologies. The need for
`increased
`
`-. on to make digital or hard copies of all or part of this
`r personal or classroom use is granted without for:
`Mt
`that copies are not made or distributed for profit or
`
`“-i
`is! advantage and that copies bear this notice and the
`Cation on the first page. To copy otherwise. to republish.
`
`on servers or to redistribute to lists. requires prior
`pcmission andlor 11 fee.
`
`
`
`IGGRAPH 200i. 12-17 August 200i. Los Angclcs. CA.
`200i ACM l-SSI l3-374vX/01/08...$5.00
`
`
`MEDIATEK, Ex. 1013, Page 4
`IPR2018-00101
`
`
`
`SlGGFlAPH 2001. Los Annelos, California. Au-ust 12-47, 2001
`
`caches
`CPU. and CPU extensions. Arrangement refers to the appr
`to exploiting parallelism, such as SIMD or MlMD. Each system's
`programmnbility may be characterized by whether they were
`intended for end-user programming, and the relative ease with
`which they were programmed.
`
`the Stellar 681000 [4]
`implementation.
`The only non-parallel
`used a supercomputer-like vector processor. and was driven by
`hand-coded assembly for critical paths
`
`Pixar's CHAP [l7] and the lltonas [7] are early examples of fine.
`grain SIMD processors, based on the AMDZ903. user micro-
`eodable by skilled programmers. "These machines operated in
`parallel on pixel and vertex components. The only coarse-grained
`SIMD implementation of which we are aware is the geometry
`subsystem ofthe Indigo Extreme [ll]. It was implemented using a
`hand micro-coded ASlC. The Indigo processed eight triangles in
`parallel. stalling if any of the group were clipped. or otherwise
`required branching.
`
`Following the original Geometry Engine. the IRIS GT [3] and
`The Pixel Machine [24] were the only machines to arrange
`floating point DSPs in pipeline Fashion. As has been observed by
`many. the slowest proeemr in the pipeline gated than machines“
`perfonnsnee. Since it was only practical to distribute the geometry
`rash statically.
`the pipelines were
`inefficient
`for certain
`workloads.
`
`MIMD machines dominate the history ofgeornetry processors. In
`each case the individual processors operated on single triangles.
`The Raster Tech GX4000 [26],[27] was the earliest example.
`followed by Pixel-Planes s [10]. the outoooovs [[5], Pixel
`Flow [19]. and the ReoliryEngine [2}. The @0100!) used a Weitek
`floating point DSP. while all but one of the remaining machines
`used the lBtSOXP [13]. a 64-bit microprocessor. The last of the
`MIMD geometry subsystems was the lnfiniteReality [23]. using a
`custom micro-cow ASlC built
`to exceed the performance
`available in third party processors. The lnfiniteReality's processor
`was micro-coded in SIMD fashion within each of the processors
`in a MIMD array of configurable size.
`
`Alternatives to the above large high-perfonnance machines are the
`processor extarsions, all of which exploit fine-grained SIMD
`parallelism similar to the CHAP and lkonas. Each of these
`exploits the existing resources and clock rate of a general purpose
`CPU to deliver high performance. MIPS-3D ASE [18] and
`JDNowl
`[I] perform paired single SlMD floating point
`operations. Intel's SSE instructions [I4] express 4-wide SlMD
`processing. Motorola's AltiVec [9] delivers the hill 4-wide SIMD
`performance. Sony's Emotion Engine [16] has two 4-wide SIMD
`processors'l'hefirstisinterfacedtothemainCPUasa
`coprocessor. executing instructions directly from the application's
`instruction stream. The second processor is more loosely coupled.
`ntnning loaded subromines.
`typically performing standard
`geometry processing tasks.
`
`In all cases, experts were required to very carefully crafi assembly
`code to achieve processor performance approaching theoretical
`peaks. Clem attention to pipeline latency.
`lunrds. and stall
`conditions was necessary to produce good results. While
`compilers were generally available. generated code was typically
`of inadequate pcrl‘onnance.
`
`in contrast to virtually all of these systems. our geometry engine
`only exposes the progranmtability of a small part of the larger
`geometry pipeline. Tasks such as vertex loaddrstore.
`format
`conversion1 primitive assembly, clipping. and triangle setup occur
`completely in parallel. in pipeline fashion. We use 4-wide fine-
`grained SIMD floating point
`to provide
`the necmary
`performance. and rim multiple execution threads to maintain
`efficiency and provide a very simple programming model.
`
`3 PROGRAD/[MING MODEL
`In this section we describe our programming model for geometry
`processing and discuss the design in the areas of input. Dutpm’
`data path, and instruction set selection. We include the rationing
`for choicest made in the design process.
`
`3.1 Vertex Processing
`There were two main possibilities for processing the vm-_
`stream:
`as
`independent vertices or as part of a geomenicj‘
`primitive. for example a triangle. The advantage of primitivenlevcl
`.
`information is enabling operations such as culling. reducing}
`processing time. However. we determined that
`the increases f.
`complexity and loss of parallelism in the primitive p
`model did not justify the perceived benefits. We chose an
`independent vertex program model to exploit the parallel nature if:
`of the task. and greatly simplify the resulting programming task. l'f“
`We preserved the latter stages of the fixed function programming
`.1
`model. there being no benefit to their programmability. in fact.
`incorrect clipping could fleece a hardware rasterizer. As such we "
`'
`leave frustum clipping. perspective divide. and viewport scale and
`bias
`to subsequent
`irnplementation~specific processing. The
`programming model is capable of expressing everything in the ._
`fixed function pipeline excqat user clip planes. We instead
`recommend encoding plane distances into texture coordinates and ’E,
`using fi'agment level operations to implement this fimctionality.
`3.2 Precision and Data Type
`lEEE single precision floating point has been used for many years
`as the standard precision for 3D transformations and to keep the
`model simple it was adopted as the only data type. The common
`data in 3D graphics are 3 and 4 component vectors. for example
`position. normal. texture coordinates and colors. The basic data
`type is therefore the quad-float vector written as ix.y.z.w].
`
`7"
`
`3.3 Scalar and Vector Handling
`It was critical to deal efficiently with scalar packing/extraction
`and vector data in this design since the 3D transform pipeline
`mixes these operations. Two simple concepts can resolve this:
`I. On input. vectors can have their components arbitrarily
`rearranged/replicated (swinled).
`
`2. Any operation generating a scalar must generate that scalar
`replicated across all components. and output writes have a
`component write mask.
`
`.
`
`A scalar value in a vector register can be replicated into a vector
`through (i), and then stored again as a scalar through (2).
`Swizzling is very useful for doing cross products efficiently.
`where the source vectors need to be rotated. Another use is
`converting constants
`such as {4.0.1.2} into others such as
`[0.0.1.0] or[-l.-l.-l.l].
`
`3.4 Program Model
`The program model is illustrated in Figure 2. The current vertex
`attributes are available in the input (source) registers. and the
`processed vertex is written into the output (destination) registers.
`The constant bank holds n'artsform and light parameters. and the
`register file (R) holds temporary results. A function unit (Fl
`implements the instruction set.
`
`Making the vertex source read-only by the vertex program. and
`the destination write-only recognizes the streaming nature of the
`design and simplifies implementation.
`
`mm“Whammsn3'
`
`x».’.\.w-'.J.Nx%..n
`
`MEDIATEK, EX. 1013, Page 5
`
`IPR2018-00101
`
`MEDIATEK, Ex. 1013, Page 5
`IPR2018-00101
`
`
`
`Com-ular Grannies Proceedin-s. Annual Conference Series. 2001
`
`(clamped. only vaiid for points). Having a fog output permits
`more general fog effects than using the position‘a z or w values,
`and is interpolated before use a a distance in the standard fog
`equations. We allow for up to eight texture coordinate sets that
`can be used for traditional texturing as well as more novel effects
`in wmbination with GeForoeS's texture shade:- and register
`combiner: per-fragment functionality [20]. Texture coordinates
`are assumed to be full precision and range. as well as perspective
`correct when used in pixel programs.
`All instruction writes have an optional 4-componcnt write mask.
`
`
`
`Table 1: Output Attribute:
`
`All vertex output registers are initialized to (0.0.0.0.0.0.l .0) at the
`start of a vertex program. Subsequent writes then apply the output
`write mask to update the selected components. This avoids any
`problems with undefined outputs. and having to verify raster
`subsystem input options.
`
`3.7 Instruction Set
`The instruction set consists of l? operations. These can be
`divided into vector, scalar. and miscellaneous operation. We
`discuss the instructions selected after explaining the constraints
`we choaetoimpose.
`
`Figure 2: Program Model
`
`
`
`Input Attributes
`are to quad-float vertex source attribute registers Fixed
`
`m mode typically requires a position, normal, two colors.
`.. eight texture coordinate sets. skin weights. fog. and point
`
`These are sent from the host in many fonnats including
`shorts.
`integers. and floats, with conversion to floating
`
`done before the data is accessed. Unspecified attribute
`
`« ts default to 0.0 for the second and third components.
`1.1.0 for the founh. The attribmcs are all persistent. that is they
`
`their data until they are changed by subsequent API calls.
`are addressed from 0 to l5. An API write to attribute 0 (the
`
`position when in fixed fimction mode) will
`invoke the
`program. Only one vertex attribute may be read per
`instruction.
`
`
`-
`
`‘i‘
`
`
`
`
`light positions, and plane
`ld constants such as matrices.
`Iathatareusedintypicalvertcxprogrmndwreisa
`bank of 96 quad-floats. It may only be loaded before
`are processed (for example outside of BeginfEnd). The
`“chosenbasedon fixedfimctionmemoryusage.andto
`
`Ironsonably large set of matrices for indexed skinning. As
`emcee attributes, only one constant may be read by one
`
`...
`inmwfion.1hepmgnmmaynotwritetoconstants
`
`.,
`it would create a dependency between venues. forcing
`
`w-n'on causing a serious performance impact.
`Iainooneirneguaddrusregistertbatmaybeloadedusing
`
`r-u'on (ARL). This address register allows for indexed
`t ready. with out-of-range reads returning the (0.0.0.0)
`
`'te register file is 12 quad-floats in size and allows
`-
`,
`
`leads and one write per instruction. The size was chosen to
`amenably simple modular code design, where some of the
`
`.
`,. would be used for storage of variables across multiple
`
`.
`«- All registers are initialized to (0.0.0.0) per vertex.
`.‘3metorreadmaybesoureedasmultipleweranda,md
`
`'1 ly swizzledlnegated each time: see Figure 2. Since any
`embenegatedthereisnoneedforasubtractinsuuction.
`
`atptrt Attributes
`
`vertex program outputs merge back into the fixed function
`
`at the homogeneous clip space point. there is a standard
`.. of output attributes. Position is used for clipping. Vertex
`
`..
`v
`- components are automatically clamped to the range
`”1.0. There is also a fog distance. and point sine output
`
`
`
`[333“
`
`(55.1.:
`[it-M'
`_‘
`
`3.1.1 No Bumble:
`and
`in OpenGL’[25]
`The fixed fimction transform paths
`DirectJD'“[6] are both controlled by glow sate that does not
`depend on the some! data supplied with each vertex. This allows
`for driver optimizations at the time the first vertex is supplied by
`the application since all subsequent vertices (until a new state
`changc)eandtensharethiacarefully optimizedpath. Theresult is
`a code segment that removes state checking and branching. It is
`therefore possible to support the full fixed flmction transform path
`(at
`least
`to hornogenous clip space) withotu branching. The
`decision was therefore made to not support branching. keeping
`the hardware as simple as possible. Also. late binding changes in
`control
`flow disrupt pipeline efficiency. Simple ifi'then/else
`evaluation is still supported through sum-oilproducts using. 1.0
`and 0.0. which can be generated with SL1" and 5m.
`3.7.2 Consult Latency
`One instruction set constraint we imposed was that our hardware
`implementation must issue any instruction per clock and execute
`
`MEDIATEK, EX. 1013, Page 6
`
`151
`
`IPR2018-00101
`
`IE_
`
`mE
`m[
`
`E-
`
`E—
`
`MEDIATEK, Ex. 1013, Page 6
`IPR2018-00101
`
`
`
`_.e._:
`
`option. We also wanted an accurate power function con fen-11mg
`the car" model. hence known approximations would not an ., ,
`,
`is possible to implement the an instruction with about It) .
`..
`instructions, but the performance lossIs extreme.
`1‘
`
`The we base 2 instruction returns an output accurate to sham |.
`mantissa hits as well as two partial results: the exponent
`.‘-..
`mantissaofthesotuccscalar. Amoreaccurateuserpro--
`....'.
`approximation based on the limited range mantissa can be ...'-,‘,-
`with the result added to the exponent. The EXP base 2 ins- mi: "
`also realms an output accurate to about ll mantissa bits as well?“
`
`two partial results.
`two raised to power of floodsouree)
`
`fiacticdsource).
`A more
`accurate
`user
`pro». :..
`approximation based on the limited range fraction can be .
`
`with the result multiplied by the power output. The precision .
`these instructions was based on the desired 8-bit color a - '
`
`of the specular LIT operation It takes about
`to instructions.
`achieve full accuracy LOG and EXP evaluation.
`
`The tint and flax operations allow for clamping and absolute
`computations (ttn of source and -source). Related to these are
`
`an and sea instructions that return [.0 it” the component co .
`is true and 00 if false.
`
`The an. instruction was added to allow support of vertex- . .
`constant access such as a matrix or plane equation. It converts
`floating—-point scalar into a signed integer. which can be used
`an offset into the constant memory Out-of-range reads from
`constant memory return (0.0.0.0).
`
`'
`Sourcesare negated by prefixinga - sign.and canbe
`via
`four optional
`subscripts
`that describe the on u.
`rearrangement desired. For example:
`IIW RD.
`-R1.tryxy;
`
`J
`
`-'
`
`'
`
`.
`
`
`
`
`into the .-
`moves the negated w component of register RI
`component ofregister R0.movcsthenegatedyandzcom~ ..
`
`across. and uses the negated y component again to place into .
`'
`R0 w component.
`
`The destination of an instruction has an optional write mask of . ..
`desired xyzw compmems to be written. For example:
`A30 30.". R1. R2 t
`
`updatesthexandwcomponentsofkfl with sum ole and R1.
`
`
`
`4 HARDWARE IMPLEMENTATION
`4.1 Overview
`The hardware implementation of vertex programs is divided '- _
`two main blocks:
`the vertex attribute buffer (VAB) and
`floating point core.
`
`Vortex In
`
`Vector FP Core
`
`Vertex Out
`
`Figure 3: Hardware Units
`
`The VAB is responsible for vertex attribute persistence. and-
`floating-point core processed the instnrction set.
`
`:t
`
`
`
`SlGGFlAPH 2001. Los An-eles. California. Au-Usl 12—17. 2001
`
`all instructions with the same latency. limiting the complexity of
`any instruction. This improves programmability and simplifies the
`hardware. All operands are immediately available.
`limiting the
`size of register and memory banks.
`3.1.3 Instruction Set Rationale
`Since we wanted to use the same instruction set for vertex
`programs and fixed fimction (non-programmable) mode. we
`started by anaiyzing the fixed fimction implementation of a
`previous architecture. We found that the equivalents of the HOV.
`not. act). and rum instructions were used about 50% of the time.
`and that the on. and DH equivalents were used about 40% of the
`time. We Stqrport dot products for their coding convenience. and
`also because as the number of cycles spent on a vertex decreases
`over architectural generations. it becomes more important to have
`powerful concise instructions. Cross products are also important,
`andtheycanbedoneviaanet’ficientmmsequence with
`source vector rotations. For example. R1 = RGXRZ is done as:
`IDS. R1, RD.IIYU. R2.yaxw ;
`Inn R1. RD.ysz-r.
`it: . zxy'w.
`
`-R1.-
`
`We support reciprocal (RCP) instead of division due to the samurai
`latency restriction. The RCP instruction is also scalar since the
`main use of it is in the perspective division of w in homogeneous
`clip space (done after the vertex program) which involves the
`multiply of the (x.
`.2) vector with the scalar NW.
`
`The reciprocal square root (use) is mainly used in nomralizing
`vectors to be used in lighting equations. The typical sequence is a
`up: to find the vector length squared. a use to get the reciprocal
`length. and a nut. to normalize the vector. It is very convenient to
`use the vector w component for storing the length squared and
`reciprocal lengtirvaluea.nsois alsoascalaroperator.
`To avoid problems with vector lengths of 0.0 causing R59 to return
`infinity. we mandated that 0.0 times anything be 0.0. This is also
`useful in conditional evaluation when multiplying by 0.0. Artmher
`mandate is tint L0 times anything be the same value.
`
`A major exception to our goal of similar performance in fixed
`fimction and program mode involved lighting. The previous
`architecture design has a separate hard-wired lighting engine.
`Since it was too hard to expose this engine in program mode. the
`decision was made to turn it off when fuming vertex programs.
`Fixed function performance with heavy lighting can therefore be
`twice as that as a comparable vertex program. To alleviate this
`problem. two instructions were included: DST and LIT. The not
`instruction assists in constructing attenuation factors of the form:
`(Karma-(lewd) = K0+Kl'd+K2‘d‘d
`where d is some distance. Since {Pd and lid are natural
`bypmducts of the vector nonrtalization process. these values are
`input as (NA.d'd.d'd.NA) and (NA.I/d.NA.l/d)) to DST, which
`then remms the (ldfi‘ddld) vector. The last l/d term can be
`used with a DIM operation if desired.
`The LIT instruction does the fairly complex ambient. diffuse. and
`specular calculations with clamping based on N-l... N-H. and the
`power p. The calculations are:
`contains-1.0.
`Output. y - null-1.. 0. 0);
`OutputHa-DD:
`12(IIL>0.0flp—0.0l
`Du
`r...a-1ll;
`also DM.>0-0ul-II>0.0)
`On L's-IDS);
`0cm ..;II-1tl
`
`”ambient
`// diffuse
`If specular
`
`Since LIT implements the specular power fitnction via use of a
`log, multiply. and exp sequence. we also decided to expose the
`too artd m instructions Since the power is a variable in the LIT
`source. a table needing a pro-known specular power was not an
`
`4.2 Attribute Input
`Vertex attributes are converted to floating point represent-
`before arriving at the VAB. which has room for the 16v-
`attributes. The contents oi'each address default to (0.0.0.0.0.0.1 '
`
`MEDIATEK, EX. 1013, Page 7
`
`IPR2018-00101
`
`
`
`.__
`
`-;
`
`MEDIATEK, Ex. 1013, Page 7
`IPR2018-00101
`
`
`
`‘F
`
`Computer Graphics Proceedings. Annual Conference Series. 2001
`
`Eall“ an attribute write arrives. and then overwritten by thevalid
`
`data components This is required since the API allows for
`Encoding less than four components; defaulting the remainder saves
`_undwidthrnto the GPU.
`
`
`act-w“‘..
`
`-r'ymssuv-cp1s-u
`
`w-we.“nu.a.“
`
`.3 E m
`
`Figure 4: VAB
`
`The VAB drains into a number of input buffers (18) that are used
`first feed the floating-point core in a round-robin fashion. Dirty bits
`m maintained in the VAB so that only changed attributes are
`updated when the same buffer is again the drain target. The
`mfer of a vertex is triggered by a write to address 0,
`corresponding to the vertex position in fixed function mode. To
`' prevent bubbles during simultaneous loading and draining of the
`VAB. incoming writes may push out the contents of the target
`address. supetceding a default drain sequence.
`
`4.3 The Floating-Point Core
`- The floating-point core is a multi-threaded vector processor
`‘operating on quad-float dam. Vertex data is read from the input
`buffers and transformed into the output buffers (OB). The latency
`of the vector and special function units are equal and multiple
`vertex threads are used to hide this latency.
`'llre SIMD Vector Unit is responsible for the ttov. mm. ADD. Han.
`m, D“. csr. ant, rm, 5LT. and sea operations. The Special
`Function Unit is responsible for the m. ttso. too. ass. and LIT
`operations.
`
`imgi
`
`
`
`Figure 5: Floating Point Core
`
`The Vector Unit flouting-point precision is approximately IEEE.
`There'rs no support for denomiulized numbers or exceptions. and
`rounding is always towards negative infinity. The hardware
`outputs 0.0 for a multiply with any source of 0.0.
`including
`
`0.0‘infinity and 0.0'NaN. The Special Function Unit calculates
`the Res and R50 functions to within about
`l.S bits of [EEE
`precision using two-pass Newton-Raphson iteration from a seed
`table. While lighting may suffice with a lower precision use,
`texture and position evaluation can require much higher precision.
`It was not felt necessary to provide a low-precision use option.
`The hardware accepts one instruction per clock and firlly
`implements all
`instruction set
`input/output options with no
`perfonnance penalty. All
`input vectors are available with no
`latency.
`
`5 PROGRAMMING INTERFACES
`the
`Given the predominance of OpenGL and DirectBD.
`3!)
`integration
`of
`programmable
`geometry
`into
`these
`programming interfaces is vital to its widespread availability and
`quick adoption. The discussion below concentrates on how we
`integrated programmable geometry into OpenGL through an
`extension named NV_vertex_pmgr-am. Where Direct3D makes
`alternative design choices. such choices are noted.
`
`5.1 Design Goals
`Existing OpenGL applications
`I. Backward compatibilitv.
`unaware of programmable
`geometry
`should
`operate
`unchanged.
`
`2.
`
`3.
`
`It should be relatively straightforward to
`Ease ofadoption.
`integrate
`prograrmnable
`geometry
`into
`an
`existing
`application without overhauling the way in which vertex data
`is presented to OpenGL. Moreover. applications should be
`able to mix existing fixed function vertex processing with
`programmable geometry.
`
`in our view. programmable geometry frees
`Fonvard focus.
`programmers from existing API conventions of what a
`“vertex normal“ or a “light direction” is; the vertex program
`supplies these semantic connections. transcending per-vertex
`attributes and vertex-related naming. By not constraining
`programmable geometry to existing conventions, we hope
`this will encourage novel applications for programmable
`geometry. including automatic generation of vertex programs
`by higher-level software [22].
`
`4.
`
`Preparation to expate firrwe progranrmabr‘llgr. We believe
`that other
`functionality beyond vertex processing in
`OpenGL's dataflow will eventually be programmable as
`well. The programming interface should be amenable to
`exposing other types ofprogrammability.
`5. Well-defined execution environment. Preliminary fwdback
`from developers and our own thinking convinced us that an
`unconstrained cxecrnion environment
`for programmable
`geometry would lead to harsh-afloat for developers. Unlike
`textures that can usually be down-sampled if too large,
`vertex programs that require more instructions. registers. Or
`other
`resources
`that
`are not
`available on
`a given
`implementation oamtot be easily simplified to cope with
`implementation limitations. For this reason, we chose to
`require a strict. well-defined execution environment.
`
`5.2 Programming Mode]
`NVuverraxflprogmm augments OpenGL vertex processing with a
`new mode known as vertex program mode.
`initially. vertex
`program mode is disabled. When disabled, vertices are
`transformed
`by OpenGL‘s
`conventional
`vertex-processing
`fimctionality. consisting of coordinate transformation. vertex
`lighting,
`texture coordinate generation. and user-defined clip
`planes.
`
`MEDIATEK, EX. 1013, Page 8
`IPR2018-00101
`
`‘53
`
`MEDIATEK, Ex. 1013, Page 8
`IPR2018-00101
`
`
`
`SIG-(2‘ HAl-‘r‘l 2001. Los An-eies. California. Au - as! 12—17, 2001
`
`Vertex program state afi'ects the OpenGL dataflow only when
`vertex program mode is enabled, so vertex program mode being
`initially disabled ensures backward compatibility.
`Vertex program mode is enabled as follows
`
`gramme lot._vsarax_9aocaxxasvl .-
`
`When enabled, a glVertex command (or equivalent) initiates
`vertex prognm execution. The current vertex program processes
`the current
`l6 vertex attributes and 96 program parameters as
`described in Section 3.5. At vertex program completion,
`the
`vertex result registers contain a transformed vertex that is firrther
`processed to screen space and forwarded to primitive assembly.
`
`5.2.1 Vertex Program Objects
`Multiple vertex programs are managed via progrmn objects. but
`there is a single current vertex program that i