`
`PRESS
`
`COMPUTER GRAPHICS -PROCEEDINGS-
`
`SL
`3393.971000 7
`
`ate
`Petege cig Weg)citsgecitzegors
`palfecsMetSiestaEB
`HE) er esnit8)
`aie
`
`Bs
`
`Papers
`
`Cha
`
`eee = gg tn:
`
`A Publication of ACM SIGGRAPH
`
`Sponsored by the ACM's Special
`Interest Group-on Computer
`Graphics
`Per tac
`
`tw Pewee CCa7i1esie
` 5-AuG-2002 Bsps “pease
`
`_ Ac,
`
`MEDIATEK, Ex. 1013, Page 1
`IPR2018-00101
`
`
`
`
`
`Annual Conference Series 2001
`SIGGRAPH 2001
`Conference Proceedings
`August 12-17, 2001
`Papers Chair: Eugene Fiume
`
`A Publication of ACM SIGGRAPH
`
`Sponsored by the ACM's Special
`Interest Group on Computer
`Graphics
`
`PROCEEDINGS 2 2. “SIGGRAPH
`
`AND DIGITAL IMAGES
`
`MEXED
` ——.
`
`5)
`i =]
`
`\ ee,
`5
`
`PATIht NOW
`in ‘all
`Nn? a,
`I SE
`
`~¢ 26 2
`GOirc 2 Le
`
`a
`
`IMINOMNIIIACMIAP
`
`wei
`REG-22587597
`
`=e
`=e
`
`International Loan, Return Airmail within 4 weeks of date
`
`ofreceipt unless recalledearlier. ”
`Request Ref. No.
`9/10-2-VE LOANS S
`If no other library indicated please return loan to:-
`The British Library Document Supply Centre, Bostan Spa,
`Wetherby, West Yorkshire, United Kingdom LS23 7BQ
`
`MEDIATEK, Ex. 1013, Page 2
`IPR2018-00101
`
`MEDIATEK, Ex. 1013, Page 2
`IPR2018-00101
`
`
`
`SIGGRAPH 2001, Los Angeles, California, August 12-17, 2001
`
`The Association for Computing Machinery, Inc.
`1515 Broadway
`New York, New York 10036
`
`Copyright © 2001 by the Association for Computing Machinery, Inc (ACM). Permission to make digital or hard copies of
`portions of this work for personal or classroom use is granted without fee provided that the copies are not madeordistributed
`for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright for com-
`ponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
`
`To copy otherwise, to republish, to post on servers or to redistributetolists, requires prior specific permission and/ora fee.
`Request permission to republish from : Publications Department, ACM,Inc. Fax +1-212-869-0481 or e-mail
`permissions @acm.org.
`
`For other copying ofarticles that carry a code at the bottom ofthefirst or last page, copying is permitted provided that the
`per-copy fee indicatedin the codeis paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`
`Notice to Past Authors of ACM-Published Articles
`ACMintendsto create a complete electronic archiveofall articles and/or other material previously published by ACM.If you
`have written a work that was previously published by ACM in any journal or conference proceedings prior to 1978, or any
`SIG newsletter at any time, and you do NOT want this work to appear in the ACMDigital Library, please inform
`permissions @acm.org,stating the title of the work, the author(s), and where and whenpublished.
`
`ACM ISBN: 1-58113-374-X
`Additional copies may be ordered prepaid from:
`ACM Order Department
`Phone: 1-800-342-6626
`P.O. Box 11405
`(USA and Canada)
`Church Street Station
`+1-212-626-0500
`New York, NY 10286-1405
`(All other countries)
`Fax: +1-212-944-1318
`E-mail: acmhelp@acm.org
`
`ACM Order Number: 428010
`
`Printed in the USA
`
`MEDIATEK,Ex. 1013, Page 3
`IPR2018-00101
`
`MEDIATEK, Ex. 1013, Page 3
`IPR2018-00101
`
`
`
`fe a 2 0 0 — INTERACTION
`i
`ANDDIGITAL IMAGES
`
`Computer Graphics Proceedings, Annual Conference Series, 2001
`
`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`E
`Re
`
`Erik Lindholm
`erikI@nvidia.com
`
`A User-Programmable Vertex Engine
`Mark J Kilgard
`mjk@nvidia.com
`
`Henry Moreton
`moreton@nvidia.com
`
`NVIDIA Corporation
`
`
`
`performance has driven, and been driven by increasingly rich
`graphics APIs. The motivation behind the creation of the user-
`programmable geometry engine described in this paper is two
`fold: first, the increasing configurability required by continually
`evolving graphics APIs requires a programmable device to
`the combinatorial explosion of mode combinations.
`Second, high-performance programmability is an end unto itself.
`Given the right programming model, with a sufficient degree of
`target processor independence,
`the need for rapidly evolving
`graphics APIs is reduced, and an opportunity is created for
`inventiveness
`unconstrained
`by
` fixed-function, modally
`configured APIs and hardware. Further, compatibility across
`hardware generations and platforms will increase the lifespan and
`utility of programs written for geometry processors.
`
`INTRODUCTION
`
`Framebuffer Interface
`
`vertex
`fragment
`
`ABSTRACT
`In this paper we describe the design, programming interface, and
`mplementation of a very efficient user-pr
`ble vertex
`
`engine. The vertex engine of NVIDIA’s GeForce3 GPU evolved
`
`from a highly tuned fixed-function pipeline requiring considerable
`
`knowledge to program. Programs operate only on a stream of
`independent vertices traversing the pipe. Embeddedin the broader
`
`ed function pipeline, our approach preserves parallelism
`sacrificed by previous approaches. The programmer is presented
`
`with a straightforward programming model, which is supported by
`transparent multi-threading and bypassing to preserveparallelism
`
`
`In the remainder of the paper we discuss the motivation behind
`ur design and contrast it with previous work. We present the
`ogramming model, the instruction set selection process, and
`etails of the hardware implementation. Finally, we discuss
`
`nportant API design issues encountered when creating an
`face
`to such a device. We close with thoughts about the
`
`ture of programmable graphics devices.
`
`‘ords
`
`raphics Hardware, Graphics Systems.
`
`
`
`
`
`The programming model and design of the geometry engine in the
`GeForce} was guided by several
`factors: commodity pricing,
`design
`time,
`area,
`legacy
`performance,
`programmable
`performance, programmability,
`and platform independence.
`Ultimately, all of these influence the commercial viability of the
`design. Design time obviously determines time to market. Area is
`directly linked to product cost. Previously existing applications
`must exhibit higher performance on new products, There can only
`be a slight performance penalty paid for taking advantage of
`programmability. To gain acceptance, the engine must be easy to
`program. Finally, to promote adoption across vendors, a standard
`interface is required and thus the functionality cannot be too
`tightly coupled to a specific hardware implementation;
`for
`example, CPU implementations must be viable.
`Host Interface
`aL
`— ___%
`We provide a taxonomic description of previous programmable
`Geometr
`graphics processors, comparing them to our device. We show how
`the programming model can be effectively supported by a custom
`processor design. We describe how a programmable processing
`_ Primitive Assembly/Setup
`element can be incorporated into an existing graphics API.
`— __%
`Finally, we illustrate how the programming model and interface
`Raster/Texture
`may be used to efficiently implement complex custom effects.
`2 PREVIOUS WORK
`Geometric calculations have been accelerated for over 30 years,
`starting with early flight simulators. Among the best known is the
`Geometry Engine [5]. A system was built from 12 instances of the
`GE,coupled with a raster subsystem built out of AMD2903s. The
`GE was fabricated using a 31m feature size and housed in a 40-
`pin package. The GeForce3 GPU is manufactured using a 0.18pm
`process with a ~550-pin package. So while available logic has
`increased by a factor of 300, the relative amount ofavailable
`bandwidth has only increased by a factor of 14. Note that
`increases in clock frequency cancelin this relative measure. We
`provide these numbers simply to illustrate that the problem is
`continually evolving, and that the natural amount of computation
`performed by the GPU today is far more than was performed in
`years past, and probably a fraction of what will be appropriate
`tomorrow.
`
`Figure 1: Graphics Processing Unit (GPU)
`
`
`
`dramatic increases in the computational power of graphics
`sing units (GPUs, Figure 1) have been fueled both by
`
`innovation
`and
`the
`continuing
`improvement
`in
`ctor
`process technologies. The need for
`increased
`
`Ssion to make digital or hard copies ofall or part of this
`r personal or classroom use is granted without fee
`
`ded that copies are not made or distributed for profit or
`Mercial advantage and that copies bear this notice and the
`
`tation on the first page. To copy otherwise, to republish,
`On servers or to redistribute to lists, requires prior
`
`Permission and/ora fee.
`
`
`
`IGGRAPH 2001, 12-17 August 2001, Los Angeles, CA,
`i 2001 ACM1-58113-374-X/01/08...$5.00
`
`The various products and technologies applied to performing the
`standard geometry processing tasks can be categorized by a small
`number
`of
`attributes:
`technology,
`arrangement,
`and
`programmability. The technology is one of ASIC, DSP, RISC
`
`MEDIATEK,Ex. 1013, Page 4 149
`IPR2018-00101
`
`MEDIATEK, Ex. 1013, Page 4
`IPR2018-00101
`
`
`
`SIGGRAPH 2001, Los Angeles, California, August 12-17, 2001
`
`CPU,and CPU extensions. Arrangement refers to the approaches
`to exploiting parallelism, such as SIMD or MIMD.Each system's
`programmability may be characterized by whether they were
`intended for end-user programming, and the relative ease with
`which they were programmed,
`the Stellar GS1000 [4]
`The only non-parallel
`implementation,
`used a supercomputer-like vector processor, and was driven by
`hand-coded assembly for critical paths.
`Pixar’s CHAP [17] and the Ikonas [7] are early examples offine-
`grain SIMD processors, based on the AMD2903, user micro-
`codable by skilled programmers. These machines operated in
`parallel on pixel and vertex components. The only coarse-grained
`SIMD implementation of which we are aware is the geometry
`subsystem ofthe Indigo Extreme[11]. It was implemented using a
`hand micro-coded ASIC. The Indigo processed eight triangles in
`parallel, stalling if any of the group were clipped, or otherwise
`required branching.
`Following the original Geometry Engine, the IRIS GT [3] and
`The Pixel Machine [24] were the only machines to arrange
`floating point DSPsin pipeline fashion. As has been observed by
`many, the slowest processor in the pipeline gated these machines’
`performance. Since it was only practical to distribute the geometry
`tasks
`statically,
`the pipelines were
`inefficient
`for certain
`workloads.
`
`f
`
`3 PROGRAMMING MODEL
`In this section we describe our programming modelfor geometry
`processing and discuss the design in the areas ofinput, output,
`data path, and instruction set selection. We include the rationale #
`for choices made in the design process.
`3.1 Vertex Processing
`There were two main possibilities for processing the vertex|
`stream:
`as
`independent vertices or as part of a geometric _
`primitive, for examplea triangle. The advantageofprimitive-level
`_
`information is enabling operations such as culling, reducing ©
`Processing time. However, we determined that
`the increased if
`complexity and loss of parallelism in the primitive p
`model did not justify the perceived benefits. We chose an
`independent vertex program model to exploit the parallel nature
`ofthe task, and greatly simplify the resulting programming task.
`Wepreserved thelatter stages ofthe fixed function programming
`model, there being no benefit to their programmability. In fact,
`incorrect clipping could freeze a hardware rasterizer. As such we
`leave frustum clipping, perspective divide, and viewport scale and
`bias
`to subsequent
`implementation-specific processing. The
`programming model is capable of expressing everything in the _
`fixed function pipeline except user clip planes. We instead a
`recommend encoding planedistances into texture coordinates and i
`using fragmentlevel operations to implementthis functionality,
`3.2 Precision and Data Type
`IEEEsingle precision floating point has been used for many years
`as the standard precision for 3D transformations and to keep the
`model simple it was adopted as the only data type. The common
`data in 3D graphics are 3 and 4 componentvectors, for example
`position, normal, texture coordinates and colors. The basic data oe
`type is therefore the quad-float vector written as (x,y,z, w).
`3.3 Scalar and Vector Handling
`It was critical to deal efficiently with scalar packing/extraction
`and vector data in this design since the 3D transform pipeline
`mixes these operations. Two simple concepts can resolvethis:
`1. On input, vectors can have their components arbitrarily
`rearranged/replicated (swizzled).
`2. Any operation generating a scalar must generate that scalar
`replicated across all components, and output writes have a
`component write mask.
`A scalar value in a vector register can be replicated into a vector
`through (1), and then stored again as a scalar through (2).
`Swizzling is very useful for doing cross products efficiently,
`where the source vectors need to be rotated. Another use is
`converting constants
`such as [-1,0,1,2]
`into others such as
`[0,0,1,0] or {-1,-1,-1,1].
`3.4 Program Model
`The program modelis illustrated in Figure 2. The current vertex
`attributes are available in the input (source) registers, and the
`processed vertex is written into the output (destination) registers.
`The constant bank holds transform and light parameters, and the
`register file (R) holds temporary results, A function unit (F)
`implements the instruction set,
`Making the vertex source read-only by the vertex program, and
`the destination write-only recognizes the streaming nature of the
`design and simplifies implementation.
`
`altace
`
`MEDIATEK,Ex. 1013, Page 5
`IPR2018-00101
`
`
`
`ichSahSnrchtamaiNesAhANaAtee
`
`MIMD machines dominate the history of geometry processors. In
`each case the individual processors operated on single triangles.
`The Raster Tech GX4000 [26],[27] was the earliest example,
`followed by Pixel-Planes 5 [10], the DNIQOOOVS [15], Pixel
`Flow [19], and the RealityEngine [2]. The GX4000 used a Weitek
`floating point DSP, while all but one of the remaining machines
`used the I860XP [13], a 64-bit microprocessor. The last of the
`MIMDgeometry subsystems was the InfiniteReality [23], using a
`custom micro-coded ASIC built
`to exceed the performance
`available in third party processors. The InfiniteReality’s processor
`was micro-coded in SIMD fashion within each of the processors
`in a MIMD array of configurable size.
`Alternatives to the above large high-performance machines are the
`processor extensions, all of which exploit fine-grained SIMD
`parallelism similar to the CHAP and Ikonas. Each of these
`exploits the existing resources and clock rate ofa general purpose
`CPU to deliver high performance. MIPS-3D ASE [18] and
`3DNow!
`[1] perform paired single SIMD floating point
`operations, Intel’s SSE instructions [14] express 4-wide SIMD
`processing. Motorola's AltiVec [9] delivers the full 4-wide SIMD
`performance. Sony’s Emotion Engine [16] has two 4-wide SIMD
`processors. The first
`is
`interfaced to the main CPU as a
`coprocessor, executing instructions directly from the application's
`instruction stream. The second processor is more loosely coupled,
`running loaded subroutines,
`typically performing standard
`geometry processing tasks.
`In all cases, experts were required to very carefully craft assembly
`code to achieve processor performance approaching theoretical
`peaks. Close attention to pipeline latency, hazards, and stall
`conditions was necessary to produce good results. While
`compilers were generally available, generated code was typically
`of inadequate performance.
`In contrast to virtually all of these systems, our geometry engine
`only exposes the programmability of a small part of the larger
`geometry pipeline. Tasks such as vertex load&store,
`format
`conversion, primitive assembly, clipping, and triangle setup occur
`completely in parallel, in pipeline fashion. We use 4-wide fine-
`grained SIMD floating point
`to provide
`the
`necessary
`performance, and run multiple execution threads to maintain
`efficiency and provide a very simple programming model.
`
`MEDIATEK, Ex. 1013, Page 5
`IPR2018-00101
`
`
`
`Computer Graphics Proceedings, Annual Conference Series, 2001
`
`ory
`
`
`Destination
`(write-only)
`
`
`
`Figure 2: Program Model
`Input Attributes
`are 16 quad-float vertex source attribute registers. Fixed
`
`nection mode typically requires a position, normal, twocolors,
`to eight texture coordinate sets, skin weights, fog, and point
`
`These are sent from the host in many formats including
`shorts,
`integers, and floats, with conversion to floating
`
`done before the data is accessed. Unspecified attribute
`ents default to 0.0 for the second and third components,
`id 1.0 for the fourth. The attributes are all persistent, that is they
`
`their data until they are changed by subsequent API calls,
`
`are addressed from 0 to 15. An API write to attribute 0 (the
`position when in fixed function mode) will
`invoke the
`program. Only one vertex attribute may be read per
`
`ram
`instruction.
`light positions, and plane
`»
`hold constants such as matrices,
`ents that are used in typical vertex programs, there is a
`
`bank of 96 quad-floats. It may only be loaded before
`are processed (for example outside of Begin/End). The
`‘was chosen based on fixed function memory usage, and to
`
`a reasonably large set of matrices for indexed skinning. As
`
`source attributes, only one constant may be read by one
`fam instruction. The program may not write to constants
`
`it would create a dependency between vertices, forcing
`lization causing a serious performance impact.
`
`
`¢ is also one integer address register that may be loaded using
`uction (ARL). This address register allows for indexed
`it reads with out-of-range reads returning the (0,0,0,0)
`
`d/write register file is 12 quad-floats in size and allows
`
`:reads and one write per instruction. The size was chosen to
`feasonably simple modular code design, where some ofthe
`
`ters would be used for storage of variables across multiple
`ules. All registers are initialized to (0,0,0,0) per vertex.
`
`‘vector read may be sourced as multiple operands, and
`‘idually swizzled/negated each time; see Figure 2. Since any
`
`ice can be negated, there is no need for a subtract instruction.
`
`
`utput Attributes
`
`Vertex program outputs merge back into the fixed function
`at the homogeneous clip space point, there is a standard
`ig of output attributes. Position is used for clipping. Vertex
`
`Sutput components are automatically clamped to the range
`) 10. There is also a fog distance, and point size output
`
`(clamped, only valid for points). Having a fog output permits
`more general fog effects than using the position’s z or w values,
`and is interpolated before use as a distance in the standard fog
`equations. We allow for up to eight texture coordinate sets that
`can be used for traditional texturing as well as more noveleffects
`in combination with GeForce3’s texture shader and register
`combiners pet-fragment functionality [20]. Texture coordinates
`are assumed to be full precision and range, as well as perspective
`correct when used in pixel programs.
`
`All instruction writes have an optional 4-component write mask.
`
`Table 1: Output Attributes
`
`All vertex outputregisters are initialized to (0.0,0.0,0.0,1.0) at the
`start of a vertex program. Subsequent writes then apply the output
`write mask to update the selected components. This avoids any
`problems with undefined outputs, and having to verify raster
`subsystem input options.
`
`3.7 Instruction Set
`The instruction set consists of 17 operations. These can be
`divided into vector, scalar, and miscellaneous operation. We
`discuss the instructions selected after explaining the constraints
`
`we chose to impose.
`
`Table 2: Instruction Set
`
`3.7.1 No Branching
`and
`in OpenGL®(25]
`The fixed function transform paths
`Direct3D™[6] are both controlled by global state that does not
`depend on the actual data supplied with each vertex. This allows
`for driver optimizations at the time the first vertex is supplied by
`the application since all subsequent vertices (until a new state
`change) can then share this carefully optimized path. The result is
`a code segment that removes state checking and branching. It is
`therefore possible to support the full fixed function transform path
`(at
`least
`to homogenous clip space) without branching. The
`decision was therefore made to not support branching, keeping
`the hardware as simple as possible. Also, late binding changes in
`control
`flow disrupt pipeline efficiency. Simple if/then/else
`evaluation is still supported through sum-of-products using 1.0
`and 0.0, which can be generated with sit and saz.
`3.7.2 Constant Latency
`One instruction set constraint we imposed was that our hardware
`implementation must issue any instruction per clock and execute
`
`MEDIATEK,Ex. 1013, Page 6
`IPR2018-00101
`
`151
`
`MEDIATEK, Ex. 1013, Page 6
`IPR2018-00101
`
`
`
`SIGGRAPH 2001, Los Angeles, California, August 12-17, 2001
`
`oRge
`
`all instructions with the samelatency, limiting the complexity of
`any instruction. This improves programmability and simplifies the
`hardware. All operands are immediately available,
`limiting the
`size of register and memory banks.
`3.7.3 Instruction Set Rationale
`Since we wanted to use the same instruction set for vertex
`programs and fixed function (non-programmable) mode, we
`started by analyzing the fixed function implementation of a
`previous architecture. We foundthat the equivalents of the Mov,
`MUL, ADD, and MAD instructions were used about 50% ofthe time,
`and that the DP3, and pP4 equivalents were used about 40% ofthe
`time. We support dot products for their coding convenience, and
`also because as the number ofcycles spent on a vertex decreases
`over architectural generations, it becomes more important to have
`powerful concise instructions, Cross products are also important,
`and they can be done via an efficient MUL, MAD sequence with
`sourcevectorrotations. For example, Rl = ROxR2 is done as:
`MOL R1, RO.zxyw, R2.yzxw ;
`MAD R1, RO.yzxw, R2.zxyw,
`
`-R1;
`
`Wesupport reciprocal (RcP) instead of division due to the constant
`latency restriction. The RcP instruction is also scalar since the
`mainuse ofit is in the perspective division of w in homogeneous
`clip space (done after the vertex program) which involves the
`multiply ofthe (x,y,z) vector with the scalar
`I/w.
`The reciprocal square root (RSQ) is mainly used in normalizing
`vectors to be used in lighting equations. The typical sequence is a
`DP3 to find the vector length squared, a RSQ to get the reciprocal
`length, and a MUL to normalize the vector. It is very convenient to
`use the vector w component for storing the length squared and
`reciprocal length values. RQ is also a scalar operator.
`To avoid problems with vector lengths of 0.0 causing Rsq te return
`infinity, we mandated that 0.0 times anything be 0.0. This is also
`useful in conditional evaluation when multiplying by 0.0. Another
`mandate is that 1.0 times anything be the samevalue.
`A major exception to our goal of similar performance in fixed
`function and program mode involved lighting. The previous
`architecture design has a separate hard-wired lighting engine.
`Since it was too hard to expose this engine in program mode, the
`decision was made to tum it off when running vertex programs.
`Fixed function performance with heavy lighting can therefore be
`twice as fast as a comparable vertex program. Toalleviate this
`problem, two instructions were included: pst and LIT. The pst
`instruction assists in constructing attenuation factors of the form:
`(K0,K1,K2) (1,d,d"*d) = KO + KI*d+ K2*d *d
`where d is some distance. Since d*d and I/d are natural
`byproducts of the vector normalization process, these values are
`input as (NA,d*d,d*d,NA) and (NA,I/d,NA,I/d)) to pst, which
`then returns the (1,d,d*d,l/d) vector. The last I/d term can be
`used with a DP4 operation if desired.
`The LIT instruction does the fairly complex ambient, diffuse, and
`specular calculations with clamping based on NeL, NeH, and the
`power p. The calculations are:
`oes © 1,0;
`eed =nies 0);
`Output.z
`= 0.0
`if (Wel>0.0 66 p = 0.0)
`Output.2 = 1.0;
`else
`(Wel > 0.0&& NeH > 0.0)
`Output.z = (NeH)?;
`Output.w = 1.0;
`
`// ambient
`// diffuse
`// specular
`
`Since LIT implements the specular power function via use of a
`log, multiply, and exp sequence, we also decided to expose the
`LOG and XP instructions. Since the poweris a variable in the LIT
`source, a table needing a pre-known specular power was not an
`
`a
`option. We also wanted an accurate power function confo;
`the cos" model; hence known approximations wouldnot suffice. y
`is possible to implement the LIT instruction with about 10 othe
`instructions, but the performance loss is extreme.
`a
`The 106 base 2 instruction returns an output accurate to about 1
`mantissa bits as well as two partial results: the exponent nd
`mantissa of the source scalar, A more accurate user progra med
`approximation based on the limited range mantissa can be done _
`with the result added to the exponent. The exp base 2 instruction
`
`also returns an output accurate to about 11 mantissa bits as wellgs
`two partial results:
`two raised to power of floor(source)
`
`fraction(source).
`A more
`accurate
`user
`program
`approximation based on the limited range fraction can be d
`
`with the result multiplied by the power output. The precision 0
`these instructions was based on the desired 8-bit color preci
`
`of the specular LIT operation. It takes about 10 instructions ¢
`achieve full accuracy Loc and ExP evaluation.
`
`The MIN and Max operations allow for clamping and absolute
`computations (MAX of source and -source). Related to these are
`
`SLT and SGE instructions that return 1.0 if the component comp
`is true and 0.0 if false.
`
`The ARL instruction was added to allow support of vertex spe:
`constant access such as a matrix or plane equation. It converts ¢
`floating-point scalar into a signed integer, which can be used
`an offset into the constant memory. Out-of-range reads from
`constant memory return (0,0,0,0).
`swizzled
`Sources are negated by prefixing a “-” sign, and can be
`via
`four optional
`subscripts
`that describe the
`component
`rearrangementdesired. For example:
`2
`MOV RO,
`-Ri.wyzy ;
`
`_
`
`—
`
`into the :
`moves the negated w component of register RI
`componentofregister RO, moves the negated y and z componen
`across, and uses the negated y component again to placeinto th
`RO w component.
`The destination of an instruction has an optional write mask of the
`desired xyzw components to be written. For example:
`ADD RO.xw, Rl, R2 ;
`
`
`
`
`updates the x and w components of RO with sum of RI and R2.
`
`|
`
`
`
`
`
`
`4 HARDWARE IMPLEMENTATION
`4.1 Overview
`The hardware implementation of vertex programs is divided inte
`two main blocks:
`the vertex attribute buffer (VAB) and
`floating point core.
`
`Vertex In
`
`Vector FP Core
`
`Vertex Out
`
`Figure 3: Hardware Units
`The VABis responsible for vertex attribute persistence, and t
`floating-point core processes the instructionset.
`4,2 Attribute Input
`Vertex attributes are converted to floating point representa
`before arriving at the VAB, which has room for the 16 im
`attributes. The contents ofeach address default to (0.0,0.0,0.0,1.0)b.
`
`MEDIATEK,Ex. 1013, Page 7
`IPR2018-00101
`
`4
`4
`
`MEDIATEK, Ex. 1013, Page 7
`IPR2018-00101
`
`
`
`Computer Graphics Proceedings, Annual Conference Series, 2001
`
`e t
`
`aanattributewritearrives,andthenoverwrittenbythevalid
`
`data components. This is required since the API allows for
`|sending less than four components; defaulting the remaindersaves
`bandwidth into the GPU.
`
` i ;
`
`
`
`SHEETARR
`8 [| [+|scans
`
`0.0%infinity and 0.0*NaN. The Special Function Unit calculates
`the rcp and RsQ functions to within about 1.5 bits of IEEE
`precision using two-pass Newton-Raphsoniteration from a seed
`table. While lighting may suffice with a lower precision Rs9,
`texture and position evaluation can require much higher precision.
`It was not felt necessary to provide a low-precision RsQ option.
`The hardware accepts one instruction per clock and fully
`implements all
`instruction set
`input/output options with no
`performance penalty. All
`input vectors are available with no
`latency.
`5 PROGRAMMING INTERFACES
`the
`Given the predominance of OpenGL and Direct3D,
`3D
`integration
`of
`programmable
`geometry
`into
`these
`programming interfaces is vital to its widespread availability and
`quick adoption. The discussion below concentrates on how we
`integrated programmable geometry into OpenGL through an
`extension named NV_vertex_program. Where Direct3D makes
`alternative design choices, such choices are noted.
`5.1 Design Goals
`Existing OpenGL applications
`1. Backward compatibility,
`unaware of programmable
`geometry
`should
`operate
`unchanged.
`It should be relatively straightforward to
`Ease of adoption.
`integrate
`programmable
`geometry
`into
`an
`existing
`application without overhauling the way in which vertex data
`is presented to OpenGL. Moreover, applications should be
`able to mix existing fixed function vertex processing with
`programmable geometry.
`Forward focus,
`In our view, programmable geometry frees
`programmers from existing API conventions of what a
`“vertex normal” or a “light direction” is; the vertex program
`supplies these semantic connections, transcending per-vertex
`attributes and vertex-related naming. By not constraining
`programmable geometry to existing conventions, we hope
`this will encourage novel applications for programmable
`geometry, including automatic generation of vertex programs
`by higher-level software [22].
`Preparation to expose future programmability. We believe
`that other
`functionality beyond vertex processing in
`OpenGL’s dataflow will eventually be programmable as
`well. The programming interface should be amenable to
`exposing other types of programmability.
`5. Well-defined execution environment. Preliminary feedback
`from developers and our own thinking convinced us that an
`unconstrained execution environment
`for programmable
`geometry would lead to frustration for developers. Unlike
`textures that can usually be down-sampled if too large,
`vertex programs that require more instructions, registers, or
`other
`resources
`that
`are not
`available
`on
`a_ given
`implementation cannot be easily simplified to cope with
`implementation limitations. For this reason, we chose to
`require a strict, well-defined execution environment.
`5.2 Programming Model
`NV_vertex_program augments OpenGL vertex processing with a
`new mode known as vertex program mode.
`Initially, vertex
`program mode is disabled. When disabled, vertices are
`transformed
`by OpenGL’s
`conventional
`vertex-processing
`functionality, consisting of coordinate transformation, vertex
`lighting,
`texture coordinate generation, and user-defined clip
`planes.
`
`2.
`
`3.
`
`4,
`
`MEDIATEK,Ex. 1013, Page 8
`IPR2018-00101.
`
`|
`
`Figure 4: VAB
`
`The VABdrains into a number ofinput buffers (IB) that are used
`"to feed the floating-point core in a round-robin fashion. Dirty bits
`are maintained in the VAB so that only changed attributes are
`updated when the same buffer is again the drain target. The
`transfer of a vertex is triggered by a write to address 0,
`corresponding to the vertex position in fixed function mode. To
`_ prevent bubbles during simultaneous loading and draining of the
`VAB, incoming writes may push outthe contents of the target
`address, superceding a default drain sequence.
`4.3 The Floating-Point Core
`The floating-point core is a multi-threaded vector processor
`operating on quad-float data. Vertex data is read from the input
`buffers and transformedinto the output buffers (OB). The latency
`of the vector and special function units are equal and multiple
`vertex threads are used to hidethis latency.
`The SIMD Vector Unit is responsible for the MoV, MUL, ADD, MAD,
`DP3, DP4, DST, MIN, MAX, SLT, and SG operations, The Special
`Function Unit is responsible for the RCP, RSQ, LOG, EXP, and LIT
`operations.
`
`reeceeaee
`
`
`
`Figure 5: Floating Point Core
`The Vector Unit floating-point precision is approximately IEEE.
`There is no support for de-normalized numbers or exceptions, and
`founding is always towards negative infinity. The hardware
`Outputs 0.0 for a multiply with any source of 0.0,
`including
`
`MEDIATEK, Ex. 1013, Page 8
`IPR2018-00101
`
`
`
`SIGGHAPH 2001, Los Angeles, California, August 12-17, 2001
`
`Vertex program state affects the OpenGL dataflow only when
`vertex program mode is enabled, so vertex program mode being
`initially disabled ensures backward compatibility.
`Vertex program modeis enabledas follows
`
`2.
`
`
`
`ve
`
`
`
`
`
`
`
`
`1S|Tentarecoord7__|ginuleiextoordAta(GL_TexTUER?_1|ning,
`
`typically dwarfs the overhead involved in string parsing for
`
`program loading.
`
`We expect that most vertex programs will be written in a hur
`readable form. Building the parser for program strings
`
`OpenGL eliminates the potential
`for bugs due to errors
`glfnable (GL_VERTEX_PROGRAM_NV);
`
`translation to byte-code.|Other approaches such as
`When enabled, a glVertex command(or equivalent)initiates
`glNewProgram/glEndProgram approach similar to display
`vertex program