throbber
PROCEEDINGS
`
`PRESS
`
`COMPUTER GRAPHICS -PROCEEDINGS-
`
`SL
`3393.971000 7
`
`ate
`Petege cig Weg)citsgecitzegors
`palfecsMetSiestaEB
`HE) er esnit8)
`aie
`
`Bs
`
`Papers
`
`Cha
`
`eee = gg tn:
`
`A Publication of ACM SIGGRAPH
`
`Sponsored by the ACM's Special
`Interest Group-on Computer
`Graphics
`Per tac
`
`tw Pewee CCa7i1esie
` 5-AuG-2002 Bsps “pease
`
`_ Ac,
`
`MEDIATEK, Ex. 1013, Page 1
`IPR2018-00101
`
`

`

`
`
`Annual Conference Series 2001
`SIGGRAPH 2001
`Conference Proceedings
`August 12-17, 2001
`Papers Chair: Eugene Fiume
`
`A Publication of ACM SIGGRAPH
`
`Sponsored by the ACM's Special
`Interest Group on Computer
`Graphics
`
`PROCEEDINGS 2 2. “SIGGRAPH
`
`AND DIGITAL IMAGES
`
`MEXED
` ——.
`
`5)
`i =]
`
`\ ee,
`5
`
`PATIht NOW
`in ‘all
`Nn? a,
`I SE
`
`~¢ 26 2
`GOirc 2 Le
`
`a
`
`IMINOMNIIIACMIAP
`
`wei
`REG-22587597
`
`=e
`=e
`
`International Loan, Return Airmail within 4 weeks of date
`
`ofreceipt unless recalledearlier. ”
`Request Ref. No.
`9/10-2-VE LOANS S
`If no other library indicated please return loan to:-
`The British Library Document Supply Centre, Bostan Spa,
`Wetherby, West Yorkshire, United Kingdom LS23 7BQ
`
`MEDIATEK, Ex. 1013, Page 2
`IPR2018-00101
`
`MEDIATEK, Ex. 1013, Page 2
`IPR2018-00101
`
`

`

`SIGGRAPH 2001, Los Angeles, California, August 12-17, 2001
`
`The Association for Computing Machinery, Inc.
`1515 Broadway
`New York, New York 10036
`
`Copyright © 2001 by the Association for Computing Machinery, Inc (ACM). Permission to make digital or hard copies of
`portions of this work for personal or classroom use is granted without fee provided that the copies are not madeordistributed
`for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright for com-
`ponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
`
`To copy otherwise, to republish, to post on servers or to redistributetolists, requires prior specific permission and/ora fee.
`Request permission to republish from : Publications Department, ACM,Inc. Fax +1-212-869-0481 or e-mail
`permissions @acm.org.
`
`For other copying ofarticles that carry a code at the bottom ofthefirst or last page, copying is permitted provided that the
`per-copy fee indicatedin the codeis paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`
`Notice to Past Authors of ACM-Published Articles
`ACMintendsto create a complete electronic archiveofall articles and/or other material previously published by ACM.If you
`have written a work that was previously published by ACM in any journal or conference proceedings prior to 1978, or any
`SIG newsletter at any time, and you do NOT want this work to appear in the ACMDigital Library, please inform
`permissions @acm.org,stating the title of the work, the author(s), and where and whenpublished.
`
`ACM ISBN: 1-58113-374-X
`Additional copies may be ordered prepaid from:
`ACM Order Department
`Phone: 1-800-342-6626
`P.O. Box 11405
`(USA and Canada)
`Church Street Station
`+1-212-626-0500
`New York, NY 10286-1405
`(All other countries)
`Fax: +1-212-944-1318
`E-mail: acmhelp@acm.org
`
`ACM Order Number: 428010
`
`Printed in the USA
`
`MEDIATEK,Ex. 1013, Page 3
`IPR2018-00101
`
`MEDIATEK, Ex. 1013, Page 3
`IPR2018-00101
`
`

`

`fe a 2 0 0 — INTERACTION
`i
`ANDDIGITAL IMAGES
`
`Computer Graphics Proceedings, Annual Conference Series, 2001
`
`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`E
`Re
`
`Erik Lindholm
`erikI@nvidia.com
`
`A User-Programmable Vertex Engine
`Mark J Kilgard
`mjk@nvidia.com
`
`Henry Moreton
`moreton@nvidia.com
`
`NVIDIA Corporation
`
`
`
`performance has driven, and been driven by increasingly rich
`graphics APIs. The motivation behind the creation of the user-
`programmable geometry engine described in this paper is two
`fold: first, the increasing configurability required by continually
`evolving graphics APIs requires a programmable device to
`the combinatorial explosion of mode combinations.
`Second, high-performance programmability is an end unto itself.
`Given the right programming model, with a sufficient degree of
`target processor independence,
`the need for rapidly evolving
`graphics APIs is reduced, and an opportunity is created for
`inventiveness
`unconstrained
`by
` fixed-function, modally
`configured APIs and hardware. Further, compatibility across
`hardware generations and platforms will increase the lifespan and
`utility of programs written for geometry processors.
`
`INTRODUCTION
`
`Framebuffer Interface
`
`vertex
`fragment
`
`ABSTRACT
`In this paper we describe the design, programming interface, and
`mplementation of a very efficient user-pr
`ble vertex
`
`engine. The vertex engine of NVIDIA’s GeForce3 GPU evolved
`
`from a highly tuned fixed-function pipeline requiring considerable
`
`knowledge to program. Programs operate only on a stream of
`independent vertices traversing the pipe. Embeddedin the broader
`
`ed function pipeline, our approach preserves parallelism
`sacrificed by previous approaches. The programmer is presented
`
`with a straightforward programming model, which is supported by
`transparent multi-threading and bypassing to preserveparallelism
`
`
`In the remainder of the paper we discuss the motivation behind
`ur design and contrast it with previous work. We present the
`ogramming model, the instruction set selection process, and
`etails of the hardware implementation. Finally, we discuss
`
`nportant API design issues encountered when creating an
`face
`to such a device. We close with thoughts about the
`
`ture of programmable graphics devices.
`
`‘ords
`
`raphics Hardware, Graphics Systems.
`
`
`
`
`
`The programming model and design of the geometry engine in the
`GeForce} was guided by several
`factors: commodity pricing,
`design
`time,
`area,
`legacy
`performance,
`programmable
`performance, programmability,
`and platform independence.
`Ultimately, all of these influence the commercial viability of the
`design. Design time obviously determines time to market. Area is
`directly linked to product cost. Previously existing applications
`must exhibit higher performance on new products, There can only
`be a slight performance penalty paid for taking advantage of
`programmability. To gain acceptance, the engine must be easy to
`program. Finally, to promote adoption across vendors, a standard
`interface is required and thus the functionality cannot be too
`tightly coupled to a specific hardware implementation;
`for
`example, CPU implementations must be viable.
`Host Interface
`aL
`— ___%
`We provide a taxonomic description of previous programmable
`Geometr
`graphics processors, comparing them to our device. We show how
`the programming model can be effectively supported by a custom
`processor design. We describe how a programmable processing
`_ Primitive Assembly/Setup
`element can be incorporated into an existing graphics API.
`— __%
`Finally, we illustrate how the programming model and interface
`Raster/Texture
`may be used to efficiently implement complex custom effects.
`2 PREVIOUS WORK
`Geometric calculations have been accelerated for over 30 years,
`starting with early flight simulators. Among the best known is the
`Geometry Engine [5]. A system was built from 12 instances of the
`GE,coupled with a raster subsystem built out of AMD2903s. The
`GE was fabricated using a 31m feature size and housed in a 40-
`pin package. The GeForce3 GPU is manufactured using a 0.18pm
`process with a ~550-pin package. So while available logic has
`increased by a factor of 300, the relative amount ofavailable
`bandwidth has only increased by a factor of 14. Note that
`increases in clock frequency cancelin this relative measure. We
`provide these numbers simply to illustrate that the problem is
`continually evolving, and that the natural amount of computation
`performed by the GPU today is far more than was performed in
`years past, and probably a fraction of what will be appropriate
`tomorrow.
`
`Figure 1: Graphics Processing Unit (GPU)
`
`
`
`dramatic increases in the computational power of graphics
`sing units (GPUs, Figure 1) have been fueled both by
`
`innovation
`and
`the
`continuing
`improvement
`in
`ctor
`process technologies. The need for
`increased
`
`Ssion to make digital or hard copies ofall or part of this
`r personal or classroom use is granted without fee
`
`ded that copies are not made or distributed for profit or
`Mercial advantage and that copies bear this notice and the
`
`tation on the first page. To copy otherwise, to republish,
`On servers or to redistribute to lists, requires prior
`
`Permission and/ora fee.
`
`
`
`IGGRAPH 2001, 12-17 August 2001, Los Angeles, CA,
`i 2001 ACM1-58113-374-X/01/08...$5.00
`
`The various products and technologies applied to performing the
`standard geometry processing tasks can be categorized by a small
`number
`of
`attributes:
`technology,
`arrangement,
`and
`programmability. The technology is one of ASIC, DSP, RISC
`
`MEDIATEK,Ex. 1013, Page 4 149
`IPR2018-00101
`
`MEDIATEK, Ex. 1013, Page 4
`IPR2018-00101
`
`

`

`SIGGRAPH 2001, Los Angeles, California, August 12-17, 2001
`
`CPU,and CPU extensions. Arrangement refers to the approaches
`to exploiting parallelism, such as SIMD or MIMD.Each system's
`programmability may be characterized by whether they were
`intended for end-user programming, and the relative ease with
`which they were programmed,
`the Stellar GS1000 [4]
`The only non-parallel
`implementation,
`used a supercomputer-like vector processor, and was driven by
`hand-coded assembly for critical paths.
`Pixar’s CHAP [17] and the Ikonas [7] are early examples offine-
`grain SIMD processors, based on the AMD2903, user micro-
`codable by skilled programmers. These machines operated in
`parallel on pixel and vertex components. The only coarse-grained
`SIMD implementation of which we are aware is the geometry
`subsystem ofthe Indigo Extreme[11]. It was implemented using a
`hand micro-coded ASIC. The Indigo processed eight triangles in
`parallel, stalling if any of the group were clipped, or otherwise
`required branching.
`Following the original Geometry Engine, the IRIS GT [3] and
`The Pixel Machine [24] were the only machines to arrange
`floating point DSPsin pipeline fashion. As has been observed by
`many, the slowest processor in the pipeline gated these machines’
`performance. Since it was only practical to distribute the geometry
`tasks
`statically,
`the pipelines were
`inefficient
`for certain
`workloads.
`
`f
`
`3 PROGRAMMING MODEL
`In this section we describe our programming modelfor geometry
`processing and discuss the design in the areas ofinput, output,
`data path, and instruction set selection. We include the rationale #
`for choices made in the design process.
`3.1 Vertex Processing
`There were two main possibilities for processing the vertex|
`stream:
`as
`independent vertices or as part of a geometric _
`primitive, for examplea triangle. The advantageofprimitive-level
`_
`information is enabling operations such as culling, reducing ©
`Processing time. However, we determined that
`the increased if
`complexity and loss of parallelism in the primitive p
`model did not justify the perceived benefits. We chose an
`independent vertex program model to exploit the parallel nature
`ofthe task, and greatly simplify the resulting programming task.
`Wepreserved thelatter stages ofthe fixed function programming
`model, there being no benefit to their programmability. In fact,
`incorrect clipping could freeze a hardware rasterizer. As such we
`leave frustum clipping, perspective divide, and viewport scale and
`bias
`to subsequent
`implementation-specific processing. The
`programming model is capable of expressing everything in the _
`fixed function pipeline except user clip planes. We instead a
`recommend encoding planedistances into texture coordinates and i
`using fragmentlevel operations to implementthis functionality,
`3.2 Precision and Data Type
`IEEEsingle precision floating point has been used for many years
`as the standard precision for 3D transformations and to keep the
`model simple it was adopted as the only data type. The common
`data in 3D graphics are 3 and 4 componentvectors, for example
`position, normal, texture coordinates and colors. The basic data oe
`type is therefore the quad-float vector written as (x,y,z, w).
`3.3 Scalar and Vector Handling
`It was critical to deal efficiently with scalar packing/extraction
`and vector data in this design since the 3D transform pipeline
`mixes these operations. Two simple concepts can resolvethis:
`1. On input, vectors can have their components arbitrarily
`rearranged/replicated (swizzled).
`2. Any operation generating a scalar must generate that scalar
`replicated across all components, and output writes have a
`component write mask.
`A scalar value in a vector register can be replicated into a vector
`through (1), and then stored again as a scalar through (2).
`Swizzling is very useful for doing cross products efficiently,
`where the source vectors need to be rotated. Another use is
`converting constants
`such as [-1,0,1,2]
`into others such as
`[0,0,1,0] or {-1,-1,-1,1].
`3.4 Program Model
`The program modelis illustrated in Figure 2. The current vertex
`attributes are available in the input (source) registers, and the
`processed vertex is written into the output (destination) registers.
`The constant bank holds transform and light parameters, and the
`register file (R) holds temporary results, A function unit (F)
`implements the instruction set,
`Making the vertex source read-only by the vertex program, and
`the destination write-only recognizes the streaming nature of the
`design and simplifies implementation.
`
`altace
`
`MEDIATEK,Ex. 1013, Page 5
`IPR2018-00101
`
`
`
`ichSahSnrchtamaiNesAhANaAtee
`
`MIMD machines dominate the history of geometry processors. In
`each case the individual processors operated on single triangles.
`The Raster Tech GX4000 [26],[27] was the earliest example,
`followed by Pixel-Planes 5 [10], the DNIQOOOVS [15], Pixel
`Flow [19], and the RealityEngine [2]. The GX4000 used a Weitek
`floating point DSP, while all but one of the remaining machines
`used the I860XP [13], a 64-bit microprocessor. The last of the
`MIMDgeometry subsystems was the InfiniteReality [23], using a
`custom micro-coded ASIC built
`to exceed the performance
`available in third party processors. The InfiniteReality’s processor
`was micro-coded in SIMD fashion within each of the processors
`in a MIMD array of configurable size.
`Alternatives to the above large high-performance machines are the
`processor extensions, all of which exploit fine-grained SIMD
`parallelism similar to the CHAP and Ikonas. Each of these
`exploits the existing resources and clock rate ofa general purpose
`CPU to deliver high performance. MIPS-3D ASE [18] and
`3DNow!
`[1] perform paired single SIMD floating point
`operations, Intel’s SSE instructions [14] express 4-wide SIMD
`processing. Motorola's AltiVec [9] delivers the full 4-wide SIMD
`performance. Sony’s Emotion Engine [16] has two 4-wide SIMD
`processors. The first
`is
`interfaced to the main CPU as a
`coprocessor, executing instructions directly from the application's
`instruction stream. The second processor is more loosely coupled,
`running loaded subroutines,
`typically performing standard
`geometry processing tasks.
`In all cases, experts were required to very carefully craft assembly
`code to achieve processor performance approaching theoretical
`peaks. Close attention to pipeline latency, hazards, and stall
`conditions was necessary to produce good results. While
`compilers were generally available, generated code was typically
`of inadequate performance.
`In contrast to virtually all of these systems, our geometry engine
`only exposes the programmability of a small part of the larger
`geometry pipeline. Tasks such as vertex load&store,
`format
`conversion, primitive assembly, clipping, and triangle setup occur
`completely in parallel, in pipeline fashion. We use 4-wide fine-
`grained SIMD floating point
`to provide
`the
`necessary
`performance, and run multiple execution threads to maintain
`efficiency and provide a very simple programming model.
`
`MEDIATEK, Ex. 1013, Page 5
`IPR2018-00101
`
`

`

`Computer Graphics Proceedings, Annual Conference Series, 2001
`
`ory
`
`
`Destination
`(write-only)
`
`
`
`Figure 2: Program Model
`Input Attributes
`are 16 quad-float vertex source attribute registers. Fixed
`
`nection mode typically requires a position, normal, twocolors,
`to eight texture coordinate sets, skin weights, fog, and point
`
`These are sent from the host in many formats including
`shorts,
`integers, and floats, with conversion to floating
`
`done before the data is accessed. Unspecified attribute
`ents default to 0.0 for the second and third components,
`id 1.0 for the fourth. The attributes are all persistent, that is they
`
`their data until they are changed by subsequent API calls,
`
`are addressed from 0 to 15. An API write to attribute 0 (the
`position when in fixed function mode) will
`invoke the
`program. Only one vertex attribute may be read per
`
`ram
`instruction.
`light positions, and plane

`hold constants such as matrices,
`ents that are used in typical vertex programs, there is a
`
`bank of 96 quad-floats. It may only be loaded before
`are processed (for example outside of Begin/End). The
`‘was chosen based on fixed function memory usage, and to
`
`a reasonably large set of matrices for indexed skinning. As
`
`source attributes, only one constant may be read by one
`fam instruction. The program may not write to constants
`
`it would create a dependency between vertices, forcing
`lization causing a serious performance impact.
`
`
`¢ is also one integer address register that may be loaded using
`uction (ARL). This address register allows for indexed
`it reads with out-of-range reads returning the (0,0,0,0)
`
`d/write register file is 12 quad-floats in size and allows
`
`:reads and one write per instruction. The size was chosen to
`feasonably simple modular code design, where some ofthe
`
`ters would be used for storage of variables across multiple
`ules. All registers are initialized to (0,0,0,0) per vertex.
`
`‘vector read may be sourced as multiple operands, and
`‘idually swizzled/negated each time; see Figure 2. Since any
`
`ice can be negated, there is no need for a subtract instruction.
`
`
`utput Attributes
`
`Vertex program outputs merge back into the fixed function
`at the homogeneous clip space point, there is a standard
`ig of output attributes. Position is used for clipping. Vertex
`
`Sutput components are automatically clamped to the range
`) 10. There is also a fog distance, and point size output
`
`(clamped, only valid for points). Having a fog output permits
`more general fog effects than using the position’s z or w values,
`and is interpolated before use as a distance in the standard fog
`equations. We allow for up to eight texture coordinate sets that
`can be used for traditional texturing as well as more noveleffects
`in combination with GeForce3’s texture shader and register
`combiners pet-fragment functionality [20]. Texture coordinates
`are assumed to be full precision and range, as well as perspective
`correct when used in pixel programs.
`
`All instruction writes have an optional 4-component write mask.
`
`Table 1: Output Attributes
`
`All vertex outputregisters are initialized to (0.0,0.0,0.0,1.0) at the
`start of a vertex program. Subsequent writes then apply the output
`write mask to update the selected components. This avoids any
`problems with undefined outputs, and having to verify raster
`subsystem input options.
`
`3.7 Instruction Set
`The instruction set consists of 17 operations. These can be
`divided into vector, scalar, and miscellaneous operation. We
`discuss the instructions selected after explaining the constraints
`
`we chose to impose.
`
`Table 2: Instruction Set
`
`3.7.1 No Branching
`and
`in OpenGL®(25]
`The fixed function transform paths
`Direct3D™[6] are both controlled by global state that does not
`depend on the actual data supplied with each vertex. This allows
`for driver optimizations at the time the first vertex is supplied by
`the application since all subsequent vertices (until a new state
`change) can then share this carefully optimized path. The result is
`a code segment that removes state checking and branching. It is
`therefore possible to support the full fixed function transform path
`(at
`least
`to homogenous clip space) without branching. The
`decision was therefore made to not support branching, keeping
`the hardware as simple as possible. Also, late binding changes in
`control
`flow disrupt pipeline efficiency. Simple if/then/else
`evaluation is still supported through sum-of-products using 1.0
`and 0.0, which can be generated with sit and saz.
`3.7.2 Constant Latency
`One instruction set constraint we imposed was that our hardware
`implementation must issue any instruction per clock and execute
`
`MEDIATEK,Ex. 1013, Page 6
`IPR2018-00101
`
`151
`
`MEDIATEK, Ex. 1013, Page 6
`IPR2018-00101
`
`

`

`SIGGRAPH 2001, Los Angeles, California, August 12-17, 2001
`
`oRge
`
`all instructions with the samelatency, limiting the complexity of
`any instruction. This improves programmability and simplifies the
`hardware. All operands are immediately available,
`limiting the
`size of register and memory banks.
`3.7.3 Instruction Set Rationale
`Since we wanted to use the same instruction set for vertex
`programs and fixed function (non-programmable) mode, we
`started by analyzing the fixed function implementation of a
`previous architecture. We foundthat the equivalents of the Mov,
`MUL, ADD, and MAD instructions were used about 50% ofthe time,
`and that the DP3, and pP4 equivalents were used about 40% ofthe
`time. We support dot products for their coding convenience, and
`also because as the number ofcycles spent on a vertex decreases
`over architectural generations, it becomes more important to have
`powerful concise instructions, Cross products are also important,
`and they can be done via an efficient MUL, MAD sequence with
`sourcevectorrotations. For example, Rl = ROxR2 is done as:
`MOL R1, RO.zxyw, R2.yzxw ;
`MAD R1, RO.yzxw, R2.zxyw,
`
`-R1;
`
`Wesupport reciprocal (RcP) instead of division due to the constant
`latency restriction. The RcP instruction is also scalar since the
`mainuse ofit is in the perspective division of w in homogeneous
`clip space (done after the vertex program) which involves the
`multiply ofthe (x,y,z) vector with the scalar
`I/w.
`The reciprocal square root (RSQ) is mainly used in normalizing
`vectors to be used in lighting equations. The typical sequence is a
`DP3 to find the vector length squared, a RSQ to get the reciprocal
`length, and a MUL to normalize the vector. It is very convenient to
`use the vector w component for storing the length squared and
`reciprocal length values. RQ is also a scalar operator.
`To avoid problems with vector lengths of 0.0 causing Rsq te return
`infinity, we mandated that 0.0 times anything be 0.0. This is also
`useful in conditional evaluation when multiplying by 0.0. Another
`mandate is that 1.0 times anything be the samevalue.
`A major exception to our goal of similar performance in fixed
`function and program mode involved lighting. The previous
`architecture design has a separate hard-wired lighting engine.
`Since it was too hard to expose this engine in program mode, the
`decision was made to tum it off when running vertex programs.
`Fixed function performance with heavy lighting can therefore be
`twice as fast as a comparable vertex program. Toalleviate this
`problem, two instructions were included: pst and LIT. The pst
`instruction assists in constructing attenuation factors of the form:
`(K0,K1,K2) (1,d,d"*d) = KO + KI*d+ K2*d *d
`where d is some distance. Since d*d and I/d are natural
`byproducts of the vector normalization process, these values are
`input as (NA,d*d,d*d,NA) and (NA,I/d,NA,I/d)) to pst, which
`then returns the (1,d,d*d,l/d) vector. The last I/d term can be
`used with a DP4 operation if desired.
`The LIT instruction does the fairly complex ambient, diffuse, and
`specular calculations with clamping based on NeL, NeH, and the
`power p. The calculations are:
`oes © 1,0;
`eed =nies 0);
`Output.z
`= 0.0
`if (Wel>0.0 66 p = 0.0)
`Output.2 = 1.0;
`else
`(Wel > 0.0&& NeH > 0.0)
`Output.z = (NeH)?;
`Output.w = 1.0;
`
`// ambient
`// diffuse
`// specular
`
`Since LIT implements the specular power function via use of a
`log, multiply, and exp sequence, we also decided to expose the
`LOG and XP instructions. Since the poweris a variable in the LIT
`source, a table needing a pre-known specular power was not an
`
`a
`option. We also wanted an accurate power function confo;
`the cos" model; hence known approximations wouldnot suffice. y
`is possible to implement the LIT instruction with about 10 othe
`instructions, but the performance loss is extreme.
`a
`The 106 base 2 instruction returns an output accurate to about 1
`mantissa bits as well as two partial results: the exponent nd
`mantissa of the source scalar, A more accurate user progra med
`approximation based on the limited range mantissa can be done _
`with the result added to the exponent. The exp base 2 instruction
`
`also returns an output accurate to about 11 mantissa bits as wellgs
`two partial results:
`two raised to power of floor(source)
`
`fraction(source).
`A more
`accurate
`user
`program
`approximation based on the limited range fraction can be d
`
`with the result multiplied by the power output. The precision 0
`these instructions was based on the desired 8-bit color preci
`
`of the specular LIT operation. It takes about 10 instructions ¢
`achieve full accuracy Loc and ExP evaluation.
`
`The MIN and Max operations allow for clamping and absolute
`computations (MAX of source and -source). Related to these are
`
`SLT and SGE instructions that return 1.0 if the component comp
`is true and 0.0 if false.
`
`The ARL instruction was added to allow support of vertex spe:
`constant access such as a matrix or plane equation. It converts ¢
`floating-point scalar into a signed integer, which can be used
`an offset into the constant memory. Out-of-range reads from
`constant memory return (0,0,0,0).
`swizzled
`Sources are negated by prefixing a “-” sign, and can be
`via
`four optional
`subscripts
`that describe the
`component
`rearrangementdesired. For example:
`2
`MOV RO,
`-Ri.wyzy ;
`
`_
`
`—
`
`into the :
`moves the negated w component of register RI
`componentofregister RO, moves the negated y and z componen
`across, and uses the negated y component again to placeinto th
`RO w component.
`The destination of an instruction has an optional write mask of the
`desired xyzw components to be written. For example:
`ADD RO.xw, Rl, R2 ;
`
`
`
`
`updates the x and w components of RO with sum of RI and R2.
`
`|
`
`
`
`
`
`
`4 HARDWARE IMPLEMENTATION
`4.1 Overview
`The hardware implementation of vertex programs is divided inte
`two main blocks:
`the vertex attribute buffer (VAB) and
`floating point core.
`
`Vertex In
`
`Vector FP Core
`
`Vertex Out
`
`Figure 3: Hardware Units
`The VABis responsible for vertex attribute persistence, and t
`floating-point core processes the instructionset.
`4,2 Attribute Input
`Vertex attributes are converted to floating point representa
`before arriving at the VAB, which has room for the 16 im
`attributes. The contents ofeach address default to (0.0,0.0,0.0,1.0)b.
`
`MEDIATEK,Ex. 1013, Page 7
`IPR2018-00101
`
`4
`4
`
`MEDIATEK, Ex. 1013, Page 7
`IPR2018-00101
`
`

`

`Computer Graphics Proceedings, Annual Conference Series, 2001
`
`e t
`
`aanattributewritearrives,andthenoverwrittenbythevalid
`
`data components. This is required since the API allows for
`|sending less than four components; defaulting the remaindersaves
`bandwidth into the GPU.
`
` i ;
`
`
`
`SHEETARR
`8 [| [+|scans
`
`0.0%infinity and 0.0*NaN. The Special Function Unit calculates
`the rcp and RsQ functions to within about 1.5 bits of IEEE
`precision using two-pass Newton-Raphsoniteration from a seed
`table. While lighting may suffice with a lower precision Rs9,
`texture and position evaluation can require much higher precision.
`It was not felt necessary to provide a low-precision RsQ option.
`The hardware accepts one instruction per clock and fully
`implements all
`instruction set
`input/output options with no
`performance penalty. All
`input vectors are available with no
`latency.
`5 PROGRAMMING INTERFACES
`the
`Given the predominance of OpenGL and Direct3D,
`3D
`integration
`of
`programmable
`geometry
`into
`these
`programming interfaces is vital to its widespread availability and
`quick adoption. The discussion below concentrates on how we
`integrated programmable geometry into OpenGL through an
`extension named NV_vertex_program. Where Direct3D makes
`alternative design choices, such choices are noted.
`5.1 Design Goals
`Existing OpenGL applications
`1. Backward compatibility,
`unaware of programmable
`geometry
`should
`operate
`unchanged.
`It should be relatively straightforward to
`Ease of adoption.
`integrate
`programmable
`geometry
`into
`an
`existing
`application without overhauling the way in which vertex data
`is presented to OpenGL. Moreover, applications should be
`able to mix existing fixed function vertex processing with
`programmable geometry.
`Forward focus,
`In our view, programmable geometry frees
`programmers from existing API conventions of what a
`“vertex normal” or a “light direction” is; the vertex program
`supplies these semantic connections, transcending per-vertex
`attributes and vertex-related naming. By not constraining
`programmable geometry to existing conventions, we hope
`this will encourage novel applications for programmable
`geometry, including automatic generation of vertex programs
`by higher-level software [22].
`Preparation to expose future programmability. We believe
`that other
`functionality beyond vertex processing in
`OpenGL’s dataflow will eventually be programmable as
`well. The programming interface should be amenable to
`exposing other types of programmability.
`5. Well-defined execution environment. Preliminary feedback
`from developers and our own thinking convinced us that an
`unconstrained execution environment
`for programmable
`geometry would lead to frustration for developers. Unlike
`textures that can usually be down-sampled if too large,
`vertex programs that require more instructions, registers, or
`other
`resources
`that
`are not
`available
`on
`a_ given
`implementation cannot be easily simplified to cope with
`implementation limitations. For this reason, we chose to
`require a strict, well-defined execution environment.
`5.2 Programming Model
`NV_vertex_program augments OpenGL vertex processing with a
`new mode known as vertex program mode.
`Initially, vertex
`program mode is disabled. When disabled, vertices are
`transformed
`by OpenGL’s
`conventional
`vertex-processing
`functionality, consisting of coordinate transformation, vertex
`lighting,
`texture coordinate generation, and user-defined clip
`planes.
`
`2.
`
`3.
`
`4,
`
`MEDIATEK,Ex. 1013, Page 8
`IPR2018-00101.
`
`|
`
`Figure 4: VAB
`
`The VABdrains into a number ofinput buffers (IB) that are used
`"to feed the floating-point core in a round-robin fashion. Dirty bits
`are maintained in the VAB so that only changed attributes are
`updated when the same buffer is again the drain target. The
`transfer of a vertex is triggered by a write to address 0,
`corresponding to the vertex position in fixed function mode. To
`_ prevent bubbles during simultaneous loading and draining of the
`VAB, incoming writes may push outthe contents of the target
`address, superceding a default drain sequence.
`4.3 The Floating-Point Core
`The floating-point core is a multi-threaded vector processor
`operating on quad-float data. Vertex data is read from the input
`buffers and transformedinto the output buffers (OB). The latency
`of the vector and special function units are equal and multiple
`vertex threads are used to hidethis latency.
`The SIMD Vector Unit is responsible for the MoV, MUL, ADD, MAD,
`DP3, DP4, DST, MIN, MAX, SLT, and SG operations, The Special
`Function Unit is responsible for the RCP, RSQ, LOG, EXP, and LIT
`operations.
`
`reeceeaee
`
`
`
`Figure 5: Floating Point Core
`The Vector Unit floating-point precision is approximately IEEE.
`There is no support for de-normalized numbers or exceptions, and
`founding is always towards negative infinity. The hardware
`Outputs 0.0 for a multiply with any source of 0.0,
`including
`
`MEDIATEK, Ex. 1013, Page 8
`IPR2018-00101
`
`

`

`SIGGHAPH 2001, Los Angeles, California, August 12-17, 2001
`
`Vertex program state affects the OpenGL dataflow only when
`vertex program mode is enabled, so vertex program mode being
`initially disabled ensures backward compatibility.
`Vertex program modeis enabledas follows
`
`2.
`
`
`
`ve
`
`
`
`
`
`
`
`
`1S|Tentarecoord7__|ginuleiextoordAta(GL_TexTUER?_1|ning,
`
`typically dwarfs the overhead involved in string parsing for
`
`program loading.
`
`We expect that most vertex programs will be written in a hur
`readable form. Building the parser for program strings
`
`OpenGL eliminates the potential
`for bugs due to errors
`glfnable (GL_VERTEX_PROGRAM_NV);
`
`translation to byte-code.|Other approaches such as
`When enabled, a glVertex command(or equivalent)initiates
`glNewProgram/glEndProgram approach similar to display
`vertex program

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket