throbber
Huppenthal
`
`Reference 10
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 1
`
`

`

`
`
`A PRACTITIONER'S GUIDE TO
`ADJUSTED PEAK PERFORMANCE
`
`
`
`
`
`
`w'
`
`oH
`
`i
`
`■F
`i-
`^:-
`■■ *
`
`■i:-
`■<1
`
`h
`
`m
`%:■
`
`Tc
`
`
`
`
`
`
`U.S. Department of Commerce
`Bureau of Industry and Security
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 2
`
`

`

`ACKNOWLEDGEMENT
`
`
`The Department of Commerce would like to acknowledge the Information
`Systems Technical Advisory Committee (ISTAC) which developed the
`Adjusted Peak Performance formula, supported the adoption of this
`formula by the Wassenaar Arrangement, prepared the initial drafts of this
`document, and recommended that it be published by the Department.
`
`The ISTAC is a government sponsored technical advisory committee made
`up of industry and government representatives and administered by the
`Department of Commerce. The ISTAC advises the U.S. Government on
`U.S. export control matters as authorized under the Export Administration
`Act.
`
`Page 2 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 3
`
`

`

`Note
`
`Please note that this document makes use of various proprietary
`trademarks and trade names (hereinafter "Proprietary Marks") as a means
`of identifying relevant products, systems and vendors. Use of these
`Proprietary Marks is for descriptive and identification purposes only. Any
`third-party use of these Proprietary Marks may require permission from
`their respective owners, as well as appropriate use of the ® and/or ™
`symbols.
`
`Page 3 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 4
`
`

`

`
`
`A PRACTITIONER'S GUIDE TO
`ADJUSTED PEAK PERFORMANCE
`
`US Dept. of Commerce, BIS, Information Systems Technical Advisory Committee
`
`December 2006
`
`
`
`BACKGROUND
`
`On April 24, 2006 the US Department of Commerce implemented a new formula for calculating
`the performance of digital computers, replacing the Composite Theoretical Performance (CTP)
`formula, measured in Millions of Theoretical Operations per Second (MTOPS), with the
`Adjusted Peak Performance (APP) formula, measured in Weighted Teraflops (WT).
`
`The APP formula, like the CTP formula it replaced, was designed to determine computer
`performance for export control purposes. The CTP formula implemented in 1990 could no
`longer keep up with advances in microprocessor technology and computer architecture, and was
`therefore losing relevance in meeting national security objectives. The APP formula, derived
`from existing industry standards, is a more accurate differentiator between high-end, special-
`order, high-performance computers (HPCs) such as vector supercomputers, and commodity off-
`the-shelf systems.
`
`The APP formula restored the credibility for controlling HPCs by focusing controls on the high
`end of industry capability systems. The applications run on these systems demand exceptional
`floating-point performance. HPCs used for national security applications include vector
`supercomputers, massively-parallel processor systems, and proprietary cluster architectures.
`
`This practitioner’s guide is written as an aid to calculating the WT values of HPCs. A similar
`guide, A Practitioner's Guide to Composite Theoretical Performance, was published in
`November 1991 to accompany the implementation of the CTP formula. Like its predecessor,
`this practitioner’s guide recognizes that a rating system for export control of computers must be:
`easy to complete, independent of software, subject to governmental audit, and capable of
`producing a single rating number for a given computer.
`
`APP is simple, can usually be calculated with publicly available vendor literature, does not
`require actual benchmarks, and provides a reasonable degree of accuracy in ranking HPCs. Like
`CTP, it produces a peak number which can be thought of as a "not to exceed" value, independent
`of memory and I/O considerations. The only thing that matters is the computer's ability to
`produce 64-bit or larger floating-point arithmetic results per unit time. While the formula is
`new, many of the notes are either unchanged or adapted from CTP to APP. This allows
`exporters to follow familiar rules in determining APP values and classifying computers. APP
`
`Page 4 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 5
`
`

`

`separates computers into two broad categories by applying a simple weighting factor equal to 0.9
`for vector computers and 0.3 for scalar (non-vector) computers.
`
`
`RATIONALE FOR CHANGING FROM CTP TO APP
`
`From August 1991 until April 2006 computer exports were evaluated on the basis of their
`performance, calculated in MTOPS using the CTP formula. This method worked effectively for
`more than a decade and applied to a broad range of computer products and architectures. The
`designers of the CTP formula anticipated a period in the evolution of processor and system
`design in which performance increased dramatically and Moore's Law allowed designs to
`become quite complex. This period of innovation saw clock frequencies increase by nearly two
`orders of magnitude and gate counts increase by at least three orders of magnitude. Multi-chip
`processors became single-chip microprocessors and even multiple processors on a die.
`
`As the technology and products evolved, experts agreed that the existing formula no longer
`correctly rank ordered real computational value. Specifically, systems built from commodity
`scalar processors were significantly overstated relative to true vector supercomputers. In
`addition, the CTP method had become increasingly difficult to calculate. In some cases its
`calculation required access to proprietary details of a given computer’s design. Even then
`experts had difficulty agreeing on the correct value. As a result, in 2005 the US and Japan
`proposed, and the members of the international export control regime, the Wassenaar
`Arrangement, unanimously accepted a new metric, Adjusted Peak Performance.
`
`Performance continues to be the defining attribute in selecting some computers for export
`controls. HPC application performance varies widely from system to system and some
`applications are better suited to a particular architecture than others are. In some cases compiler
`efficiency and software tuning can double the performance of an application. No single metric
`which is simple and does not require benchmarking can be expected to account for these things
`but the objective of APP is to reach a reasonable level of accuracy while maintaining fairness.
`
`
`Vector systems continue to be the acknowledged leaders in providing the highest efficiency on
`the broadest range of applications. APP applies a weighting factor of 0.9 based on observed
`percentage of peak performance on applications of interest.
`
`The next page is taken from the US Export Administration Regulations, Category 4 of the
`Commerce Control List (Supplement No. 1 to Part 774 of the EAR), and contains the APP
`formula for calculating the WT values of digital computers. Following that is a Q&A section
`and finally a number of examples of the APP formula calculated solely on the basis of publicly
`available information for representative computers.
`
`
`
`
`TECHNICAL NOTE ON "ADJUSTED
` PEAK PERFORMANCE" ("APP")
`
`
`
`Page 5 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 6
`
`

`

`APP is an adjusted peak rate at which "digital
`computers" perform 64-bit or larger floating point
`additions and multiplications.
`
`Abbreviations used in this Technical Note
`
`n number of processors in the "digital computer"
`i processor number (i,....n)
`ti processor cycle time (ti = 1/Fi)
`Fi processor frequency
`Ri peak floating point calculating rate
`Wi architecture adjustment factor
`
` APP is expressed in Weighted TeraFLOPS
`(WT), in units of 10**12 adjusted floating point
`operations per second,
`
`Outline of "APP" calculation method
`
` 1. For each processor i, determine the peak
`number of 64-bit or larger floating-point
`operations, FPOi, performed per cycle for each
`processor in the "digital computer".
`
` Note: In determining FPO, include only 64-bit
`or larger floating point additions and/or
`multiplications. All floating point operations
`must be expressed in operations per processor
`cycle; operations requiring multiple cycles may be
`expressed in fractional results per cycle. For
`processors not capable of performing calculations
`on floating-point operands of 64-bits or more the
`effective calculating rate R is zero.
`
` 2. Calculate the floating point rate R for
`each processor
`
` Ri = FPOi/ti.
`
` 3. Calculate APP as
`
` APP = W1 x R1 + W2 x R2 + … + Wn x
`Rn.
`
` 4. For "vector processors", Wi = 0.9. For
`non-"vector processors", Wi = 0.3.
`
` Note 1: For processors that perform
`compound operations in a cycle, such as an
`
`addition and multiplication, each operation is
`counted.
`
` Note 2: For a pipelined processor the effective
`calculating rate R is the faster of the pipelined
`rate, once the pipeline is full, or the non-pipelined
`rate.
`
` Note 3: The calculating rate R of each
`contributing processor is to be calculated at its
`maximum value theoretically possible before the
`"APP" of the combination is derived.
`Simultaneous operations are assumed to exist
`when the computer manufacturer claims
`concurrent, parallel, or simultaneous operation or
`execution in a manual or brochure for the
`computer.
`
` Note 4: Do not include processors that are
`limited to input/output and peripheral functions
`(e.g., disk drive, communication and video
`display) when calculating APP.
`
` Note 5: APP values are not to be calculated
`for processor combinations (inter)connected by
`"Local Area Networks", Wide Area Networks, I/O
`shared connections/devices, I/O controllers and
`any communication interconnection implemented
`by "software".
`
` Note 6: APP values must be calculated for 1)
`processor combinations containing processors
`specially designed to enhance performance by
`aggregation, operating simultaneously and sharing
`memory; or 2) multiple memory/processor
`combinations operating simultaneously utilizing
`specially designed hardware.
`
` Note 7: A "vector processor" is defined as a
`processor with built-in instructions that perform
`multiple calculations on floating-point vectors
`(one-dimensional arrays of 64-bit or larger
`numbers) simultaneously, having at least 2 vector
`functional units and at least 8 vector registers of at
`least 64 elements each.
`
`
`Page 6 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 7
`
`

`

`CONSIDERATIONS IN APPLYING THE APP FORMULA
`
`Obviously, the first determination an exporter must make is whether the computer is capable of
`performing 64-bit or larger floating-point arithmetic. If it is not, the WT value is zero.
`
` point of clarification has to do with the terminology used in computers: microprocessor,
`processor, core. Rather than attempting to precisely define the meaning and usage of each of
`these terms, one should note that in calculating APP it doesn’t matter what names are used for
`the computational facilities. In common usage, “core” refers to the smallest complete
`computational element in a digital computer that is visible to software. Many microprocessors
`consist of a single core. Newer microprocessors contain two or more cores. For the purpose of
`this guide and in order to be internally consistent, the term, “processor,” will be used to refer to a
`single core, regardless of how it is packaged or how many are contained on a single
`semiconductor chip.
`
`Q1: Is there a simple way to express the APP formula?
`
`A1: Yes. For the majority of computer systems today APP is simply the peak double-precision
`floating-point capacity of a computer (the sum of all processors to be aggregated) multiplied by
`either 0.9 for vector processors or 0.3 for everything else. The execution rate for each processor
`in the computer is:
`
`Frequency (in GHz) x Number of 64-bit (or larger) floating-point results per cycle x 10-3
`
`
`Q2: How does one account for multi-core microprocessors?
`
`A2: For the purposes of calculating APP, it does not matter whether a computer uses processors
`comprised of multiple chips, a single chip, or a fraction of a chip. One simply adds up the
`contribution of all the processors in the computer, independent of how they are packaged, and
`apply the appropriate weighting factor to arrive at the APP value.
`
`Q3: How does one account for multi-threaded processors?
`
`A3: Multi-threaded processors are treated no differently than single-threaded processors. The
`processor has a peak floating-point performance based on the number of 64-bit floating-point
`results per unit time that the execution unit(s) produces. Multi-threading merely allows the
`processor to achieve a higher percentage of peak performance by increasing the utilization of the
`floating-point hardware.
`
`Q4: The microprocessors in a specific digital computer contain vector-like multimedia
`extensions. Does this mean they are to be considered vector processors in computing APP?
`
`A4: Note 7 in the APP method defines the capabilities that are required to be considered a
`vector processor: built-in vector instructions, at least 2 vector functional units, and at least 8
`vector registers of at least 64 elements each. The multimedia extensions currently found in high
`
` A
`
`Page 7 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 8
`
`

`

`volume microprocessors fall short of this mark and therefore would not be considered as vector
`processors.
`
`Q5: How does one rate the APP of a computer which has the ability to increase its performance
`upon paying the manufacturer an additional fee?
`
`A5: Assuming that only the manufacturer or his authorized agent can increase the performance
`of the computer, the APP is evaluated on an as-shipped basis, not based on the optional ability to
`go faster when enabled by the manufacturer. The system has a specific WT value when initially
`exported to the customer. At some later date if the performance is boosted the WT would be
`recalculated based on the additional capability.
`
`Q6: Some computers are made from a larger number of processors than the advertised count.
`Should they be counted in determining the APP?
`
`A6: APP rates the performance of a computer based on the hardware facilities which are not
`reserved for spares (in the event of a failure) or support functions such as the service processor
`or I/O processors. If some processors are "hidden" from the programmer and therefore unable to
`contribute floating-point results to the application then they are not counted in APP.
`
`Q7: When should the performance of nodes in a cluster be aggregated in determining the APP?
`
`A7: Consistent with past practice in applying CTP, APP Note 5 is interpreted to mean that
`common, industry-standard I/O-attached networks such as Ethernet, Fibre Channel, InfiniBand,
`Myrinet, PCI, and RapidIO do not require aggregation. The APP of an Ethernet-based cluster is
`the APP of a single (the fastest) node. Note 6 is interpreted to mean that clusters based on
`proprietary, high-performance networks such as SGI's NUMAlink and IBM's HPS do require
`aggregation of all nodes in calculating APP.
`
`Q8: How is the floating-point calculating rate determined?
`
`A8: What matters is the effective (i.e. visible to the programmer) rate at which results are
`produced. Intermediate results, speculative execution results, and other transitory effects are not
`counted towards APP. For example, some of the architectures are described as performing three
`simultaneous floating-point instructions per cycle. But if one of them is either a Load or a Store,
`it doesn't count as an arithmetic operation. The architecture, therefore, would actually produce
`two floating-point results per cycle and the APP would be based on this rate.
`
`Q9: How are non-homogeneous computers evaluated with respect to APP?
`
`A9: For systems comprising both scalar and vector floating-point capabilities, the weighted
`contributions of each should be added to determine the APP. In cases where they cannot be
`effectively mixed the APP is the greater of the scalar or vector performance.
`
`
`Q10: What about computers with different modes?
`
`Page 8 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 9
`
`

`

`
`A10: When a processor or computer is capable of operating in different modes the APP is
`computed on the basis of the mode which produces the highest WT number.
`
`Q11: How should APP be calculated for “reconfigurable” computers, typically those based on
`Field Programmable Gate Arrays (FPGAs) or incorporating FPGAs along with conventional
`processors?
`
`A11: FPGA-based computers represent an unusual case where proprietary vendor information
`may be required to establish an APP value. At present some FPGAs contain embedded
`microprocessors but they do not have the ability to perform double-precision floating-point
`arithmetic (and are typically used for housekeeping and other non-performance critical
`functions). Instead, in order to have a non-zero WT, FPGAs must implement floating-point logic
`in the cells, slices, or logic blocks. In some cases the manufacturer may supply libraries to
`program the FPGAs and in other cases the FPGA functionality is determined by the user. In any
`event, the WT is likely to be based on the manufacturer’s experience in compiling for the FPGA
`target.
`
`
`
`
`
`
`
`
`
`
`
`
`Page 9 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 10
`
`

`

`EXAMPLES
`
`
`Commodity Cluster
`
`The most common commodity clusters found are made from 1U rack-mount, dual-socket servers
`and interconnected via an inexpensive network such as Gigabit Ethernet. The example depicted
`below in Figure 1 uses dual-core AMD Opteron™ microprocessors such that each node is a 4-
`way SMP server made from two microprocessor chips. (Note: Each "core" is a processor; "dual
`core" implies two processors per microprocessor chip.) Each processor (core) has a pipelined
`floating-point unit capable of executing one double-precision (64-bit) fused Multiply-Add
`instruction per cycle. Thus, the Opteron processors achieve a rate of 2 operations per cycle per
`processor. The processors are not "vector processors" as defined in Note 7. This is a
`homogeneous system: all the processors in the system are of the same type and performance
`(thus, they all have the same architecture adjustment factor, W, of 0.3). By application of Note
`5, the performance of only the four processors within a single node is aggregated and therefore
`the APP of the entire cluster is independent of the number of nodes.
`
`
`
`
`4-way
`
`4-way
`
`4-way
`
`Node 0
`
`Node 1
`
`Node n
`
`Ethernet
`Switch
`
`Figure 1. Simple Commodity Cluster
`
`
`
`
`
`
`Page 10 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 11
`
`

`

`
`
`
`Basic data:
`
`Processor frequency (clock speed) F = 2.6 GHz
`
`Processor cycle time (1/F) t = 384.615 ps.
`
`Floating-point operations FPO = 2
`
`Architecture adjustment factor W = 0.3
`Calculations:
`
`Floating-point rate (for a single processor) R = 2/384.615 = 0.0052 TF
`
`Alternatively (FGHz * FPO * 10-3) R = 2.6 * 2 * 10-3 = 0.0052 TF
`
`APP (for a single processor) = 0.3 * 0.0052 = 0.00156 WT
`
`APP (for a 4 processor node) = 0.3 * 0.0052 * 4 = 0.00624 WT
`
`APP (any number of nodes) = 0.3 * 0.0052 * 4 = 0.00624 WT
`
`
`
`
`
`
`Page 11 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 12
`
`

`

`Cray XT3
`
`The Cray XT3 is a massively parallel processor (MPP) system consisting of hundreds to many
`thousands of commodity AMD Opteron single-core microprocessors connected by a proprietary 3D
`torus network. Each processor has a pipelined floating-point unit capable of executing one double-
`precision (64-bit) fused Multiply-Add instruction per cycle. Thus, the Opterons achieve a rate of 2
`operations per cycle per processor. The processors are not "vector processors" as defined in Note 7.
`The XT3 is a homogeneous system: all the processors in the system are of the same type and
`performance (thus, they all have the same architecture adjustment factor, W). By application of Note 6,
`the performance of all the processors is aggregated on the basis of the specially-designed MPP network.
`
`Figure 2 shows a single processing element with one Opteron processor connected to the chip that
`performs the 3D Torus interconnection via 6 high-speed links. Figure 3 illustrates how a small, 3x3x3
`element Torus is arranged – each node communicates directly with his 6 nearest neighbors. Sample
`calculations are given for large systems employing 1024 and 4096 processing elements.
`
`
`
`Memory
`
`AMD
`Opteron
`
`Interconnect
`Links
`
`Network
`Hub
`
`
`
`Figure 2. XT3 Processing Element
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 12 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 13
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 3. 3x3x3 Torus
`
`
`
`Basic data:
`
`Processor frequency (clock speed) F = 2.6 GHz
`
`Processor cycle time (1/F) t = 384.615 ps.
`
`Floating-point operations FPO = 2
`
`Architecture adjustment factor W = 0.3
`
`
`Calculations:
`
`Floating-point rate (for a single processor) R = 2/384.615 = 0.0052 TF
`Alternatively (FGHz * FPO * 10-3) R = 2.6 * 2 * 10-3 = 0.0052 TF
`
`
`APP (for a single processor) = 0.3 * 0.0052 = 0.00156 WT
`
`APP (for 1024 processors) = 0.3 * 0.0052 * 1024 = 1.59744 WT
`
`APP (for 4096 processors) = 0.3 * 0.0052 * 4096 = 6.38976 WT
`
`Page 13 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 14
`
`

`

`Cray X1E
`
`The Cray X1E is the latest in a long line of Cray vector supercomputers. From an entry-level
`16-processor configuration the architecture scales all the way to 8192 processors. Each MSP
`processor has 8 vector floating-point pipelines and delivers 18 GFLOPS (listed on the
`manufacturer’s data sheet) at a clock frequency of 1.13 GHz. A vector pipe executes one
`multiply and one add per cycle. Thus, the X1E achieves a rate of 16 operations per cycle per
`MSP processor. The processors do meet the definition of a "vector processors" as defined in
`Note 7 as they have the necessary vector instruction set, 8 pipes, and 128 vector registers of 64
`words apiece. The X1E is a homogeneous system: all the processors in the system are of the
`same type and performance (thus, they all have the same architecture adjustment factor, W =
`0.9). By application of Note 6, the performance of all the processors is aggregated on the basis
`of a specially-designed network (a set of parallel 2D Torus interconnects). Examples are given
`for a 16-processor, single node and a 1024 processor, 64-node systems, counting only the
`contribution of the vector units.
`
`
`Basic data:
`
`Processor frequency (clock speed) F = 1.13 GHz
`
`Processor cycle time (1/F) t = 885 ps.
`
`Floating-point operations FPO = 16
`
`Architecture adjustment factor W = 0.9
`
`
`Calculations:
`
`Floating-point rate (for a single processor) R = 16/885 = 0.018 TF
`Alternatively (FGHz * FPO * 10-3) R = 1.13 * 16 * 10-3 = 0.018 TF
`
`
`APP (for a single processor) = 0.9 * 0.018 = 0.0163 WT
`APP (for 16 processors) = 0.9 * 0.018 * 16 = 0.26 WT
`APP (for 1024 processors) = 0.9 * 0.018 * 1024 = 16.6 WT
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 14 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 15
`
`

`

`Basic data (single processor core per socket):
`
`Processor frequency (clock speed) F = 3.8 GHz
`
`Processor cycle time (1/F) t = 263 ps.
`
`Floating-point operations FPO = 2
`
`Architecture adjustment factor W = 0.3
`
`
`Calculations:
`
`Floating-point rate (for a single processor) R = 2/263 = 0.0076 TF
`Alternatively (FGHz * FPO * 10-3) R = 3.6 * 2 * 10-3 = 0.0076 TF
`
`
`APP (for a single processor) = 0.3 * 0.0076 = 0.00228 WT
`
`Dell™ PowerEdge 1855
`
`The Dell™ PowerEdge™ 1855 features up to 10 blade servers in a 7U enclosure. Individual blades
`utilize single and dual core Intel® Xeon™ microprocessors with two sockets per blade. Blades are
`connected together using InfiniBand and 1000BaseT Ethernet links. Each processor (core) has a
`pipelined floating point unit capable of executing two double precision (64-bit) multiply or add
`instructions per cycle. The processors are not “vector processors” as defined in Note 7. Processor cores
`support multi-threading, but as noted in Q3, this does not effect the calculation of APP. By application
`of Note 5, performance is not aggregated beyond the two or four processors contained on a single blade.
`Single core microprocessors operate up to 3.8 GHz, dual core microprocessors operate at 2.8 GHz.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Basic data (two processor cores per socket):
`
`Processor frequency (clock speed) F = 2.8 GHz
`
`Processor cycle time (1/F) t = 357 ps.
`
`Floating-point operations FPO = 2
`
`Architecture adjustment factor W = 0.3
`
`
`Calculations:
`
`Floating-point rate (for a single socket) R = 2/357 = 0.0056 TF
`Alternatively (FGHz * FPO * 10-3) R = 2.8 * 2 * 10-3 = 0.0056 TF
`
`
`APP (for a single socket) = 0.3 * 0.0056 = 0.00168 WT
`
`Page 15 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 16
`
`

`

`Basic data:
`
`Processor frequency (clock speed) F = 1.1 GHz
`
`Processor cycle time (1/F) t = 909 ps.
`
`Floating-point operations FPO = 4
`
`Architecture adjustment factor W = 0.3
`
`
`Calculations:
`
`Floating-point rate (for a single processor) R = 4/909 = 0.0044 TF
`Alternatively (FGHz * FPO * 10-3) R = 1.1 * 4 * 10-3 = 0.0044 TF
`
`
`APP (for a single processor) = 0.3 * 0.0044 = 0.00132 WT
`
`APP (for 4 processors) = 0.3 * 0.0044 * 4 = 0.00528 WT
`
`APP (for 128 processors) = 0.3 * 0.0044 * 128 = 0.1690 WT
`
`HP 9000 Superdome & Integrity Superdome
`
`The HP 9000 Superdome is a large non-uniform memory access (NUMA) system consisting of 4 to 128
`HP PA8900 processors connected by a proprietary internal Crossbar switch. (While not important for
`this calculation, two processors are packaged on a common PA8900 silicon die).
`Each processor is a conventional PA8900 RISC microprocessor with two pipelined floating-point units.
`Each floating-point unit is capable of executing one double-precision (64-bit) fused Multiply-Add
`instruction per cycle. Thus, two operations per cycle times two floating-point units per processor yield a
`rate of 4 operations per cycle per processor. The processors are not "vector processors" as defined in
`Note 7. The HP 9000 Superdome is a homogeneous system: all the processors in the system are of the
`same type and performance (thus, they all have the same architecture adjustment factor, W). By
`application of Note 6, the performance of all the processors is aggregated on the basis of the specially-
`designed NUMA network. The faster of two available processor options is shown below, with examples
`of a 4-processor and 128-processor configuration.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`HP Integrity Superdome (homogenous and mixed processors)
`
`The HP Integrity Superdome is a large non-uniform memory access (NUMA) system consisting of 2 to
`128 Intel Itanium 2 processors connected by a proprietary internal Crossbar switch. Each processor has
`two pipelined floating-point units capable of executing two double-precision (64-bit) fused Multiply-
`Add instruction per cycle apiece. Thus, the Itanium 2 processors achieve a rate of 4 operations per cycle
`per processor. The processors are not "vector processors" as defined in Note 7. In the HP Integrity
`Superdome all the processors have the same architecture adjustment factor, W. By application of Note
`6, the performance of all the processors is aggregated on the basis of the specially-designed NUMA
`network. The faster of two available processor options is shown below, with examples of a 4-processor
`and 128-processor configuration.
`
`The HP Integrity can have a homogonous or mixed processor architecture configuration. For example,
`you can have a 64-processor system with a total of 64 Itanium 2 processors at 1.6 GHz. Or you can mix
`the processors and have for example 32 Itanium 2 processors at 1.6 GHz plus 32 PA8900 RISC
`processors at 1.1 GHz
`
`
`Page 16 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 17
`
`

`

`Basic data for Itanium 2 portion:
`
`Processor frequency (clock speed) F = 1.6 GHz
`
`Processor cycle time (1/F) t = 625 ps.
`
`Floating-point operations FPO = 4
`
`Architecture adjustment factor W = 0.3
`
`
`Basic data for PA8900 portion:
`
`Processor frequency (clock speed) F = 1.1 GHz
`
`Processor cycle time (1/F) t = 909 ps.
`
`Floating-point operations FPO = 4
`
`Architecture adjustment factor W = 0.3
`
`Calculations:
`
`
` Itanium 2 Calculation
`
`Floating-point rate (for a single processor) R = 4/625 = 0.0064 TF
`Alternatively (FGHz * FPO * 10-3) R = 1.6 * 4 * 10-3 = 0.0064 TF
`
`
`APP (for 32 processors) = 0.3 * 0.0064 * 32 = 0.0614 WT
`
`
`
`
`
`PA8900 RISC Calculation
`Floating-point rate (for a single processor) R = 4/909 = 0.0044 TF
`Alternatively (FGHz * FPO * 10-3) R = 1.1 * 4 * 10-3 = 0.0044 TF
`APP (for 32 processors) = 0.3 * 0.0044 * 32 = 0.0422 WT
`
`Mixed (combining Itanium and PA8900) Calculation
`Total APP = (0.3 * 0.0064 * 32) + (0.3 * 0.0044 * 32) = 0.1036 WT
`
` = (0.0614) + (0.0422) = 0.1036 WT
` (32 Itanium2 processors @ 1.6 GHz ) + (32 PA8900 processors @ 1.1 GHz)
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`HP zx6000
`
`The HP zx6000 is a workstation/cluster computer utilizing the Intel® Itanium® 2 processor. Two
`processor nodes are connected together using Myrinet 2000 links. Each processor has two pipelined
`floating-point units capable of executing two double-precision (64-bit) fused Multiply-Add instruction
`per cycle apiece. Thus, the Itanium 2 processors achieve a rate of 4 operations per cycle per processor.
`The processors are not "vector processors" as defined in Note 7. By application of Note 5, performance
`is not aggregated beyond the two processors contained in a single node.
`
`
`
`
`
`
`
`
`Basic data:
`
`Processor frequency (clock speed) F = 1.5 GHz
`
`Processor cycle time (1/F) t = 667 ps.
`
`Floating-point operations FPO = 4
`
`Architecture adjustment factor W = 0.3
`
`
`
`Page 17 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 18
`
`

`

`Calculations:
`
`Floating-point rate (for a single processor) R = 4/667 = 0.006 TF
`Alternatively (FGHz * FPO * 10-3) R = 1.5 * 4 * 10-3 = 0.006 TF
`
`
`APP (for a single processor) = 0.3 * 0.006 = 0.0018 WT
`
`APP (for 128 processors) = 0.3 * 0.006 * 2 = 0.0036 WT
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 18 of 32
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2111, p. 19
`
`

`

`IBM BlueGene/L
`
`BlueGene/L is a massively parallel processor (MPP) computer consisting of many thousands of
`processors connected by a proprietary 3D torus network. Each processor is a conventional PowerPC
`440 RISC microprocessor core with two pipelined floating-point units. (While not important for this
`calculation, two processors (cores) are packaged on a common silicon die along with the network hub as
`shown in Figure 4). Each floating-point unit is capable of executing one double-precision (64-bit)
`fused Multiply-Add instruction per cycle. Thus, two operations per cycle times two floating-point units
`per processor yield a rate of 4 operations per cycle per processor. The processors are not "vector
`processors" as defined in Note 7. BlueGene/L is a homogeneous system: all the processors in the
`system are of the same type and performance (thus, they all have the same architecture adjustment
`factor, W). By application of Note 6, the performance of all the processors is aggregated on the basis of
`the specially-designed MPP network. Examples are given for 1024 and 4096 processor systems.
`
`
`Memory
`
`L2
`
`PPC440
`Core
`
`PPC440
`Core
`
`L3
`
`Interconnect
`Links
`
`Network
`Hub
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 4. BlueGene/L node
`
`
`
`Basic data:
`
`Processor frequency (clock speed) F = 700 MHz
`
`Processor cycle time (1/F) t = 1,428.571 ps.
`
`Floating-point operations FPO = 4
`
`Architecture adjustment factor W = 0.3
`
`
`Calculations:
`
`Floating-point rate (for a single processor) R = 4/1428.571 = 0.0028 TF
`Alternatively (FGHz * FPO * 10-3) R = 0.7 * 4 * 10-3 = 0.0028 TF
`
`
`APP (for a single p

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket