throbber
160
`
`Characterization of Other Parameters on Performance
`
`global data referencing is taking place, although performance tails off
`here as well. Unfortunately, the amount of testing done using the old
`operating system was limited, so additional results could not be
`obtained. It is clear from these results, though, that the operating
`system in a shared memory multiprocessor has significant impact on
`the overall performance. We feel confident that the latest version of
`the GP1000 operating system is better geared to the current machine
`and does indeed provide exceptional performance.
`
`6.2.2. Comparison of Architectural Differences
`In addition to the impact of the operating system, other factors can
`affect overall algorithmic performance. For instance, one would like
`to compare what would happen if a faster CPU or a faster switch node
`were to be employed in the machine. BBN has continually updated
`the Butterfly family of machines from the Butterfly 1, which used
`MC68000 processors with 1 megabyte of memory per board, to the
`current generation GP1000, which uses the MC68020 with 4
`megabytes of memory per board. We were not able to test the algo(cid:173)
`rithms on the original Butterfly, but we were able to test them on the
`next generation BBN multiprocessor, the TC2000. The TC2000 is a
`similar design to the GP1000 but there are significant differences
`which are illustrated in the tables on the page following the graphs.
`Table 6.1 shows the difference in processor characteristics, while table
`6.2 shows the difference in the memory characteristics for the GP1000
`and TC2000. In general, the primary differences between the two
`machines are the faster CPU in the TC2000, as well as a change in
`the basic switch node component from a 4 x 4 crossbar to an 8 x 8
`crossbar.
`
`Table 6.1: Comparison ofBBN multiprocessor CPU characteristics
`
`Machine
`
`CPU
`
`GP1000 M68020
`TC2000 M88100
`
`Clock MIPS MFLOPS
`Speed
`16Mhz
`20Mhz
`
`2.5
`19
`
`0.6
`20
`
`The faster CPU in the TC2000 necessitates a faster switch with
`increased path width, and an 8 x 8 crossbar switch component solves
`this problem. One impact of the increased size ofthe crossbar switch
`
`TEXAS INSTRUMENTS EX. 1011 - 172/229
`
`

`
`Machine Parameters 161
`
`GP1000 Operating System Comparison
`
`280
`
`240
`
`200
`
`Time
`(seconds)160
`
`120
`
`80
`
`40
`
`12
`
`24
`
`36
`
`60
`48
`# Processors
`
`72
`
`84
`
`96
`
`Figure 6.13: Comparison of old OS vs. new OS for mountain image,
`rectangular region UD
`
`320+---~--~--~--~--~~--~--+
`
`-& · Old OS
`- e ·NewOS
`
`240
`
`200
`
`Time
`(seconds) 160
`
`120
`
`80
`
`40
`
`12
`
`24
`
`36
`
`60
`48
`# Processors
`
`72
`
`84
`
`96
`
`Figure 6.14: Comparison of old OS vs. new OS for mountain image,
`rectangular region LC
`
`TEXAS INSTRUMENTS EX. 1011 - 173/229
`
`

`
`162
`
`Characterization of Other Parameters on Performance
`
`is that fewer wires are needed between the switch columns in the
`interconnection network. The 8 x 8 crossbar is more costly to produce
`than the 4 x 4 but it does allow 8 simultaneous messages to be output,
`whereas a 4 x 4 only supports 4 messages at a time.
`
`Table 6.2: Comparison of BBN multiprocessor memory characteristics
`
`Machine Cache Memory Switch Path
`per Board Speed Width
`
`GP1000
`TC2000
`
`no
`yes
`
`8Mhz
`4meg
`4 or 16 meg 38Mhz
`
`4 bit
`8 bit
`
`Basic
`Switch
`Node
`4x4
`8x8
`
`From the programmer's point of view, the TC2000 is functionally
`the same as the GP1000. There are several small differences
`regarding communication, however. The GP1000 supports the block
`transfer mechanism in hardware, whereby a path is held open long
`enough for 256 byte length messages. to go from one board to another.
`In the TC2000, this operation is supported through software
`emulation rather than hardware implementation. The TC2000 does
`contain a memory cache which allows data to be allocated as cachable
`or non-cachable. Although using the cache significantly enhances
`performance, judicious management of this memory is required by the
`programmer since no cache coherence scheme is supported. The
`primary goal here is to compare the different algorithms under
`different CPU and switch characteristics, so the algorithms were not
`modified to take advantage ofthe cache.
`The results, including times for the setup phase from the front
`end plus the tiling time, are shown in figures 6.15, 6.16, 6.17, and
`6.18. A thorough analysis of the scan line algorithm was deemed
`unnecessary on the TC2000 due to its performance limitations noticed
`on the GPlOOO.- It is, however, included for comparison purposes in
`the next section of this chapter.
`These graphs indicate similar performance in the algorithms
`when compared to the previous graphs for the GP1000. The only
`problem with this comparison is that the results on the TC2000 were
`limited for most of the tests to a maximum of 48 processors, while
`with the GPlOOO, 96 processors were consistently available.2
`
`2we have included some data obtained on the TC2000 at 96 processors in
`table 6.3. In general, though, due to the other users on the machine, only 48
`processors were used for most of the tests.
`
`TEXAS INSTRUMENTS EX. 1011 - 174/229
`
`

`
`Machine Parameters 163
`
`TC2000 Tiling + Setup Time Comparisons
`
`10
`
`8
`Time
`6
`
`4
`
`2
`
`- e ·Reel. • LC
`-<r Reel. - UD
`~ · Task A. -UD -•-TaskA. - LC
`
`- ~ -T q>-OOwn
`
`---Rect. - LC
`- -e -Rect.-UO
`-o-TaskA.- UD--+-TaskA. -LC
`
`· ·:>t·Tqrdown
`
`16
`
`24
`
`32
`# Processors
`
`40
`
`48
`
`16
`
`24
`
`32
`# Processors
`
`40
`
`48
`
`Figure 6.15: TC2000 algorithm com(cid:173)
`parison, stegosaurus image
`
`Figure 6.17: TC2000 algorithm
`comparison, tree image
`
`24 ~-·. ' ·. '
`
`20
`
`16
`Time
`12
`
`~ '· '
`.. ~.:·-.',
`.... ~ ..
`'· '
`--:. ·. 'G
`"\~ ',
`~::-: .......
`-~~~~---
`
`... -::;~:~~;;;.;;; .
`
`2
`
`e Rect - UO
`--e-Rect.-LC
`o- Task A.- UO -•-Task A.- LC
`
`- ~- Tq>-OOwn
`
`8 • e ·Reel. - UD
`.... Rect - LC
`<> ·Task A. - UD --+-Task A. - LC
`
`4
`
`--:>t·T c:4)-00wn
`
`16
`
`24
`
`32
`# Processors
`
`40
`
`48
`
`16
`
`24
`
`32
`# Processors
`
`40
`
`48
`
`Figure 6.16: TC2000 algorithm com(cid:173)
`parison, Laser image
`
`Figure 6.18: TC2000 algorithm
`comparison, mountain image
`
`TEXAS INSTRUMENTS EX. 1011 - 175/229
`
`

`
`r
`
`164 Characterization of Other Parameters on Performance
`
`In order to allow a fair comparison between the two machines, the
`speedup was computed for each of the algorithms at 96 processors.
`The task adaptive algorithm is used for this comparison, and the
`results are shown in table 6.3.
`
`Table 6.3: GP1000 and TC2000 speedup and time ratio comparison using 96
`processors
`
`Machine
`
`Stegosaurus Laser Tree Mountain
`
`GPlOOO Speedup
`
`TC2000 Speedup
`
`70.3
`
`61.3
`
`73.3
`
`59.3
`
`68.1
`
`56.4
`
`82.1
`
`70.6
`
`Ratio of Execution
`Times at P = 96:
`GP1 OOOffC2000
`
`8.6
`
`8.7
`
`8.0
`
`8.8
`
`As can be seen from the table, the TC2000 exhibits slightly
`reduced speedup when compared to the GPlOOO on 96 processors for
`most of the images. This could be caused by a number of factors,
`ranging from the amount of work per task to the processor-to-switch
`speed ratio. The last row in the table indicates the ratio of parallel
`execution times of the TC2000 divided by the GPlOOO. From this
`data, it appears that on 96 processors, the TC2000 is approximately
`8.5 times faster than the GPlOOO for this problem.
`
`6.2.3. Relationship of Machine Parameters to
`Performance

`In this section, we evaluate the various overheads on both machines
`to see their differences. The comparison involves examining the total
`processor-time space and comparing the results on the two machines.
`Here, the overheads are evaluated with respect to P and comparison
`values are shown to the right of each graph for the overhead
`percentages at 48 processors. Also, the speedup is given at each
`processor configuration. All of the algorithms are compared on the
`GPlOOO and the TC2000 for the Laser image as a representative
`example. Due to the volume of data and the CPU time involved in the
`tests, only one image was used for comparison. Different results
`would be obtained for the different test images, but the main interest
`
`TEXAS INSTRUMENTS EX. 1011 - 176/229
`
`

`
`Machine Parameters 165
`
`here was to evaluate the trend in performance and directly compare
`the percentages on various processor configurations.
`
`6.2.3.1. Comparison of Overheads
`The next five pages provide a direct comparison of the overhead
`factors for all of the algorithms. The graphs include the total
`processor-time space for each particular processor configuration, with
`the overheads clearly marked as a percentage. Although the results
`were measured up to 96 processors for the GPlOOO, the overhead
`values given on the right side of the graph are for 48 processors so
`they can be compared to the values for the TC2000 below.
`
`6.2.3.2. Analysis
`These results present a number of interesting phenomena not
`noticed in any previous graphs. In the parallel scan line algorithm,
`the latency and code modification overheads constitute almost the
`same overhead percentage regardless of the processor configuration
`on both machines. This makes sense since the number of tasks is
`constant regardless of the number ofprocessors in this algorithm. In
`the other cases, since the total number of tasks increases with the
`number of processors, the overhead effects increase as· well. In some
`cases, the load balancing may go down at some point but this may be
`due to an increase in another factor as explained next.
`With the exception of the task adaptive algorithm, the load
`balancing is better on the TC2000 than in the GPlOOO. On the other
`hand, the network contention, code modification, and latency/com(cid:173)
`munication are significantly worse. It seems that the increased delay
`due to communication overheads and contention contribute to even
`out the load in the algorithms on the TC2000 (recall that load
`balancing cannot be measured independently from other factors).
`Since these overheads are larger in the TC2000 than in the GP!OOO,
`they contribute to an increase in the average task execution time.
`This changes the load balancing since it is based on dynamic
`scheduling of the tasks, as well as their execution time.
`In the case of the task adaptive algorithm, the load balancing is a
`direct result of dynamic task partitioning, and it is possible that the
`tasks cannot be partitioned near the end of the computation due to
`the imposed threshold. This effect may be more pronounced in the
`TC2000 than in the GPlOOO due to the difference in the synchroniza(cid:173)
`tion and communication mechanisms.
`
`TEXAS INSTRUMENTS EX. 1011 - 177/229
`
`

`
`166
`
`Characterization of Other Parameters on Performance
`
`Comparison of Overhead Factors, GP1000 vs. TC2000,
`Laser Image, Scan line Algorithm
`
`0 Sequential IIIII Ld Imbal. 0 Contention
`e;:! Code Mod. Q Latency
`
`1750
`
`1400 ~?:z~~~~i~llill
`i
`
`Cwnulative
`Time
`1050
`
`700
`
`350
`
`8
`
`16
`
`48
`32
`24
`# Processors
`
`64
`
`96
`
`Latency - 4.6%
`Code Mod. - 7.5%
`Contention-10.0%
`Load Imbalance -6.3%
`
`Speedup
`
`Figure 6.19: GPIOOO, scan line algorithm, UD, overhead comparison
`
`D Sequential 1111 Ld Imbal. 0 Contention
`1:3 Code Mod. 8 Latency
`
`80
`
`40
`
`8
`
`16
`24
`# Processors
`
`32
`
`48
`
`Latency - 9.0%
`Code Mod. - 13.0%
`
`Contention- 20.5%
`
`Speedup
`
`Figure 6.20: TC2000, scan line algorithm, UD, overhead comparison
`
`TEXAS INSTRUMENTS EX. 1011 - 178/229
`
`

`
`Comparison of Overhead Factors, GP1000 vs. TC2000,
`Laser Image, Rectangular Region Algorithm, UD Scheme
`
`Machine Parameters 167
`
`0 Sequential II Ld lmbal. [) Contention
`~ Latency
`C) Code Mod.
`2100--.-----------------.,
`
`700
`
`350
`
`8
`
`16
`
`32
`24
`# Processors
`
`48
`
`64
`
`96
`
`Speedup
`
`Figure 6.21: GPlOOO, rectangular region algorithm, UD, overhead comparison
`
`0 Sequential
`ea Latency
`
`l..d Imbal.
`1111
`[J CodeMod.
`
`0 Contention
`
`200
`
`160
`
`120
`Cumulative
`Time
`80
`
`40
`
`0
`
`8
`
`16
`
`24
`# Processors
`
`32
`
`48
`
`Code Mod. · 12.1%
`Latency- 5.7%
`Contention - 11.9%
`Load Imbalance - 6.6%
`
`Speedup
`
`Figure 6.22: TC2000, rectangular region algorithm, UD, overhead comparison
`
`TEXAS INSTRUMENTS EX. 1011 - 179/229
`
`

`
`168
`
`Characterization of Other Parameters on Performance
`
`Comparison of Overhead Factors, GP1000 vs. TC2000,
`Laser Image, Rectangular Region Algorithm, LC Scheme
`
`0 Sequential ill Ld lmbal. [J Contention
`f:a Comm. ~ Code Mod.
`
`Code Mod.- 4.4%
`Communication -0.04%
`Contention - 1.2%
`Load Imbalance- 7.9%
`
`1600
`
`1200
`Cumulative
`Time 800
`
`400
`
`8
`
`16
`
`48
`32
`24
`# Processors
`
`64
`
`96
`
`Figure 6.23: GPlOOO, rectangular region algorithm, LC, overhead comparison
`
`150
`
`120
`
`90
`Cumulative
`Time
`60
`
`30
`
`0
`
`0 Sequential
`Zl Comrn.
`
`Ill Ld lmbal.
`[J CodeMod.
`
`[] Contention
`
`Code Mod. -2.2%
`Communication -0.28%
`Contention -5.7%
`Load Imbalance -6.5%
`
`~speedup
`
`8
`
`32
`
`48
`
`# Processors
`
`Figure 6.24: TC2000, rectangular region algorithm, LC, overhead comparison
`
`TEXAS INSTRUMENTS EX. 1011 - 180/229
`
`

`
`Comparison of Overhead Factors, GP1000 vs. TC2000,
`Laser Image, Top-Down Algorithm
`
`Machine Parameters 169
`
`0 Sequential II Ld lmbal. 0 Contention
`0 Comm.
`e:J Code Mod.
`
`Code Mod. - 3.5%
`CommWlication -0.03%
`itm1BmL Contentioo- 4.8%
`
`1600
`
`1200
`Cumulative
`Time
`800
`
`400
`
`Speedup
`
`8
`
`16
`
`48
`32
`# Processors
`
`64
`
`96
`
`Figure 6.25: GPlOOO, top-down algorithm, LC, overhead comparison
`
`!50
`
`120
`
`90
`Cumulative
`Time
`
`60
`
`30
`
`0
`
`0 Sequential
`ea Conun.
`
`11!11 Ld lmbal.
`G CodeMod.
`
`[] Contention
`
`Code Mod.- 2.3%
`Communication - .35%
`Cootention- 14.5%
`Load Imbalance -3.0%
`
`Speedup
`
`8
`
`32
`16
`# Processors
`
`48
`
`Figure 6.26: TC2000, top-down algorithm, LC, overhead comparison
`
`TEXAS INSTRUMENTS EX. 1011 - 181/229
`
`

`
`170
`
`Characterization of Other Parameters on Performance
`
`Comparison of Overhead Factors, GP1 000 vs. TC2000,
`Laser Image, Task Adaptive Algorithm
`
`0 Sequential
`f0 Comm.
`
`111 Ld lmbal.
`Q Code Mod.
`
`III Contention
`lSI Synch.
`
`1600
`
`1200
`Cumulative
`Time 800
`
`400
`
`Synchronization - 0.23%
`Code Mod. - 1.6%
`Commtmication - 0.14%
`Contention - 3.0%
`Load Imbalance - 6.0%
`
`8
`
`16
`
`48
`32
`24
`# Processors
`
`64
`
`96
`
`Figure 6.27: GPlOOO, task adaptive algorithm, LC, overhead comparison
`
`III Contention
`0 Sequential Ill Ld Irnbal.
`e:3 Synch.
`lSI Code Mod.
`e! Comm.
`Synchronization - 1.8%
`160 :r--------Eis~~CodeMod.-1.3%
`Commtmication - 1.0%
`140
`Contention - 11.3%
`120
`Load Imbalance- 8.5%
`100
`Cumulative 80
`Time
`60
`
`40
`
`20
`0+---+---t---t--+----l
`16
`24
`32
`48
`8
`# Processors
`
`Speedup
`
`Figure 6.28: TC2000, task adaptive algorithm, LC, overhead comparison
`
`TEXAS INSTRUMENTS EX. 1011 - 182/229
`
`

`
`Scene Characteristics 171
`
`The probable cause for the general network contention increase in
`the TC2000 over the GPlOOO could be attributed to these factors:
`
`1. The increase in switch speed in the TC2000 over the GPlOOO
`does not match the corresponding increase in processor speed,
`therefore more collisions in the network switch are likely to
`occur.
`2. The hardware support for block transfers in the GPlOOO is not
`available in the TC2000 (this was used in the LC scheme).
`
`It seems likely that the cause for the general network contention
`increase could be a combination of both of these reasons, especially for
`the LC scheme which extensively uses the block transfer mechanism.
`Although the TC2000 contains a cache as well as support for use of
`fine grained interleaved shared memory, neither of these characteris(cid:173)
`tics is used in the implementation of the algorithms. The cache does
`not affect performance in the LC schemes, however, since the remote
`data is copied to local memory, where it is then cached as local data.
`In addition, the fine grain interleaving is not needed and would not
`provide better performance anyway since the shared data is already
`scattered among the memory modules. In the next section, the effect
`of enhancing the characteristics of computer graphics scenes is ana(cid:173)
`lyzed to determine the overall performance differential.
`
`6.3. Scene Characteristics
`One of the reasons different images are used for performance
`comparisons throughout this book is that it is desirable to be able to
`generalize these results to apply to all computer generated scenes. Of
`course this is an impossible task since there are always pathological
`cases one cannot predict. In the experiments four scenes were used
`which have different characteristics in screen area projection, number
`of data elements, and depth complexity. In this section, these same
`four scenes are analyzed, along with several new ones which have
`added scene complexity in one form or another. These break down
`into two categories: image complexity and object complexity .
`In his thesis, Whelan analyzes several scenes which vary in
`complexity in terms of both of the above categories. His conclusions
`merely represent what most researchers intuitively realize, but they
`are worth repeating here:
`
`1. Scenes are not usually composed of uniformly distributed
`polygons.
`
`TEXAS INSTRUMENTS EX. 1011 - 183/229
`
`

`
`172 Characterization of Other Parameters on Performance
`
`2. As the number of polygons increases, their size generally
`decreases. Somescenes may have a few large polygons which
`take up a significant portion of the drawn area on the screen.
`3. Most polygons have few edges.
`4. The aspect ratio of polygons is non-uniform, although some
`scenes seem to be oriented towards a particular direction.
`5. The depth complexity of most pixels is fairly small (less than
`six), although some scenes obviously violate this rule.
`
`It is not necessary to repeat Whelan's analysis for the scenes used
`here since it does not categorize scenes in terms of their difficulty in
`the various rendering stages. His analysis does point out that non(cid:173)
`uniformity in scenes is the norm, so a parallel graphics algorithm
`must take this into account and perform well under various
`circumstances. The algorithms presented in this book are general
`purpose and are designed to handle various input scenes rather than
`a specific type. One cannot categorically draw a relationship between
`a given algorithm and say, the depth complexity of an image. Even if
`this were the case, does that information provide anything useful to
`the user community? In general, the algorithm should perform well
`on all imagery and should have the capability of handling pathological
`cases with some efficiency. This is more useful than algorithm
`analysis based upon depth complexity, polygon area coverage, or some
`other factor.
`In the following sections, the different algorithms' performance is
`compared using an increase in image complexity in the first
`subsection and an increase in object complexity in the second
`subsection.
`
`6.3.1 . Image Complexity
`Image complexity refers to the addition of features to a scene to make
`a higher quality image. Such additional features can include:
`rendering at higher resolution, advanced anti-aliasing, texture
`mapping, shadow generation, and bump mapping to name a few.
`Texturing, shadowing, or bump mapping, have not been implemented
`here since these features require careful planning in order to be
`implemented efficiently in parallel. This is primarily due to the
`additional memory required for each of these features, plus the desire
`to avoid contention for this memory. Anti-aliasing has already been
`incorporated into this algorithm, and all of the data presented so far
`includes this feature. Therefore, increasing spatial resolution is con-
`
`TEXAS INSTRUMENTS EX. 1011 - 184/229
`
`

`
`Scene Characteristics 173
`
`centrated on here, and a comparison of the other features is left for
`future work.
`All ofthe previous scenes are re-tested at double resolution (1280
`x 968) to evaluate the performance of the different algorithms under
`these conditions. The first set of graphs involves a comparison
`similar to the others in this chapter, in which the setup phase as well
`as the tiling time is taken into account. The speedup and efficiency
`are given in the second set of graphs.
`As a representative example of the times for the high resolution
`computations, the results for the mountain image are shown in fig(cid:173)
`ures 6.29 and 6.30, but the graphs for all the images are included in
`the appendix in figures A21 through A28. These graphs are zoomed
`in to show more detail at the higher processor counts. The speedup
`for the mountain image on the GPlOOO using the task adaptive
`algorithm is shown in figure 6.31, and the efficiency for this image in
`figure 6.32. Since communication is the same as before but there is
`an increase in work due to the increased resolution, the algorithms
`are more efficient. Table 6.4 shows the speedup and efficiency for
`each ofthe images when calculated at high resolution on the GPlOOO,
`versus normal resolution using the task adaptive (LC) algorithm.
`
`Table 6.4: Tiling section comparison of speedup and efficiency for normal
`resolution images vs. high resolution images on GPIOOO, 96 Processors
`
`Images
`(#polygons)
`
`High
`Normal
`Normal
`High
`Resolution Resolution Resolution Resolution
`Speedup Efficiency Efficiency
`Speedup
`
`Stegosaurus
`
`(9K)
`
`Laser
`
`Tree
`
`(46K)
`
`(106K)
`
`Mountain
`
`(131K)
`
`71.4
`
`73.6
`
`70.5
`
`82.2
`
`86.5
`
`80.0
`
`84.6
`
`87.6
`
`0.74
`
`0.77
`
`0.73
`
`0.86
`
`0.90
`
`0.83
`
`0.88
`
`0.91
`
`The data from the table indicates that the speedup varies widely
`among the images, but the single common result is that the additional
`work in the high resolution images provides better speedup and
`efficiency than in the normal resolution case. It is likely that the
`reason for the improved speedup is that the ratio of work to
`communication time has increased, thus reducing the network
`contention percentage.
`
`I
`J
`
`TEXAS INSTRUMENTS EX. 1011 - 185/229
`
`

`
`17 4
`
`Characterization of Other Parameters on Performance
`
`High Resolution Tiling + FE Comparison
`
`123
`
`111
`Time
`
`99
`
`87
`
`- e Reel. -LC
`--0-Rect - UD
`--+-Task A.- LC
`- ~ Task A. · UD
`75 4-~--------~~--------~~~
`96
`80
`64
`# Processors
`
`Figure 6.29: Rectangular vs. task adaptive on GPlOOO, mountain image, high(cid:173)
`res
`
`50
`
`40
`Time
`30
`
`20
`
`10
`
`\
`
`\
`
`·.
`·.
`' . ' ·.
`\
`'
`'~· ..
`~~.
`...:.
`\•
`,._
`'
`-~ I
`~- -a
`,~~.<!).·~ .... ....
`... . ·· .:-,
`'<~:t~::-:;;;,,>'
`-e-Rect~Lc:.::.:::;~
`· +Task A.- LC
`32
`40
`# Processors
`
`· -o-Rect-UO
`-<>· Task A.- UO
`16
`24
`
`48
`
`Figure 6.30: Rectangular vs. task adaptive on TC2000, mountain image, high(cid:173)
`res
`
`TEXAS INSTRUMENTS EX. 1011 - 186/229
`
`

`
`Speedup and Efficiency of High Resolution Image
`
`Scene Characteristics 17 5
`
`· •-Speedup
`
`Ideal Speedup
`
`84
`
`72
`
`60
`Speedup
`48
`
`36
`
`24
`
`0
`
`12
`
`24
`
`60
`48
`36
`# Processors
`
`72 84 96
`
`Figure 6.31: Speedup for high-res mountain image, GPlOOO
`
`---- . - ---.-------·
`
`0.8 -
`-
`0.6 -
`Efficiency
`-
`0.4 -
`-
`
`0.2
`
`-e -Efficiency
`0 1-o-ro-.-..-.-~-r~.-~~-+
`0
`12
`24
`36
`48
`60
`72 84
`96
`# Processors
`
`Figure 6.32: Efficiency for high-res mountain image, GPIOOO
`
`TEXAS INSTRUMENTS EX. 1011 - 187/229
`
`

`
`176
`
`Characterization of Other Parameters on Performance
`
`This is logical since the amount of network traffic has not
`changed, but the amount of work has increased due to the increase in
`spatial resolution. In the next section, the algorithms are compared
`on a different set of images which involve much higher numbers of
`polygons to see the expected times on these datasets.
`
`6.3.2. Object Complexity
`One of the goals of this work was to present solutions for developing a
`highly efficient parallel rendering algorithm which allows extremely
`fast computation of complex imagery. To this end, new datasets are
`evaluated which contain a considerably greater number of polygons.
`Examples of two highly complex datasets are the rings image in color
`plate 5 and the dense tree image shown in color plate 6. The dense
`tree image has more polygons than the previously evaluated tree
`image, particularly in the twigs. Due to the amount of memory
`required for the tree dataset, it is not possible to evaluate speedup
`and efficiency since not enough physical memory was available even
`locally. All of the images are evaluated based on their total time, as
`well as the number of polygons rendered per second. Hardware
`manufacturers typically quote a figure of polygons per second in
`evaluating hardware Z-buffer graphics workstations. A typical
`example of this type of machine is the HP-320/SRX. Some of the
`images which were evaluated previously have been rendered on this
`machine by Eric Haines, who developed the SPD database from which
`the tree, mountain, and rings datasets were extracted. Table 6.5
`shows a comparison of all the images and the effective number of
`polygons per second achieved. These results were obtained using the
`task adaptive algorithm on 96 processors of a BBN TC2000.
`In comparison to the values in the table, Haines ran some of the
`same tests on the HP-320/SRX [Hain87b]. In rendering the dense
`tree, 4835 polygons/second was achieved. Using a denser version of
`the rings image (87 4 K polygons) than employed here, the HP-
`320/SRX achieved 4819 polygons/second. For comparison, a current
`example of a state of the art graphics superworkstation is the Silicon
`Graphics Iris 4D VGX [Haeb90]. The manufacturer quotes a figure of
`750,000 Gouraud-shaded triangles per second for this machine.
`This value for the hardware performance is based on Gouraud
`shaded polygons which are not anti-aliased. Our data is for Phong
`shaded polygons with specular highlights and stochastic sampled
`anti-aliasing. Although it is difficult to compare exactly, the addition
`of Phong shading typically might cost 3 times as much as Gouraud
`shading, in addition to the anti-aliasing cost which is about 4 times as
`
`TEXAS INSTRUMENTS EX. 1011 - 188/229
`
`

`
`Table 6.5: Effective rendering rate and speedup using 96 processors on BBN
`TC2000, task adaptive algorithm
`
`Conclusions 177
`
`Images
`
`#Polygons
`
`Stegosaurus
`
`Laser
`
`Tree
`
`Mountain
`
`Rings
`
`Dense Tree
`
`9,639
`
`46,393
`
`106,289
`
`131,072
`
`567,841
`
`851,288
`
`Time
`(sec.)
`
`1.14
`
`2.02
`
`2.94
`
`4.12
`
`9.90
`
`8.81
`
`Polygons/Second
`
`Speedup
`
`8,455
`
`22,967
`
`32,907
`
`29,857
`
`52,239
`
`80,690
`
`61.3
`
`59.3
`
`56.4
`
`70.6
`
`85.6
`
`58.9
`
`much for stochastic sampling as it is for a Z-buffer. Also, the amount
`of physical memory required to support over 500,000 polygons is most
`likely not available within a graphics workstation, and this will
`almost certainly slow down the hardware. Nevertheless, the results
`given here are significantly faster than a slightly older generation
`graphics workstation and might compare favorably with a current
`generation machine. When the fact that the rendering is done in
`software rather than hardware in this algorithm is added, the benefit
`is that much greater since the software version allows much more
`flexibility in its use. This is elaborated upon in the next chapter.
`
`6.4. Conclusions
`Based on all of the data reported in this chapter, it seems clear that
`the task adaptive algorithm utilizing the LC memory referencing
`scheme provides the best performance for all types of imagery among
`the methods implemented. Besides performance, other advantages of
`the task adaptive approach are:
`
`1. It is unnecessary to determine an initial optimal granularity
`ratio for this algorithm. The number of areas chosen initially
`corresponds to the number of processors in the system, and
`high performance is achieved regardless of machine configura(cid:173)
`tion. In the other algorithms, this ratio must be derived for
`each image independently for the best performance.
`
`TEXAS INSTRUMENTS EX. 1011 - 189/229
`
`

`
`178
`
`Characterization of Other Parameters on Performance
`
`2. The load balancing in the task adaptive approach is completely
`dynamic, based on the amount of work left. Because of this,
`worst case scenarios, such as all of the data being located in
`one portion of the screen, can be handled effectively while the
`other algorithms will not perform nearly as well in this type of
`situation.
`
`3. This approach has minimal time cost in the setup phase of the
`front end since the number of areas is very small initially.
`Hence, this time is much less than in the other algorithms.
`
`In summary, the first section of this chapter presents an analysis
`of several different memory referencing strategies. Based on this
`theoretical analysis, the Uniformly Distributed and Locally Cached
`schemes are shown to only differ by several tenths of a second. A
`description regarding the implementation of both of these schemes is
`then presented in detail After comparing the results of the
`implementations, it is clear that the LC scheme combined with the
`task adaptive decomposition method results in the best performance
`for all the test imagery. The setup time from the front end for each
`approach is included for the timings, in order to allow a fair
`comparison and substantiate this fact. The theoretical analysis of the
`memory reference strategies indicated that there would only be a
`small difference between the UD and LC schemes.
`Instead, the
`difference is much larger due to the fact that network contention is
`not accounted for in the theoretical analysis. This contention is an
`important degradation factor since it is significantly smaller in the
`LC scheme than it is in the UD scheme.
`The second section involves a comparison of the different parallel
`algorithms with respect to changes in the underlying machine
`parameters. It is shown that a change in the GPlOOO operating
`system which allows parallel page faults markedly improved
`performance over the previous version of this operating system.
`When using the next generation version of the BBN multiprocessor
`known as the TC2000, the faster CPU and network switch improves
`performance in all cases almost an order of magnitude over the
`GPlOOO. The task adaptive algorithm using the LC scheme proves to
`be the best performer for this new machine as well. A comparison of
`overhead factors between the two machines reveals that network
`contention plays a more significant role in degrading performance in
`the TC2000 than in the GPlOOO when measured at 48 processors,
`however. A possible explanation is that the interconnection network
`speed increase from the GPlOOO to the TC2000 does not match the
`
`TEXAS INSTRUMENTS EX. 1011 - 190/229
`
`

`
`Conclusions 179
`
`corresponding increase in the CPU performance. As a result, tasks
`execute faster, but the communication of data is more frequent,
`leading to a greater possibility of a blocked path in the network. The
`TC2000 offers enhancements to memory referencing (such as a
`hardware cache) that are not incorporated into the implemented
`algorithms. It is possible that if these are utilized, the effects of
`network contention could be reduced.
`The third section in this chapter involves an analysis of the effect
`of increasing image and object complexity in the test scenes. Re(cid:173)
`evaluating the performance of the algorithms at a resolution of 1280 x
`968 on the mountain image reveals a speedup of87.6 on 96 processors
`using the task adaptive algorithm on the GPlOOO. In addition, using
`a highly complex scene with over 800,000 data elements, an effective
`polygon rendering rate of over 80,000 anti-aliased Phong shaded
`polygons per second is achieved on 96 processors of a TC2000. This is
`most likely the fastest rendering ever realized to this point using a
`software algorithm on a general purpose MIMD architecture for
`graphics rendering.
`
`TEXAS INSTRUMENTS EX. 1011 - 191/229
`
`

`
`7
`
`Conclusion
`
`This book has primarily concentrated on the development and
`analysis of various approaches to tiling three-dimensional computer
`generated scenes on a multiprocessor. In doing so we have presented
`the following:
`
`1. A categorization of possible parallel approaches to graphics
`rendering into a taxonomy according to graphical task
`decomposition.
`2. A number of methods which incorporate parallelism in all
`aspects of a graphics rendering program.
`3. A quantitative analysis of various degradation factors
`encountered in a multiprocessor graphics display algorithm
`implementation.
`4. The development of general task partitioning and memory
`referencing strategies which may be used in other graphics
`rendering algorithms, as well as non-graphics applications.
`
`These will be described in detail in the following sections.
`
`181
`
`TEXAS INSTRUMENTS EX. 1011 - 192/229
`
`

`
`182
`
`Conclusion
`
`7.1. Summary
`In the past, research in the area of para

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket