throbber
Data Non-Adaptive Partitioning Scheme 103
`
`processors. The key to solving this problem is to reduce the number of
`remote memory requests. As we mentioned previously, it is likely
`that remote references became blocked in the switch since the tasks
`assigned to processors involved adjacent scan lines. If the scan lines
`were assigned in some haphazard order, this effect might be
`alleviated since the data to be referenced would not be the same. On
`the other hand, with a large number of processors such as 96, it
`probably could not be reduced significantly. The dataset size could
`also impact the network contention since larger datasets require more
`references. The combination of these two factors seems to indicate
`that this overhead cannot be reduced in the general case unless a
`If
`radical change in the memory referencing scheme is applied.
`coherence can be maximized without sacrificing load balancing and
`can still maintain adequate parallelism, some of these factors can be
`reduced significantly. The algorithm described in the next section is a
`logical extension of this one and attempts to achieve these goals.
`
`5.1.2. Rectangular Region Decomposition
`(UD Scheme)
`The rectangular region decomposition algorithm is a generalization of
`the scan line decomposition algorithm. Instead of using single scan
`lines as wide as the screen for each task, a small group of contiguous
`scan lines is designated as a single task.l This idea was first sug(cid:173)
`gested by Kaplan and Greenberg [Kapl79], where they implemented
`both the Watkins and the Warnock algorithm [Roge85] as two alter(cid:173)
`native rectangular approaches. In their Warnock implementation,
`equal size areas are initially assigned to all processors. The Warnock
`cull was then used within each single task.
`
`5.1.2.1. Coherence Effect on Rectangular Regions
`Presumably, the reason the Warnock algorithm was chosen for a
`rectangular area partition is due to the area coherence exploited in
`this algorithm. In the case here, however, a scan line Z-buffer is the
`basis algorithm for each single area task since this algorithm is
`inherently faster than Warnock's. Kaplan and Greenberg chose full
`scan line width tasks (of multiple contiguous scan lines) for their
`Watkins approach. The probable reason for this was to capitalize on
`
`lAs a reference note, from now on, when we refer to "scan lines" in the
`context of an algorithm such as this, we mean a scan line within the width of
`the area.
`
`TEXAS INSTRUMENTS EX. 1011 - 115/229
`
`

`
`104
`
`Comparison of Task Partitioning Schemes
`
`the vertical span coherence exploited by that algorithm. Hu and
`Foley showed that this type of contiguous approach did not provide
`sufficient load balancing in comparison to the dynamic single scan
`line method presented in the previous subsection. Since a scan line Z(cid:173)
`buffer is used here and not a Watkins spanning scan line algorithm as
`a sequential task basis, there are no spans to update in the vertical
`dimension. Therefore, the aspect ratio (their width versus height) of
`the rectangular areas chosen as single tasks is open to further
`analysis. The type of coherence lost when a region is started anew is
`in the vertical direction. For the first scan line of a region, all of the
`polygons which started above the area but are still relevant to it need
`to be initialized. Figure 5.5 illustrates those polygons.
`This initialization is not necessary if vertical coherence is
`maintained. On the other hand, some initialization is required for
`polygons which have their top vertex in the region to the immediate
`left or right of the current region as well. This initialization is also
`unnecessary ifhorizontal coherence is maintained.
`Another loss due to coherence in the horizontal direction is the
`update of edge interpolation parameters for those edges which
`intersect the region adjacent to the current one on its left. A multiply
`and an add operation are required for each interpolation parameter
`update, whereas an add operation is all that is needed if horizontal
`coherence is maintained. Note that all of these factors would change
`if clipping is implemented at the region level, but this is not done in
`the implementations described here. The losses due to coherence are
`further illustrated in figure 5.6.
`For the purposes of this discussion, it is assumed that for a par(cid:173)
`ticular polygon and a square region, the loss in time due to horizontal
`
`Figure 5.5: Polygons relevant to an area but starting above it
`
`TEXAS INSTRUMENTS EX. 1011 - 116/229
`
`

`
`Data Non-Adaptive Partitioning Scheme 105
`
`coherence and vertical coherence is nearly the same. In the vertical
`case, the initialization overhead is fairly costly and occurs for all
`polygons above the area but relevant to it. The horizontal initializa(cid:173)
`tion cost is only valid for polygons whose top vertex enters the left or
`right adjacent regions (not counting screen boundaries since clipping
`occurs there). The other horizontal coherence loss at the scan line
`level is small in comparison since it only involves the left adjacent re(cid:173)
`gion and is of complexity O(e * h * T mult)where e is the number of
`edges in the left region and the current region, h is the height of each
`of these edges in the left region portion, and T mult is the time for a
`multiply instruction. In any case, because of the wide variation in
`polygon size and number in currently used imagery, it appears that
`any inaccuracy encountered by invalidity of these assumptions would
`be minor.
`
`5. 1 .2.2. Aspect Ratio Choice
`From figure 5.6 it can be seen that coherence is lost only along the
`perimeter of the region, particularly along the left and top sides. If
`the regions are tall and narrow, horizontal coherence is completely
`lost, but little vertical coherence is lost, as is shown in figure 5. 7. The
`opposite is true for wide regions, as shown in figure 5.8. For nearly
`square regions, some degree of coherence is lost in both directions as
`shown in figure 5.9. The total perimeter of a rectangular region is a
`function of its aspect ratio and is shown in equation 5.4, where b is
`the length of the base and his the height.
`perimeter = 2 • (b + h)
`Thus, if the aspect ratio of a given region is 1:3, the perimeter is 2
`• (1 + 3) • x, or on the order of Sx. If the aspect ratio is 1:1, the
`
`(5.4)
`
`1 - Vert. Loss (initialization)
`2 - Horiz. Loss (initialization)
`3 - Horiz. Loss (scanline update)
`
`Figure 5.6: Loss due to coherence in vertical and horizontal directions
`
`TEXAS INSTRUMENTS EX. 1011 - 117/229
`
`

`
`106
`
`Comparison of Task Partitioning Schemes
`
`perimeter is 2 • (1 + 1) • x or 4x. It therefore seems logical that the
`situation which results in the least perimeter for regions will also
`have the least loss due to coherence. A few simple tests were done on
`a test image of nearly uniform polygon distribution across the scene to
`verify these assertions. The results are given in table 5.1.
`
`r
`i
`
`Table 5.1: Aspect ratio (width:height) time comparison in seconds
`
`Ratio
`Time
`
`3:1
`25.8
`
`2:1
`22.7
`
`1:1
`21.8
`
`1:2
`21.9
`
`1:3
`33.6
`
`Based on the analysis and the experimental data, it seems like a
`1:1 aspect ratio is the most suitable choice for high performance for a
`rectangular region decomposition. This verifies Whelan' s
`experimental tests in which he compared vertical, horizontal, and
`rectangular decompositions. This method is illustrated in figure 5.9
`and also in color plate 2. The same shared memory referencing
`method is used here as in the previou~ algorithm, namely the
`uniformly distributed (UD) scheme.
`
`5. 1.2.3. Granularity Ratio Comparison
`This algorithm requires a methodology to determine the granularity
`ratio R to be used for a given image or given number of processors
`(recall that R = #tasks/P). It seems clear that a ratio of R = 1 would
`not produce adequate load balancing for images which contain data
`that is not uniformly distributed across the scene. Some questions
`need to be answered in order to determine what this ratio should be,
`and if it should change depending on the image or number of
`processors. Other researchers in the past (notably Fuchs and
`Johnson [Fuch79]) have used static assignment of tasks onto
`processors in an attempt to develop an evenly load balanced system.
`While static assignment may be preferable in some instances, it is not
`necessary since the Uniform System on the Butterfly uses a small
`amount of overhead in implementing a dynamic scheduler. In the
`algorithm used here, a large number of tasks which are easily
`determined are executed in a dynamic fashion.
`The dynamic method reduces load imbalance, and since the tasks
`are small at the end of the computation, very little work is left to
`perform. The number of tasks created strongly influences the degree
`of success ofthe load balancing.
`
`--.~~~~~~~=-~-=~----------------------------------------------.....
`
`TEXAS INSTRUMENTS EX. 1011 - 118/229
`
`

`
`Data Non-Adaptive Partitioning Scheme 107
`
`Figure 5.7: Vertical subdivision
`
`Figure 5.8: Horizontal subdivision
`
`Figure 5.9: Rectangular mesh subdivision
`
`TEXAS INSTRUMENTS EX. 1011 - 119/229
`
`

`
`108 Comparison of Task Partitioning Schemes
`
`On the other hand, if too many tasks are assigned, load imbalance
`is reduced, but the overhead of assigning more tasks will introduce
`additional loss of coherence, communication, aild network contention.
`Therefore, to choose the appropriate granularity ratio for a particular
`combination of (image, #processors), a series of experiments was
`designed to evaluate the performance of a particular granularity ratio
`R. These are shown in figures A.1, A2, A3, and A.4 in the appendix.
`The data for the mountain image is given in figure 5.10 as a
`representative example. From the graphs, it seems clear that ratios
`anywhere in the range from 16 to 1 up to 28 to 1 are suitable for most
`of the imagery. A compromise ratio of 24 to 1 was decided as the
`single choice to be used in the main timing experiments.
`One important note here is that the value of R determined above
`was evaluated on 96 processors since this was the maximum number
`of processors used. Due to increases in communication and network
`contention as the number of processors increases, this value of R may
`be higher than desirable if more processors are to be used. Either
`educated guesses or empirical testing similar to above would be
`required for a given processor configuration, especially if a different
`machine architecture were to be used. It is not reasonable to attempt
`to determine this ratio mathematically a priori since the overhead
`factors are too hard to predict for a given image. The times for all
`
`111 Load Imbal. Q Contention
`0 Sequential
`(!] CodeMod.
`@ Commun.
`8000~-+--~~~--~~--~~~--~~
`
`Cumulative
`Time
`
`7000
`
`6000
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0 +--+~~4--+--~~-+--~~-+~
`2
`4
`8 12 16 20 24 28 32 36 40
`Granularity Ratio
`
`Figure 5.10: Granularity ratio comparison for mountain image, UD scheme
`
`TEXAS INSTRUMENTS EX. 1011 - 120/229
`
`

`
`Data Non-Adaptive Partitioning Scheme 109
`
`images using R = 24 and a 1:1 image aspect ratio for the regions are
`shown in figure 5.11. The speedup graph can be seen in figure 5.12.
`Although an aspect ratio of 1:1 for the regions was desired for all
`values of P, this could not be achieved for each combination of Rand
`P. Therefore, an aspect ratio as close to 1:1 was used when needed
`(all regions have the same aspect ratio for a given value of P).
`We will now analyze this algorithm with regard to the issues
`identified previously to determine the various overhead percentages.
`
`5.1.2.4. Scheduling (0.004%- 0.013%)
`In this decomposition method, it is necessary to schedule all regions
`as separate tasks, but the number of regions varies as the number of
`processors is increased. To determine the maximum number of
`regions to be scheduled, we used the granularity ratio (R) of 24 to 1 on
`96 processors, which results in a total of2,304 tasks to schedule. The
`time to run a task consisting solely of the background color for one of
`these areas has been measured as Tback = 2.3 msec. The reason this
`is larger than in the scan line decomposition case is that a separate
`block transfer is necessary for each scan line in the area and each
`block transfer requires a setup cost. Since 2.3 msec is exactly the
`overhead time to schedule 96 tasks on 96 processors (Tsched),
`scheduling will not become a bottleneck in this algorithm. The
`overhead due to scheduling then involves plugging the values for this
`algorithm into equation 4.5, as shown below. This time represents a
`percentage overhead for the different images, ranging from 0.004% for
`the mountain image to 0.013% for the stegosaurus image.
`
`<95 • 96) • 24 ~sec + (2,304 - 96) • 24 ~sec
`Scheduling % =
`2
`• 100 %
`
`Tp • 96
`
`(5.5)
`
`5.1.2.5. Memory Latency (1.4%- 3.6%)
`In this algorithm, memory latency is measured the same way as the
`previous method, namely by counting the number of remote
`references and using equation 4.6 to determine the overall
`percentage. Fewer references to the shared data are needed than in
`the previous approach due to the coherence maintained within a
`region. Consequently, the overall latency is reduced.
`
`TEXAS INSTRUMENTS EX. 1011 - 121/229
`
`

`
`110
`
`Comparison of Task Partitioning Schemes
`
`Rectangular Region Algorithm (UD Scheme)
`Performance
`
`-B-Laser
`
`1.000e+04
`
`1000
`
`Time 100
`(seconds)
`
`10
`
`10
`# Processors
`
`100
`
`Figure 5.11: Rectangular region performance (UD scheme)
`
`-a- Laser
`-+Ideal
`
`84
`
`72
`
`60
`
`Speedup 48
`
`36
`
`24
`
`12
`
`0
`
`12
`
`24
`
`60
`48
`36
`# Processors
`
`72
`
`84
`
`96
`
`Figure 5.12: Speedup of rectangular region decomposition (UD scheme)
`
`l
`
`TEXAS INSTRUMENTS EX. 1011 - 122/229
`
`

`
`Data Non-Adaptive Partitioning Scheme 111
`
`The percentage time of latency for this algorithm for the different test
`images ranges from L4% for the stegosaurus image to 3.6% for the
`mountain image. The latency increases corresponding to an increase
`in dataset size, which is the same phenomenon observed previously.
`Most remote references occur for the first scan line of a region
`since it is necessary to retrieve the data from remote memory and
`store it in local data structures. Once this is accomplished, the
`majority of references are local, with the exception being a small
`amount of remote referencing required in the anti-aliasing portion of
`the code (this would be the same for each of these algorithms). The
`remote referencing in the anti-aliasing section stems from the need to
`obtain the plane equation for each polygon for stochastic sampling
`purposes. Since the previous algorithm incorporates no vertical scan
`line coherence, all the data for each scan line must be referenced
`remotely. The rectangular region partitioning scheme capitalizes on
`coherence within the region so that remote referencing is reduced.
`
`5. 1.2.6. Network Contention (5.6% - 33. 1%)
`The network contention is calculated in the same manner for this
`algorithm as it was for the last one. Using this technique, the
`percentage effect of network contention for the various test images
`varies from 5.6% for the tree image to 33.1% for the stegosaurus
`image.
`The contention in this algorithm as compared to the parallel scan
`line approach is worse for the stegosaurus and Laser images, but
`improved for the tree and mountain images. It is difficult to speculate
`as to the reason for this without further image analysis. Regardless,
`one can see that even with a reduced number of references (as
`compared to the parallel scan line approach), contention is still a
`major degradation factor. Most of the increase in network contention
`occurs as the number of processors is increased from 64 to 96
`processors, indicating that the switch network becomes overloaded
`with requests somewhere in this range.
`
`5.1.2.7. Load Imbalance (4.3%- 11.5%)
`The granularity ratio in this algorithm provides a better load bal(cid:173)
`anced system than the last algorithm, although network contention
`increases the execution time and this varies the overall finishing
`times. The load imbalance percentages measured at 96 processors for
`the different images varies from 4.3% for the mountain image to
`1L5% for the tree image. When compared to the previous algorithm,
`the load imbalance overhead is less for this algorithm, with the
`
`TEXAS INSTRUMENTS EX. 1011 - 123/229
`
`

`
`112 Comparison of Task Partitioning Schemes
`
`exception of the tree test image. The granularity ratio comparison in
`figure A3 for this image indicates that load balancing is not particu(cid:173)
`larly good at any value of Rand in fact gets worse after R = 24.
`In general, though, this algorithm yields better load balancing
`than the scan line approach since the granularity ratio provides
`enough tasks to minimize the load imbalance over a wide range of
`processor configurations.
`
`5.1.2.8. Code Modification (7.9%- 9.6%)
`This algorithm has a different amount of coherence overhead than the
`scan line algorithm since rectangular regions are generated as tasks.
`Due to the rectangular nature of the regions, coherence is taken
`advantage of in both the vertical and horizontal directions within a
`single task. On the other hand, the lack of vertical scan line
`coherence at the beginning of an area results in extra work required
`to start the first scan line of a region.
`In addition, the lack of
`horizontal coherence at the boundary to the left causes an overhead of
`interpolating parameters for polygons which extend beyond this
`boundary.
`The code modification overhead is measured the same as before
`using equation 4.8. Based on the measured values, the overhead
`percentages vary from 7.9% for the mountain image to 9.6% for the
`tree image. Considering the fact that there are many more tasks used
`in this scheme versus the parallel scan line approach (2,304 vs. 484),
`this overhead factor does not seem out ofline in comparison.
`
`5.1.2.9. Explanation of Results
`The rectangular region decomposition scheme achieves reduced
`overheads primarily in memory latency and to some degree in
`network contention and load balancing, in comparison to the parallel
`scan line algorithm. The reduction in latency is due to the fact that
`most remote referencing occurs for the first scan line of an area, and
`this effect is reduced in the rectangular region algorithm. This
`suggests that the rectangular region decomposition algorithm will
`perform better than the scan line algorithm in the general case due to
`its performance advantages in the tests given. The scan line
`algorithm exhibits poor scalability as P is increased since load
`balancing will suffer as the number of scan lines approach the
`number of processors. The rectangular region algorithm uses a fixed
`granularity ratio which allows better load balancing as the number of
`processors is increased; thus its scalability is superior to the parallel
`scan line approach.
`
`TEXAS INSTRUMENTS EX. 1011 - 124/229
`
`

`
`Data Non-Adaptive Partitioning Scheme 113
`
`It is important to note that this algorithm still has its share of
`problems. Network contention still represents a significant overhead.
`The code modification overhead is not reduced in this algorithm in
`comparison to the scan line approach. A comparison of the major
`degradation factors is given in figure 5.13.
`Since the number of remote references is reduced in this
`algorithm but contention was not significantly reduced, another tactic
`is necessary to solve this problem. Consequently, it is necessary to
`implement a different memory referencing scheme that is designed to
`reduce the network contention noted in parallel implementations.
`This memory referencing strategy is referred to as the locally cached
`(LC) scheme and is described in the next section.
`
`5.1.3. Rectangular Region Decomposition
`(LC Scheme)
`The algorithm described here is implemented exactly the same as the
`last one, with the only exception being the remote memory
`referencing strategy. A brief description ofthis strategy, denoted the
`Locally Cached or LC scheme, follows. Instead of referencing globally
`
`IIIli Contention
`G CodeMod.
`
`lEI Ld. Irnbal.
`0 Usable Time
`
`~Latency
`
`80
`
`60
`Percentage
`40 __l,:bc:L..d::z:Jbc6ci=d~
`'::::;:::::::::::::::j.-,.----,r-,.-"""T""""l..-r,.j
`
`20
`
`0
`
`Image
`Figure 5.13: Degrada tion factors for rectangular r egion decomposition, (UD
`Scheme, P = 96)
`
`TEXAS INSTRUMENTS EX. 1011 - 125/229
`
`

`
`r f
`
`114 Comparison of Task Partitioning Schemes
`
`shared data remotely, the data is cached into the local memory
`module prior to referencing, thus allowing it to be accessible quickly.
`Although others have used an elaborate software caching mechanism
`for computer graphics rendering ([Gree89] and [Bado90]), we rely on
`the fact that the exact data needed for a given task can be copied
`directly to each processor prior to the computation of a given region.
`If a data element is relevant to more than one processor, it is copied to
`all processors which would reference it, incurring a space penalty.
`The extra memory required due to duplication is shown in the
`appendix in figures A9, A.10, All, and A12. These figures show the
`duplication of data by copying data elements as a function of the total
`number of regions. Although the extra memory required is wasteful,
`a tradeoff of space versus time is necessary to achieve faster memory
`referencing than the previous Uniformly Distributed (UD) approach.
`The cost of non-local memory access is eliminated by block
`transferring data from its global storage location to the local memory
`of the processor(s) that need it. Details of the LC scheme are given in
`chapter 6.
`
`5. 1.3. 1. Granularity Ratio
`The granularity ratio R was re-evaluated for this algorithm to see
`what a good ratio would be, since a different memory referencing
`scheme is used. This ratio was tested at values of 2, 4, 8, 12, 16, 20,
`24, 28, 32, 36, and 40 using the maximum configuration of 96
`processors. Figure 5.14 shows the comparison for the Laser image as
`an example. In the appendix, figures A.5, A6, A.7, and A.8 show the
`data for all the images. The downward slope of the curves is
`primarily due to a reduction in load imbalance as a higher granularity
`ratio is used. Then the curves continue upwards after a point since
`. the other culprits introduce more overhead cost for the higher ratios.
`The minimum point on the curve is the optimal granularity ratio {R)
`to use for a particular scene. Each scene exhibits different
`characteristics which affect · the choice of this optimal R so a
`compromise must be made so that a single value of R may be used in
`the general case.
`In this case, the choice of a good ratio spans a broader range than
`in the UD scheme. The reason for this is the reduction in
`communication and contention costs versus the previous method. It
`seems like a good choice for R can be anywhere in the range from 12
`to 1 up to 32 to 1. Since R = 24 was chosen for the previous algorithm
`and the performance for that ratio with this scheme is nearly optimal
`in most cases, this value will again be used. This ratio should provide
`
`TEXAS INSTRUMENTS EX. 1011 - 126/229
`
`

`
`Data Non-Adaptive Partitioning Scheme 115
`
`good results for most imagery, given this machine configuration.
`Graphs for the time and speedup of this algorithm with the LC
`memory referencing scheme are given in figures 5.15 and 5.16.
`The overhead factors for the rectangular region decomposition are
`now discussed, using the LC memory referencing scheme evaluated at
`96 processors.
`
`5.1.3.2. Scheduling (0.004%- 0.017%)
`The time to run a background task (Tback) in this scheme is the same
`time as the previous one, since the only difference between the two is
`the memory referencing, which does not affect the background task.
`This algorithm is faster than the previous one, so the overhead
`percentage is slightly higher. Equation 5.5 is again used for
`evaluating the overhead due to scheduling for this algorithm. Based
`on this equation, the overhead due to scheduling varied from 0.004%
`overhead for the mountain image to 0.017% overhead for the
`stegosaurus image.
`
`0 Sequential
`~ Commun.
`
`1111 Load Imbal.
`0 CodeMod.
`
`[ill Contention
`
`Cwnulative
`Time
`
`2
`
`4
`
`8
`
`16 20 24 28 32 36 40
`12
`Granularity Ratio
`
`Figure 5.14: Comparison of ratios for Laser image
`
`TEXAS INSTRUMENTS EX. 1011 - 127/229
`
`

`
`116
`
`Comparison of Task Partitioning Schemes
`
`Rectangular Region Algorithm (LC Scheme)
`Performance
`
`-e- Stegosaurus
`
`-s ·Laser
`
`1.000e+04
`
`1000
`
`Time 100
`(seconds)
`
`10
`
`10
`# Processors
`
`100
`
`Figure 5.15: Tiling time for rectangular region partitioning (LC)
`
`r
`t
`
`I
`I
`
`I I
`
`I
`f
`
`I
`
`96
`
`84
`
`72
`
`60
`
`Speedup 48
`
`36
`
`24
`
`12
`
`0
`
`- & Stegosaurus
`
`· EJ - Laser
`-+-Ideal
`
`- <) Tree
`
`0
`
`12
`
`24
`
`60
`48
`36
`# Processors
`
`72
`
`84
`
`96
`
`Figure 5.16: Speedup for rectangular region partitioning (LC Scheme)
`
`TEXAS INSTRUMENTS EX. 1011 - 128/229
`
`

`
`r
`
`Data Non-Adaptive Partitioning Scheme 117
`
`5.1.3.3. Communication Overhead (0.03%- 0.05%)
`Instead of latency due to remote referencing as in the UD case,
`communication occurs in blocks in this algorithm, resulting in a
`different overhead factor. This communication overhead is not
`present in the UD referencing method. Recall that this overhead is ·
`measured by a count of the total number of bytes transferred in the
`system during the computation. Using equation 4. 7 given in the
`previous chapter, the communication overhead range is 0.03% for the
`stegosaurus image up to 0.05% for the mountain image. One can see
`that these values represent a significant drop in the amount of time
`necessary to transfer data in the system. Since the data is copied into
`local memory, all future references occur locally. This means that the
`total amount of data transferred is also reduced in comparison to the
`UD referencing scheme.
`
`5.1.3.4. Network Contention (3.1%- 16.3%)
`Although there is some overhead necessary to set up the blocks of
`data to be transferred in this algorithm, the deficit is more than made
`up for by a reduction in network contention when compared to the UD
`scheme. The calculated network contention overhead varies from
`3.1% for the tree image to 16.3% for the stegosaurus image. The
`contention in this scheme is significantly less than in the previous
`one. This indicates that the locally cached memory referencing
`scheme does in fact reduce the messages in the system, which results
`in reduced chances for a blocked switch node.
`
`5.1.3.5. Load Imbalance (4.5%- 11.1%)
`The load imbalance in this algorithm is measured the same as before,
`using equation 4.11. The overhead percentages for load imbalance
`vary from 4.5% for the mountain image to 11.1% for the tree image.
`These values are nearly the same as those from the previous algo(cid:173)
`rithm, which is to be expected since they both use the same partition(cid:173)
`ing method.
`
`5. 1.3.6. Code Modification (5.4% - 6.4%)
`The code modification overhead using the LC scheme is less than in
`the UD scheme in all cases except the tree image. The measured
`overhead ranges from 5.4% for the mountain image to 6.4% for the
`stegosaurus image. The probable reason for the difference is that
`communication is not completely factored out of the measurement
`method. Recall that the measurement technique used for this
`
`TEXAS INSTRUMENTS EX. 1011 - 129/229
`
`

`
`r r
`
`118
`
`Comparison of Task Partitioning Schemes
`
`overhead involves timing the program running on a single processor,
`using MIN memory modules. The UD scheme involves remote
`referencing to these memory modules, while the LC scheme does not.
`Although the communication cost is factored out of the measured time
`by counting the number of remote references or bytes transferred
`respectively, it is impossible to factor out the system overheads. Since
`the LC scheme will not likely include these to the degree that the UD
`scheme does due to the method of memory allocation and deallocation,
`the resultant code modification is a generally lower figure here.
`
`5. 1.3. 7. Explanation of Results
`Latency is no longer a factor using this memory referencing scheme,
`and although communication overhead is introduced, it is minimal.
`The change in memory referencing scheme also affects the overall
`code modification, as reported above.
`The load imbalance is nearly the same as the previous algorithm,
`with the slight difference due to the effect of reduced contention in
`this algorithm. The chart in figure 5.17 indicates the overheads for
`the various images.
`
`Ill Contention
`[ill Ld. Imbal.
`0 Usable Time
`EJ CodeMod
`100-.--------.-------.--------.-------.
`
`[a Comm.
`
`80
`
`60
`Percentage
`40
`
`20
`
`0
`
`Laser
`
`Tree
`
`Mountain
`
`Image
`Figure 5.17: Degradation factors for rectangular region decomposition (LC
`Scheme, P = 96)
`
`TEXAS INSTRUMENTS EX. 1011 - 130/229
`
`

`
`Data Adaptive Partitioning Scheme 119
`
`As one can see from the chart, network contention is still a
`problem, although it is significantly reduced in comparison to the UD
`scheme, especially for the more complex imagery. Load imbalance is
`also a problem, although the algorithm should scale up fairly well,
`especially when one takes into account the reduced contention using
`this memory reference strategy.
`In the next section, we introduce another partitioning scheme
`which achieves even better load balancing.
`
`5.2. Data Adaptive Partitioning Scheme
`In a data adaptive algorithm, load balancing is achieved by construct(cid:173)
`ing tasks which are estimated to take nearly the same amount of
`time. By using image space partitioning in a parallel graphics
`rendering program, tasks can be determined based on the location of
`data within the image. If the task work can be accurately predicted
`by using a heuristic, then the granularity ratio R can be reduced,
`resulting in less communication and scheduling.
`In fact, if the
`adaptation can produce exactly the same size tasks in terms of work,
`R can be reduced to 1. It is not generally possible to pick a very
`accurate heuristic since factors such as depth complexity, polygon
`area, and anti-aliasing all affect the time it takes to render a pixel.
`Pre-processing of the data cannot take all of these factors into
`account; otherwise it would require too much time. Following is a
`brief description of several algorithms which fall under the data
`adaptive category.
`Whelan [Whel85] uses a data adaptive approach in his Median
`Cut algorithm, although his application was for a hardware
`architecture. His primary motivation was to reduce the scheduling
`overhead associated with the type of dynamic task assignment used
`in the algorithms discussed thus far. This is not necessary in a
`software multiprocessor approach since the Uniform System provides
`scheduling with a very small overhead. Whelan's approach involves
`task partitioning so that each task contains the same number of
`polygons. He uses the centroids of the polygons to determine their
`screen space location; however, extensive sorting is necessary to
`determine the locations to place the screen space partitions. His
`algorithm provides excellent load balancing, but the overhead cost of
`creating the areas outweighs the benefit of adaptive partitioning.
`Roble's [Robl88] approach is another data adaptive method which
`also uses polygon location as a heuristic for determining tasks. His
`approach involves a large amount of communication prior to the tiling
`phase, and thus exhibits too much overhead as well.
`
`TEXAS INSTRUMENTS EX. 1011 - 131/229
`
`

`
`120
`
`Comparison of Task Partitioning Schemes
`
`Although there are many different decomposition methods that
`fall under the data adaptive method, one algorithm was chosen as a
`representative example for implementation. The goal here was to
`eliminate the excess overhead associated with this type of approach.
`This algorithm is described next.
`
`5.2.1 . Top-down Decomposition
`A partitioning scheme similar to Whelan's Median Cut algorithm is
`used which takes comparatively less time to determine the task
`partitions. This scheme is based solely on the number of data
`elements in a region, regardless of the location oftheir centroids. The
`heuristic in this algorithm is based on the assumption that the
`number of polygons in a region is linearly related to the time it takes
`to tile that region. Using this simple heuristic, good load balancing
`can be achieved with a small overhead. The LC memory re

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket