throbber
0121
`
`Volkswagen 1012 - Part 3 of 4
`
`

`
`1 10
`
`Comparison of Task Partitioning Schemes
`
`Rectangular Region Algorithm (UD Scheme)
`Performance
`
`10
`
`# Processors
`
`Figure 5.11: Rectangular region performance (UD scheme)
`
`36
`
`43
`
`so
`
`72
`
`a4
`
`96
`
`#P1-ocessors
`
`Figure 5.12: Speedup of rectangular region decomposition (UD scheme)
`
`0122
`
`

`
`Data Non-Adaptive Partitioning Scheme
`
`111
`
`The percentage time of latency for this algorithm for the different test
`images ranges from 1.4% for the stegosaurus image to 3.6% for the
`mountain image. The latency increases corresponding to an increase
`in dataset size, which is the same phenomenon observed previously.
`Most remote references occur for the first scan line of a region
`since it is necessary to retrieve the data from remote memory and
`store it in local data structures. Once this is accomplished, the
`majority of references are local, with the exception being a small
`amount of remote referencing required in the anti-aliasing portion of
`the code (this would be the same for each of these algorithms). The
`remote referencing in the anti-aliasing section stems from the need to
`obtain the plane equation for each polygon for stochastic sampling
`purposes. Since the previous algorithm incorporates no vertical scan
`line coherence, all the data for each scan line must be referenced
`remotely. The rectangular region partitioning scheme capitalizes on
`coherence within the region so that remote referencing is reduced.
`
`5.1.2.6. Network Contention (5.6% - 33.1%)
`The network contention is calculated in the same manner for this
`
`algorithm as it was for the last one. Using this technique, the
`percentage effect of network contention for the various test images
`varies from 5.6% for the tree image to 33.1% for the stegosaurus
`image.
`
`The contention in this algorithm as compared to the parallel scan
`line approach is worse for the stegosaurus and Laser images, but
`improved for the tree and mountain images. It is difficult to speculate
`as to the reason for this without further image analysis. Regardless,
`one can see that even with a reduced number of references (as
`compared to the parallel scan line approach), contention is still a
`major degradation factor. Most of the increase in network contention
`occurs as the number of processors is increased from 64 to 96
`processors, indicating that the switch network becomes overloaded
`with requests somewhere in this range.
`
`5.1.2.7. Load Imbalance (4.3% - 11.5%)
`
`The granularity ratio in this algorithm provides a better load bal-
`anced system than the last. algorithm, although network contention
`increases the execution time and this varies the overall finishing
`times. The load imbalance percentages measured at 96 processors for
`the different images varies from 4.3% for the mountain image to
`11.5% for the tree image. When compared to the previous algorithm,
`the load imbalance overhead is less for this algorithm, with the
`
`0123
`
`

`
`112
`
`Compaliaon of Task Partitioning Schemes
`
`exception of the tree test image. The granularity ratio comparison in
`figure A3 for this image indicates that load balancing is not particu-
`larly good at any value of R and in fact gets worse after R = 24.
`In general, though, this algorithm yields better load baiancing
`than the scan line approach since the granularity ratio provides
`enough tasks to minimize the load imbalance over a wide range of
`processor configurations.
`
`5.1.2.8. Coda Modification (7.9% — 9. 6%)
`
`This algorithm has a difl'erent amount of coherence overhead than the
`scan line algorithm since rectangular regions are generated as tasks.
`Due to the rectangular nature of the regions, coherence is taken
`advantage of in both the vertical and horizontal directions within a
`single task. On the other hand, the lack of vertical scan line
`coherence at the beginning of an area results in extra work required
`to start the first scan line of a region.
`In addition, the lack of
`horizontal coherence at the boundary to the left causes an overhead of
`interpolating parameters for polygons which extend beyond this
`boundary.
`The code modification overhead is measured the same as before
`
`using equation 4.8. Based on the measured values, the overhead
`percentages vary from 7.9% for the mountain image to 9.6% for the
`tree image. Considering the fact that there are many more tasks used
`in this scheme versus the parallel scan line approach (2,304 vs. 484).
`this overhead factor does not seem out of line in comparison.
`
`5.1.2.9. Explanation of Results
`
`The rectangular region decomposition scheme achieves reduced
`overheads primarily in memory latency and to some degree in
`network contention and load balancing, in comparison to the parallel
`scan line algorithm. The reduction in latency is due to the fact that
`most remote referencing occurs for the first scan line of an area, and
`this effect is reduced in the rectangular region algorithm. This
`suggests that the rectangular region decomposition algorithm will
`perform better than the scan line algorithm in the general case due to
`its performance advantages in the tests given. The scan line
`algorithm exhibits poor scalability as P is increased since load
`balancing will suffer as the number of scan lines approach the
`number of processors. The rectangular region algorithm uses a fixed
`granularity ratio which allows better load balancing as the number of
`processors is increased; thus its scalability is superior to the parallel
`scan line approach.
`
`0124
`
`

`
`Data Non-Adaptive Partitioning Scheme
`
`113
`
`It is important to note that this algorithm still has its share of
`problems. Network contention still represents a significant overhead.
`The code modification overhead is not reduced in this algorithm in
`comparison to the scan line approach. A comparison of the major
`degradation factors is given in figure 5.13.
`Since the number of remote references is reduced in this
`
`algorithm but contention was not significantly reduced, another tactic
`is necessary to solve this problem. Consequently, it is necessary to
`implement a different memory referencing scheme that is designed to
`reduce the network contention noted in parallel implementations.
`This memory referencing strategy is referred to as the locally cached
`(LC) scheme and is described in the next section.
`
`5.1.3. Rectangular Region Decomposition
`(LC Scheme)
`
`The algorithm described here is implemented exactly the same as the
`last one, with the only exception being the remote memory
`referencing strategy. A brief description of this strategy, denoted the
`Locally Cached or LC scheme, follows. Instead of referencing globally
`
`g Contention
`
`‘.353:
`
`Percentage
`40
`
`I//I/I/IIIII/I3
`
`Stegosaurus
`
`Figure 5.13: Degradation factors for rectangular region decomposition, (UD
`Scheme, P = 96)
`
`0125
`
`

`
`114
`
`Comparison of Task Partitioning Schemes
`
`shared data remotely, the data is cached into the local memory
`module prior to referencing, thus allowing it to be accessible quickly.
`Although others have used an elaborate software caching mechanism
`for computer graphics rendering ([Gree89] and [Bado90]). we rely on
`the fact that the exact data needed for a given task can be copied
`directly to each processor prior to the computation of a given region.
`If a data element is relevant to more than one processor, it is copied to
`all processors which would reference it, incurring a space penalty.
`The extra memory required due to duplication is shown in the
`appendix in figures A3, A10, A11, and A.12. These figures show the
`duplication of data by copying data elements as a function of the total
`number of regions. Although the extra memory required is wasteful,
`a tradeoff of space versus time is necessary to achieve faster memory
`referencing than the previous Uniformly Distributed (UD} approach.
`The cost of non-local memory access is eliminated by block
`transferring data from its global storage location to the local memory
`of the processor(s) that need it. Details of the LC scheme are given in
`chapter 6.
`
`5. 1.3. 1. Granuiarity Ratio
`
`The granularity ratio R was re-evaluated for this algorithm to see
`what a good ratio would be, since a different memory referencing
`scheme is used. This ratio was tested at values of 2, 4, 8, 12, 16, 20,
`24, 28, 32, 36, and 40 using the maximum configuration of 96
`processors. Figure 5.14 shows the comparison for the Laser image as
`an example. In the appendix, figures A.5, A.6, A.7, and A.8 show the
`data for all the images. The downward slope of the curves is
`primarily due to a reduction in load imbalance as a higher granularity
`ratio is used. Then the curves continue upwards aiter a point since
`the other culprits introduce more overhead cost for the higher ratios.
`The minimum point on the curve is the optimal granularity ratio (R)
`to use for a particular scene. Each scene exhibits different
`characteristics which affect the choice of this optimal B so a
`compromise must be made so that a single value ofR may be used in
`the general case.
`In this case, the choice of a good ratio spans a broader range than
`in the UD scheme. The reason for this is the reduction in
`
`communication and contention costs versus the previous method. It
`seems like a good choice for R can be anywhere in the range from 12
`to 1 up to 32 to 1. Since R = 24 was chosen for the previous algorithm
`and the performance for that ratio with this scheme is nearly optimal
`in most cases, this value will again he used. This ratio should provide
`
`0126
`
`

`
`Data Non-Adaptive Partitioning Scheme
`
`115
`
`good results for most imagery, given this machine configuration.
`Graphs for the time and speedup of this algorithm with the LC
`memory referencing scheme are given in figures 5.15 and 5.16.
`The overhead factors for the rectangular region decomposition are
`now discussed, using the LC memory referencing scheme evaluated at
`96 processors.
`
`5.1.3.2. Scheduling (0.004% - 0.017%)
`
`The time to run a background task (T5,,.,;,) in this scheme is the same
`time as the previous one, since the only difference between the two is
`the memory referencing, which does not affect the background task.
`This algorithm is faster than the previous one, so the overhead
`percentage is slightly higher. Equation 5.5 is again used for
`evaluating the overhead due to scheduling for this algorithm. Based
`on this equation, the overhead due to scheduling varied from 0.004%
`overhead for the mountain image to 0.017% overhead for the
`stegosaurus image.
`
`|:] Sequential
`
`m Load imbal.
`
`isss Contention
`
`Common.
`:1.
`
`.....I.
`
`Code Mod.
`I.
`J
`.l
`
`Cumulaiivc
`Time
`
`2000
`
`1000
`
`Sequential Time
`
`812162024282-323640
`
`Granularity Ratio
`
`Figure 5.14: Comparison of ratios for Laser image
`
`0127
`
`

`
`116
`
`Comparison of Task Partitioning Schemes
`
`Rectangular Region Algorithm (LC Scheme)
`Performance
`
`- -15- Mountain
`
`10
`
`# Processors
`
`Figure 5.15: Tiling time for rectangular region partitioning (LC)
`
`43
`
`so
`
`# Processors
`
`Figure 5.16: Speedup for rectangular region partitioning {LC Scheme)
`
`0128
`
`

`
`Data Non-Adaptive Partitioning Scheme
`
`117
`
`5.1.3.3. Communication Overhead (0.03% - 0.05%)
`
`Instead of latency due to remote referencing as in the UD case,
`communication occurs in blocks in this algorithm, resulting in a
`different overhead factor. This communication overhead is not
`
`present in the UD referencing method. Recall that this overhead is
`measured by a count of the total number of bytes transferred in the
`system during the computation. Using equation 4.? given in the
`previous chapter, the communication overhead range is 0.03% for the
`stegosaurus image up to 0.05% for the mountain image. One can see
`that these values represent a significant drop in the amount of time
`necessary to transfer data in the system. Since the data is copied into
`local memory, all future references occur locally. This means that the
`total amount of data transferred is also reduced in comparison to the
`UD referencing scheme.
`
`5.1.3.4. Network Contention (3.1% — 16.3%)
`
`Although there is some overhead necessary to set up the blocks of
`data to be transferred in this algorithm, the deficit is more than made
`up for by a reduction in network contention when compared to the UD
`scheme. The calculated network contention overhead varies from
`
`3.1%. for the tree image to 16.3% for the stegosaurus image. The
`contention in this scheme is significantly less than in the previous
`one. This indicates that the locally cached memory referencing
`scheme does in fact reduce the messages in the system, which results
`in reduced chances for a blocked switch node.
`
`5.1.3.5. Load Imbalance (4.5% - 11.1%)
`
`The load imbalance in this algorithm is measured the same as before,
`using equation 4.11. The overhead percentages for load imbalance
`vary from 4.5% for the mountain image to 11.1% for the tree image.
`These values are nearly the same as those from the previous algo-
`rithm, which is to be expected since they both use the same partition-
`ing method.
`
`5.1.3.6. Code Modification (5.4% — 6.4%)
`
`The code modification overhead using the LC scheme is less than in
`the UD scheme in all cases except the tree image. The measured
`overhead ranges from 5.4% for the mountain image to 6.4% for the
`stegosaurus image. The probable reason for the difference is that
`communication is not completely factored out of the measurement
`method. Recall that the measurement technique used for this
`
`0129
`
`

`
`113
`
`Comparison of Task Partitioning Schemes
`
`overhead involves timing the program running on a single processor,
`using MIN memory modules. The UD scheme involves remote
`referencing to these memory modules, while the LC scheme does not.
`Although the communication cost is factored out of the measured time
`by counting the number of remote references or bytes transferred
`respectively, it is impossible to factor out the system overheads. Since
`the LC scheme will not likely include these to the degree that the UD
`scheme does due to the method of memory allocation and deallocation,
`the resultant code modification is a generally lower figure here.
`
`5.1.3.7. Explanation of Fiesuits
`
`Latency is no longer a factor using this memory referencing scheme,
`and although communication overhead is introduced, it is minimal.
`The change in memory referencing scheme also affects the overall
`code modification, as reported above.
`The load imbalance is nearly the same as the previous algorithm,
`with the slight difference due to the effect of reduced contention in
`this algorithm. The chart in figure 5.17 indicates the overheads for
`the various images.
`
`3 Contention
`
`Ld. Imbal.
`
`Ef] Code Mod.
`
`[1 Usable Time
`
`Percentage
`
`" 4o
`
`20
`
`Stegosaurus
`
`Figure 5.17: Degradation factors for rectangular region decomposition (LC
`Scheme, P = 96)
`
` _
`
`0130
`
`

`
`Data Adaptive Partitioning Scheme
`
`119
`
`As one can see from the chart, network contention is still a
`problem, although it is significantly reduced in comparison to the UD
`scheme, especially for the more complex imagery. Load imbalance is
`also a problem, although the algorithm should scale up fairly well,
`especially when one takes into account the reduced contention using
`this memory reference strategy.
`In the next section, we introduce another partitioning scheme
`which achieves even better load balancing.
`
`5.2. Data Adaptive Partitioning Scheme
`In a data adaptive algorithm, load balancing is achieved by construct-
`ing tasks which are estimated to take nearly the same amount of
`time. By using image space partitioning in a parallel graphics
`rendering program, tasks can be determined based on the location of
`data within the image. If the task work can be accurately predicted
`by using a heuristic, then the granularity ratio R can be reduced,
`resulting in less communication and scheduling.
`In fact, if the
`adaptation can produce exactly the same size tasks in terms of work,
`R can be reduced to 1.
`It is not generally possible to pick a very
`accurate heuristic since factors such as depth complexity, polygon
`area, and anti-aliasing all affect the time it takes to render a pixel.
`Pre-processing of the data cannot take all of these factors into
`account; otherwise it would require too much time. Following is a
`brief description of several algorithms which fall under the data
`adaptive category.
`Whelan [Whel85] uses a data adaptive approach in his Median
`Cut algorithm, although his application was for a hardware
`architecture. His primary motivation was to reduce the scheduling
`overhead associated with the type of dynamic task assignment used
`in the algorithms discussed thus far. This is not necessary in a -
`software multiprocessor approach since the Uniform System provides
`scheduling with a very small overhead. Whelan's approach involves
`task partitioning so that each task contains the same number of
`polygons. He uses the centroids of the polygons to determine their
`screen space location; however, extensive sorting is necessary to
`determine the locations to place the screen space partitions. His
`algorithm provides excellent load balancing, but the overhead cost of
`creating the areas outweighs the benefit of adaptive partitioning.
`Roble's [Robl88] approach is another data adaptive method which
`also uses polygon location as a heuristic for determining tasks. His
`approach involves a large amount of communication prior to the tiling
`phase, and thus exhibits too much overhead as well.
`
`0131
`
`

`
`120
`
`Comparison ol'Task Partitioning Schemes
`
`Although there are many different decomposition methods that
`fall under the data adaptive method, one algorithm was chosen as a
`representative example for implementation. The goal here was to
`eliminate the excess overhead associated with this type of approach.
`This algorithm is described next.
`
`5.2.1. Top-down Decomposition
`A partitioning scheme similar to Whelan‘s Median Cut algorithm is
`used which takes comparatively less time to determine the task
`partitions. This scheme is based solely on the number of data
`elements in a region, regardless of the location of their centroids. The
`heuristic in this algorithm is based on the assumption that the
`number of polygons in a region is linearly related to the time it takes
`to tile that region. Using this simple heuristic, good load balancing
`can be achieved with a small overhead. The LC memory referencing
`scheme is used in the implementation of this algorithm based on the
`results shown in the previous section. The implementation is
`described below.
`'
`
`A 2D mesh is created as in the rectangular region decomposition,
`but this time the mesh is 4 times as dense (i.e. iiregions =R'P'4).
`Polygons are placed into the mesh during the front end portion of the
`program as before, based on their screen space bounding boxes. Prior
`to tiling, adjacent meshes are combined hierarchically and a sum of
`the combined regions is stored in a tree data structure. This process
`is repeated until a point is reached where the entire screen is in a
`single region. Then, a data structure is created which consists of a
`hierarchical binary tree of counts referring to the number of data
`elements in each area.
`
`After the tree is created, it is traversed in top-down fashion and
`the area with the most polygons at a given point is then split into its
`two components. This process is repeated by considering all areas
`created thus far, splitting the one with the next most polygons. The
`splitting process is stopped when the desired number of tasks has
`been reached. A count of the number of polygons in each small area is
`used, so it is not necessary to sequentially go through the entire list of
`potential polygons to determine which polygons are relevant to each
`area at this time. The limiting factor in the splitting is the leaf level,
`which is why a fairly dense mesh is created at the beginning. An
`example of this type of decomposition is illustrated graphically in
`figure 5.18 and also in color plate 3.
`
`0132
`
`

`
`Data Adaptive Partitioning Scheme
`
`121
`
`After the tree has been traversed, each of the regions is available
`for rendering in parallel. Some computational overhead exists for this
`scheme prior to the tiling phase, but fewer tasks are created than in
`the previous rectangular region approach. Figure 5.19 shows the
`performance for the various images using the data adaptive approach,
`with a value ofR = 10. This value ofR was determined empirically
`similar to the methods used previously.
`It is less than the value
`needed for the rectangular region scheme for good load balancing. A
`perfect match would result in a ratio of R = 1 but that situation is
`almost impossible to achieve using a heuristic which has minimal
`overhead cost. The relative speedup for the top-down scheme is
`shown in figure 5.20.
`The time to build the tree data structure is not included in these
`
`timings since it is not part of the tiling section of the program. This
`time is fairly small anyway, but it is included in the overall algorithm
`comparison presented in chapter 6. We now analyze the top—down
`decomposition method with regard to the possible overhead factors.
`
`5.2.1.1. Scheduling (o.oo3% — 0.01%)
`
`This partitioning scheme uses regions that are not the same size, so
`each background task does not take the same time. The areas consist
`of groups of scan lines as before, but the number of scan lines and
`their size diifer.
`
`The average time to render the different background areas was
`measured for the different images. The results were fairly consistent,
`with an average background task time of 4.48 msec.
`
`Figure 5.18: Top-down partitioning scheme
`
`0133
`
`

`
`122
`
`Comparison of Task Partitioning Schemes
`
`Top-down Algorithm (LC Scheme) Performance
`
`--e- » Stagosaums
`
`10
`
`# Processors
`
`Figure 5.19: Top-down decomposition performance
`
`no-Stegosaurus
`-A-Mountain
`
`-I3-Laser
`—l—ldeal
`
`54
`
`eo
`
`# Processors
`
`Figure 5.20: Speedup of top-down decomposition method
`
`0134
`
`

`
`Data Adaptive Partitioning Scheme
`
`123
`
`This is more than Tsched (2.3 msec). which is the time it takes to
`schedule 96 tasks, so no bottleneck will occur due to scheduling. As
`before, the scheduling overhead is determined by plugging the values
`for this algorithm into equation 4.5, as shown in equation 5.6. Using
`this equation for 96 processors, the scheduling overhead for the test
`images ranges from 0.003% for the mountain image to 0.01% for the
`stegosaurus image.
`
`gi-.—L)°24u.sec+(960-96)¢24psec
`Scheduling at =2m o 100 %
`9",, - 96
`
`(5.6)
`
`5.2.1.2. Communication Overhead (0.02% - 0.04%)
`
`The communication overhead is measured the same way as in the
`previous algorithm, by determining the number of bytes transferred
`in the system and using equation 4.7 to calculate the overhead. The
`values vary from 0.02% for the stegosaurus image to 0.04% for the
`mountain image. The communication overhead percentage in this
`algorithm is slightly less than in the rectangular region (LC) method
`since there are fewer areas.
`
`5.2.1.3. Network Contention (1 1.8% - 34.9%)
`
`Unfortunately, network contention is a significant factor in this
`algorithm, even more so than in the previous one. The network
`contention overhead ranges from 11.8% for the mountain image to
`34.9% for the stegosaurus image. The reason for this increase in
`network contention is given here.
`As was explained at the beginning of this section, a 2D dense
`mesh is created, from which small regions are clustered together to
`form tasks. The LG scheme requires communication from each of
`these small regions which form the larger clusters in order to obtain
`the data necessary for rendering a particular task area. Figure 5.21
`illustrates this situation.
`
`In order to render the cluster composed of sub-regions 1, 2, 3, and
`4, it is necessary to retrieve the polygons from these sub-regions.
`This requires a block transfer from each of the sub-regions, whereas
`the rectangular region algorithm requires only one block transfer for
`the entire region. There may be even more than four sub-regions
`which are part of a larger cluster. Although the total amount of data
`is not large (evident by the communication factor given previously),
`the number of messages is higher than in the rectangular region
`
`0135
`
`

`
`124
`
`Comparison of Task Partitioning Schemes
`
`In addition, the
`algorithm due to this copying from sub-regions.
`frequency of these communications is greater since they proceed one
`right after another. The block transfer mechanism in the GPIOOO
`which is utilized in the LC scheme holds a message path open for as
`long as it is needed to transfer the data. Therefore, more collisions
`are likely to occur in this algorithm due to the increased number of
`messages required, resulting in high network contention.
`
`5.2.1.4. Load Imbalance (1.5% - 6.9%)
`
`The goal of better load balancing was achieved in this algorithm,
`using a smaller granularity ratio than the rectangular region
`approach. The percentage overhead for load imbalance varies from
`1.5% for the stegosaurus and mountain images to 6.9% for the Laser
`image. This algorithm achieves better load balancing than the
`previous algorithm, with minimal expense required to build the
`hierarchical tree data structure.
`It therefore overcomes the
`
`limitation noticed in Whelan's and Roble‘s algorithms, which also
`used a data adaptive scheme. More details on the overhead time
`required for the tree construction are given in chapter 6.
`
`5.2.1.5. Code Modification (2.5% - 3.3%)
`The overhead due to code modification is much smaller than in the
`
`rectangular region approach. This overhead ranges from 2.5% for the
`mountain image to 3.3% for the Laser image. The reason for the
`reduction is that there are fewer total tasks and each task area is
`
`larger, reducing the overall coherence loss.
`
`Retrieve data from each
`sub-region"
`
`Figure 5.21: Block transfer of data from sub—reg'ions for topdown decomposi-
`tion
`
`Region
`
`0136
`
`

`
`Data Adaptive Partitioning Scheme
`
`125
`
`Looking back at figure 5.21, it can be seen that it is likely that a
`number of polygons cross over several sub-regions but are singularly
`contained within the main region to be rendered. Unfortunately,
`short of a direct comparison of all polygons there is no way to detect if
`a given sub—region is sending the same polygon as another sub-region,
`due to the usage of the LC memory referencing scheme. If a polygon
`is sent from two or more different sub-regions as a result of its
`overlapping these regions, that polygon is rendered more than once.
`This is a direct function of the duplication factor for the given mesh
`size. The overhead of this occurrence is difficult to determine since
`
`not all polygons which are duplicated are rendered more than once,
`only those that are duplicated across sub-regions and are part of the
`same higher region. This duplication of rendering is included in the
`code modification overhead given previously.
`
`5.2.1.6. Explanation of Results
`
`The goal of the data adaptive top-down scheme is to maintain good
`load balancing. The implementation here achieves this goal, but due
`to the method of data transfer required by the LC scheme, additional
`contention is introduced. There is also the additional cost of con-
`
`structing the tree data structure, but this cost is offset by the reduc-
`tion in the number of regions resulting in reduced code modification
`overhead. The times for the tree building are not included here since
`this chapter deals with a comparison of the algorithms‘ tiling section,
`but they are given in the next chapter. The chart in figure 5.22 shows
`the overhead comparison for the various images.
`It can be seen that all of the overhead factors have been reduced
`
`compared to the previous approaches, with the exception of network
`contention. This algorithm requires a dense mesh to be created for
`determination of the regions. As P is increased, the mesh will need to
`be even denser, and this may result in even higher network
`contention overhead and duplication of polygons. As a result, this
`algorithm may not exhibit good scalability for very dense meshes.
`It might be possible to create the mesh in some other manner
`which does not result in as much overhead, but other methods were
`not explored here. For example, if one were to try to determine the
`clusters from the top down, a pseudo—parallel method could be used
`whereby tasks are spawned off according to the level of the tree
`traversed. A large amount of synchronization would be necessary to
`implement this technique, and the result might
`involve more
`overhead than in the current implementation. One of the problems
`with the algorithms discussed thus far is that they rely on a good
`choice for the granularity ratio.
`
`0137
`
`

`
`126
`
`Comparison of'Task Partitioning Schemes
`
`Unfortunately, empirical testing must be employed to determine
`what the best value is for a given situation. In fact, it is possible that
`the value might need to be changed when the number of processors is
`increased significantly beyond 96. The next section covers an
`algorithm that does not rely on a predetermined granularity ratio,
`but instead achieves load balancing by dynamically partitioning
`existing tasks into smaller ones when a processor needs work.
`
`5.3. Task Adaptive Partitioning Scheme
`The task adaptive methodology relies on an algorithm's capability to
`dynamically partition tasks as the program is running.
`If tasks
`cannot be adaptively partitioned, then that algorithm is not well
`suited for dynamic task splitting. Fortunately, the serial scan line Z-
`buffer algorithm upon which these parallel algorithms are based
`consists of independent regions, and there is no required order of
`execution between these regions. The task adaptive algorithm
`consists of the following steps:
`
`1. When a processor needs work (call this processor P5), it
`searches among the other processors for the one which contains
`
`§ Contention
`
`Ld.Imbal.
`
`Q Comm.
`
`[3 Code Mod.
`
`l'_"| Usable Tlllle
`
`80
`
`60
`
`Percentage
`40
`
`20
`
`Stegosaurus
`
`Laser
`
`Image
`
`Figure 5.22: Degradation factors for top-down decomposition (P 2 96)
`
`0138
`
`

`
`Task Adaptive Partitioning Scheme
`
`12?
`
`the most amount of work left to do (call this processor P,,,,,,).
`2. The P3 processor then sets a lock preventing any other
`processor from splitting P5,“.
`3. Pg partitions P,,,,,,,'s work into two segments; the first segment
`goes to Pm“ and the second segment goes to P3.
`4. P, then copies the data necessary for it to work on the second
`segment.
`5. P, unsets the lock and starts doing its work.
`
`This task adaptive scheme could be tacked onto any of the
`previous algorithms, so that additional load balancing would be
`ensured toward the end of the computation. For the implementation
`here, the rectangular region decomposition scheme was chosen as a
`basis parallel algorithm since it is fairly simple to work with in
`developing the heuristic for step 1. A description of this parallel
`algorithm is given next.
`instead of attempting to choose an optimal granularity ratio, the
`number of areas is initially set equal to the number of processors (R =
`1). When a processor has finished computing its area, it executes
`steps 1 through 5 above.
`In order to do this, it was necessary to come
`up with a method for determining the amount of work a given
`processor has left to do. Since all of the areas are the same size, the
`number of scan lines lefl; to render in a particular area is used as an
`indication of how much work there is left on a given processor. This
`proceeds as follows.
`During the tiling portion of the computation, each processor
`updates a shared variable corresponding to the number of scan lines
`it has left to compute. P, quickly runs through these variables
`checking for the processor that has the maximum number of scan
`lines left. Once it finds the processor with the most scan lines left
`(P,,..,,_.), P, proceeds to split P,,,,,, as is shown in figure 5.23. Color
`plate 4 shows an illustration of this process after completion. Pm” is
`not interrupted during this time.
`The splitting mechanism prevents a race condition from occuring
`if several processors attempt to split the same region simultaneously
`or, alternatively, Pm“, attempts to work on a portion of its region
`which is to be split. The first instance is solved by using a test and
`lock methodology in which a splitting processor checks to see if Pm,
`is currently being split and if so, this splitting processor finds another
`processor to split. The second case is solved by updating a shared
`variable which Pm“, checks to determine the last scan line for it to
`calculate. Neither case requires P,,,.,,, to be interrupted from its work,
`thus avoiding any synchronization delay.
`
`0139
`
`

`
`128
`
`Qomparison of Task Partitioning Schemes
`
`A threshold must be chosen which limits partitioning of tasks
`when the cost of the actual partition exceeds the cost of running the
`task serially. Through empirical testing, it was determined that
`partitioning a task with only two scan lines left does in fact yield good
`per

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket