throbber
48
`
`Overview of Parallel Methods for Image Genemtion
`
`machine rather than simulated, it is also possible to determine
`quantitatively how well each implementation has succeeded.
`
`2.3. Conclusions
`In the first section of this chapter, a number of criteria are given to
`evaluate the different parallel decompositions which are presented in
`the second section. These criteria for evaluation include granularity,
`type of parallelism, use of coherence, load balancing characteristics,
`methods of data access, and scalability.
`In the second section, a number of past as well as yet untested
`possible parallel approaches are presented and categorized into a
`taxonomy. The image space pixel decompositions based on areas of
`pixels seem to hold the most promise for high performance on an
`MIMD architecture. In chapter 5, the implementations of approaches
`are described in an effort to conclusively show that one technique is
`optimal. Since most of the past work has involved simulations and
`not multiprocessor implementations, these implementations allow us
`to compare different task decompositions and memory referencing
`strategies on an equal basis for the first time.
`
`TEXAS INSTRUMENTS EX. 1011 - 58/229
`
`

`
`3
`
`Issues in Parallel
`Algorithm
`Development
`
`In this chapter, different advanced parallel computer architectures
`are compared according to their suitability for implementation of a
`graphics display algorithm.
`In the first section, the architectures are presented and analyzed
`with regard to the development of a parallel graphics rendering
`algorithm. Although SIMD architectures (single instruction, multiple
`data path) have been used for graphics applications in the past, this
`mode of operation generally requires task execution in lock step
`fashion. Work done at the pixel level can be accomplished in this
`manner, but higher level tasks are not well suited to this type of
`parallel approach.
`The type of architectures investigated here is restricted to MIMD
`machines (multiple instruction, multiple data path) since different
`tasks can proceed simultaneously under separate control flow. These
`machines can be configured with small to large processor counts,
`allowing flexibility in performance versus cost. An inexpensive
`MIMD architecture can be obtained for as little as $10,000, while
`
`49
`
`TEXAS INSTRUMENTS EX. 1011 - 59/229
`
`

`
`50
`
`Issues in Parallel Algorithm Development
`
`extremely high performance machines can cost several million
`dollars. The algorithms presented here are designed to be useful on a
`small system containing only 2 processors, as well as a large system
`of 100 or more processors.
`The second section involves a comparison of the two main
`architectural choices in MIMD hardware. These two methods,
`message passing and shared memory, are analyzed with regard to a
`parallel graphics implementation. Finally, we describe the
`programming environment which is specific to the BBN Butterfly
`multiprocessor since this machine was chosen for the implementation
`comparisons here.
`
`3.1. Architectural Choices
`There are currently a number of commercial message passing
`architectures (Intel iPSC, NCube, Inmos transputer based machines)
`and shared memory multiprocessors (Sequent Balance, Encore
`Multi max, Alliant F'X/2800, and BBN Butterfly1) on the market.
`Coarse grained MIMD architectures such as those offered by Cray,
`Convex, Silicon Graphics, and others allow only limited parallelism.
`The issues in developing programs for the latter machines are not as
`pronounced as for the previously mentioned machines due to their
`small processor counts (typically 2 to 8 processors).
`In this section, we describe a few specific commercial machines to
`illustrate the differences among the classes of architectures. These
`differences allow us to evaluate how well the various types of
`computer image generation algorithms can be expected to run on a
`variety of parallel computers.
`The first subsection describes the impact of using conventional
`MIMD hardware to perform computer graphics rendering. Different
`issues by which the architectures are evaluated are given in this
`section. The amount of memory used in a graphics application, as
`well as the high data movement involved in this application,
`influences the choice for the appropriate architecture suitable for this
`type of application.
`Distributed memory architectures are useful for applications
`where the data can be partitioned initially among the nodes2 of the
`system with little communication thereafter. These types of machines
`typically have a high message passing cost, and any time spent
`
`lBBN Advanced Computers, Inc. is no longer marketing the Butterfly,
`although it is still being supported.
`2The terms processors and nodes are used interchangeably here.
`
`TEXAS INSTRUMENTS EX. 1011 - 60/229
`
`

`
`Architectural Choices
`
`51
`
`communicating is time wasted from computation. The Intel
`hypercube family of machines are examples of distributed memory
`architectures in which the processors are connected in a hypercube
`topology. Although hypercube connections are a common design, Intel
`is also experimenting with a mesh design on what is currently
`reported to be the fastest computer in the world: the prototype Intel
`Touchstone machine installed at California Institute of Technology.
`The BBN Butterfly contains physically distributed memory, but it
`is classified as a shared memory architecture since remote memory
`can be logically shared. The Encore Multimax is an example of global
`shared memory architecture which uses caching to speed up
`references to the memory modules. These machines are analyzed as
`representative examples of their genre in the second subsection.
`
`3.1.1. Impact of Graphics Rendering on System
`Requirements
`Utilizing a multiprocessor for an application such as computer
`graphics rendering imposes certain demands on the system that other
`applications might not introduce. The large amount of data
`movement in this type of algorithm presents unusual problems for
`certain types of architectures. Following are two characteristics of
`graphics display algorithms which can affect the performance of the
`computer architecture to be used for implementation of the software
`algorithm.
`
`3. 1. 1. 1. Image Quality
`In this book, we are primarily interested in trying to achieve good
`performance when rendering highly complex images in a graphics
`display algorithm on a parallel processor. This increase in image
`complexity can arise from several factors given in [Whit89]:
`
`Anti-aliasing. This is a correction mechanism for the typically
`inadequate sampling of high frequencies in a computer
`generated scene. More information about the geometric
`structure ofthe scene must be available to do anti-aliasing, and
`this affects the size of the data structures and the amount of
`information which must be shared by each processor.
`In
`general, polygon fragment or sub-pixel information must be
`stored to perform anti-aliasing.
`
`TEXAS INSTRUMENTS EX. 1011 - 61/229
`
`

`
`52
`
`Issues in Parallel Algorithm Development
`
`Mapping. The data structures required for texture, bump, and
`reflection mapping are very large and must be shared by a
`large number of processors, which increases the communica(cid:173)
`tion between processors. These mapping operations enhance
`the quality ofthe image by simulating different types of surface
`attributes for the objects in the scene.
`Shadows. Some shadow casting algorithms require data struc(cid:173)
`tures as large as those required for texture, bump, and
`reflection mapping, with the same resultant communication
`problems. If a Z-buffer shadow algorithm is used, the visibility
`calculations are repeated for each light source and this adds
`time complexity [Will78]. The shadow volumes technique
`requires additional geometric data and this adds to the
`memory requirement [Crow77].
`Resolution. Instead of generating 640 x 4843 images, 1280 x 968
`or evengreater pixel resolution is desired to enhance image
`quality.
`Data Elements. An increase in the number of geometric elements
`provides enhanced realism in the scene.
`
`Anti-aliasing, mapping, and shadowing involve increasing the
`complexity of the rendering calculations. Increased resolution raises
`the number of rendering calculations necessary since more pixels are
`displayed. Shadow casting and higher resolution increase the overall
`realism in the scene description by providing more detail. Each factor
`which increases image quality introduces distinct problems into the.
`parallelization process. The random access memory referencing
`patterns associated with mapping and shadow casting can degrade
`performance significantly if this data is not managed effectively. The
`implementation of these factors in a parallel environment is strongly
`· dependent on the decomposition and general memory referencing
`scheme chosen. Because this book focuses primarily on the analysis
`of the decomposition and memory referencing strategies, the focus
`here is on anti-aliasing, resolution, and number of data elements,
`leaving mapping and shadowing to be analyzed in a future work.
`Both greater resolution in image size and larger datasets require
`more memory to be available in the system.
`If the frame buffer is stored internally in RAM and resolution is
`increased from 640 x 484 to 1280 x 968, the memory storage require(cid:173)
`ments quadruple. If a Z-buffer [Catm74] or A buffer [Carp84) hidden
`
`3This refers to a display resolution of 640 pixels across by 484 pixels
`down the screen, which is standard video resolution.
`
`TEXAS INSTRUMENTS EX. 1011 - 62/229
`
`

`
`Architectural Choices
`
`53
`
`/;;.· ·;
`
`surface algorithm is used, even more memory is needed due to the
`additional data stored per pixel. For a 1280 x 968 image maintaining
`4 bytes for red, green, blue, and coverage in addition to 4 bytes for the
`z-value, 8 megabytes of memory is needed. This memory needs to be,.
`accessible by all of the processors.
`·.,.
`An increase in the number of geometric elements also requires a
`corresponding linear increase in the memory required. If we assume
`that the elements used are quadrilateral polygons, each polygon
`requires 12 bytes to store each point, 12 bytes to store each normal,
`and 10 bytes (minimally) to store the connectivity information. This
`adds up to 106 bytes per polygon, although we can in general assume
`that each normal and point are shared by 4 other polygons. This
`results in 32 bytes per polygon. Based on these values, the amount of
`memory can be determined for different levels of image quality, as
`shown next.
`The following are scenarios for memory usage based on image
`quality. Memory requirements for each image quality level are
`variable within a certain range, depending on the features included
`and the algorithm chosen.
`
`Case 1. A "low quality" image generated today might involve
`1,000 to 10,000 geometric elements at a resolution of 640 x 484
`(standard video resolution).
`Memory requirement for data: 32K up to 320K
`Case 2. A "normal" image would involve 10,000 to 70,000
`geometric elements, include at least anti-aliasing and possibly
`additional visual effects. Resolution would be 640 x 484 up to
`1280 X 968.
`Memory requirement for data: 320K up to 2.2 megabytes
`Case 3. A "high quality" image would involve 70,000 up to
`1,000,000 geometric elements, include anti-aliasing and one or
`more visual effects. Resolution would be 1280 x 968 up to 4K x
`4K
`Memory requirement for data: 2.2 megabytes up to 32
`megabytes
`
`These estimates do not include storage for maintaining edgelists
`and interpolation values, in addition to the space required for
`advanced features such as anti-aliasing. Frame buffer or Z-buffer
`storage in RAM will make matters worse at all levels. The actual
`memory requirements are at least double and possibly quadruple the
`values given previously for the data storage. Access to all of this data
`remotely on a shared memory machine will likely cause problems
`
`TEXAS INSTRUMENTS EX. 1011 - 63/229
`
`

`
`54
`
`Issues in Parallel Algorithm Development
`
`with excessive usage of the interconnection network unless the data is
`carefully managed. On message passing architectures, access to
`remote data requires knowledge of the location of the data and a
`complex mechanism for distributing it among the processors in the
`system. In addition, the time to pass the data back and forth in a
`message passing machine will degrade performance. One of the
`reasons to use a multiprocessor for graphics rendering is to handle
`large and complicated scenes. Therefore, the memory requirements
`and manipulation of data in the system need to be carefully evaluated
`to reduce the overhead effects.
`3.1.1.2. vo
`Some advanced architectures use parallel disk setups such as the
`data vault mechanism in the Connection Machine. In general, though,
`most multiprocessors employ only a single disk for 1/0. The large
`scene descriptions required in various applications may require a
`time of several seconds up to many minutes to read in the data from
`disk. This can only be accomplished sequentially on a conventional
`von Neumann architecture. In a parallel machine, however, this
`presents a bottleneck in performance if only one processor interacts
`with the disk. One can see that it is desirable to be able to exploit
`some type of parallelism in disk usage. At the end of the graphics
`computation, an image is sent out either to an external frame buffer
`or to the disk for storage. This operation should also be parallelized.
`The assumption in this book is that only a single disk is available for
`1/0 operations, so parallelism may take the form of pipe lining. For
`machines which allow parallel 1/0, performance will greatly benefit if
`this feature is exploited.
`As was stated in chapter 1, the 1/0 phases of a display algorithm
`are not the focus of this book. Too often, though, computer graphics
`specialists have ignored this portion of the program in their parallel
`algorithms. The 1/0 can directly affect how the algorithm is
`structured and optimization might not be feasible within the context
`of any given parallel tiling algorithm. An example parallel implemen(cid:173)
`tation of the 1/0 operations is illustrated later in this text.
`
`3. 1.2. Message Passing
`In a message passing architecture, all of the processors are connected
`by an interconnection network through which messages are passed.
`This is the only form of communication between processors because
`they do not share any memory. The Intel iPSC is an example of this
`
`TEXAS INSTRUMENTS EX. 1011 - 64/229
`
`

`
`Architectural Choices
`
`55
`
`type of architecture. The processor nodes in the iPSC are connected
`in a hypercube fashion whereby each processor can communicate with
`n other nodes which differ by exactly one bit in their addresses in a
`fully configured system containing 2n nodes. The more recent
`versions of the Intel hypercube use a type of message passing
`mechanism that is called "wormhole" routing [Nuge88]. This routing
`mechanism allows processors to communicate with each other directly
`even if they are not directly connected. A path which is held open for
`the entire length of the message is created from the source to the
`destination. This type of path is only created for sufficiently long
`messages to prevent unfair use of the interconnection network.
`The programmer's burden of "mapping" a parallel algorithm onto
`the hypercube is lessened since optimizing the algorithm for nearest
`neighbor communication is no longer a necessary consideration for
`this architecture. The issue of multiple messages contesting for
`portions of the same path still exists in this scheme since the worm(cid:173)
`hole is maintained as long as a message is being transferred. If
`another route is not available, any other messages vying for a portion
`of this path must wait. In addition to the above hypercube connec(cid:173)
`tions between nodes, there exists another processing node called the
`cube manager which is connected to all the processors via ethernet
`link. A three-dimensional hypercube is illustrated in figure 3.1.
`Some choices for algorithm decomposition favor one type of
`architecture over another solely because of the method of memory
`distribution. Much work has been directed at the problem of mapping
`
`Cube
`Manager
`
`(100)
`
`(000)
`
`Figure 3.1: Example of hypercube architecture (8 nodes)
`
`TEXAS INSTRUMENTS EX. 1011 - 65/229
`
`

`
`56
`
`Issues in Parallel Algorithm Development
`
`algorithms to architectures [Berm87], but no single methodology is
`optimal across different architectural models and algorithmic
`paradigms. Since the optimal mapping problem is NP-complete in
`most realistic settings [Chen88], heuristics are usually used.
`Sadayappan and Ercal [Sada87] have developed a technique in which
`data locality is exploited by a nearest neighbor mapping to reduce
`communication costs in a mesh architecture. This approach would be
`applicable to a graphics display application when it is to be
`implemented on such a machine.
`The Intel hypercube has several limitations which make this
`machine a less likely candidate for a graphics image generation
`implementation. This type of architecture is primarily intended for
`problems which exhibit high parallelism and high computation costs,
`but little data movement. The cost of sending messages between the
`cube manager and each processor, as well as between the processors
`themselves, is very high and is only reduced when a wormhole is
`used. A message must contain greater than 256 bytes in order for
`wormhole routing to occur. The low bandwidth of the ethemet and
`high set-up time for messages allow only limited dynamic load
`balancing to be accomplished since data movement is costly. For a
`graphics display algorithm, the cost of propagating part or all of the
`database to . the nodes is high, as is the cost for retrieving the
`rendered pixels. Parallel 1/0 cannot be achieved in this machine since
`only the cube manager processor can access the disk. The Intel
`hypercube was not designed for high data movement applications,
`and therefore the bottleneck created by the cube manager for disk 1/0,
`as well as the cost of message passing, limits the use of this machine
`for graphics algorithms.
`A reasonable solution might involve an initial communication of
`data to be scattered throughout the processors with successive com(cid:173)
`munication involving very small amounts of data. This would be
`viable in a graphics context in which the computing cost overwhelm(cid:173)
`ingly outweighs the communication cost, such as in a ray tracing
`program. Badouel [Bado90] has used a ray tracing algorithm with
`good success on the iPSC by employing a caching scheme for subse(cid:173)
`quent communication after the initial data distribution.
`
`3.1.3. Shared Memory
`There are a number of distinct interconnection network strategies for
`connecting processors to a global memory. Next we describe two
`types of shared memory multiprocessor architectures: bus-based
`
`TEXAS INSTRUMENTS EX. 1011 - 66/229
`
`

`
`Architectural Choices
`
`57
`
`tightly coupled shared memory and multistage switch-based shared
`memory.
`
`3. 1.3. 1. Bus-based Shared Memory
`The shared memory paradigm allows the programmer to think in
`terms of parallel tasks, rather than assigning tasks to processors as in
`a message passing design. The Encore Multimax is an example of a
`shared memory multiprocessor which uses a single bus for processor
`to memory communication. Other bus-based systems may use
`multiple buses for faster communication and fewer conflicts. The
`Multimax can contain up to 20 processors and from 32 to 128
`megabytes of memory. A drawing of this type of architecture is given
`in figure 3.2, where P indicates a processor.
`The Encore's primary limitation is the fixed bandwidth of the bus,
`which restricts the number of processors that can be used efficiently.
`This does not allow testing of the algorithm on large scale processor
`configurations. Another factor weighing against this type of
`architecture is the notion of a single contiguous memory shared by all
`processors. Processors do not have any local memory to access, and
`therefore contention for the bus can become a performance
`degradation factor even when referencing data that does not need to
`be shared. A memory cache provides a form of local access and
`alleviates this bottleneck somewhat. For some algorithmic choices
`and for initial implementation and debugging, bus-based
`architectures could be a good choice for a graphics image generation
`algorithm, but they do not provide the scalable performance that is
`necessary to achieve fast processing of large graphics databases.
`
`Memory
`
`Figure 3.2: An example of a bus-based multiprocessor architecture
`
`TEXAS INSTRUMENTS EX. 1011 - 67/229
`
`

`
`58
`
`Issues in Parallel Algorithm Development
`
`3. 1.3.2. Switch-based Shared Memory
`The BBN family of multiprocessors (which includes the Butterfly
`GPlOOO and TC2000) are shared memory multiprocessors which
`utilize a complex interconnection network to connect processors to
`shared memory modules. For the purposes of this discussion, we will
`restrict ourselves to analyzing the architecture of the Butterfly
`GPlOOO. The Butterfly GPlOOO is a scalable multiprocessor which
`can be configured from 1 to 256 processors, each containing 4
`megabytes of memory. The network in this machine is built up from a
`basic 4 x 4 crossbar switch, which allows simultaneous communica(cid:173)
`tion between processors and memory as long as more than one
`processor does not try to communicate with the same memory module
`at the same time. This switch is shown in figure 3.3. A description of
`the advantages of this type of switch over a bus-based interconnection
`network is given in [BBN84].
`The memory in this machine can be logically shared, but it is
`physically distributed across a multi-stage network switch. The
`processors each have access to their own local memory, as well as to
`remote memory modules, by making references across the switch.
`This puts the Butterfly into the NUMA (non-uniform memory access)
`class of shared memory multiprocessors since the local data access is
`faster than the remote data access. Other NUMA machines include
`the Cedar [Kuck86] project from University of Illinois as well as the
`NYU Ultracomputer [Gott83]. The software interface provided for the
`programmer in the Butterfly makes this remote referencing
`transparent. The interconnection network provides very high
`performance with a 32 Megabit/second communication bandwidth.
`This network is illustrated in figure 3.4.
`There is no notion of a single main processor in the Butterfly,
`although one of the processors is connected to the multibus and serves
`as the processor through which I/0 is accomplished. This could create
`a possible I/0 bottleneck similar to the one stated previously for the
`Intel iPSC. In the case ofthe Butterfly, however, each processor can
`access the disk transparently through the I/0 processor, whereas the
`cube manager is the only processor which can access the disk in the
`iPSC. This transparent disk access allows any processor to perform a
`disk read or write operation, although semaphores may be required to
`prevent interference. The I/0 bottleneck still exists in the GPlOOO,
`but it is easier to program the reading in of data in the GPlOOO than
`in the iPSC.
`
`TEXAS INSTRUMENTS EX. 1011 - 68/229
`
`

`
`Architectural Choices
`
`59
`
`Ml
`
`M2 M3
`
`M4
`
`PI
`
`P2
`
`P3
`
`P4
`
`....
`.,
`.... ,. ..
`
`4x4 Crossbar Switch
`
`Figure 3.3: Connection of processors to memory with crossbar switch
`
`Each processor is connected
`through a multi-stage ·
`interconnect to remote
`memory.
`
`p
`R
`0
`c
`E
`s
`s
`0
`R
`s
`
`M
`E
`M
`0
`R
`y
`
`Figure 3.4: Multi-stage interconnection network in BBN Butterfly
`
`TEXAS INSTRUMENTS EX. 1011 - 69/229
`
`

`
`60
`
`Issues in Parallel Algorithm Development
`
`3.2. Comparison of MIMD Methodologies
`In comparing shared memory and message passing architectures for a
`graphics display algorithm, several items are worth considering. One
`important item to note when choosing an architecture for
`implementation is the fact that the amount of memory to be managed
`when generating high quality imagery can grow to be very large.
`Based on the algorithmic requirements given previously, it is
`desirable to develop parallel graphics algorithms on a machine with
`the following characteristics:
`
`1. Ease of programming
`2. High performance interconnection network
`3. Scalable to high processor configurations
`
`First, the shared memory programming paradigm is generally
`considered to be easier than message passing for the programmer to
`work with in developing code for an MIMD computer. The reason is
`that data which all processors need access to can be stored once in the
`system in logically shared memory. It does not need to be copied to
`each processor, which would otherwise waste time and/or space. Nor
`does the actual physical memory module location need to be specified
`by the programmer and sent to each processor. This is handled by the
`operating system (or hardware) as if the collection of memory modules
`is in one global address space (in the case of the Butterfly, for
`instance).
`Secondly, the interconnection network performance in shared
`memory architectures is generally better suited for frequent commu(cid:173)
`nication of very small to very large packets of data. This is not
`usually the case for message passing machines. On the other hand,
`recent work described in [Nitz91] indicates that it is feasible to simu(cid:173)
`late a shared memory environment on a distributed memory parallel
`computer. In the future, this type of architecture/programming
`methodology might provide the type of scalable performance for which
`a graphics application would be well suited. Message passing
`architectures would be well suited to complex image generation
`algorithms such as ray tracing or radiosity because the cost of image
`generation is amortized when the task time is large in comparison to
`the data transfer time.
`The performance of the Butterfly multistage interconnection
`network is much higher than the interconnection network in the
`iPSC, although the former is more costly. In addition, the Butterfly
`
`TEXAS INSTRUMENTS EX. 1011 - 70/229
`
`

`
`Comparison ofMIMD Methodologies
`
`61
`
`network can scale to large processor configurations, and therefore
`provide better performance than a bus-based shared memory system.
`Since the Encore Multimax and other bus-based systems are
`limited in the number of processors they can support, low or normal
`quality image generation would fit well onto that type of architecture.
`Higher quality imagery demands more compute power as well as
`more memory than is normally available on bus architectures. One
`exception is the Alliant FX/28000, which uses up to 28 Intel i860
`processors. The number of processors cannot be increased beyond
`this, but very high performance has been obtained on this system.
`Still, the Butterfly TC2000 is a faster version of the GPlOOO and does
`provide the scalability necessary to render extremely large datasets.
`If judicious distribution of graphics data is used, memory latency can
`be reduced to a negligible overhead.
`Although all of these architectures are suitable for graphics
`algorithms, the Butterfly environment provides a stronger case for
`high performance. In summary, with regard to the issues discussed
`previously, the BBN Butterfly is the best example of a shared memory
`multiprocessor which meets these requirements. The distributed
`memory modules within the Butterfly allow the programmer to take
`advantage of local memory access while a global view is provided of
`shared memory. The performance of the interconnection network and
`memory modules is better than a bus-based system such as the
`Encore Multimax, and the number of processors that the machine is
`capable of supporting allows massive parallelism.
`The BBN Butterfly GP1000 at The Ohio State University's
`Computer and Information Science Department was used for the
`primary development and debugging of the algorithms presented in
`this book. This machine only had 10 processors, so it was desirable to
`test the programs on a larger machine. The Naval Research Lab,
`Georgia Institute of Technology, and Michigan State University all
`provided access to machines with larger configurations for this
`purpose: Final testing was done on the Butterfly GP1000 located at
`the headquarters of BBN Advanced Computers, Inc. This machine
`contains 107 processors, but we have limited testing of the programs
`to 96 processors. Under all circumstances, this machine was not
`being used by others at the time of testing, so no other processes in(cid:173)
`terfered with the timings. The BBN TC2000 is their next generation
`multiprocessor which contains numerous enhancements to the design
`used in the GP1000. For some of the tests given in chapter 5, access
`was provided to a 47 processor TC2000 at Argonne National
`Laboratory, as well as to a 128 processor TC2000 at Lawrence
`Livermore National Laboratory. As a note to the reader, for further
`
`TEXAS INSTRUMENTS EX. 1011 - 71/229
`
`

`
`62
`
`Issues in Parallel Algorithm Development
`
`reference in the rest of this book, the term processor is used to denote
`a processor-memory module on the Butterfly.
`
`3.3. The BBN Programming Environment
`An algorithm can be designed to take advantage of a particular
`machine's characteristics to enhance overall performance. The
`algorithms illustrated here were designed for an MIMD architecture
`and optimized for implementation on the BBN Butterfly. Ideally, one
`would like to design program code so that it will run unmodified on a
`variety of parallel architectures. Although some parallel
`environments are available on a wide variety of machines (notably
`Linda [Ahuh86]), we did not have access to these programming tools.
`The parallel programming paradigms available on the Butterfly at
`Ohio State (where the code was originally developed) are BBN's
`Uniform System and a version of C-Threads originally developed at
`Carnegie-Mellon University [Coop87]. Additionally, Lawrence
`Livermore provides a split-join programming model called PCP, but it
`is currently only supported on the TC2000 [Gord91], and not the
`GPlOOO.
`The Uniform System approach is a BBN specific parallel pro(cid:173)
`gramming scheme [BBN89a]. This method uses the concept of
`generators which spawn off parallel tasks in a number of different
`ways selectable by the programmer. The second paradigm is C(cid:173)
`Threads, which was ported to the Butterfly at The Ohio State
`University [Sami89]. The Uniform System was used for implementa(cid:173)
`tion purposes since it is supported by BBN and the program code
`would be able to run unmodified on the TC2000 as well.
`The Uniform System parallel programming paradigm is designed
`to allow the programmer to develop parallel applications which are
`insensitive to the actual number of processors in the system. The
`program code does not need to be modified, nor special cases taken
`into account when it is run under different processor configurations.
`This also allows for debugging and testing on a small number of
`processors and later using all of the processors in the system for
`timing measurements. The Uniform System is a library of routines
`that the user links with in a C or Fortran program. In the case of the
`algorithms presented here, all of the code is written in C. The
`Uniform System can basically be divided into two sections: the shared
`memory portion and the task assignment portion. In addition, special
`routines are available to handle atomic operations, locks for synchro(cid:173)
`nization, spin waits, and other configuration operators. Memory can
`be allocated in the system as:
`
`TEXAS INSTRUMENTS EX. 1011 - 72/229
`
`

`
`The BBN Programming Environment
`
`63
`
`1. Local variables
`2. Global variables
`3. Dynamic storage
`4. Shared dynamic storage
`5. Copied storage
`
`Local, global, and dynamically allocated memory are treated the
`same as in any normal C program, with each processor having its own
`copy of a variable. Processor i and processor j may both reference a
`variable l, but its value is different on each processor since each has
`its own local copy.
`Shared dynamic storage allows one to create space for data which
`can be stored somewhere in global memory. The data is available to
`each processor, but is only stored physically in one process

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket