throbber
E C E N 6 2 5 3 A d v a n c e d D i g i t a l C o m p u t e r D e s i g n
`
`Multiprocessors
`
`We can think of multiprocessors as a way to increase the parallelism of the uniprocessor
`systems we have been studying. Scalar processors have a single pipeline. Superscalar
`processors have multiple parallel execution pipelines, but share common instruction fetch
`and decode pipelines. Multiprocessors have complete multiple parallel pipelines where
`each pipeline has its own fetch and decode stages.
`
`Anyone that has used the internet has experienced a system with multiple processors. The
`internet consists of massive numbers of independent processors that are loosely coupled
`through their network connection. To each processor, the network connection looks like
`another relatively slow I/O device. In particular, each processor has its own memory
`which is not shared with other processors.
`
`Multiprocessors are much more tightly coupled. Memory (and I/O devices) are shared via
`a local interconnection network. Each processor has access to its own memory and all the
`memory of all the other processors. This requires a much higher speed interconnection
`that is capable of keeping up with memory data transfer rates. The two basic organiza-
`tions of shared memory multiprocessors are shown in fig. 9-4, p. 424.
`
`Memory becomes a common resource which must be shared between execution threads
`running simultaneously (really simultaneously, not time shared) on different processors in
`the multiprocessor system. The use of memory by different threads running on the same
`or different processors must be properly synchronized using the techniques of the previous
`section. In this manner, the design goals (middle of p. 423) for multiprocessors can be
`achieved.
`
`Cache Coherence. The unit latency requirement is usually satisfied by providing each
`processor with its own local cache as seen in fig. 9-4, p. 424. As long as cache miss rates
`are low, the unit latency goal will be (almost) met.
`
`Separate local caches makes it difficult for each processor to have a coherent view of
`memory which includes the effects of its own memory writes and the memory writes of
`other processors. Coherence protocols are necessary as shown in fig. 9-5, p. 425.
`
`An update protocol requires the transfer of write data (and write address) to all the proces-
`sor caches that have a copy of the data. Since the data is transferred through the intercon-
`nection network, the data might as well be stored in main memory as well. This is very
`similar to a write-through strategy for uniprocessor cache. Write-through (and therefore
`update protocols) is known to have high memory bandwidth requirements since each write
`(by every processor) must be transmitted on the interconnection network. For this reason,
`update protocols are not used in modern designs.
`
`An invalidate protocol only requires the transfer of the write address to all processor
`caches that have a copy of the data. Furthermore, once the copies of the data in other
`
`Multiprocessors
`
`April 28, 2003
`
`page 1 of 4
`
`APPLE 1013
`
`

`
` E C E N 6 2 5 3 A d v a n c e d D i g i t a l C o m p u t e r D e s i g n
`
`caches have been invalidated (removed from the cache), the original processor has an
`exclusive copy of the data. It can continue to read or write to this location without notify-
`ing the other processors. This is very similar to a write-back strategy for uniprocessor
`cache. Write-back (and therefore invalidate protocols) is known to have lower memory
`bandwidth requirements since most data transfers take place to local cache without using
`the interconnection network at all. Once in the exclusive state, the interconnect network is
`used only when
`
`1. the local cache must write back (evict) the data to make room for new data locations in
`cache,
`2. other processors want to read or write to the exclusive location.
`
`The invalidate protocol is usually implemented with the four states shown in fig. 9-6, p.
`428. Note:
`
`1. Each cache block has its own state which is maintained by the local cache controller in
`response to both local processor memory operations and bus operations from remote
`processors.
`2. Bus read, bus write and bus upgrade correspond to remote read, remote write and
`remote upgrade (invalidate) respectively.
`3. Bus upgrade is usually called bus invalidate since this is the message to invalidate
`caches in the shared state.
`4. A cache miss (tag mismatch) on read or write is treated the same as a local read or write
`to an invalid state.
`5. The Modified state is also exclusive. Only one cache can be in the exclusive or modi-
`fied state for a particular memory block.
`6. Local evict does not have an effect on other caches.
`
`Snooping. The simplest way to implement cache coherence invalidate protocols is to have
`all processors share a common address bus. The common address bus can be a single
`physical bus or it might be a collection of physical busses in a more complicated intercon-
`nection topology. Each processors monitors (snoops) addresses on the bus from other pro-
`cessors. When an address on the address bus matches a location in the local cache, the
`appropriate action is taken according to the coherence protocol.
`
`State transitions in fig. 9-6 do not occur until the local cache controller gains access to the
`shared address bus. This prevents such problems as two cache controllers in the shared
`state from invalidating each other. This means that the real cache controller has intermedi-
`ate states to wait for bus access.
`
`The cache must respond to the local processor as any normal write-back cache would. In
`addition, snooping requires a cache operation (tag match) every time any processor uses
`the shared address bus. This normally requires multiport cache to avoid slowing down the
`local processor.
`
`Multiprocessors
`
`April 28, 2003
`
`page 2 of 4
`
`

`
` E C E N 6 2 5 3 A d v a n c e d D i g i t a l C o m p u t e r D e s i g n
`
`Snooping has been used successfully in commercial systems with up to 64 processors. For
`larger multiprocessor systems, the bandwidth requirements on the shared address bus will
`make snooping impractical. Each processor puts addresses on the shared address bus at a
`rate shown in eq. 9-1, p. 429. The total bus traffic, eq. 9-2, is proportional to the number
`of processors. Since only one processor can put an address on the shared bus at any given
`time, it takes longer and longer for messages on the bus to get through as most processors
`spend time waiting to get control of the bus. The waiting time appears as increased mem-
`ory latency which increases the effective cache miss penalty.
`
`Limitations of Snooping. The bandwidth of a single shared address bus can limit the per-
`formance of a multiprocessor system with large numbers of processors. Suppose we add
`more interprocessor communication busses. The best we can do is a dedicated bus
`between each pair of processors.
`
`proc 1
`
`proc 6
`
`proc 2
`
`proc 5
`
`proc 3
`
`proc 4
`
`This is called a “fully connected” network. If there are N processors, then the total num-
`ber of busses is
`
`(N-1) + (N - 2) + ... + 2 + 1 + 0 = N(N - 1)/2
`
`
`The best we can do is ~ N 2/2 times the communication bandwidth of a single bus.
`
`If cache coherence is implemented with snooping, the bus read, bus write and bus upgrade
`(invalidate) messages must be broadcast to all processors. Each of the N processors will
`send N - 1 messages (one to each of the other processors) resulting in a total average mes-
`sage count ~ N 2. Clearly, adding more busses does not help if a cache coherence protocol
`is used which requires broadcasting messages to all processors. We need a coherence
`scheme that takes advantage of invalidate protocols not needing to broadcast messages.
`
`Directory Based Cache Coherence. Suppose we add extra bits (the directory) to main
`memory to keep track of which processor caches have copies of each memory block. We
`can also keep track of the status of each main memory block with a directory controller
`similar to the cache controller for each processor. The organization of main memory with
`directory is as follows.
`
`Multiprocessors
`
`April 28, 2003
`
`page 3 of 4
`
`

`
` E C E N 6 2 5 3 A d v a n c e d D i g i t a l C o m p u t e r D e s i g n
`
`
`
`address
`
`processor
`list
`
`status main memory data block
`
`The processor list must be compact to keep from increasing the memory size too much.
`Usually, a single bit per processor is used in the list to indicate whether or not a processor
`has a copy. Even so, the number of bits in the list can get excessive when there are hun-
`dreds of processors.
`
`The status bits of the directory represent one of the following four states.
`• Uncached (U) - none of the processor caches has a copy. The main memory data block
`is valid.
`• Shared (S) - each processor in the list has a copy in the shared state. The main memory
`data block is valid.
`• Exclusive (E) - one and only one processor has a copy in the exclusive (unmodified)
`state. The main memory block is valid.
`• Modified (M) - one and only one processor has a copy in the modified state. The main
`memory block is invalid.
`
`
`Observe that:
`
`1. The cache controller responds in the same way as in fig. 9-6, p. 428. The difference is
`that the bus read, bus write and bus upgrade (invalidate) messages are sent to and come
`from the directory controller, not other cache controllers.
`2. A bus read request to the directory for a block in the M state requires the directory to
`send a message to the single processor cache that has a valid copy of the data. This
`extra “dirty miss” latency can be a problem for some data base applications.
`3. Other bus read requests can be handled by the directory (rather than loading down the
`processor cache controller) since the main memory data block is valid.
`4. A bus write or bus upgrade (invalidate) request requires messages to be sent only to
`processors in the list (no broadcasting).
`
`
`Since all bus read, write and invalidate operations must be handled by the directory, con-
`tention for the directory can easily become a problem. For this reason, most directory
`schemes spread the main memory locations between the processor nodes (the NUMA
`architecture fig. 9-4, p. 424) with a separate directory controller for each processor.
`
`Multiprocessors
`
`April 28, 2003
`
`page 4 of 4

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket