throbber
42 Memor~ Systems: Cache, DRAM, Disk
`
`Cache Accesses (per I 0 ms) and System CPI
`gzip -- 512MB
`
`215
`
`310
`
`315
`time(s)
`
`4O
`
`45
`
`5O
`
`55
`
`6O
`
`65
`
`Cache Power (per I0 ms) and Cumulative Energy
`
`15
`
`2O
`
`25
`
`3O
`
`35
`
`time(s)
`
`4O
`
`45
`
`5O
`
`55
`
`6O
`
`65
`
`DRAH & Disk Power/Cumulative Energy/Access Numbers
`gzip -- 512MB
`
`oOOOOoOOOOO .... o ..... oooooooooooo oooooo ...... ooooooooooooooooooooo .... o~.,~4~.,-~ .~ .... ~ ° °
`
`FIGURE 0v.27: Full execution of Gzip. The figure shows the entire run of gzip. System configuration is a 2-GHz Pentium proces-
`sor with 512 MB of DDR2-533 FB-DIMM main memory and a 12k-RPM disk drive with built-in disk cache. The figure shows the
`interaction between all components of the memory system, including the L1 instruction and data caches, the unified L2 cache, the
`DRAM system, and the disk drive. All graphs use the same x-axis, which represents execution time in seconds. The x-axis does not
`start at zero; the measurements exclude system boot time, invocation of the shell, etc. Each data point represents aggregated (not
`sampled) activity within a 10-ms epoch. The CPI graph shows two system CPI values: one is the average CPI for each 10-ms epoch,
`and the other is the cumulative average CPI. A duration with no CPI data point indicates that no instructions were executed due to
`I/0 latency. During such a window the CPI is essentially infinite, and thus, it is possible for the cumulative average to range higher
`than the displayed instantaneous CPI. Note that the CPI, the DRAM accesses, and the disk accesses are plotted on log scales.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 76
`
`

`

`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 43
`
`more compute-intensive phase in which the
`CPI asymptotes down to the theoretical sus-
`tainable performance, the single-digit values
`that architecture research typically reports.
`By the end of execution, the total energy
`consumed in the FB-DIMM DRAM system
`(a half a kilojoule) almost equals that of
`the energy consumed by the disk, and
`it is twice that of the L1 data cache, L1
`instruction cache, and unified L2 cache
`combined.
`
`Currently, there is substantial work happening
`in both industry and academia to address the lat-
`ter issue, with much of the work focusing on access
`scheduling, architecture improvements, and data
`migration. To complement this work, we look at
`a wide range of organizational approaches, i.e.,
`attacking the problem from a parameter point of
`view rather than a system-redesign, component-
`redesign, or new-proposed-mechanism point of
`view, and find significant synergy between the disk
`cache and the memory system. Choices in the disk-
`side cache affect both system-level performance and
`system-level (in particular, DRAM-subsystem-level)
`energy consumption. Though disk-side caches have
`been proposed and studied, their effect upon the total
`system behavior, namely execution time or CPI or
`total memory-system power including the effects of
`the operating system, is as yet unreported. For exam-
`ple, Zhu and Hu [2002] evaluate disk built-in cache
`using both real and synthetic workloads and report
`the results in terms of average response time. Smith
`[1985a and b] evaluates a disk cache mechanism
`with real traces collected in real IBM mainframes
`on a disk cache simulator and reports the results in
`terms of miss rate. Huh and Chang [2003] evaluate
`their RAID controller cache organization with a syn-
`thetic trace.Varma and Jacobson [1998] and Solworth
`and Orji [1990] evaluate destaging algorithms and
`write caches, respectively, with synthetic workloads.
`This study represents the first time that the effects of
`the disk-side cache can be viewed at a system level
`(considering both application and operating-system
`effects) and compared directly to all the other com-
`ponents of the memory system.
`
`We use a full-system, execution-based simulator
`combining Bochs [Bochs 2006], Wattch [Brooks et al.
`2000], CACTI [Wilton & Jouppi 1994], DRAMsim [Wang
`et al. 2005, September], and DiskSim [Ganger et al.
`2006]. It boots the RedHat Linux 6.0 kernel and there-
`fore can capture all application behavior, and all operat-
`ing-system behavior, including I/O activity, disk-block
`buffering, system-call overhead, and virtual memory
`overhead such as translation, table walking, and page
`swapping. We investigate the disk-side cache in both
`single-disk and RAID-5 organizations. Cache parame-
`ters include size, organization, whether the cache sup-
`ports write caching or not, and whether it prefetches
`read blocks or not. Additional parameters include disk
`rotational speed and DRAM-system capacity.
`We find a complex trade-off between the disk
`cache, the DRAM system, and disk parameters like
`rotational speed. The disk cache, particularly its
`write-buffering feature, represents a very powerful
`tool enabling significant savings in both energy and
`execution time. This is important because, though the
`cache’s support for write buffering is often enabled in
`desktop operating systems (e.g., Windows and some
`but not all flavors of Unix/Linux [Ng 2006]), it is typi-
`cally disabled in enterprise computing applications
`[Ng 2006], and these are the applications most likely
`to use FB-DIMMs [Haas & Vogt 2005]. We find sub-
`stantial improvement between existing implemen-
`tations and an ideal write buffer (i.e., this is a limit
`study). In particular, the disk cache’s write-buffering
`ability can offset the total energy consumption of
`the memory system (including caches, DRAMs, and
`disks) by nearly a factor of two, while sacrificing a
`small amount of performance.
`
`0v.4.2 Fully Buffered DIMM: Basics
`The relation between a traditional organiza-
`tion and a FB-DIMM organization is shown in Fig-
`ure Ov.28, which motivates the design in terms of a
`graphics-card organization. The first two drawings
`show a multi-drop DRAM bus next to a DRAM bus
`organization typical of graphics cards, which use
`point-to-point soldered connections between the
`DRAM and memory controller to achieve higher
`speeds. This arrangement is used in FB-DIMM.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 77
`
`

`

`44 Memor~ S~,stems: Cache, DRAM, Disk
`
`Traditional (JEDEC) Organization
`DIMMs
`~D RAMs" s" s" s’~
`
`Package Pins
`
`Graphics-Card Organization
`
`Fully Buffered DIMM
`Organization
`
`Memory Controller
`
`tl
`
`Northbound Channel
`
`Southbound Channel
`
`Package Pins
`
`DRAM
`
`IC
`
`ontroller
`
`DIMM 0
`
`DIMM 1
`
`Edge
`
`IC
`
`ontroller
`
`Memory
`Contro er
`
`nnn sr nnnnl
`Innnn 4 nnn
`
`DDRx SDRAM device ... up to 8 Modules
`
`FIGURE 0v.28: FB-DIMM and its motivation. The first two pictures compare the memory organizations of a JEDEC SDRAM system
`and a graphics card. Above each design is its side-profile, indicating potential impedance mismatches (sources of reflections). The
`organization on the far right shows how the FB-DIMM takes the graphics-card organization as its de facto DIMM. In the FB-DIMM
`organization, there are no multi-drop busses; DIMM-to-DIMM connections are point to point. The memory controller is connected
`to the nearest AMB via two unidirectional links. The AMB is, in turn, connected to its southern neighbor via the same two links.
`
`A slave memory controller has been added onto each
`DIMM, and all connections in the system are point
`to point. A narrow, high-speed channel connects the
`master memory controller to the DIMM-level mem-
`ory controllers (called Advanced Memory Buffers or
`AMBs). Since each DIMM-to-DIMM connection is
`a point-to-point connection, a channel becomes a
`de facto multi-hop store and forward network. The
`FB-DIMM architecture limits the channel length to
`eight DIMMs, and the narrower inter-module bus
`requires roughly one-third as many pins as a tradi-
`tional organization. As a result, an FB-DIMM orga-
`nization can handle roughly 24 times the storage
`capacity of a single-DIMM DDR3-based system,
`without sacrificing any bandwidth and even leaving
`headroom for increased intra-module bandwidth.
`The AMB acts like a pass-through switch, directly
`forwarding the requests it receives from the controller
`
`to successive DIMMs and forwarding flames from
`southerly DIMMs to northerly DIMMs or the mem-
`ory controller. All frames are processed to determine
`whether the data and commands are for the local
`DIMM. The FB-DIMM system uses a serial packet-
`based protocol to communicate between the memory
`controller and the DIMMs. Frames may contain data
`and/or commands. Commands include DRAM com-
`mands such as row activate (RAS), column read (CAS),
`refresh (REF) and so on, as well as channel commands
`such as write to configuration registers, synchroniza-
`tion commands, etc. Frame scheduling is performed
`exclusively by the memory controller. The AMB only
`converts the serial protocol to DDRx-based commands
`without implementing any scheduling functionality.
`The AMB is connected to the memory control-
`ler and/or adjacent DIMMs via unidirectional links:
`the southbound channel which transmits both data
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 78
`
`

`

`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 45
`
`and commands and the northbound channel which
`transmits data and status information. The south-
`bound and northbound datapaths are 10 bits and
`14 bits wide, respectively. The FB-DIMM channel
`clock operates at six times the speed of the DIMM
`clock; i.e., the link speed is 4 Gbps for a 667-Mbps
`DDtlx system. Frames on the north- and southbound
`channel require 12 transfers (6 FB-DIMM channel
`clock cycles) for transmission. This 6:1 ratio ensures
`that the FB-DIMM frame rate matches the DRAM
`command clock rate.
`Southbound frames comprise both data and
`commands and are 120 bits long; northbound
`frames are data only and are 168 bits long. In addi-
`tion to the data and command information, the
`frames also carry header information and a frame
`CtlC (cyclic redundancy check) checksum that is
`used to check for transmission errors. A north-
`bound read-data frame transports 18 bytes of data
`in 6 FB-DIMM clocks or 1 DIMM clock. A DDtlx sys-
`tem can burst back the same amount of data to the
`memory controller in two successive beats lasting
`an entire DRAM clock cycle. Thus, the read band-
`
`width of an FB-DIMM system is the same as that of
`a single channel of a DDtlx system. Due to the nar-
`rower southbound channel, the write bandwidth
`in FB-DIMM systems is one-half that available in a
`DDtlx system. However, this makes the total band-
`width available in an FB-DIMM system 1.5 times
`that of a DDtlx system.
`Figure Ov.29 shows the processing of a read trans-
`action in an FB-DIMM system. Initially, a command
`frame is used to transmit a command that will per-
`form row activation. The AMB translates the request
`and relays it to the DIMM. The memory controller
`schedules the CAS command in a following frame.
`The AMB relays the CAS command to the DRAM
`devices which burst the data back to the AMB. The
`AMB bundles two consecutive bursts of data into
`a single northbound frame and transmits it to the
`memory controller. In this example, we assume a
`burst length of four corresponding to two FB-DIMM
`data frames. Note that although the figures do not
`identify parameters like t_CAS, t_tlCD, and t_CWD,
`the memory controller must ensure that these con-
`straints are met.
`
`:~:
`
`’
`
`RAS
`
`~ CAS
`
`FB-DIMM
`clock
`Southbound :~)(~ .....
`BUS
`DIMM
`clock
`
`DIMM
`Command
`Bus
`
`DIMM
`Data Bus
`
`Northbound
`Bus
`
`D2-D3 ),’ ....,
`
`FIGURE 0v.29: Read transaction in an FB-DIMM system. The figure shows how a read transaction is performed in an FB-DIMM
`system. The FB-DIMM serial busses are clocked at six times the DIMM busses. Each FB-DIMM frame on the southbound bus takes
`six FB-DIMM clock periods to transmit. On the northbound bus a frame comprises two DDRx data bursts.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 79
`
`

`

`46 Memor~ Systems: Cache, DRAM, Disk
`
`The primary dissipater of power in an FB-DIMM
`channel is the AMB, and its power depends on its
`position within the channel. The AMB nearest to
`the memory controller must handle its own traffic
`and repeat all packets to and from all downstream
`AMBs, and this dissipates the most power. The AMB
`in DDR2-533 FB-DIMM dissipates 6 W, and it is cur-
`rently 10 W for 800 Mbps DDR2 [Staktek 2006]. Even
`if one averages out the activity on the AMB in a long
`channel, the eight AMBs in a single 800-Mbps chan-
`nel can easily dissipate 50 W. Note that this number
`is for the AMBs only; it does not include power dis-
`sipated by the DRAM devices.
`
`0v.4.3 Disk Caches: Basks
`Today’s disk drives all come with a built-in cache
`as part of the drive controller electronics, ranging in
`size from 512 KB for the micro-drive to 16 MB for the
`largest server drives. Figure Ov.30 shows the cache
`and its place within a system. The earliest drives
`had no cache memory, as they had little control
`electronics. As the control of data transfer migrated
`
`from the host-side control logic to the drive’s own
`controller, a small amount of memory was needed
`to act as a speed-matching buffer, because the disk’s
`media data rate is different from that of the inter-
`face. Buffering is also needed because when the
`head is at a position ready to do data transfer, the
`host or the interface may be busy and not ready to
`receive read data. DRAM is usually used as this buf-
`fer memory.
`In a system, the host typically has some memory
`dedicated for caching disk data, and if a drive is
`attached to the host via some external controller, that
`controller also typically has a cache. Both the system
`cache and the external cache are much larger than
`the disk drive’s internal cache. Hence, for most work-
`loads, the drive’s cache is not likely to see too many
`reuse cache hits. However, the disk-side cache is very
`effective in opportunistically prefetching data, as
`only the controller inside the drive knows the state
`the drive is in and when and how it can prefetch
`without adding any cost in time. Finally, the drive
`needs cache memory if it is to support write cach-
`ing/buffering.
`
`Applications
`
`Web
`Server
`
`etc.
`
`Operating system
`
`Kernel
`
`Buffer cache I~-- Buffer cache
`
`Modul/e/~
`/ ~ Disk cache
`
`Hardware
`
`(a)
`Cached data ]
`
` C thaed (b)
`
`FIGURE Ov.30: Buffer caches and disk caches. Disk blocks are cached in several places, including (a) the operating system’s
`buffer cache in main memory and (b), on the disk, in another DRAM buffer, called a disk cache.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 80
`
`

`

`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 47
`
`With write caching, the drive controller services a
`write request by transferring the write data from the
`host to the drive’s cache memory and then reports
`back to the host that the write is "done," even
`though the data has not yet been written to the disk
`media (data not yet written out to disk is referred to
`as dirty). Thus, the service time for a cached write is
`about the same as that for a read cache hit, involving
`only some drive controller overhead and electronic
`data transfer time but no mechanical time. Clearly,
`write caching does not need to depend on having
`the right content in the cache memory for it to work,
`unlike read caching. Write caching will always work,
`i.e., a write command will always be a cache hit, as
`long as there is available space in the cache mem-
`ory. When the cache becomes full, some or all of the
`dirty data are written out to the disk media to free
`up space. This process is commonly referred to as
`destage.
`Ideally, destage should be done while the drive
`is idle so that it does not affect the servicing of read
`requests. However, this may not be always possible.
`The drive may be operating in a high-usage system
`with little idle time ever, or the writes often arrive in
`bursts which quickly fill up the limited memory space
`of the cache. When destage must take place while the
`drive is busy, such activity adds to the load of drive
`at that time, and a user will notice a longer response
`time for his requests. Instead of providing the full
`benefit of cache hits, write caching in this case merely
`delays the disk writes.
`Zhu and Hu [2002] have suggested that large
`disk built-in caches will not significantly benefit
`the overall system performance because all mod-
`ern operating systems already use large file system
`caches to cache reads and writes. As suggested by
`Przybylski [1990], the reference stream missing a
`first-level cache and being handled by a second-
`level cache tends to exhibit relatively low locality. In
`a real system, the reference stream to the disk sys-
`tem has missed the operating system’s buffer cache,
`and the locality in the stream tends to be low. Thus,
`our simulation captures all of this activity. In our
`experiments, we investigate the disk cache, includ-
`ing the full effects of the operating system’s file-sys-
`tem caching.
`
`0v.4.4 £xperimental Results
`Figure Ov.27 showed the execution of the GZIP
`benchmark with a moderate-sized FB-DIMM DRAM
`system: half a gigabyte of storage. At 512 MB, there is no
`page swapping for this application. When the storage
`size is cut in half to 256 MB, page swapping begins but
`does not affect the execution time significatly. When
`the storage size is cut to one-quarter of its original size
`(128 MB), the page swapping is significant enough
`to slow the application down by an order of magni-
`tude. This represents the hard type of decision that a
`memory-systems designer would have to face: if one
`can reduce power dissipation by cutting the amount
`of storage and feel negligible impact on performance,
`then one has too much storage to begin with.
`Figure Ov.31 shows the behavior of the system
`when storage is cut to 128 MB. Note that all aspects
`of system behavior have degraded; execution time
`is longer, and the system consumes more energy.
`Though the DRAM system’s energy has decreased
`from 440 J to just under 410 J, the execution time has
`increased from 67 to 170 seconds, the total cache
`energy has increased from 275 to 450 J, the disk energy
`has increased from 540 to 1635 J, and the total energy
`has doubled from 1260 to 2515 J. This is the result of
`swapping activity--not enough to bring the system to
`its knees, but enough to be relatively painful.
`We noticed that there exists in the disk subsystem
`the same sort of activity observed in a microproces-
`sor’s load/store queue: reads are often stalled waiting
`for writes to finish, despite the fact that the disk has
`a 4-MB read/write cache on board. The disk’s cache
`is typically organized to prioritize prefetch activity
`over write activity because this tends to give the best
`performance results and because the write buffering
`is often disabled by the operating system. The solu-
`tion to the write-stall problem in microprocessors
`has been to use write buffers; we therefore modified
`DiskSim to implement an ideal write buffer on the
`disk side that would not interfere with the disk cache.
`Figure Ov.32 indicates that the size of the cache seems
`to make little difference to the behavior of the system.
`The important thing is that a cache is present. Thus,
`we should not expect read performance to suddenly
`increase as a result of moving writes into a separate
`write buffer.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 81
`
`

`

`48 Memor~ Systems: Cache, DRAM, Disk
`
`Gache Accesses (per I 0 ms) and S:ystem
`gzip -- 128MB
`
`Cache Power (per I0 ms) and Cumulative Ener~
`gzip -- 128MB
`
`80
`
`time(s)
`
`I 0
`
`I 0
`
`140
`
`160
`
`DF~AH & Disk Power/Cumulative Ener~/Access Numbers
`gzip -- 128MB
`
`FIGURE 0v.31: Full execution of GZIP, 128 MB DRAM. The figure shows the entire run of GZIR System configuration is a 2 GHz
`Pentium processor with 128 MB of FB-DIMM main memory and a 12 K-RPM disk drive with built-in disk cache. The figure shows
`the interaction between all components of the memory system, including the L1 instruction cache, the L1 data cache, the unified
`L2 cache, the DRAM system, and the disk drive. All graphs use the same x-axis, which represents the execution time in seconds.
`The x-axis does not start at zero; the measurements exclude system boot time, invocation of the shell, etc. Each data point repre-
`sents aggregated (not sampled) activity within a lO-ms epoch. The CPI graph shows 2 system CPI values: one is the average CPI
`for each lO-ms epoch, the other is the cumulative average CPI. A duration with no CPI data point indicates that no instructions
`were executed due to I/0 latency. The application is run in single-user mode, as is common for SPEC measurements; therefore,
`disk delay shows up as stall time. Note that the CPI, the DRAM accesses, and the Disk accesses are plotted on log scales.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 82
`
`

`

`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 49
`
`1100
`
`1000
`
`9OO
`
`8OO
`
`7OO
`
`6OO
`
`5OO
`
`4OO
`
`3OO
`
`2OO
`
`lO0
`
`0
`
`no$
`lx512 sectors (256 KB total)
`2x512 sectors (512 KB total)
`16x512 se~ors (4 MB total)
`24x512 se~ors (6 MB total)
`
`ammp
`
`bzip2
`
`gcc
`
`gzip
`
`mcf
`
`mgfid
`
`parser
`
`twolf
`
`vortex
`
`FI6URE 0v.32: The effects of disk cache size by varying the number of segments. The figure shows the effects of a different
`number of segments with the same segment size in the disk cache. The system configuration is 128 MB of DDR SDRAM with a
`12k-RPM disk. There are five bars for each benchmark, which are (1) no cache, (2) 1 segment of 512 sectors each, (3) 2 segments
`of 512 sectors each, (4) 16 segment of 512 sectors each, and (5) 24 segment of 512 sectors each. Note that the CPI values are for
`the disk-intensive portion of application execution, not the CPU-intensive portion of application execution (which could otherwise
`blur distinctions).
`
`Figure Ov.33 shows the behavior of the system with
`128 MB and an ideal write buffer. As mentioned, the
`performance increase and energy decrease is due to
`the writes being buffered, allowing read requests to
`progress. Execution time is 75 seconds (compared
`to 67 seconds for a 512 MB system); and total energy
`is 1100 ] (compared to 1260 ] for a 512-MB system).
`For comparison, to show the effect of faster read and
`write throughput, Figure Ov.34 shows the behavior of
`the system with 128 MB and an 8-disk RAID-5 system.
`Execution time is 115 seconds, and energy consump-
`tion is 8.5 K]. This achieves part of the performance
`effect as write buffering by improving write time,
`thereby freeing up read bandwidth sooner. However,
`the benefit comes at a significant cost in energy.
`Table Or.5 gives breakdowns for gzip in tabu-
`lar form, and the graphs beneath the table give the
`breakdowns for gzip, bzip2, and ammp in graphical
`form and for a wider range of parameters (different
`disk RPMs). The applications all demonstrate the
`same trends: to cut down the energy of a 512-MB
`system by reducing the memory to 128 MB which
`
`causes both the performance and the energy to get
`worse. Performance degrades by a factor of 5-10;
`energy increases by 1.5× to 10×. Ideal write buffer-
`ing can give the best of both worlds (performance of
`a large memory system and energy consumption of
`a small memory system), and its benefit is indepen-
`dent of the disk’s RPM. Using a RAID system does
`not gain significant performance improvement, but
`it consumes energy proportionally to the number
`of disks. Note, however, that this is a uniprocessor
`model running in single-user mode, so RAID is not
`expected to shine.
`Figure Ov.35 shows the effects of disk caching
`and prefetching on both single-disk and RAID sys-
`tems. In RAID systems, disk caching has only mar-
`ginal effects to both the CPI and the disk average
`response time. However, disk caching with prefetch-
`ing has significant benefits. In a slow disk system (i.e.,
`5400 RPM), RAID has more tangible benefits over a
`non-RAID system. Nevertheless, the combination of
`using RAID, disk cache, and fast disks can improve
`the overall performance up to a factor of 10. For the
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 83
`
`

`

`50 Memory Systems: Cache, DRAM, Disk
`
`Cache Accesses (per I 0 ms) and STstem CPI
`gzip -- 128MB + write buffer
`
`35
`
`~10
`time(s)
`
`_
`
`50
`
`55
`
`60
`
`65
`
`70
`
`75
`
`Cache Power (per 10 ms) and Cumulative Energy
`gzip -- 128MB + write buffer
`
`time(s)
`
`50
`
`55
`
`60
`
`65
`
`70
`
`75
`
`DRAVI & Disk Power/Cumulative Energy/Access Numbers
`gzip -- 128MI1 + write buffer
`
`, 61R , .I° , I
`
`RI0
`_
`
`, I ,
`55
`
`610
`
`_
`
`,
`
`75
`
`time(s)
`
`FIGURE 0v.33: Full execution of GZIP, 128 MB DRAM and ideal write buffer. The figure shows the entire run of GZIR System
`configuration is a 2 GHz Pentium processor with 128 MB of FB-DIMM main memory and a 12 K-RPM disk drive with built-in disk
`cache. The figure shows the interaction between all components of the memory system, including the L1 instruction cache, the
`L1 data cache, the unified L2 cache, the DRAM system, and the disk drive. All graphs use the same x-axis, which represents
`the execution time in seconds. The x-axis does not start at zero; the measurements exclude system boot time, invocation of
`the shell, etc. Each data point represents aggregated (not sampled) activity within a 10-ms epoch. The CPI graph shows two
`system CPI values: one is the average CPI for each 10-ms epoch, the other is the cumulative average CPI. A duration with no
`CPI data point indicates that no instructions were executed due to I/0 latency. The application is run in single-user mode, as is
`common for SPEC measurements; therefore, disk delay shows up as stall time. Note that the CPI, the DRAM accesses, and the
`Disk accesses are plotted on log scales.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 84
`
`

`

`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 51
`
`Cache Accesses (per I 0 ms) and System CPI
`
`time(s)
`Cache Power (per 10 ms) and Cumulative Ener~
`
`410
`
`810
`
`time(s)
`DF~AH & Disk Power/Cumulative Ener~/Access Numbers
`gzip -- 128PIB + 8-disE
`
`210
`
`40
`
`60
`
`80
`
`FIGURE 0v.34: Full execution of GZIP, 128 MB DRAM and RAID-5 disk system. The figure shows the entire run of GZIR System
`configuration is a 2 GHz Pentium processor with 128 MB of FB-DIMM main memory and a RAID-5 system of eight 12-K-RPM disk
`drives with built-in disk cache. The figure shows the interaction between all components of the memory system, including the L1
`instruction cache, the L1 data cache, the unified L2 cache, the DRAM system, and the disk drive. All graphs use the same x-axis,
`which represents the execution time in seconds. The x-axis does not start at zero; the measurements exclude system boot time,
`invocation of the shell, etc. Each data point represents aggregated (not sampled) activity within a lO-ms epoch. The CPI graph
`shows two system CPI values: one is the average CPI for each 10-ms epoch, the other is the cumulative average CPI. A duration
`with no CPI data point indicates that no instructions were executed due to I/0 latency. The application is run in single-user mode,
`as is common for SPEC measurements; therefore, disk delay shows up as stall time. Note that the CPI, the DRAM accesses, and
`the Disk accesses are plotted on log scales.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 85
`
`

`

`,52 Memory Systems: Cache, DRAM, Disk
`
`TABLE Ov.5 Execution time and energy breakdowns for GZIP and BZIP2
`
`GZIP
`512 MB-12 K
`128 MB-12 K
`128 MB-12 K-WB
`128 MB-12 K-RAID
`
`66.8
`169.3
`75.8
`113.9
`
`129.4
`176.5
`133.4
`151
`
`122.1
`216.4
`130.2
`165.5
`
`25.4
`67.7
`28.7
`44.8
`
`440.8
`419.6
`179.9
`277.8
`
`544.1
`1635.4
`622.5
`7830
`
`1261.8
`2515.6
`1094.7
`8469.1
`
`System Breakdown Energy and Execution Time
`~,HHP 32MB per rank (I rank or 4 ranks);diskRPM:Sk, 12k 20k;I disk or 8 RAID disks
`
`System Breakdown Energy and Execution Time
`BZIP2 128MB per rank (I rank or 4 ranks); disk RPM: 5k, 12k, 20k; I disk or 8 RAID disks
`
`System Breakdown Energy and Execution Time
`GZIP 128MB per rank (I rank or 4 ranks); disk RPM: 5k, 12k, 20k; I disk or 8 RAID disks
`
`800O
`
`average response time, even though the write
`response time in a RAID system is much higher than
`the write response time in a single-disk system, this
`trend does not translate directly into the overall per-
`formance. The write response time in a RAID system
`is higher due to parity calculations, especially the
`benchmarks with small writes. Despite the improve-
`ment in performance, care must be taken in applying
`RAID because RAID increases the energy proportion-
`ally to the number of the disks.
`Perhaps the most interesting result in Figure Ov.35 is
`that the CPI values (top graph) track the disk’s average
`read response time (bottom graph) and not the disk’s
`average response time (which includes both reads and
`writes, also bottom graph). This observation holds true
`for both read-dominated applications and applications
`with significant write activity (as are gzip and bzip2).
`The reason this is interesting is that the disk commu-
`nity tends to report performance numbers in terms of
`average response time and not average read response
`
`time, presumably believing the former to be a better
`indicator of system-level performance than the latter.
`Our results suggest that the disk community would
`be better served by continuing to model the effects of
`write traffic (as it affects read latency) by reporting per-
`formance as the average read response time.
`
`0v.4.5 Conclusions
`We find that the disk cache can be an effective tool
`for improving performance at the system level. There
`is a significant interplay between the DRAM system
`and the disk’s ability to buffer writes and prefetch
`reads. An ideal write buffer homed within the disk has
`the potential to move write traffic out of the way and
`begin working on read requests far sooner, with the
`result that a system can be made to perform nearly
`as well as one with four times the amount of main
`memory, but with roughly half the energy consump-
`tion of the configuration with more main memory.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 86
`
`

`

`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 53
`
`bzip2 112MB (lds/4ds/8ds)
`
`1 disk
`4 disks
`8 disks
`
`no cache no prefetch
`w/cache no prefetch
`w/cache & prefetch
`
`5k RPM
`
`12k RPM
`
`20k RPM
`
`disk
`4 disks
`8 disks
`
`avrg Write resp time
`avrg resptime
`avrg Read resp time
`
`5kno$ nopf pf&$ 12kno$ no pf pf&$ 20kno$ nopf pf&$
`
`6OOO
`
`5OOO
`
`4OOO
`
`3000
`
`2000
`
`1000
`
`105
`
`104
`
`103
`
`102
`
`101
`
`FIGURE 0v.35." The effects of disk prefetching. The experiment tries to identify the effects of prefetching and caching in the disk
`cache. The configuration is 112 MB of DDR SDRAM running bzip2. The three bars in each group represent a single-disk system,
`4-disk RAID-5 system, and 8-disk RAID-5 system. The figure above shows the CPI of each configuration, and the figure below
`shows the average response time of the disk requests. Note that the CPI axis is in linear scale, but the disk average response time
`axis is in log scale. The height of the each bar in the average response time graph is the absolute value.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 87
`
`

`

`54 Memor~ S~,stems: Cache, DRAM, Disk
`
`This is extremely important, because FB-DIMM
`systems are likely to have significant power-dissipa-
`tion problems, and because of this they will run at
`the cutting edge of the storage-performance trade-
`off. Administrators will configure these systems to
`use the least amount of storage available to achieve
`the desired performance, and thus a simple reduc-
`tion in FB-DIMM storage will result in an unaccept-
`able hit to performance. We have shown that an ideal
`write buffer in the disk system will solve this problem,
`transparently to the operating system.
`
`Ov.5 What to [xpect
`What are the more important architecture-level
`issues in store for these technologies? On what prob-
`lems should a designer concentrate?
`For caches and SRAMs in particular, power dis-
`sipation and reliability are primary issues. A rule of
`thumb is that SRAMs typically account for at least
`one-third of the power dissipated by microproces-
`sors, and the reliability for SRAM is the worst of the
`three technologies.
`For DRAMs, power dissipation is becoming an
`issue with the high I/O speeds expected of future sys-
`tems. The FB-DIMM, the only proposed architecture
`seriously being considered for adoption that would
`solve the capacity-scaling problem facing DRAM
`systems, dissipates roughly two orders of magnitude
`more power than a traditional organization (due to
`an order of magnitude higher per DIMM power dis-
`sipation and the ability to put an order of magnitude
`more DIMMs into a system).
`For disks, miniaturization and development of
`heuristics for control are the primary consider-
`
`ations, but a related issue is the reduction of power
`dissipation in the drive’s electronics and mechanisms.
`Another point is that some time this year, the indus-
`try will be seeing the first generation of hybrid disk
`drives: those with flash memory to do write caching.
`Initially, hybrid drives will be available only for mobile
`applications. One reason for a hybrid drive is to be
`able to have a di

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket