`
`Cache Accesses (per I 0 ms) and System CPI
`gzip -- 512MB
`
`215
`
`310
`
`315
`time(s)
`
`4O
`
`45
`
`5O
`
`55
`
`6O
`
`65
`
`Cache Power (per I0 ms) and Cumulative Energy
`
`15
`
`2O
`
`25
`
`3O
`
`35
`
`time(s)
`
`4O
`
`45
`
`5O
`
`55
`
`6O
`
`65
`
`DRAH & Disk Power/Cumulative Energy/Access Numbers
`gzip -- 512MB
`
`oOOOOoOOOOO .... o ..... oooooooooooo oooooo ...... ooooooooooooooooooooo .... o~.,~4~.,-~ .~ .... ~ ° °
`
`FIGURE 0v.27: Full execution of Gzip. The figure shows the entire run of gzip. System configuration is a 2-GHz Pentium proces-
`sor with 512 MB of DDR2-533 FB-DIMM main memory and a 12k-RPM disk drive with built-in disk cache. The figure shows the
`interaction between all components of the memory system, including the L1 instruction and data caches, the unified L2 cache, the
`DRAM system, and the disk drive. All graphs use the same x-axis, which represents execution time in seconds. The x-axis does not
`start at zero; the measurements exclude system boot time, invocation of the shell, etc. Each data point represents aggregated (not
`sampled) activity within a 10-ms epoch. The CPI graph shows two system CPI values: one is the average CPI for each 10-ms epoch,
`and the other is the cumulative average CPI. A duration with no CPI data point indicates that no instructions were executed due to
`I/0 latency. During such a window the CPI is essentially infinite, and thus, it is possible for the cumulative average to range higher
`than the displayed instantaneous CPI. Note that the CPI, the DRAM accesses, and the disk accesses are plotted on log scales.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 76
`
`
`
`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 43
`
`more compute-intensive phase in which the
`CPI asymptotes down to the theoretical sus-
`tainable performance, the single-digit values
`that architecture research typically reports.
`By the end of execution, the total energy
`consumed in the FB-DIMM DRAM system
`(a half a kilojoule) almost equals that of
`the energy consumed by the disk, and
`it is twice that of the L1 data cache, L1
`instruction cache, and unified L2 cache
`combined.
`
`Currently, there is substantial work happening
`in both industry and academia to address the lat-
`ter issue, with much of the work focusing on access
`scheduling, architecture improvements, and data
`migration. To complement this work, we look at
`a wide range of organizational approaches, i.e.,
`attacking the problem from a parameter point of
`view rather than a system-redesign, component-
`redesign, or new-proposed-mechanism point of
`view, and find significant synergy between the disk
`cache and the memory system. Choices in the disk-
`side cache affect both system-level performance and
`system-level (in particular, DRAM-subsystem-level)
`energy consumption. Though disk-side caches have
`been proposed and studied, their effect upon the total
`system behavior, namely execution time or CPI or
`total memory-system power including the effects of
`the operating system, is as yet unreported. For exam-
`ple, Zhu and Hu [2002] evaluate disk built-in cache
`using both real and synthetic workloads and report
`the results in terms of average response time. Smith
`[1985a and b] evaluates a disk cache mechanism
`with real traces collected in real IBM mainframes
`on a disk cache simulator and reports the results in
`terms of miss rate. Huh and Chang [2003] evaluate
`their RAID controller cache organization with a syn-
`thetic trace.Varma and Jacobson [1998] and Solworth
`and Orji [1990] evaluate destaging algorithms and
`write caches, respectively, with synthetic workloads.
`This study represents the first time that the effects of
`the disk-side cache can be viewed at a system level
`(considering both application and operating-system
`effects) and compared directly to all the other com-
`ponents of the memory system.
`
`We use a full-system, execution-based simulator
`combining Bochs [Bochs 2006], Wattch [Brooks et al.
`2000], CACTI [Wilton & Jouppi 1994], DRAMsim [Wang
`et al. 2005, September], and DiskSim [Ganger et al.
`2006]. It boots the RedHat Linux 6.0 kernel and there-
`fore can capture all application behavior, and all operat-
`ing-system behavior, including I/O activity, disk-block
`buffering, system-call overhead, and virtual memory
`overhead such as translation, table walking, and page
`swapping. We investigate the disk-side cache in both
`single-disk and RAID-5 organizations. Cache parame-
`ters include size, organization, whether the cache sup-
`ports write caching or not, and whether it prefetches
`read blocks or not. Additional parameters include disk
`rotational speed and DRAM-system capacity.
`We find a complex trade-off between the disk
`cache, the DRAM system, and disk parameters like
`rotational speed. The disk cache, particularly its
`write-buffering feature, represents a very powerful
`tool enabling significant savings in both energy and
`execution time. This is important because, though the
`cache’s support for write buffering is often enabled in
`desktop operating systems (e.g., Windows and some
`but not all flavors of Unix/Linux [Ng 2006]), it is typi-
`cally disabled in enterprise computing applications
`[Ng 2006], and these are the applications most likely
`to use FB-DIMMs [Haas & Vogt 2005]. We find sub-
`stantial improvement between existing implemen-
`tations and an ideal write buffer (i.e., this is a limit
`study). In particular, the disk cache’s write-buffering
`ability can offset the total energy consumption of
`the memory system (including caches, DRAMs, and
`disks) by nearly a factor of two, while sacrificing a
`small amount of performance.
`
`0v.4.2 Fully Buffered DIMM: Basics
`The relation between a traditional organiza-
`tion and a FB-DIMM organization is shown in Fig-
`ure Ov.28, which motivates the design in terms of a
`graphics-card organization. The first two drawings
`show a multi-drop DRAM bus next to a DRAM bus
`organization typical of graphics cards, which use
`point-to-point soldered connections between the
`DRAM and memory controller to achieve higher
`speeds. This arrangement is used in FB-DIMM.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 77
`
`
`
`44 Memor~ S~,stems: Cache, DRAM, Disk
`
`Traditional (JEDEC) Organization
`DIMMs
`~D RAMs" s" s" s’~
`
`Package Pins
`
`Graphics-Card Organization
`
`Fully Buffered DIMM
`Organization
`
`Memory Controller
`
`tl
`
`Northbound Channel
`
`Southbound Channel
`
`Package Pins
`
`DRAM
`
`IC
`
`ontroller
`
`DIMM 0
`
`DIMM 1
`
`Edge
`
`IC
`
`ontroller
`
`Memory
`Contro er
`
`nnn sr nnnnl
`Innnn 4 nnn
`
`DDRx SDRAM device ... up to 8 Modules
`
`FIGURE 0v.28: FB-DIMM and its motivation. The first two pictures compare the memory organizations of a JEDEC SDRAM system
`and a graphics card. Above each design is its side-profile, indicating potential impedance mismatches (sources of reflections). The
`organization on the far right shows how the FB-DIMM takes the graphics-card organization as its de facto DIMM. In the FB-DIMM
`organization, there are no multi-drop busses; DIMM-to-DIMM connections are point to point. The memory controller is connected
`to the nearest AMB via two unidirectional links. The AMB is, in turn, connected to its southern neighbor via the same two links.
`
`A slave memory controller has been added onto each
`DIMM, and all connections in the system are point
`to point. A narrow, high-speed channel connects the
`master memory controller to the DIMM-level mem-
`ory controllers (called Advanced Memory Buffers or
`AMBs). Since each DIMM-to-DIMM connection is
`a point-to-point connection, a channel becomes a
`de facto multi-hop store and forward network. The
`FB-DIMM architecture limits the channel length to
`eight DIMMs, and the narrower inter-module bus
`requires roughly one-third as many pins as a tradi-
`tional organization. As a result, an FB-DIMM orga-
`nization can handle roughly 24 times the storage
`capacity of a single-DIMM DDR3-based system,
`without sacrificing any bandwidth and even leaving
`headroom for increased intra-module bandwidth.
`The AMB acts like a pass-through switch, directly
`forwarding the requests it receives from the controller
`
`to successive DIMMs and forwarding flames from
`southerly DIMMs to northerly DIMMs or the mem-
`ory controller. All frames are processed to determine
`whether the data and commands are for the local
`DIMM. The FB-DIMM system uses a serial packet-
`based protocol to communicate between the memory
`controller and the DIMMs. Frames may contain data
`and/or commands. Commands include DRAM com-
`mands such as row activate (RAS), column read (CAS),
`refresh (REF) and so on, as well as channel commands
`such as write to configuration registers, synchroniza-
`tion commands, etc. Frame scheduling is performed
`exclusively by the memory controller. The AMB only
`converts the serial protocol to DDRx-based commands
`without implementing any scheduling functionality.
`The AMB is connected to the memory control-
`ler and/or adjacent DIMMs via unidirectional links:
`the southbound channel which transmits both data
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 78
`
`
`
`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 45
`
`and commands and the northbound channel which
`transmits data and status information. The south-
`bound and northbound datapaths are 10 bits and
`14 bits wide, respectively. The FB-DIMM channel
`clock operates at six times the speed of the DIMM
`clock; i.e., the link speed is 4 Gbps for a 667-Mbps
`DDtlx system. Frames on the north- and southbound
`channel require 12 transfers (6 FB-DIMM channel
`clock cycles) for transmission. This 6:1 ratio ensures
`that the FB-DIMM frame rate matches the DRAM
`command clock rate.
`Southbound frames comprise both data and
`commands and are 120 bits long; northbound
`frames are data only and are 168 bits long. In addi-
`tion to the data and command information, the
`frames also carry header information and a frame
`CtlC (cyclic redundancy check) checksum that is
`used to check for transmission errors. A north-
`bound read-data frame transports 18 bytes of data
`in 6 FB-DIMM clocks or 1 DIMM clock. A DDtlx sys-
`tem can burst back the same amount of data to the
`memory controller in two successive beats lasting
`an entire DRAM clock cycle. Thus, the read band-
`
`width of an FB-DIMM system is the same as that of
`a single channel of a DDtlx system. Due to the nar-
`rower southbound channel, the write bandwidth
`in FB-DIMM systems is one-half that available in a
`DDtlx system. However, this makes the total band-
`width available in an FB-DIMM system 1.5 times
`that of a DDtlx system.
`Figure Ov.29 shows the processing of a read trans-
`action in an FB-DIMM system. Initially, a command
`frame is used to transmit a command that will per-
`form row activation. The AMB translates the request
`and relays it to the DIMM. The memory controller
`schedules the CAS command in a following frame.
`The AMB relays the CAS command to the DRAM
`devices which burst the data back to the AMB. The
`AMB bundles two consecutive bursts of data into
`a single northbound frame and transmits it to the
`memory controller. In this example, we assume a
`burst length of four corresponding to two FB-DIMM
`data frames. Note that although the figures do not
`identify parameters like t_CAS, t_tlCD, and t_CWD,
`the memory controller must ensure that these con-
`straints are met.
`
`:~:
`
`’
`
`RAS
`
`~ CAS
`
`FB-DIMM
`clock
`Southbound :~)(~ .....
`BUS
`DIMM
`clock
`
`DIMM
`Command
`Bus
`
`DIMM
`Data Bus
`
`Northbound
`Bus
`
`D2-D3 ),’ ....,
`
`FIGURE 0v.29: Read transaction in an FB-DIMM system. The figure shows how a read transaction is performed in an FB-DIMM
`system. The FB-DIMM serial busses are clocked at six times the DIMM busses. Each FB-DIMM frame on the southbound bus takes
`six FB-DIMM clock periods to transmit. On the northbound bus a frame comprises two DDRx data bursts.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 79
`
`
`
`46 Memor~ Systems: Cache, DRAM, Disk
`
`The primary dissipater of power in an FB-DIMM
`channel is the AMB, and its power depends on its
`position within the channel. The AMB nearest to
`the memory controller must handle its own traffic
`and repeat all packets to and from all downstream
`AMBs, and this dissipates the most power. The AMB
`in DDR2-533 FB-DIMM dissipates 6 W, and it is cur-
`rently 10 W for 800 Mbps DDR2 [Staktek 2006]. Even
`if one averages out the activity on the AMB in a long
`channel, the eight AMBs in a single 800-Mbps chan-
`nel can easily dissipate 50 W. Note that this number
`is for the AMBs only; it does not include power dis-
`sipated by the DRAM devices.
`
`0v.4.3 Disk Caches: Basks
`Today’s disk drives all come with a built-in cache
`as part of the drive controller electronics, ranging in
`size from 512 KB for the micro-drive to 16 MB for the
`largest server drives. Figure Ov.30 shows the cache
`and its place within a system. The earliest drives
`had no cache memory, as they had little control
`electronics. As the control of data transfer migrated
`
`from the host-side control logic to the drive’s own
`controller, a small amount of memory was needed
`to act as a speed-matching buffer, because the disk’s
`media data rate is different from that of the inter-
`face. Buffering is also needed because when the
`head is at a position ready to do data transfer, the
`host or the interface may be busy and not ready to
`receive read data. DRAM is usually used as this buf-
`fer memory.
`In a system, the host typically has some memory
`dedicated for caching disk data, and if a drive is
`attached to the host via some external controller, that
`controller also typically has a cache. Both the system
`cache and the external cache are much larger than
`the disk drive’s internal cache. Hence, for most work-
`loads, the drive’s cache is not likely to see too many
`reuse cache hits. However, the disk-side cache is very
`effective in opportunistically prefetching data, as
`only the controller inside the drive knows the state
`the drive is in and when and how it can prefetch
`without adding any cost in time. Finally, the drive
`needs cache memory if it is to support write cach-
`ing/buffering.
`
`Applications
`
`Web
`Server
`
`etc.
`
`Operating system
`
`Kernel
`
`Buffer cache I~-- Buffer cache
`
`Modul/e/~
`/ ~ Disk cache
`
`Hardware
`
`(a)
`Cached data ]
`
` C thaed (b)
`
`FIGURE Ov.30: Buffer caches and disk caches. Disk blocks are cached in several places, including (a) the operating system’s
`buffer cache in main memory and (b), on the disk, in another DRAM buffer, called a disk cache.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 80
`
`
`
`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 47
`
`With write caching, the drive controller services a
`write request by transferring the write data from the
`host to the drive’s cache memory and then reports
`back to the host that the write is "done," even
`though the data has not yet been written to the disk
`media (data not yet written out to disk is referred to
`as dirty). Thus, the service time for a cached write is
`about the same as that for a read cache hit, involving
`only some drive controller overhead and electronic
`data transfer time but no mechanical time. Clearly,
`write caching does not need to depend on having
`the right content in the cache memory for it to work,
`unlike read caching. Write caching will always work,
`i.e., a write command will always be a cache hit, as
`long as there is available space in the cache mem-
`ory. When the cache becomes full, some or all of the
`dirty data are written out to the disk media to free
`up space. This process is commonly referred to as
`destage.
`Ideally, destage should be done while the drive
`is idle so that it does not affect the servicing of read
`requests. However, this may not be always possible.
`The drive may be operating in a high-usage system
`with little idle time ever, or the writes often arrive in
`bursts which quickly fill up the limited memory space
`of the cache. When destage must take place while the
`drive is busy, such activity adds to the load of drive
`at that time, and a user will notice a longer response
`time for his requests. Instead of providing the full
`benefit of cache hits, write caching in this case merely
`delays the disk writes.
`Zhu and Hu [2002] have suggested that large
`disk built-in caches will not significantly benefit
`the overall system performance because all mod-
`ern operating systems already use large file system
`caches to cache reads and writes. As suggested by
`Przybylski [1990], the reference stream missing a
`first-level cache and being handled by a second-
`level cache tends to exhibit relatively low locality. In
`a real system, the reference stream to the disk sys-
`tem has missed the operating system’s buffer cache,
`and the locality in the stream tends to be low. Thus,
`our simulation captures all of this activity. In our
`experiments, we investigate the disk cache, includ-
`ing the full effects of the operating system’s file-sys-
`tem caching.
`
`0v.4.4 £xperimental Results
`Figure Ov.27 showed the execution of the GZIP
`benchmark with a moderate-sized FB-DIMM DRAM
`system: half a gigabyte of storage. At 512 MB, there is no
`page swapping for this application. When the storage
`size is cut in half to 256 MB, page swapping begins but
`does not affect the execution time significatly. When
`the storage size is cut to one-quarter of its original size
`(128 MB), the page swapping is significant enough
`to slow the application down by an order of magni-
`tude. This represents the hard type of decision that a
`memory-systems designer would have to face: if one
`can reduce power dissipation by cutting the amount
`of storage and feel negligible impact on performance,
`then one has too much storage to begin with.
`Figure Ov.31 shows the behavior of the system
`when storage is cut to 128 MB. Note that all aspects
`of system behavior have degraded; execution time
`is longer, and the system consumes more energy.
`Though the DRAM system’s energy has decreased
`from 440 J to just under 410 J, the execution time has
`increased from 67 to 170 seconds, the total cache
`energy has increased from 275 to 450 J, the disk energy
`has increased from 540 to 1635 J, and the total energy
`has doubled from 1260 to 2515 J. This is the result of
`swapping activity--not enough to bring the system to
`its knees, but enough to be relatively painful.
`We noticed that there exists in the disk subsystem
`the same sort of activity observed in a microproces-
`sor’s load/store queue: reads are often stalled waiting
`for writes to finish, despite the fact that the disk has
`a 4-MB read/write cache on board. The disk’s cache
`is typically organized to prioritize prefetch activity
`over write activity because this tends to give the best
`performance results and because the write buffering
`is often disabled by the operating system. The solu-
`tion to the write-stall problem in microprocessors
`has been to use write buffers; we therefore modified
`DiskSim to implement an ideal write buffer on the
`disk side that would not interfere with the disk cache.
`Figure Ov.32 indicates that the size of the cache seems
`to make little difference to the behavior of the system.
`The important thing is that a cache is present. Thus,
`we should not expect read performance to suddenly
`increase as a result of moving writes into a separate
`write buffer.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 81
`
`
`
`48 Memor~ Systems: Cache, DRAM, Disk
`
`Gache Accesses (per I 0 ms) and S:ystem
`gzip -- 128MB
`
`Cache Power (per I0 ms) and Cumulative Ener~
`gzip -- 128MB
`
`80
`
`time(s)
`
`I 0
`
`I 0
`
`140
`
`160
`
`DF~AH & Disk Power/Cumulative Ener~/Access Numbers
`gzip -- 128MB
`
`FIGURE 0v.31: Full execution of GZIP, 128 MB DRAM. The figure shows the entire run of GZIR System configuration is a 2 GHz
`Pentium processor with 128 MB of FB-DIMM main memory and a 12 K-RPM disk drive with built-in disk cache. The figure shows
`the interaction between all components of the memory system, including the L1 instruction cache, the L1 data cache, the unified
`L2 cache, the DRAM system, and the disk drive. All graphs use the same x-axis, which represents the execution time in seconds.
`The x-axis does not start at zero; the measurements exclude system boot time, invocation of the shell, etc. Each data point repre-
`sents aggregated (not sampled) activity within a lO-ms epoch. The CPI graph shows 2 system CPI values: one is the average CPI
`for each lO-ms epoch, the other is the cumulative average CPI. A duration with no CPI data point indicates that no instructions
`were executed due to I/0 latency. The application is run in single-user mode, as is common for SPEC measurements; therefore,
`disk delay shows up as stall time. Note that the CPI, the DRAM accesses, and the Disk accesses are plotted on log scales.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 82
`
`
`
`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 49
`
`1100
`
`1000
`
`9OO
`
`8OO
`
`7OO
`
`6OO
`
`5OO
`
`4OO
`
`3OO
`
`2OO
`
`lO0
`
`0
`
`no$
`lx512 sectors (256 KB total)
`2x512 sectors (512 KB total)
`16x512 se~ors (4 MB total)
`24x512 se~ors (6 MB total)
`
`ammp
`
`bzip2
`
`gcc
`
`gzip
`
`mcf
`
`mgfid
`
`parser
`
`twolf
`
`vortex
`
`FI6URE 0v.32: The effects of disk cache size by varying the number of segments. The figure shows the effects of a different
`number of segments with the same segment size in the disk cache. The system configuration is 128 MB of DDR SDRAM with a
`12k-RPM disk. There are five bars for each benchmark, which are (1) no cache, (2) 1 segment of 512 sectors each, (3) 2 segments
`of 512 sectors each, (4) 16 segment of 512 sectors each, and (5) 24 segment of 512 sectors each. Note that the CPI values are for
`the disk-intensive portion of application execution, not the CPU-intensive portion of application execution (which could otherwise
`blur distinctions).
`
`Figure Ov.33 shows the behavior of the system with
`128 MB and an ideal write buffer. As mentioned, the
`performance increase and energy decrease is due to
`the writes being buffered, allowing read requests to
`progress. Execution time is 75 seconds (compared
`to 67 seconds for a 512 MB system); and total energy
`is 1100 ] (compared to 1260 ] for a 512-MB system).
`For comparison, to show the effect of faster read and
`write throughput, Figure Ov.34 shows the behavior of
`the system with 128 MB and an 8-disk RAID-5 system.
`Execution time is 115 seconds, and energy consump-
`tion is 8.5 K]. This achieves part of the performance
`effect as write buffering by improving write time,
`thereby freeing up read bandwidth sooner. However,
`the benefit comes at a significant cost in energy.
`Table Or.5 gives breakdowns for gzip in tabu-
`lar form, and the graphs beneath the table give the
`breakdowns for gzip, bzip2, and ammp in graphical
`form and for a wider range of parameters (different
`disk RPMs). The applications all demonstrate the
`same trends: to cut down the energy of a 512-MB
`system by reducing the memory to 128 MB which
`
`causes both the performance and the energy to get
`worse. Performance degrades by a factor of 5-10;
`energy increases by 1.5× to 10×. Ideal write buffer-
`ing can give the best of both worlds (performance of
`a large memory system and energy consumption of
`a small memory system), and its benefit is indepen-
`dent of the disk’s RPM. Using a RAID system does
`not gain significant performance improvement, but
`it consumes energy proportionally to the number
`of disks. Note, however, that this is a uniprocessor
`model running in single-user mode, so RAID is not
`expected to shine.
`Figure Ov.35 shows the effects of disk caching
`and prefetching on both single-disk and RAID sys-
`tems. In RAID systems, disk caching has only mar-
`ginal effects to both the CPI and the disk average
`response time. However, disk caching with prefetch-
`ing has significant benefits. In a slow disk system (i.e.,
`5400 RPM), RAID has more tangible benefits over a
`non-RAID system. Nevertheless, the combination of
`using RAID, disk cache, and fast disks can improve
`the overall performance up to a factor of 10. For the
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 83
`
`
`
`50 Memory Systems: Cache, DRAM, Disk
`
`Cache Accesses (per I 0 ms) and STstem CPI
`gzip -- 128MB + write buffer
`
`35
`
`~10
`time(s)
`
`_
`
`50
`
`55
`
`60
`
`65
`
`70
`
`75
`
`Cache Power (per 10 ms) and Cumulative Energy
`gzip -- 128MB + write buffer
`
`time(s)
`
`50
`
`55
`
`60
`
`65
`
`70
`
`75
`
`DRAVI & Disk Power/Cumulative Energy/Access Numbers
`gzip -- 128MI1 + write buffer
`
`, 61R , .I° , I
`
`RI0
`_
`
`, I ,
`55
`
`610
`
`_
`
`,
`
`75
`
`time(s)
`
`FIGURE 0v.33: Full execution of GZIP, 128 MB DRAM and ideal write buffer. The figure shows the entire run of GZIR System
`configuration is a 2 GHz Pentium processor with 128 MB of FB-DIMM main memory and a 12 K-RPM disk drive with built-in disk
`cache. The figure shows the interaction between all components of the memory system, including the L1 instruction cache, the
`L1 data cache, the unified L2 cache, the DRAM system, and the disk drive. All graphs use the same x-axis, which represents
`the execution time in seconds. The x-axis does not start at zero; the measurements exclude system boot time, invocation of
`the shell, etc. Each data point represents aggregated (not sampled) activity within a 10-ms epoch. The CPI graph shows two
`system CPI values: one is the average CPI for each 10-ms epoch, the other is the cumulative average CPI. A duration with no
`CPI data point indicates that no instructions were executed due to I/0 latency. The application is run in single-user mode, as is
`common for SPEC measurements; therefore, disk delay shows up as stall time. Note that the CPI, the DRAM accesses, and the
`Disk accesses are plotted on log scales.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 84
`
`
`
`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 51
`
`Cache Accesses (per I 0 ms) and System CPI
`
`time(s)
`Cache Power (per 10 ms) and Cumulative Ener~
`
`410
`
`810
`
`time(s)
`DF~AH & Disk Power/Cumulative Ener~/Access Numbers
`gzip -- 128PIB + 8-disE
`
`210
`
`40
`
`60
`
`80
`
`FIGURE 0v.34: Full execution of GZIP, 128 MB DRAM and RAID-5 disk system. The figure shows the entire run of GZIR System
`configuration is a 2 GHz Pentium processor with 128 MB of FB-DIMM main memory and a RAID-5 system of eight 12-K-RPM disk
`drives with built-in disk cache. The figure shows the interaction between all components of the memory system, including the L1
`instruction cache, the L1 data cache, the unified L2 cache, the DRAM system, and the disk drive. All graphs use the same x-axis,
`which represents the execution time in seconds. The x-axis does not start at zero; the measurements exclude system boot time,
`invocation of the shell, etc. Each data point represents aggregated (not sampled) activity within a lO-ms epoch. The CPI graph
`shows two system CPI values: one is the average CPI for each 10-ms epoch, the other is the cumulative average CPI. A duration
`with no CPI data point indicates that no instructions were executed due to I/0 latency. The application is run in single-user mode,
`as is common for SPEC measurements; therefore, disk delay shows up as stall time. Note that the CPI, the DRAM accesses, and
`the Disk accesses are plotted on log scales.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 85
`
`
`
`,52 Memory Systems: Cache, DRAM, Disk
`
`TABLE Ov.5 Execution time and energy breakdowns for GZIP and BZIP2
`
`GZIP
`512 MB-12 K
`128 MB-12 K
`128 MB-12 K-WB
`128 MB-12 K-RAID
`
`66.8
`169.3
`75.8
`113.9
`
`129.4
`176.5
`133.4
`151
`
`122.1
`216.4
`130.2
`165.5
`
`25.4
`67.7
`28.7
`44.8
`
`440.8
`419.6
`179.9
`277.8
`
`544.1
`1635.4
`622.5
`7830
`
`1261.8
`2515.6
`1094.7
`8469.1
`
`System Breakdown Energy and Execution Time
`~,HHP 32MB per rank (I rank or 4 ranks);diskRPM:Sk, 12k 20k;I disk or 8 RAID disks
`
`System Breakdown Energy and Execution Time
`BZIP2 128MB per rank (I rank or 4 ranks); disk RPM: 5k, 12k, 20k; I disk or 8 RAID disks
`
`System Breakdown Energy and Execution Time
`GZIP 128MB per rank (I rank or 4 ranks); disk RPM: 5k, 12k, 20k; I disk or 8 RAID disks
`
`800O
`
`average response time, even though the write
`response time in a RAID system is much higher than
`the write response time in a single-disk system, this
`trend does not translate directly into the overall per-
`formance. The write response time in a RAID system
`is higher due to parity calculations, especially the
`benchmarks with small writes. Despite the improve-
`ment in performance, care must be taken in applying
`RAID because RAID increases the energy proportion-
`ally to the number of the disks.
`Perhaps the most interesting result in Figure Ov.35 is
`that the CPI values (top graph) track the disk’s average
`read response time (bottom graph) and not the disk’s
`average response time (which includes both reads and
`writes, also bottom graph). This observation holds true
`for both read-dominated applications and applications
`with significant write activity (as are gzip and bzip2).
`The reason this is interesting is that the disk commu-
`nity tends to report performance numbers in terms of
`average response time and not average read response
`
`time, presumably believing the former to be a better
`indicator of system-level performance than the latter.
`Our results suggest that the disk community would
`be better served by continuing to model the effects of
`write traffic (as it affects read latency) by reporting per-
`formance as the average read response time.
`
`0v.4.5 Conclusions
`We find that the disk cache can be an effective tool
`for improving performance at the system level. There
`is a significant interplay between the DRAM system
`and the disk’s ability to buffer writes and prefetch
`reads. An ideal write buffer homed within the disk has
`the potential to move write traffic out of the way and
`begin working on read requests far sooner, with the
`result that a system can be made to perform nearly
`as well as one with four times the amount of main
`memory, but with roughly half the energy consump-
`tion of the configuration with more main memory.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 86
`
`
`
`Chapter Overview ON MEMORY SYSTEMS AND THEIR DESIGN 53
`
`bzip2 112MB (lds/4ds/8ds)
`
`1 disk
`4 disks
`8 disks
`
`no cache no prefetch
`w/cache no prefetch
`w/cache & prefetch
`
`5k RPM
`
`12k RPM
`
`20k RPM
`
`disk
`4 disks
`8 disks
`
`avrg Write resp time
`avrg resptime
`avrg Read resp time
`
`5kno$ nopf pf&$ 12kno$ no pf pf&$ 20kno$ nopf pf&$
`
`6OOO
`
`5OOO
`
`4OOO
`
`3000
`
`2000
`
`1000
`
`105
`
`104
`
`103
`
`102
`
`101
`
`FIGURE 0v.35." The effects of disk prefetching. The experiment tries to identify the effects of prefetching and caching in the disk
`cache. The configuration is 112 MB of DDR SDRAM running bzip2. The three bars in each group represent a single-disk system,
`4-disk RAID-5 system, and 8-disk RAID-5 system. The figure above shows the CPI of each configuration, and the figure below
`shows the average response time of the disk requests. Note that the CPI axis is in linear scale, but the disk average response time
`axis is in log scale. The height of the each bar in the average response time graph is the absolute value.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2141, p. 87
`
`
`
`54 Memor~ S~,stems: Cache, DRAM, Disk
`
`This is extremely important, because FB-DIMM
`systems are likely to have significant power-dissipa-
`tion problems, and because of this they will run at
`the cutting edge of the storage-performance trade-
`off. Administrators will configure these systems to
`use the least amount of storage available to achieve
`the desired performance, and thus a simple reduc-
`tion in FB-DIMM storage will result in an unaccept-
`able hit to performance. We have shown that an ideal
`write buffer in the disk system will solve this problem,
`transparently to the operating system.
`
`Ov.5 What to [xpect
`What are the more important architecture-level
`issues in store for these technologies? On what prob-
`lems should a designer concentrate?
`For caches and SRAMs in particular, power dis-
`sipation and reliability are primary issues. A rule of
`thumb is that SRAMs typically account for at least
`one-third of the power dissipated by microproces-
`sors, and the reliability for SRAM is the worst of the
`three technologies.
`For DRAMs, power dissipation is becoming an
`issue with the high I/O speeds expected of future sys-
`tems. The FB-DIMM, the only proposed architecture
`seriously being considered for adoption that would
`solve the capacity-scaling problem facing DRAM
`systems, dissipates roughly two orders of magnitude
`more power than a traditional organization (due to
`an order of magnitude higher per DIMM power dis-
`sipation and the ability to put an order of magnitude
`more DIMMs into a system).
`For disks, miniaturization and development of
`heuristics for control are the primary consider-
`
`ations, but a related issue is the reduction of power
`dissipation in the drive’s electronics and mechanisms.
`Another point is that some time this year, the indus-
`try will be seeing the first generation of hybrid disk
`drives: those with flash memory to do write caching.
`Initially, hybrid drives will be available only for mobile
`applications. One reason for a hybrid drive is to be
`able to have a di