throbber
Fully-Buffered DIMM Memory Architectures: Understanding
`Mechanisms, Overheads and Scaling
`
`Brinda Ganesh†, Aamer Jaleel‡, David Wang†, and Bruce Jacob†
`†University of Maryland, College Park, MD
`‡VSSAD, Intel Corporation, Hudson, MA
`{brinda, blj}@eng.umd.edu
`
`Abstract
`Performance gains in memory have traditionally been
`obtained by increasing memory bus widths and speeds. The
`diminishing returns of such techniques have led to the
`proposal of an alternate architecture, the Fully-Buffered
`DIMM. This new standard replaces the conventional
`memory bus with a narrow, high-speed interface between the
`memory controller and the DIMMs. This paper examines
`how traditional DDRx based memory controller policies for
`scheduling and row buffer management perform on a Fully-
`Buffered DIMM memory architecture. The split-bus
`architecture used by FBDIMM systems results in an average
`improvement of 7% in latency and 10% in bandwidth at
`higher utilizations. On the other hand, at lower utilizations,
`the increased cost of serialization resulted in a degradation
`in latency and bandwidth of 25% and 10% respectively. The
`split-bus architecture also makes the system performance
`sensitive to the ratio of read and write traffic in the
`workload. In larger configurations, we found that the
`FBDIMM system performance was more sensitive to usage
`of the FBDIMM links than to DRAM bank availability. In
`general, FBDIMM performance is similar to that of DDRx
`systems, and provides better performance characteristics at
`higher utilization, making it a relatively inexpensive
`mechanism for scaling capacity at higher bandwidth
`requirements. The mechanism is also largely insensitive to
`scheduling policies, provided certain ground rules are
`obeyed.
`1. Introduction
`The growing size of application working sets, the popularity
`of software-based multimedia and graphics workloads, and
`increased use of speculative techniques, have contributed to
`the rise in bandwidth and capacity requirements of computer
`memory sub-systems. Capacity needs have been met by
`building increasingly dense DRAM chips (the 2Gb DRAM
`chip is expected to hit the market in 2007) and bandwidth
`needs have been met by scaling front-side bus data rates and
`using wide, parallel buses. However, the traditional bus
`organization has reached a point where it fails to scale well
`into the future.
`The electrical constraints of high-speed parallel buses com-
`plicate bus scaling in terms of loads, speeds, and widths [1].
`
`Consequently the maximum number of DIMMs per channel in
`successive DRAM generations has been decreasing. SDRAM
`channels have supported up to 8 DIMMs of memory, some
`types of DDR channels support only 4 DIMMs, DDR2 channels
`have but 2 DIMMs, and DDR3 channels are expected to support
`only a single DIMM. In addition the serpentine routing required
`for electrical path-length matching of the data wires becomes
`challenging as bus widths increase. For the bus widths of today,
`the motherboard area occupied by a single channel is signifi-
`cant, complicating the task of adding capacity by increasing the
`number of channels. To address these scalability issues, an alter-
`nate DRAM technology, the Fully Buffered Dual Inline Mem-
`ory Module (FBDIMM) [2] has been introduced.
`The FBDIMM memory architecture replaces the shared par-
`allel interface between the memory controller and DRAM chips
`with a point-to-point serial interface between the memory con-
`troller and an intermediate buffer, the Advanced Memory
`Buffer (AMB). The on-DIMM interface between the AMB and
`the DRAM modules is identical to that seen in DDR2 or DDR3
`systems, which we shall refer to as DDRx systems. The serial
`interface is split into two uni-directional buses, one for read traf-
`fic (northbound channel) and another for write and command
`traffic (southbound channel), as shown in Fig 1. FBDIMMs
`adopts a packet-based protocol that bundles commands and data
`into frames that are transmitted on the channel and then con-
`verted to the DDRx protocol by the AMB.
`The FBDIMMs’ unique interface raises questions on
`whether existing memory controller policies will continue to be
`relevant to future memory architectures. In particular we ask the
`following questions:
`• How do DDRx and FBDIMM systems compare with
`respect to latency and bandwidth?
`• What is the most important factor that impacts the
`performance of FBDIMM systems?
`• How do FBDIMM systems respond to choices made
`for parameters like row buffer management policy,
`scheduling policy and topology?
`Our study indicates that an FBDIMM system provides sig-
`nificant capacity increases over a comparable DDRx system at
`relatively little cost. for systems with high bandwidth require-
`ments. However, at lower utilization requirements, the
`FBDIMM protocol results in an average latency degradation of
`25%. We found the extra cost of serialization is also apparent in
`
`1-4244-0805-9/07/$25.00 ©2007 IEEE
`
`109
`
`Authorized licensed use limited to: Irell & Manella LLP. Downloaded on March 06,2023 at 21:53:05 UTC from IEEE Xplore. Restrictions apply.
`
`Netlist EX2040
`Samsung v. Netlist
`IPR2022-00996
`
`

`

`configurations with shallow channels i.e. 1-2 ranks per channel,
`but can be successfully hidden in systems with deeper channels
`by taking advantage of the available DIMM-level parallelism.
`The sharing of write data and command bandwidth is a sig-
`nificant bottleneck in these systems. An average of 20-30% of
`the southbound channel bandwidth is wasted due to the inability
`of the memory controller to fully utilize the available command
`slots in a southbound frame. These factors together contribute to
`the queueing delays for a read transaction in the system.
`More interestingly, we found that the general performance
`characteristics and trends seen for applications executing on the
`FBDIMM architecture were similar in many ways to that seen
`in DDRx systems. We found that many of the memory control-
`ler policies that were most effective in DDRx systems were also
`effective in an FBDIMM system.
`This paper is organized as follows. Section 2 describes the
`FBDIMM memory architecture and protocol. Section 3
`describes the related work. Section 4 describes the experimental
`framework and methodology. Section 5 has detailed results and
`analysis of the observed behavior. Section 6 summarizes the
`final conclusions of the paper.
`2. Background
`2.1 FBDIMM Overview
`As DRAM technology has advanced, channel speed has
`improved at the expense of channel capacity; consequently
`channel capacity has dropped from 8 DIMMs to a single DIMM
`per channel. This is a significant limitation—for a server
`designer, it is critical, as servers typically depend on memory
`capacity for their performance. There is an obvious dilemma:
`future designers would like both increased channel capacity and
`increased data rates—but how can one provide both?
`For every DRAM generation, graphics cards use the same
`DRAM technology as is found in commodity DIMMs, but
`operate their DRAMs at significantly higher speeds by avoiding
`multi-drop connections. The trend is clear: to improve band-
`width, one must reduce the number of impedance discontinui-
`ties on the bus.
`This is achieved in FBDIMM system by replacing the multi-
`drop bus with point-to-point connections. Fig 1 shows the
`altered memory interface with a narrow, high-speed channel
`connecting the master memory controller to the DIMM-level
`memory controllers (called Advanced Memory Buffers, or
`AMBs). Since each DIMM-to-DIMM connection is a point-to-
`point connection, a channel becomes a de facto multi-hop store
`& forward network. The FBDIMM architecture limits the chan-
`nel length to 8 DIMMs, and the narrower inter-module bus
`requires roughly one third as many pins as a traditional organi-
`zation. As a result, an FBDIMM organization can handle
`roughly 24 times the storage capacity of a single-DIMM DDR3-
`based system, without sacrificing any bandwidth and even leav-
`ing headroom for increased intra-module bandwidth.
`The AMB acts like a pass-through switch, directly forward-
`ing the requests it receives from the controller to successive
`DIMMs and forwarding frames from southerly DIMMs to
`northerly DIMMs or the memory controller. All frames are pro-
`
`cessed to determine whether the data and commands are for the
`local DIMM.
`2.2 FBDIMM protocol
`The FBDIMM system uses a serial packet-based protocol to
`communicate between the memory controller and the
`DIMMs. Frames may contain data and/or commands.
`Commands include DRAM commands such as row activate
`(RAS), column read (CAS), refresh (REF) and so on, as well
`as channel commands such as write to configuration registers,
`synchronization commands etc. Frame scheduling
`is
`performed exclusively by the memory controller. The AMB
`only converts the serial protocol to DDRx based commands
`without implementing any scheduling functionality.
`The AMB is connected to the memory controller and/or
`adjacent DIMMs via serial uni-directional links: the south-
`bound channel which transmits both data and commands and
`the northbound channel which transmits data and status infor-
`mation. The southbound and northbound data paths are respec-
`tively 10 bits and 14 bits wide respectively. The channel is dual-
`edge clocked i.e. data is transmitted on both clock edges. The
`FBDIMM channel clock operates at 6 times the speed of the
`DIMM clock i.e. the link speed is 4 Gbps for a 667 Mbps DDRx
`system. Frames on the north and south bound channel require
`12 transfers (6 FBDIMM channel clock cycles) for transmis-
`sion. This 6:1 ratio ensures that the FBDIMM frame rate
`matches the DRAM command clock rate.
`Southbound frames comprise both data and commands and
`are 120 bits long; data-only northbound frames are 168 bits
`long. In addition to the data and command information, the
`frames also carry header information and a frame CRC check-
`sum that is used to check for transmission errors.
`The types of frames defined by the protocol [3] are illustrated
`in the figure 2. They are
`Command Frame: comprises up to three commands
`(Fig 2(a)). Each command in the frame is sent to a separate
`
`Memory Controller
`
`Northbound Channel
`14
`
`Southbound Channel
`10
`AMB
`
`DDRx SDRAM device
`
`up to 8 Modules
`
`AMB
`
`Figure 1. FBDIMM Memory System. TIn the FBDIMM
`organization, there are no multi-drop busses; DIMM-to-
`DIMM connections are point-to-point. The memory
`controller is connected to the nearest AMB, via two uni-
`directional links. The AMB is in turn connected to its
`southern neighbor via the same two links.
`
`Authorized licensed use limited to: Irell & Manella LLP. Downloaded on March 06,2023 at 21:53:05 UTC from IEEE Xplore. Restrictions apply.
`
`110
`
`Netlist EX2040
`Samsung v. Netlist
`IPR2022-00996
`
`

`

`6 FBDIMM Clocks
`
`1 DDRx DIMM Clock
`C
`C
`C
`C
`DD
`DD
`(c) Northbound Data Frame
`(b) Southbound Data Frame
`(a) Command Frame
`Figure 2. FBDIMM Frames. The figure illustrates the various types of FBDIMM frames used for data and command
`transmission.
`DIMM on the southbound channel. In the absence of
`available commands to occupy a complete frame, no-ops
`are used to pad the frame.
`Command + Write Data Frame: carries 72 bits of
`write data - 8 bytes of data and 1 byte of ECC - and one
`command on the southbound channel (Fig 2b). Multiple
`such frames are required to transmit an entire data block to
`the DIMM. The AMB of the addressed DIMM buffers the
`write data as required.
`Data frame: comprises 144 bits of data or 16 bytes of
`data and 2 bytes of ECC information (Fig 2(c)). Each
`frame is filled by the data transferred from the DRAM in
`two beats or a single DIMM clock cycle. This is a
`northbound frame and is used to transfer read data.
`A northbound data frame transports 18 bytes of data in 6
`FBDIMM clocks or 1 DIMM clock. A DDRx system can burst
`back the same amount of data to the memory controller in two
`successive beats lasting an entire DRAM clock cycle. Thus, the
`read bandwidth of an FBDIMM system is the same as that of a
`single channel of DDRx system. A southbound frame on the
`other hand transports half the amount of data, 9 bytes as com-
`pared to 18 bytes, in the same duration. Thus, the FBDIMM
`southbound channel transports half the number of bytes that a
`DDRx channel can burst write in 1 DIMM clock, making the
`write bandwidth in FBDIMM systems one half that available in
`a DDRx system. This makes the total bandwidth available in an
`FBDIMM system 1.5 times that in a DDRx system.
`
`The FBDIMM has two operational modes: the fixed latency
`mode and the variable latency mode. In fixed latency mode, the
`round-trip latency of frames is set to that of the furthest DIMM.
`Hence, systems with deeper channels have larger average laten-
`cies. In the variable latency mode, the round-trip latency from
`each DIMM is dependent on its distance from the memory con-
`troller.
`Figure 3 shows the processing of a read transaction in an
`FBDIMM system. Initially a command frame is used to trans-
`mit a command that will perform row activation. The AMB
`translates the request and relays it to the DIMM. The memory
`controller schedules the CAS command in a following frame.
`The AMB relays the CAS command to the DRAM devices
`which burst the data back to the AMB. The AMB bundles two
`consecutive bursts of a data into a single northbound frame and
`transmits it to the memory controller. In this example, we
`assume a burst length of 4 corresponding to 2 FBDIMM data
`frames. Note that although the figures do not identify parame-
`ters like t_CAS, t_RCD and t_CWD [4, 5] the memory control-
`ler must ensure that these constraints are met.
`Figure 4 shows the processing of a memory write transaction
`in an FBDIMM system. The write data requires twice the num-
`ber of FBDIMM frames than the read data. All read and write
`data are buffered in the AMB before being relayed on, indicated
`by the staggered timing of data on the respective buses. The
`CAS command is transmitted in the same frame as the last
`chunk of data.
`
`FBDIMM clock
`
`Southbound bus
`
`RAS
`
`CAS
`
`DIMM clock
`
`DIMM Command
` Bus
`
`DIMM Data Bus
`
`Northbound Bus
`
`RAS
`
`CAS
`
`D0
`
`D1
`
`D2
`
`D3
`
`D0-D1
`
`D2-D3
`
`Figure 3. Read Transaction in FBDIMM system. The figure shows how a read transaction is performed in an FBDIMM
`system. The FBDIMM serial buses are clocked at 6 times the DIMM buses. Each FBDIMM frame on the southbound bus takes
`6 FBDIMM clock periods to transmit. On the northbound bus a frame comprises of two DDRx data bursts.
`
`Authorized licensed use limited to: Irell & Manella LLP. Downloaded on March 06,2023 at 21:53:05 UTC from IEEE Xplore. Restrictions apply.
`
`111
`
`Netlist EX2040
`Samsung v. Netlist
`IPR2022-00996
`
`

`

`3. Related Work
`Burger et. al. [6] demonstrated that many high-performance
`mechanisms lower latency by increasing bandwidth demands.
`Cuppu et. al. [7] demonstrated that memory manufacturers
`have successfully scaled data-rates of DRAM chips by
`employing pipelining but have not reduced the latency of
`DRAM device operations. In their follow-on paper [8] they
`showed that system bus configuration choices significantly
`impact overall system performance.
`There have been several studies of memory controller
`polices that examine how to lower latency while simultaneously
`improving bandwidth utilization. Rixner et. al. [9] studied how
`re-ordering accesses at the controller to increase row buffer hits
`can be used to lower latency and improve bandwidth for media
`processing streams. Rixner[10] studied how such re-ordering
`benefitted from the presence of SRAM caches on the DRAM
`for web servers. Natarajan et. al. [11] studied how memory con-
`troller policies for row buffer management and command
`scheduling impact latency and sustained bandwidth.
`Lin et. al. [12] studied how memory controller based pre-
`fetching can lower the system latency. Zhu et. al. [13] studied
`how awareness of resource usage of threads in an SMT could be
`used to prioritize memory requests. Zhu et. al. [14] study how
`split-transaction support would lower the latency of individual
`operations.
`Intelligent static address mapping techniques have been used
`by Lin et. al. [12] and Zhang et. al. [15] to lower memory
`latency by increasing row-buffer hits. The Impulse group at
`University of Utah [16] proposed adding an additional layer of
`address mapping in the memory controller to reduce memory
`latency by mapping non-adjacent data to the same cacheline and
`thus increasing cacheline sub-block usage.
`This paper studies how various memory controller policies
`perform on the fully-buffered DIMM system. By doing a
`detailed analysis of the performance characteristics of DDRx
`specific scheduling policies and row buffer management poli-
`cies on the FBDIMM memory system, the bottlenecks in these
`systems and their origin are identified. To the best of our knowl-
`
`FBDIMM clock
`
`Southbound bus
`
`RAS
`
`DD0
`
`DD1
`
`DD2
`
`CAS
`
`DD3
`
`edge, this is the first study to examine the FBDIMM in this
`detail.
`4. Experimental Framework
`This study uses DRAMsim [4], a stand-alone memory system
`simulator. DRAMsim provides a detailed execution-driven
`model of a fully-buffered DIMM memory system. The
`simulator also supports the variation of memory system
`parameters of interest including scheduling policies, memory
`configuration i.e. number of ranks and channels, address
`mapping policy etc.
`We studied four different DRAM types: a conventional
`DDR2 system with a data-rate of 667Mbps, a DDR3 system
`with a data rate of 1333Mbps and their corresponding
`FBDIMM systems: an FBDIMM organization with DDR2 on
`the DIMM and another with DDR3 on the DIMM. FBDIMM
`modules are modelled using the parameters available for a
`4GB module [3]; DDR2 device are modelled as 1Gb parts
`with 8 banks of memory [5]; DDR3 parameters were based on
`the proposed JEDEC standard[17]. A combination of multi-pro-
`gram and multi-threaded workloads are used. We used applica-
`tion memory traces as inputs.
`•SVM-RFE (Support Vector Machines Recursive
`Feature Elimination) [18] is used to eliminate gene
`redundancy from a given input data set to provide com-
`pact gene subsets. This is a read intensive workload (reads
`comprise approximately 90% of the traffic). It is part of
`the BioParallel suite[24].
`•SPEC-MIXED comprising of 16 independent SPEC
`[19] applications including AMMP, applu, art, gcc, lucas,
`mgrid, swim, vortex, wupwise, bzip2, galgel, gzip, mcf,
`parser, twolf and vpr. The combined workload has a read
`to write traffic ratio of approximately 2:1.
`•UA is Unstructured Adaptive (UA) benchmark which is
`part of the NAS benchmark suite, NPB 3.1 OMP [20].
`This workload has a read to write traffic ratio of approxi-
`mately 3:2.
`•ART the Open MP version of the SPEC 2000 ART
`benchmark which forms part of the SPEC OMP 2001
`suite[21]. This has a read to write ratio of around 2:1.
`
`DIMM clock
`
`DIMM Command
` Bus
`
`DIMM Data Bus
`
`Northbound Bus
`
`RAS
`
`CAS
`
`D0
`
`D1
`
`D2
`
`D3
`
`Figure 4. Write Transaction in FBDIMM system. The figure shows how a write transaction is performed in an FBDIMM
`system. The FBDIMM serial buses are clocked at 6 times the DIMM buses. Each FBDIMM frame on the southbound bus takes
`6 FBDIMM clock periods to transmit. A 32 byte cacheline takes 8 frames to be transmitted to the DIMM.
`
`Authorized licensed use limited to: Irell & Manella LLP. Downloaded on March 06,2023 at 21:53:05 UTC from IEEE Xplore. Restrictions apply.
`
`112
`
`Netlist EX2040
`Samsung v. Netlist
`IPR2022-00996
`
`

`

`Bandwidth (Gbps)
`
`120
`
`100
`
`80
`
`60
`
`40
`
`02
`
`0
`
`100
`
`Latency
`
`Bandwidth
`
`150
`
`100
`
`50
`
`Latency (ns)
`
`1
`
`2
`
`4
`
`8
`16
`Total Number of Ranks
`(b) DDR3
`
`32
`
`64
`
`100
`
`Bandwidth (Gbps)
`
`80
`
`60
`
`40
`
`02
`
`0
`
`100
`
`1
`
`2
`
`4
`
`8
`16
`Total Number of Ranks
`(a) DDR2
`
`32
`
`64
`
`South Link - Dimm
`0
`South Link Only
`South Link - Dimm -North Link
`South Link -North Link
`North Link Only
`Dimm -North Link
`DIMM
`Default Latency
`Transaction Queue
`
`150
`
`100
`
`50
`
`Latency (ns)
`
`150
`
`100
`
`50
`
`0
`
`150
`
`100
`
`50
`
`Latency (ns)
`
`Latency (ns)
`
`0
`
`Bandwidth (Gbps)
`
`1C8R
`2C4R
`4C2R
`8C1R
`
`80
`
`60
`
`40
`
`02
`
`0
`
`32
`
`64
`
`0
`
`1
`
`2
`
`4
`
`Bandwidth (Gbps)
`
`80
`
`60
`
`40
`
`02
`
`0
`
`1
`
`2
`
`4
`
`8
`
`16
`
`32
`
`64
`
`Total Number of Ranks
`
`8
`16
`Total Number of Ranks
`(d) FBD-DDR3
`(c) FBD-DDR2
`Figure 5. Latency and Bandwidth characteristics for different DRAM technologies. The figure shows the best latency
`and bandwidth characteristics for the SPEC-Mixed workload in an open page system. Systems with identical numbers of
`dimms are grouped together on the x-axis. Within a group, bars on the left represent topologies with lesser numbers of
`channels. The left y-axis depicts latency in nano-seconds, while the right y-axis plots bandwidth in Gbps. Note that the lines
`represent bandwidth and the bars represent latency.
`•Open Bank First (OBF) is used only for open-page
`All multi-threaded traces were gathered using CMP$im [22],
`a tool that models the cache hierarchy. Traces were gathered
`systems. Here priority is given to all transactions which
`using 16-way CMP processor models with 8 MB of last-level
`are issued to a currently open bank.
`•Most pending is the “most pending” scheduling policy
`cache. SPEC benchmark traces were gathered using Alpha
`21264 simulator sim-alpha [23], with a 1 MB last-level cache
`proposed by Rixner [9], where priority is given to DRAM
`and an in-order core.
`commands to banks which have the maximum number of
`Several configuration parameters that were varied in this
`transactions.
`•Least pending: is the “least pending” scheduling policy
`study include:
`Number of ranks and their configuration. FBDIMM
`proposed by Rixner [9]. Commands to transactions to
`systems support six channels of memory, each with up to
`banks with the least number of outstanding transactions
`are given priority.
`eight DIMMs. This study expands
`that slightly,
`•Read-Instruction-Fetch-First
`investigating a configuration space of 1-64 ranks, with
`(RIFF):
`prioritizes
`ranks organized in 1-8 channels of 1-8 DIMMs each.
`memory read transactions (data and instruction fetch)
`over memory write transactions.
`Though many of these configurations are not relevant for
`5. Results
`DDRx systems, we study them the impact of serialization
`on performance.We study configurations with one rank per
`We study two critical aspects of the memory systems
`DIMM.
`performance: the average latency of a memory read
`Row-buffer management policies . We study both open-
`operation, and the sustained bandwidth.
`page and closed-page systems. Note that the address
`5.1 The Bottom Line
`mapping policy is selected so as to reduce bank conflicts
`Figure 5 compares the behavior of the different DRAM
`in
`the
`system. The
`address mapping policies
`systems studied. The graphs show the variation in latency and
`sdram_close_page_map
`and
`sdram_hiperf_map
`[4]
`bandwidth with system configuration for the SPEC-Mixed
`provide the required mapping for closed page and open
`workload executing in an open page system. In the graphs
`page systems respectively.
`systems with different topologies but identical numbers of
`Scheduling policies . Several DDRx DRAM command
`ranks are grouped together on the x-axis. Within a group, the
`scheduling policies were studied. These include
`•Greedy The memory controller issues a command as
`bars on the left represent topologies with more ranks/channel
`and as one moves from left to right, the same ranks are
`soon as it is ready to go. Priority is given to DRAM com-
`distributed across more channels. For example, for the group
`mands associated with older transactions.
`of configurations with a total of 8 ranks of memory, the
`
`Authorized licensed use limited to: Irell & Manella LLP. Downloaded on March 06,2023 at 21:53:05 UTC from IEEE Xplore. Restrictions apply.
`
`113
`
`Netlist EX2040
`Samsung v. Netlist
`IPR2022-00996
`
`

`

`100
`
`Bandwidth (Gbps)
`
`80
`
`60
`
`40
`
`02
`
`0
`
`20
`
`Bandwidth (Gbps)
`
`1
`
`2
`
`4
`
`8
`16
`Total Number of Ranks
`(b) SPEC-Mixed
`
`32
`
`64
`
`150
`
`100
`
`50
`
`Latency (ns)
`
`Bandwidth (Gbps)
`
`20
`
`10
`
`South Link - Dimm
`0
`South Link Only
`South Link - Dimm -North Link
`South Link -North Link
`North Link Only
`Dimm -North Link
`DIMM
`Default Latency
`Transaction Queue
`
`250
`
`200
`
`150
`
`100
`
`50
`
`Latency (ns)
`
`1
`
`2
`
`4
`
`8
`16
`Total Number of Ranks
`(a) ART
`
`32
`
`64
`
`0
`
`Bandwidth (Gbps)
`
`20
`
`10
`
`Bandwidth
`Latency
`
`1C8R
`2C4R
`4C2R
`8C1R
`
`250
`
`200
`
`150
`
`100
`
`50
`
`0
`
`Latency (ns)
`
`150
`
`100
`
`50
`
`Latency (ns)
`
`0
`
`1
`
`2
`
`4
`
`8
`16
`Total Number of Ranks
`
`32
`
`64
`
`0
`
`8
`16
`Total Number of Ranks
`(c) SVM-RFE
`(d) UA
`Figure 6. Latency and Bandwidth characteristics for different benchmarks. The figure shows the best latency and
`bandwidth characteristics for a DDR2 closed page system. Systems with identical numbers of dimms are grouped together on
`the x-axis. Within a group, the bars on the left represent topologies with lesser numbers of channels.The left y-axis plots
`latency in nano-seconds, while the right y-axis plots bandwidth in Gbps. Note that the lines represent bandwidth and the bars
`represent latency.
`leftmost or first bar has the 8 ranks all on 1 channel, the next
`bar has 2 channels each with 4 ranks, the third bar has 4
`channels with 2 ranks each and the rightmost in the group has
`the 8 ranks distributed across 8 channels i.e. 8 channels with 1
`rank each. A more detailed explanation of the latency
`components will be given in later sections.
`The variation in latency and bandwidth follow the same gen-
`eral trend regardless of DRAM technology. This observation is
`true for all the benchmarks studied. At high system utilizations,
`FBDIMM systems have average improvements in bandwidth
`and latency of 10% and 7% respectively, while at lower utiliza-
`tions DDRx systems have average latency and bandwidth
`improvements of 25% and 10%. Overall though, FBDIMM sys-
`tems have an overall average latency which is 15% higher than
`those of DDRx systems, but have an average of 3% better band-
`width. Moving from a DDR2 based system to DDR3 system
`improves latency by 25-50% and bandwidth by 20-90%.
`Figure 6 compares the behavior of the different workloads
`studied. All data is plotted for a DDR2 closed page system (not
`open page system as in the earlier figure) but is fairly represen-
`tative for other DRAM technologies as well. The variation in
`latency and bandwidth are different for each different workload.
`The performance of SPEC-Mixed improve until a configuration
`with 16 ranks, after which it flattens out. ART and UA see sig-
`nificant latency reductions and marginal bandwidth improve-
`ments as channel count is increased from 1 to 4. By contrast,
`SVM-RFE which has the lowest memory request rate compared
`to all the applications does not benefit by increasing the number
`of ranks.
`
`0
`
`1
`
`2
`
`4
`
`32
`
`64
`
`0
`
`Figure 7 shows the latency and bandwidth characteristics for
`the SPEC-MIXED and ART workload in a closed page system.
`The relative performance of a FBDIMM system with respect to
`its DDRx counterpart is a strong function on the system utiliza-
`tion. At low system utilizations the additional serialization cost
`of an FBDIMM system is exposed, leading to higher latency
`values. At high system utilizations, as in the single channel
`cases, an FBDIMM system has lower latency than a DDRx sys-
`tem due to the ability of the protocol to allow a larger number of
`in-flight transactions. Increasing the channel depth typically
`increases the latency for FBDIMM systems while DDRx sys-
`tems experience a slight drop. For DDRx systems the increase
`in ranks essentially comes for free because the DDRx protocols
`do not provide for arbitrary-length turnaround times. For
`FBDIMM systems with fewer channels the latency of the sys-
`tem drops with increasing channel depths. By being able to
`simultaneously schedule transactions to different DIMMs and
`transmit data to and from the memory controller, FBDIMM sys-
`tems are able to offset the increased cost of serialization. Fur-
`ther, moving the parallel bus off the motherboard and onto the
`DIMM helps hide the overhead of bus turn-around time. In the
`case of more lightly loaded systems like some of the multi-
`channel configurations, due to the lower number of in-flight
`transactions in a given channel, we find that FBDIMMs are
`unable to hide the increased transmission costs.
`The bandwidth characteristics of an FBDIMM system are
`better than that of the corresponding DDRx system for the het-
`erogeneous workload - SPEC-Mixed (Fig 7d). SPEC-Mixed
`have concurrent read and write traffic and are thereby able to
`better utilize the additional system bandwidth available in an
`
`Authorized licensed use limited to: Irell & Manella LLP. Downloaded on March 06,2023 at 21:53:05 UTC from IEEE Xplore. Restrictions apply.
`
`114
`
`Netlist EX2040
`Samsung v. Netlist
`IPR2022-00996
`
`

`

`ART
`
`1
`
`2
`
`4
`Number of Channels
`
`8
`
`(c) ART-Bandwidth
`
`20
`
`15
`
`10
`
`5
`
`0
`
`Bandwidth (Gbps)
`
`SPEC-MIXED
`
`1
`
`2
`4
`Number of Channels
`
`8
`
`(b) SPEC-MIXED-Latency
`
`FBD_DDR3
`DDR3
`FBD_DDR2
`DDR2
`
`Figure 7. Latency and Bandwidth characteristics for
`different DRAM technologies. The figures plot the
`best observed latency and bandwidth characteristics for
`the various DRAM technologies studies for different
`system sizes for ART and SPEC-Mixed. The y-axis for
`figures a and b shows the average latency and for
`figures c and d show the observed bandwidth. The
`x-axis is the various system topologies grouped
`according to channel. In each channel group, points
`to the left represent configurations with lesser
`number of ranks.
`
`200
`
`150
`
`100
`
`50
`
`0
`
`Latency (ns)
`
`ART Number of ranks
`
`1 2 4 8
`
`1
`
`2
`
`4
`
`Number of Channels
`
`8
`
`(a) ART-Latency
`
`SPEC-MIXED
`
`1
`
`2
`
`4
`
`8
`
`Number of Channels
`
`200
`
`150
`
`100
`
`50
`
`0
`
`Latency (ns)
`
`120
`
`110
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Bandwidth (Gbps)
`
`(d) SPEC-MIXED-Bandwidth
`FBDIMM system. The improvements in bandwidth are more in
`configurations which are more busy i.e. lesser number of chan-
`nels. Busier workloads tend to use both FBD channels simulta-
`neously, whereas the reads and writes have to take turns in a
`DDRx system. As channel count is increased and channel utili-
`zation gets lower, and the gains in bandwidth diminish till they
`vanish completely. This is especially true for DDR3 based sys-
`tems due to the significantly higher data rates and consequently
`lower channel utilization. In the case of the homogenous work-
`loads, e.g. ART (Fig 7c), which have lower request rates, vary-
`ing the number of channels and ranks does not have signifcant
`impact.
`The RIFF scheduling policy which prioritizes reads over
`writes has the lowest read latency value for both DDRx and
`FBDIMM systems. No single scheduling policy is seen to be
`most effective in improving the bandwidth characteristics of the
`system.
`5.2 Latency
`In this section, we look at the trends in latency, and the factors
`impacting them, in further detail. The overall behavior of the
`contributors to the read latency are more characteristic of the
`system and not particular to a given scheduling policy or row
`buffer management policy. Hence, we use a particular row
`buffer management policy, RIFF in the case of closed page
`systems and OBF in case of open page systems for the
`discussion.
`Figure 8 shows the average read latency divided into queue-
`ing delay overhead and the transaction processing overhead.
`The queueing delay component refers to the duration the trans-
`action waited in the queue for one or more resources to become
`available. The causes for queueing delay include memory con-
`troller request queue availability, south link availability, DIMM
`availability (including on-DIMM command and data buses and
`
`bank conflicts), and north link availability. Note that the overlap
`of queueing delay is monitored for all components except the
`memory controller request queue factor. The default latency
`cost is the cost associated with making a read request in an
`unloaded channel.
`The queueing delay experienced by a read transaction is
`rarely due to any single factor, but usually due to a combination
`of factors. Changing the system configuration, be at by adding
`more ranks in the channel or increasing the number of channels
`in the system, result in a change in the availability of all the
`resources to a different degree. Thus, we see that the latency
`trends due to changes in a configuration are affected by how
`exactly these individual queueing delays change.
`Typically, single-rank configurations have higher latencies
`due to insufficient number of memory banks to distribute
`requests to. In such systems the DIMM is the dominant system
`bottleneck and can contribute as much as 50% of the overall
`transaction queueing delay. Adding more ranks in these system
`helps reduce the DIMM-based queueing delays. The reductions
`are 20-60%, when going from a one rank to a two-rank channel.
`For 4-8 rank channel, the DIMM-based queueing delay is typi-
`cally only 10% of that in the single-rank channel.
`In an FBDIMM system, the sharing of the southbound bus
`by command and write data results in a significant queueing
`delay associ

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket