throbber
362 I CHAPTER 22 THE PENTIUM [I PROCESSOR
`
`Register Alias Table
`
`FIGURE 22-4 Instruction decoder.
`
`Performance Considerations
`
`The only thing that you have to worry about, as far as the decoders are con-
`cerned, is to apply the 421:1 template as often as you can. With the 421:1
`template, when you schedule your instructions, you need to arrange them
`such that the first instruction breaks down to four or less micro-ops, and
`the following two instructions break down to one micro-op each. By
`repeating this template, you guarantee maximum decoder efficiency. You
`can easily apply the template with the help of VTune’s static analyzer
`described in the previous chapter. Ideally, the Pentium II processor can
`decode three instructions every clock cycle. However, in reality, you never V
`sustain this throughput because you cannot always apply the 4:1:1 template
`or because the decoder stalls from branch misprediction or RAT stalls.
`
`Use the event counters to measure the efficiency of the instruction decoder
`as follows:
`
`Instructions Decoder per clock : InstADecoded / Clock Cycles
`
`22.6 Register Alias Table Unit
`
`22.6.1
`
`Operational Overview
`
`Internally, the Pentium II processor has forty virtual registers, which are
`used to hold the intermediate calculation results. When a new micro-op is
`decoded, the Register Alias Table (RAT) unit renames the IA register (eax,
`ebx, and so forth) to one of the virtual registers. At any given instance an IA
`register could be mapped to one or more virtual registers.
`
`How does it work? Consider the following sequence of instructions and
`their related micro-ops. (Notice that the listed micro-ops are just mnemon-
`ics that we made up to illustrate the point.)
`
`The RAT aliases each of the IA registers, eczx and edx, to one of forty i11ter11al
`virtual register W1, W2, and so forth. Notice that the RAT assigns a new virtual
`
`376
`
`

`
`are con-
`l:1:1
`e them
`s, and
`
`‘Y
`y. You
`er
`can
`
`l IICVCT ,
`
`emplate
`ills.
`
`decoder
`
`REGISTER ALIAS TABLE UNIT I 353
`
`TABLE 22-5 IA Instructions and Their Related Micro-ops
`
`mov eax, Mem
`
`add edx, eax
`mov eax, 12
`
`add eax, ecx
`
`add ecx, edx
`
`uLoad vrO:eax, Mem
`
`uAdd vr1 zedx, vrO:eax
`uLoad vr2:eax, 12
`
`uAdd vr2:eax, vr3:ecx
`
`uAdd vr3:ecx, vr1 zedx
`
`register for the same IA register only when the IA instruction is loaded with a
`new Value. If the register is only read from, the last Virtual register is used. In
`our example, the eax register is assigned a new Virtual register in both instruc-
`tions 1 and 3 since both instructions load a new value into eax. But in instruc-
`
`tion 5 the RAT uses the same virtual vrltedx register since the instruction does
`not load a new Value into edoq it is a source operand.
`
`Now, let’s see what happens to the micro-ops from Table 22—5 once they’re
`handed to the execution unit:
`
`In clock 1 the execution unit executes micro-ops I and 3—in two different
`execution ports. Even though both micro-ops write to the same IA register,
`ewc, the processor executes the opcodes at the same time since they write to
`two different virtual registers.
`
`In clock 2 the execution unit stalls on micro-ops 2 because of the dependen-
`Cy on the 1/r0:eax register from micro—op I. But micro—op 4 is ready to exe-
`cute, so it does—assuming that vr3:e¢:x is ready.
`
`Since micro—op 5 depends on the result of micro—op 2, it can only execute af-
`ter micro—op 2 executes. Micro-op 2 executes whenever the value of 1/r0:eax
`gets its value from memory. Meanwhile, the execution unit processes other
`rnicro—ops that are ready and waiting in the ROB.
`
`So why is register renaming useful? Consider the third rnicro-op
`U LO a d v r 2 : e a x , 12. Without register renaming, the micro—op has to
`wait for the first two micro-ops to execute before it can execute; of course,
`micro—op 4 has to wait as well. With register renaming, micro-ops 3 and 4
`were able to execute while the processor was loading data from memory.
`
`PARTVI
`
`377
`
`

`
`364 I CHAPTER 22 THE PENTIUM II PROCESSOR
`
`22.6.2
`
`Performance Considerations
`
`The RAT is affected by one of the major performance bottlenecks in the
`Pentium II processor—pa1‘tial register stalls. You’ll typically notice such
`stalls when you run Pentium optimized code on the Pentium II processor.
`Eliminating partial register stalls is one of the most obvious and most
`rewarding optimizations you can achieve on the Pentium ll processor.
`
`Partial stalls occur when an instruction that writes to an 8- or 16-bit
`
`register (al, ah, ax) is followed by an instruction that reads a larger set of
`that same register (eax). For example, the Pentium Pro will suffer a partial
`stall if you write to the al or an register and then read the ax or eax register.
`
`Notice that partial stalls can still occur even if the second instruction does
`not immediately follow the first instruction. Since partial register stalls
`could last for more than 7 cycles, on average, you can avoid partial stalls if
`you separate the two instructions in question by a minimum of 7 cycles. Or
`you can fix them.
`
`The Pentium II processor implements special cases to eliminate partial stalls in
`order to simplify the blending of code across processors. In order to eliminate
`partial stalls, you must insert the SUB or XOR instructions in front ofthe orig-
`inal instruction and clear out the larger register. Figure 22—5 shows all the
`possibile partial register stalls and which flavor of the XOR or SUB instruc~
`tions you can use to eliminate such stalls.
`
`_
`_
`: lryou wri =
`K
`.
`then read this
`reg.
`
`You must insert one of these instructions before the
`first instruction in order to eilnimate register partial
`Stalls-
`
`EAX
`
`No Partial Staii
`
`X0!’ ah, ah
`xor ax, ax
`XDT eax,eax
`
`XDI’ 8X, 8X
`X0!’ eax,eax
`
`f
`
`xor eax,eax
`
`X0!’ eax,eax
`
`X0!’ 92X,98X
`
`FIGURE 22-5 How to eliminate partial register stalls in the Pentium II processor.
`
`378
`
`

`
`REORDER BUFFER AND EXECUTION UNITS I 355
`
`In the three examples We’ve added the XOR or SUB instructions in front of
`the original code in order to eliminate partial register stalls.
`xor eax, eax
`mov ax, meml6
`read eax
`
`sub ax, ax
`mov ai, nemB
`read ax
`
`xor ah, ah
`mov ai, mem8
`read ax
`
`You can use VTune’s static analyzer to easily detect partial register stalls in
`your code. You can also use the Partial_Rat_Stalls event counter to measure
`the amount of cycles wasted by register partial stalls.
`
`22.1 Reorder Buffer and Execution Units
`
`22.7.1
`
`Operational Overview
`The Reorder Buffer (ROB, a.k.a. Reservation Station) is at the heart of the
`out-of-order execution of the Pentium II processor. The ROB can receive
`up to three micro»ops from the RAT and can retire up to three micro-ops in
`one clock cycle. It can hold a maximum of forty micro-ops at any given
`time. (See Figure 22-6.)
`
`Register Alias Table (RAT)
`
`Load Unit
`
`A2335
`Calculation
`Unit
`
`MOV [Memret], EAX
`
`Integer _ ALU
`
`DAA
`JMP label
`PAND
`PSRLQ
`4--—*-- PACKSSWB
`
`FMUL STO
`FMADD sro ------—~w—
`FDIV STU
`
`LEA EAX, [EBX*4] ~—-——-—-»»
`PAND ————————w»-
`
`PMULHW’""““'“"""‘”’
`
`Address
`Generation
`um
`
`FIGURE 22-6 Pentium II processor Reorder Buffer and the
`execution unit (port 0-4).
`
`379
`
`

`
`366 I CHAPTER 22 THE PENTIUM ll PROCESSOR
`
`
`
`The Pentium II processor implements a data flow machine, which leads to
`the out-of-order execution. In a data flow machine, the order of execution
`
`of micro-ops is determined solely by the readiness of their data, not by the
`order in which it entered the ROB. Let’s see how this model works.
`
`C9
`[R1:4
`R4,
`.1oad
`2
`1
`. ShiftL R4,
`. move
`R2, R3
`1
`. shiftt R2,
`2
`I
`. add
`R2, R3
`1
`
`<.n4>t.ur\:»—-
`
`Consider the coined pseudo—code fragment to the left. Assume that only
`one instruction can execute, and it takes the number of cycles to the right to
`execute. In a sequential (in-order) processor, it takes the code fragment 8
`clocks to execute.
`
`Now, consider a data flow machine where instructions execute based on the
`
`availability of their data not on the order in which they appeared. Let’s examine
`what happens every clock cycle:
`
`22
`
`1. The first instruction starts to execute immediately.
`2. The second instruction stalls for the next 3 clocks in the ROB because it
`
`
`
`22
`
`needs the value of R4 to execute. Instead, instruction 3 executes (no data
`dependency).
`3. Instruction 4 executes.
`
`4. Instruction 5 executes. Also, R4 value becomes valid.
`
`5. Instruction 2 is now ready to execute, so it does.
`
`As you can see, with out-of-order execution, it only takes 5 clocks to exe-
`cute compared to 8 clocks for the sequential execution model. Even though
`the micro-ops were executed out-of-order, the final results are exactly the
`same because they are written out in the order they came in.
`
`22.7.2
`
`Performance Considerations
`
`As a programmer, you do not have direct control over the operation of the
`ROB and the execution unit. But you can affect its behavior indirectly based
`on your understanding of the internal architecture. Here are a few guide-
`lines that could help you maximize the number of executed micro-ops
`every clock cycle.
`
`I Blend your instruction types. The execution unit has five execution ports
`that can execute up to five micro-ops in 1 clock cycle. To maximize this
`number, you should use a mix of instructions as much as possible. Avoid
`clumping the same kind of operations together (back—to-back loads, stores,
`ALUS).
`
`I Minimize mispredicted branches and partial stalls. Both ofthese are detri-
`mental to the performance of the ROB and the execution units.
`
`
`
`380
`
`

`
`RENDERING OUR SPRITE ON THE PENTIUM H I 357
`
`
`
`I Keep your data in the L1 cache. This allows the load port (2) to bring in the
`data as fast as possible and in turn avoids data dependency stalls among
`micro-ops.
`
`I
`I
`22.8 Retirement Unit
`P
`P
`P
`Y
`The retirement unit acce ts u to three micro—o s in 1 clock c cle. It com-
`mits the final results to the IA registers or to memory. The retirement unit
`guarantees that the micro-ops are retired in the order in which they came
`into the ROB. There is almost nothi11 that ou can do to affect the erfor—
`.
`,
`g
`Y
`‘
`P
`mance of the retirement unit.
`
`I
`I
`I
`22.9 Rendering Our Sprite on the Pentium II
`
`Now that we know what’s important to the Pentium II processor, let’s see if
`our favorite sprite has any problems when it runs on it. This time, however,
`well use VTune to do the analysis.
`
`Figure 227 shows the MMX sprite code analyzed for the Pentium II
`processor using '\/Tune. Notice that, rather than showing the U/V pairing
`
`Si-
`1::
`
`.
`
`<D
`
`4=1=1d€°°d“ EFOUP
`
`‘
`
`‘
`
`'1
`
`1
`
`movci
`
`mmfl, QWUHD PTFE Iesi}
`
`rum/q mmi, nwoao Pm (ear;
`rnovq
`rnrn2, rnm3
`pcmpeqb mm2, mm!)
`panii mm1, mm2
`pandn mrn2, mmfl
`. pm
`mm‘l, rnrn2
`
`:2:
`
`’
`
`
`
`rm‘-.n:i
`in:
`
`DWDRD F-‘TR [edi-’:3],mm‘I 2
`rnairi+B {Mn}
`1
`
`2
`
`FIGURE 22-7 MMX sprite analysis for the Pentium II processor.
`
`
`
`ads to
`‘utjon
`by the
`
`>33
`‘lg
`t 8
`n
`
`to
`
`the
`.
`'
`(amine
`
`;e it
`data
`
`exe-
`
`iough
`y the
`
`of the
`, based
`lide_
`PS
`
`'7.e this
`Avoid
`
`stores,
`
`e detri—
`
`381
`
`

`
`368 I CHAPTER 22 THE PENTIUM II PROCESSOR
`
`columns, VTune shows a “decoder group” column and a micro-op count
`column. The decoder group column, indicated by the curly bracket “{,”
`indicates when two or three instructions are decoded simultaneously
`because they adhere to the 421:1 decoder template (refer to section 22.5 for
`more details). In the “ii-ops” column, VTune shows the number of micro—
`ops that are generated when the instruction is decoded.
`
`In the figure, notice that the highlighted instruction d e C e c X was decoded
`by itself because the instruction sequence does not adhere to the 4:121
`decoder template. The problem is caused because the m 0 v q [e di - 8] ,mm1
`consists of two micro—ops and, thus, has to be the first instruction in a
`decoder group sequence.
`
`You can easily optimize the code for the Pentium II processor by switching
`the two instructions. In this case, the mov q [edi * 8] , mm1 will be
`decoded by the complex decoder, and the following two instructions are
`decoded by the two simple decoders. Figure 22-8 shows the results of opti-
`mizing our sprite. Note the differences in Line 21 of the number of micro-
`ops and the improvement gained.
`
`Static sis forshob:
`
`Top of loop
`
`rriovq mmll IIJW PTH {em}
`mom mrrfl , QWURD PTH Iedi]
`rriovq
`mrn2, mm?‘-
`
`FIC.YmDEC|i3 mm2, mm!)
`pend mmi , mm2
`pandn
`rrin'i2, mmfl
`
`pm
`add
`add
`
`rnm1,mm2
`edi, 8
`esi, 8
`
`rri-:-~.w:i
`dew:
`inz
`
`!.7B.~‘»!’1Z8FiE3I PIP} [Edi-53?), n'm‘i1 2
`ea->4
`1
`main+E {Eh}
`1
`
`FIGURE 22-8 MMX Sprite optimized forthe Pentium II processor.
`
`382
`
`

`
`
`
`SPEED UP GRAPHICS WRITES WITH WRITE COMBINING I 359
`
`iunt
`
`{)))
`
`1.5 for
`.icro-
`
`coded
`1
`
`,mm1
`a
`
`ching
`
`am.
`
`opti-
`liCI'0—
`
`kfi
`\*2~wgj\
`NW
`\-~~,:
`MM\ 1,’
`\J
`
`VTune also warns you about partial register stalls, which are very useful to
`remove. Typically, you can remove partial register stalls with little or no
`impact on performance on the Pentium processor.
`
`
`
`In the fetch unit section, we recommended that you align loops on a l6—byte
`boundary. Notice, however, in Figure 22-8, we did not bother to apply our own
`recommendation: the top of our loop, "main+6:,” is not aligned on a l6—byte
`boundary. Why not? The purpose of that rule was to assure that the decoder
`would have three instructions to decode when itjumps to the top of the loop;
`with luck, the three instructions follow the 4:111 rule. if you examine the first
`three instructions in the loop, you'll notice that they fit within a l6—byte block
`O><OO to O><OF. And since the fetch unit forwards 16 bytes at a time to the
`decoder, the decoder will have three instructions to decode in these 1 6 bytes.
`
`
`22.10 Speed Up Graphics Writes with Write Combining
`
`22.10.1
`
`Operational Overview
`
`By the time the Pentium II processor is in the mainstream market, software—
`only 3D games and high—resolution MPEG2 video will be widely available.
`Unfortunately, one of the greatest bottlenecks for these applications is the
`access speed to graphics memory. A typical software only MPEG2 player
`consumes up to 30 percent of the CPU writing to video memory.
`
`The Pentium II processor implements the Write Combining (WC) memory
`type5 in order to accelerate CPU writes to the video frame buffer. The 32-
`byte buffer delays writes on their way to a WC memory region, so applica-
`tions can write 32 bytes of data to the WC buffer before it bursts them to
`their final destination. The 32—byte burst writes are faster than individual
`byte or DWORD writes, and they consume less bandwidth from the system
`bus.
`
`.
`1
`l
`
`i
`
`Typically, the video driver or the BIOS sets up the frame buffer to be WC
`(similar to the way it is set up now as uncached memory).As usual, you can
`use DirectDraw to retrieve the address of the frame buffer. Therefore, there
`
`is no change required from an application point of view (well, you might
`want to read on).
`
`5. Memory type: These include cached, uncached, WC, and other memory types.
`
`5l-
`
`E<n
`
`.
`
`383
`
`

`
`370 I CHAPTER 22 THE PENTIUM II PROCESSOR
`
`Let’s have a closer look at VVC and determine how it enhances graphics
`application performance.
`
`Assume that you are writing a 320 X 240 image to a WC frame buffer as
`shown in Figure 22-9. Typically, you would write the pixels from left to right,
`sequentially, one pixel at a time. For the sake of simplicity, also assume that
`the address of the frame buffer is aligned on a 32—byte boundary.
`
`When you write the first 32 bytes of line 1 to the frame buffer, those 32
`bytes actually end up in the VVC buffer rather than in video memory. Once
`you write byte 33 to the frame buffer, the WC buffer bursts its contents (the
`first 32 bytes) to video memory and captures the thirty—third byte instead.
`Similarly, the next 31 bytes are held in the WC buffer until the sixty-fifth
`byte is written out. The same process repeats for every package of 32 bytes
`of data aligned on a 32 byte boundary.
`
`So what about the last 32 bytes in the image. How are they flushed out?
`They are eventually flushed out when you write somewhere else in the
`video buffer (for example, when your write out the next frame) or when a
`task switch occurs. Actually, there are plenty of circumstances that cause the
`VVC buffer to be flushed out:
`
`omn:-
`
`ti
`
`2o
`:umIn
`6.3
`
`4-paumun.5
`
`
`
`Graphics Frame
`Buffer Memory
`
`
`
`FIGURE 22-9 WC frame buffer.
`
`384
`
`

`
`
`
`SPEED UP GRAPHICS WRITES WITH WRITE COMBINING I 371
`
`
`
`
`
`CS
`
`as
`
`i right,
`3 that
`
`32
`Once
`
`ts (the
`Ltead.
`ifth
`
`bytes
`
`1en a
`
`se the
`
`Any L1 uncached memory loads or stores (L1 cached loads and stores do
`not flush the WC buffer).
`
`Any WC memory loads or M7C stores to an address that does not map
`into the current WC buffer.
`
`I/O reads or writes.
`
`Context switches, interrupts, IRET, CPUID, Locked instructions and
`VVBINVD instructions.
`
`Notice that the Pentium II processor generates a 32—byte burst write only if
`the VVC buffer is completely full. Otherwise, it performs multiple smaller
`writes to the WC region. These multiple writes are still faster than writing
`to an uncached frame buffer.
`
`22.10.2
`
`Performance Considerations
`
`in short, your WC could enhance your graphics performance if you write
`your data sequentially to the frame buffer. We have listed the following
`guidelines to remind you of what you should consider when you optimize
`for a WC frame buffer.
`
`Always write sequentially to the frame buffer in order to gain perfor-
`mance from 32-byte VVC bursts.
`
`Avoid writing to the frame buffer Vertically. For example, if you write to the
`first pixel in line 1 then line 2, since the second write does not map to the cur-
`rent VVC buffer, the WC buffer (holding only 1 byte) will be flushed out. The
`same thing happens when you write to line 3, 4, and so forth.
`
`WHAT HAVE
`YOU LEARNED?
`
`Now you know about the internal units of the Pentium II processor. More importantly, you
`know what matters to these units so you can get the best performance for your application.
`As a last reminder:
`
`Maximize your code execution from the Li cache,
`Use the new instructions to minimize branches and mispredicted branches.
`
`Avoid partial stalls. They are deadly.
`Use VTune to analyze performance.
`Use a mix of instructions (loads, stores, /\LUs, iVliVlX, and so forth) and apply the 421:1
`decoder template.
`Use Write Combining to blast your video images to the screen.
`Read the next chapter to fami|iari7e yourself with memory optimization issues,
`
`385
`
`

`
`CHAPTER 23
`
`um
`
`Memory Optimization:
`Know Your Data
`
`T
`
`hroughout this section, we've stressed again and again that you should "know your data,"
`know where it is coming from and know where it is going. We've also stressed that the
`ptimizations for the internal components of the processor are mostly useful if the code
`rdata is already in the Li cache. It’s a nice premise, but that's not always the case.
`
`O O
`
`In this chapter we'll talk about
`
`how the data behaves away from home: in the L2 cache or main memory;
`how the data moves between the L1, L2, and main memory and what affects the
`movement of data;
`‘
`
`how to bring the data into the L1 cache and keep it there as long as it's needed; and
`as an added bonus, accesses to video memory, so you can understand how to write
`effectively to video memory.
`
`As you know, multimedia applications deal with a huge amount of data
`that changes continuously from one second to the next. For example, a typ-
`ical MPEG21 clip has 30 fps with a frame size of 704 X 480 pixels per frame
`at an average of 12 bits per pixel. Moreover, since MPEG2 uses bidirectional
`frame prediction, the size of the working data set2 is typically three to four
`
`1. MPEG2 is :1 High Resolution Motion Video Compression Algorithm.
`2. The working data set refers to the maximum si7/5 of data that is used by the application at any given
`moment.
`
`I373:
`
`
`
`WHY READ
`THIS CHAPTER?
`
`
`
`386
`
`

`
`374 I CHAPTER 23 MEMORY OPTIMIZATION: KNOW Youn DATA
`
`times the size of one frame. Taking all of this into account, you can calculate
`the size of the working data set for an MPEG2 decoder as follows:
`
`Data Set Size -"
`
`4 frames * [704 * 408 pixels) * 12 bits/pixel
`_
`8 b1ts/byte
`
`— 1.9 MB
`
`All of these bits definitely do not fit in the L1 cache or even in the L2 cache——
`the L1 cache is 8 or 16K, and the L2 cache ranges between 256 and 512K.
`Therefore, at any given moment, the majority of the data resides in main
`memory rather than in the caches.
`
`The main purpose of this chapter is to emphasize that memory access can
`be very costly, in terms of clock cycles, and to highlight certain access pat-
`terns that are more efficient than others. We’ll also point out the differences
`between the various flavors of the Pentium and Pentium Pro processors
`with regards to cache and memory behavior. We’ll top the chapter off with
`a brief discussion about accessing video memory.
`
`23.] Overview of the Memory Subsystem
`23.1.1
`Architectural Overview
`
`Figure 23-1 shows a simplistic diagram of the memory subsystem for com-
`puters with the Pentium H processor. Notice that the L1 code and data
`caches are internal to the processor and run at the same speed as the core
`engine. The L2 cache resides on a dedicated L2 bus, external to the proces-
`sor, and runs at one half to one third the speed of the processor.3 The mem-
`ory subsystem is connected to the PCI chip set, which connects the
`processor to main memory, PCT bus, and other peripheral devices.
`
`233 MHz
`
`Core
`
`V
`
`PC1391
`440FX
`
`FIGURE 23-’! Memory architecture of a system with the Pentium II processor.
`
`3. The fraction of the bus speed depends on the type of L2 cache used and the speed of the processor.
`
`387
`
`

`
`ulate
`
`he~—
`
`3.lI‘l
`
`can
`
`pat-
`nces
`rs
`
`with
`
`COITI‘
`
`OI‘€
`CBS‘
`CH1-
`
`8550!.
`
`S501‘.
`
`OVERVIEW OF THE MEMORY SUBSYSTEM I 375
`
`The PCI chip set is the glue logic between the processor, memory, DMA,
`and the PCI and AGP4 buses. It manages and controls the traffic between
`the processor and all of these devices. A dedicated bus connects the system
`memory to the PCISet. The PCI bus connects the PCISet to I/O adapters,
`such as graphics, sound, and network cards. The AGP bus is a specialized
`graphics bus that was designed with 3D acceleration in mind; notice that
`the 440LX PCISet is the first chip set with the AGP bus.
`
`23.1.2
`
`Memory Pages and Memory Access Patterns
`
`We’ve mentioned, throughout this section, that the L1 and L2 caches are
`divided into 32—byte cache lines, which represent the least amount of data
`that can be transferred between the L1 cache and main memory. For the
`curious only: you can find out more about the internal architecture of the
`caches from the Intel manuals (things like two—way and four—way set associ-
`ate, and so forth).
`
`Internally, the system memory is divided into smaller units called memory
`pages. Memory pages are typically 2K in size and are aligned on a 2K
`boundary. The only reason we’re talking about memory pages here is that
`because of the design of DRAM chips, certain memory access patterns are
`more efficient than others. In the discussion that follows, you need to come
`outwith one thing: consecutive accesses within the same memory page are
`more eficient than consecutive accesses that cross multiple memory pages.
`
`In this discussion, we’re assuming that the processor missed both the L1
`a11d L2 caches and that it is now fetching data from main memory. As we
`mentioned earlier, the processor fetches an entire cache at a time from main
`memory and writes it out to the cache. Since the processor has a 64-bit data
`bus, it can fetch an entire cache line with four bus transactions.
`
`Now, when the processor requests data from main memory, the memory
`page where the data exists is first “opened”—this is done in the hardware-
`and then the data is retrieved. Once the page is open, it takes less time to
`read or write other data to the same page. Typically, the data sheet for the
`memory chip specifies how long it takes to open the page and perform the
`first read, and how long it takes to perform subsequent reads once the page
`18 open.
`
`4. The Accelerated Graphics Port (AGP) is a specialized graphics bus designed with 3D rendering in
`mind.
`
`5I-
`as
`. <
`
`n.
`
`388
`
`

`
`376 I CHAPTER 23 MEMORY OPTIMIZATION: Know Your: DATA
`
`For example, the data sheet of an Enhanced Data Out (EDO) memory chip
`specifies the sequence {l0—2—2—2}{3—2—2—2} where the numbers represent
`clock cycles. Each curly bracket indicates four bus cycles of 64 bits each——
`that’s one cache line. The first sequence, {10-2-2-2}, specifies the timing if
`the page is first opened and accessed four times. The second sequence, {3—2—
`2-2}, specifies the timing if the page was already open and accessed four
`additional timesflthat means you did not access any other memory page in
`between. The last sequence repeats as long as you access memory within the
`same page. One last thing: only one memory page can be open at any given
`moment.
`
`The data sheet we have been discussing relates to amemory bus running at
`66 MHZ. Now, if we look at another processing speed, say a 233-MHZ pro—
`cessor, the timing becomes {35—7—7—7}{1l—7—7—7} in processor clocks.
`
`Whenever your application jumps to another memory page, the current
`open page is first closed before opening the new page. As a result, it takes an
`additional 24 processor clocks to switch between memory pages—that’s a
`lot of processor clocks to waste. So what can you do about it? Maybe noth-
`ing! Maybe a lot! The whole point is that you should try to organize your
`memory footprint in such a way that you bring the data from main mem-
`ory to the L1 cache in the most efficient manner. For example, if you know
`that most of your data resides in main memory, for example, MPEG2, you
`might try to arrange the data in a smarter fashion such that you can burst it
`to the L1 cache faster.
`
`In MPEG2’s motion comp ensation,5 for example, you typically access three
`reference frame buffers and write the output to a fourth buffer or directly to
`the screen. Typically, when the buffers are allocated, they are allocated in a
`contiguous fashion, separately, as shown in Figure 23-2.With the allocation
`scheme shown in Figure 23—2a, when you access the three frames, you’ll
`definitely cross memory page boundaries and thus reduce the overall appli-
`cation performance. Now, if you interleave the frames on a line—by—line
`boundary, as shown in Figure 23-2b, you’ll have a better chance of accessing
`the three frames from the same memory page, and thus increasing memory
`access efficiency.
`
`5. Motion Compensation is used when inter-frame decoding is used.
`
`389
`
`

`
`
`
`OVERVIEW OF THE MEMORY SUBSYSTEM I 377
`
` Y, U, and V components
`
`
` Ill
`
`
`
`
`
`
`
`YUV9
`
`E‘ Y components only
`In 4 x 4 block
`
`YL/V12-=9 (16Y+ 4U+ 4\/) x 8 =12 bits/pixel
`16 pixels
`
`
`
`
`
`YUV9 ={>(16Y+1U+1V) x B = 9 bits/pixel
`15 pixels
`
`FIGURE 23-2 MPEG2 frame buffer allocation strategy.
`
`23.1.3
`
`Memory Timing
`
`To complete the picture, let’s look at a comparison of the L1 and L2 caches
`and system memory.
`
`TABLE 23-! Memory Architecture and Timing for a System Using the
`Pentium II Processor and EDO Memory
`
`
`“Ll cache
`[l—1—I—1}
`{1—1—l—1}
`4 (l864iMB/. econ
`ii
`L2 cache
`{5—l—i—1}
`{10—2—2-2}
`16 (466 MB/Second)
`
`EDO memory
`SDRAM
`
`{10—2—2-2}
`{3-2-2-2}
`[1 H-1-1]
`[2-1-1-1}
`
`{35-7-7-7}
`{H-7-7-7
`139-4-4-4}
`{7-4-4-4}
`
`56 E133 MB/Secondg '
`32 233 MB/Second
`51 E146 MB/Secondg
`19 392 MB/Second
`
`Access timing for main memory depends on the type of PCISet and 1nem—
`ory used in the system (available types include EDO, FPRAM, SDRAM6).
`SDRAM offers the best access timing because it has a lower repetition7 rate
`{11—1—1—1}{2-1-1-1} relative to EDO {10—2—2—2}{3—2—2—2}. But SDRAMs are
`only supported on systems with the PCISet 440/LX chip set or later.
`
`7. Repetition rate: the timing for fetching the last 3 quad words in a cache line.
`
`'>'I-
`I1:
`L
`
`<I
`
`6. EDO: Enhanced Data Out; FPRAlVl: FastPage RAM; SDRAM: Synchronous DRAM.
`
`y chip
`ent
`
`ch——
`
`ng if
`, {3-2-
`ur
`
`age in
`in the
`
`given
`
`ing at
`pro-
`
`ent
`es an
`
`noth-
`our
`em-
`
`know
`, rou
`urst it
`
`three
`
`tly to
`in a
`ation
`
`appli-
`
`ssing
`mory
`
`390
`
`

`
`378 I CHAPTER 23 MEMORY OPTIMIZATION: Know YOUR DATA
`
`From the CPU point of View, notice that the total number of clocks spent
`accessing main memory depends on the speed of the processor. Faster pro-
`cessors actually wait more clocks for memory than do slower processors.
`For example, if a memory chip takes one nanosecond to respond, a proces-
`sor running at 233 MHz waits 233 clocks before it receives the data, and a
`200 MHz processor waits 200 clocks before it gets the same data. Even
`though both processors Waited the same physical time, 1 nanosecond, the
`faster processor ticked more clocks in that time——and thus it is losing more
`clocks that could be spent doing something more useful.
`
`23.1.4
`
`Performance Considerations
`
`The Pentium II processor includes event counters that can help you under-
`stand the memory footprints of your application. Notice that even though
`some of these counters are not 100 percent accurate, they can give you a
`good indication of your application cache and memory behavior.
`
`TABLE 23-2 Pentium II Processor Cache and Bus Performance Event Counters
`
`
`DATA_MemRef
`All memory accesses including reads and writes to any memory
`WP‘?
`Number of data load/store that miss in the L1 data cache and
`are issued to the L2 cache
`
`,2*LD, L2_ST
`
`_2_LD_lfetch
`
`,2_Rqsts
`
`3US_TranAny
`3US_Tran_BRD
`
`3US_Trans_WB
`
`All instruction and data load requests that miss the Ll cache and
`are issued to the L2 cache
`
`All L2 requests including data loads/stores, instruction fetches,
`and locked accesses
`
`Number of all transactions on the bus
`Number of data cache line reads from the bus
`
`Number of cache lines evicted from the L2 cache because of
`conflict with another cache line
`
`Pa
`
`3US_BrdyClocks
`
`Number of clocks when the bus is not idle
`
`Assuming that you can quantify the amount of data that you read and write
`in a portion of your application, you can derive the following formulas:
`
`L2_LD + L2_ST
`L1 Data Miss Ratio = Total Mem Ref
`
`391
`
`

`
`
`
`ARCHITECTURAL DIFFERENCES AMONG THE PENTIUM AND PENTIUM PRO PROCESSORS I 379
`
`Since L1 cache misses generate L2 cache accesses, we are using the L2 event
`counters to quantify the L1 data miss ratio rather than using the DCU (L1)
`event counters.
`
`L2 Data Read Miss Ratio _
`
`BUS_TranBRD — BUS_TranTFetch
`Total Mem Ref
`
`% L2 Data Requests =
`
`The L2 data read miss ratio represents the numberof cache line reads or
`writes that missed the L2 cache and caused a line to be brought in from
`memory. L2 holds both instruction and data. The %L2 data requests repre-
`sent the percentage of data accesses only from L2.
`
`Bus Utilization =
`
`BUS_BrdyC1ocks
`Tot31 Clocks
`
`% Bus Data Reads =
`
`BUS__TranBRD — BUS_TranIFetCh
`BUS_TmnAny
`
`The Bus Utilization indicates how often the bus is busy moving data
`around (not idle). This includes all bus transactions whether it’s from the
`CPU or from another bus master, DMA, or another processor.
`
`The %Bas Data Reads represents the percentage of the bus used for data
`reads.
`
`3I-
`D:
`.
`
`<n
`
`23.2 Architectural Differences among the Pentium and
`Pentium Pro Processors
`
`To optimize your application for multiple IA processors, you need to pay
`attention to some of the architectural differences between the Pentium and
`
`Pentium Pro processors, For example, there are differences in the behavior
`of the cache subsystem and the organization of the VVrite buffers. These
`architectural differences affect the way you should proceed in optimizing
`your memory.
`
`ent
`
`pro-
`rs.
`oces—
`
`d a
`
`the
`more
`
`nder-
`
`ugh
`
`unters
`
`
`mory
`
`and
`
`e and
`
`hes
`
`of
`
`vrite
`
`392
`
`

`
`380 I CHAPTER 23 MEMORY OPTIMIZATION: KNOW YOUR DATA
`
`23.2.1
`
`Architectural Cache Differences
`
`watch out H
`I only Small Portion
`of cache line is
`t°”°'led °'
`'W'lte wide is
`tg,;:::ecrat:E:|?n2e_
`
`On Pentium processors, when you write to an address in memory that does
`not exist in the Li cache, the data is written directly to the L2 cache without
`touching the L1 cache. If the data does not exist in the L2 cache, the data is
`written directly to system memory without touching the L2 cache. This is
`known as a Read Allocate Cache.
`
`On Pentium Pro processors, if the processor encounters a cache write miss,
`it first bursts the entire cache line to the L1 cache from main memory or the
`L2 cache, and then writes the data to the L1. This is known as a Write Allo-
`cafe on a Write Cache A/Iiss. This behavior is typically advantageous since
`sequential stores in the same cache line are faster because they hit the L1
`cache—unlike the Pentium processor where they’ll be written through. In
`addition, when the stores are committed to main memory or the L2 cache,
`they are committed in one 32—byte burst write, which is faster than individ-
`ual memory writes—thus reducing overall bus utilization.
`
`The Pentium Pro processor implements a nonblocking cache compared to
`the Pentium processor, which implements a blocking cache. When the Pen-
`tium processor encountered a

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket