`
`Register Alias Table
`
`FIGURE 22-4 Instruction decoder.
`
`Performance Considerations
`
`The only thing that you have to worry about, as far as the decoders are con-
`cerned, is to apply the 421:1 template as often as you can. With the 421:1
`template, when you schedule your instructions, you need to arrange them
`such that the first instruction breaks down to four or less micro-ops, and
`the following two instructions break down to one micro-op each. By
`repeating this template, you guarantee maximum decoder efficiency. You
`can easily apply the template with the help of VTune’s static analyzer
`described in the previous chapter. Ideally, the Pentium II processor can
`decode three instructions every clock cycle. However, in reality, you never V
`sustain this throughput because you cannot always apply the 4:1:1 template
`or because the decoder stalls from branch misprediction or RAT stalls.
`
`Use the event counters to measure the efficiency of the instruction decoder
`as follows:
`
`Instructions Decoder per clock : InstADecoded / Clock Cycles
`
`22.6 Register Alias Table Unit
`
`22.6.1
`
`Operational Overview
`
`Internally, the Pentium II processor has forty virtual registers, which are
`used to hold the intermediate calculation results. When a new micro-op is
`decoded, the Register Alias Table (RAT) unit renames the IA register (eax,
`ebx, and so forth) to one of the virtual registers. At any given instance an IA
`register could be mapped to one or more virtual registers.
`
`How does it work? Consider the following sequence of instructions and
`their related micro-ops. (Notice that the listed micro-ops are just mnemon-
`ics that we made up to illustrate the point.)
`
`The RAT aliases each of the IA registers, eczx and edx, to one of forty i11ter11al
`virtual register W1, W2, and so forth. Notice that the RAT assigns a new virtual
`
`376
`
`
`
`are con-
`l:1:1
`e them
`s, and
`
`‘Y
`y. You
`er
`can
`
`l IICVCT ,
`
`emplate
`ills.
`
`decoder
`
`REGISTER ALIAS TABLE UNIT I 353
`
`TABLE 22-5 IA Instructions and Their Related Micro-ops
`
`mov eax, Mem
`
`add edx, eax
`mov eax, 12
`
`add eax, ecx
`
`add ecx, edx
`
`uLoad vrO:eax, Mem
`
`uAdd vr1 zedx, vrO:eax
`uLoad vr2:eax, 12
`
`uAdd vr2:eax, vr3:ecx
`
`uAdd vr3:ecx, vr1 zedx
`
`register for the same IA register only when the IA instruction is loaded with a
`new Value. If the register is only read from, the last Virtual register is used. In
`our example, the eax register is assigned a new Virtual register in both instruc-
`tions 1 and 3 since both instructions load a new value into eax. But in instruc-
`
`tion 5 the RAT uses the same virtual vrltedx register since the instruction does
`not load a new Value into edoq it is a source operand.
`
`Now, let’s see what happens to the micro-ops from Table 22—5 once they’re
`handed to the execution unit:
`
`In clock 1 the execution unit executes micro-ops I and 3—in two different
`execution ports. Even though both micro-ops write to the same IA register,
`ewc, the processor executes the opcodes at the same time since they write to
`two different virtual registers.
`
`In clock 2 the execution unit stalls on micro-ops 2 because of the dependen-
`Cy on the 1/r0:eax register from micro—op I. But micro—op 4 is ready to exe-
`cute, so it does—assuming that vr3:e¢:x is ready.
`
`Since micro—op 5 depends on the result of micro—op 2, it can only execute af-
`ter micro—op 2 executes. Micro-op 2 executes whenever the value of 1/r0:eax
`gets its value from memory. Meanwhile, the execution unit processes other
`rnicro—ops that are ready and waiting in the ROB.
`
`So why is register renaming useful? Consider the third rnicro-op
`U LO a d v r 2 : e a x , 12. Without register renaming, the micro—op has to
`wait for the first two micro-ops to execute before it can execute; of course,
`micro—op 4 has to wait as well. With register renaming, micro-ops 3 and 4
`were able to execute while the processor was loading data from memory.
`
`PARTVI
`
`377
`
`
`
`364 I CHAPTER 22 THE PENTIUM II PROCESSOR
`
`22.6.2
`
`Performance Considerations
`
`The RAT is affected by one of the major performance bottlenecks in the
`Pentium II processor—pa1‘tial register stalls. You’ll typically notice such
`stalls when you run Pentium optimized code on the Pentium II processor.
`Eliminating partial register stalls is one of the most obvious and most
`rewarding optimizations you can achieve on the Pentium ll processor.
`
`Partial stalls occur when an instruction that writes to an 8- or 16-bit
`
`register (al, ah, ax) is followed by an instruction that reads a larger set of
`that same register (eax). For example, the Pentium Pro will suffer a partial
`stall if you write to the al or an register and then read the ax or eax register.
`
`Notice that partial stalls can still occur even if the second instruction does
`not immediately follow the first instruction. Since partial register stalls
`could last for more than 7 cycles, on average, you can avoid partial stalls if
`you separate the two instructions in question by a minimum of 7 cycles. Or
`you can fix them.
`
`The Pentium II processor implements special cases to eliminate partial stalls in
`order to simplify the blending of code across processors. In order to eliminate
`partial stalls, you must insert the SUB or XOR instructions in front ofthe orig-
`inal instruction and clear out the larger register. Figure 22—5 shows all the
`possibile partial register stalls and which flavor of the XOR or SUB instruc~
`tions you can use to eliminate such stalls.
`
`_
`_
`: lryou wri =
`K
`.
`then read this
`reg.
`
`You must insert one of these instructions before the
`first instruction in order to eilnimate register partial
`Stalls-
`
`EAX
`
`No Partial Staii
`
`X0!’ ah, ah
`xor ax, ax
`XDT eax,eax
`
`XDI’ 8X, 8X
`X0!’ eax,eax
`
`f
`
`xor eax,eax
`
`X0!’ eax,eax
`
`X0!’ 92X,98X
`
`FIGURE 22-5 How to eliminate partial register stalls in the Pentium II processor.
`
`378
`
`
`
`REORDER BUFFER AND EXECUTION UNITS I 355
`
`In the three examples We’ve added the XOR or SUB instructions in front of
`the original code in order to eliminate partial register stalls.
`xor eax, eax
`mov ax, meml6
`read eax
`
`sub ax, ax
`mov ai, nemB
`read ax
`
`xor ah, ah
`mov ai, mem8
`read ax
`
`You can use VTune’s static analyzer to easily detect partial register stalls in
`your code. You can also use the Partial_Rat_Stalls event counter to measure
`the amount of cycles wasted by register partial stalls.
`
`22.1 Reorder Buffer and Execution Units
`
`22.7.1
`
`Operational Overview
`The Reorder Buffer (ROB, a.k.a. Reservation Station) is at the heart of the
`out-of-order execution of the Pentium II processor. The ROB can receive
`up to three micro»ops from the RAT and can retire up to three micro-ops in
`one clock cycle. It can hold a maximum of forty micro-ops at any given
`time. (See Figure 22-6.)
`
`Register Alias Table (RAT)
`
`Load Unit
`
`A2335
`Calculation
`Unit
`
`MOV [Memret], EAX
`
`Integer _ ALU
`
`DAA
`JMP label
`PAND
`PSRLQ
`4--—*-- PACKSSWB
`
`FMUL STO
`FMADD sro ------—~w—
`FDIV STU
`
`LEA EAX, [EBX*4] ~—-——-—-»»
`PAND ————————w»-
`
`PMULHW’""““'“"""‘”’
`
`Address
`Generation
`um
`
`FIGURE 22-6 Pentium II processor Reorder Buffer and the
`execution unit (port 0-4).
`
`379
`
`
`
`366 I CHAPTER 22 THE PENTIUM ll PROCESSOR
`
`
`
`The Pentium II processor implements a data flow machine, which leads to
`the out-of-order execution. In a data flow machine, the order of execution
`
`of micro-ops is determined solely by the readiness of their data, not by the
`order in which it entered the ROB. Let’s see how this model works.
`
`C9
`[R1:4
`R4,
`.1oad
`2
`1
`. ShiftL R4,
`. move
`R2, R3
`1
`. shiftt R2,
`2
`I
`. add
`R2, R3
`1
`
`<.n4>t.ur\:»—-
`
`Consider the coined pseudo—code fragment to the left. Assume that only
`one instruction can execute, and it takes the number of cycles to the right to
`execute. In a sequential (in-order) processor, it takes the code fragment 8
`clocks to execute.
`
`Now, consider a data flow machine where instructions execute based on the
`
`availability of their data not on the order in which they appeared. Let’s examine
`what happens every clock cycle:
`
`22
`
`1. The first instruction starts to execute immediately.
`2. The second instruction stalls for the next 3 clocks in the ROB because it
`
`
`
`22
`
`needs the value of R4 to execute. Instead, instruction 3 executes (no data
`dependency).
`3. Instruction 4 executes.
`
`4. Instruction 5 executes. Also, R4 value becomes valid.
`
`5. Instruction 2 is now ready to execute, so it does.
`
`As you can see, with out-of-order execution, it only takes 5 clocks to exe-
`cute compared to 8 clocks for the sequential execution model. Even though
`the micro-ops were executed out-of-order, the final results are exactly the
`same because they are written out in the order they came in.
`
`22.7.2
`
`Performance Considerations
`
`As a programmer, you do not have direct control over the operation of the
`ROB and the execution unit. But you can affect its behavior indirectly based
`on your understanding of the internal architecture. Here are a few guide-
`lines that could help you maximize the number of executed micro-ops
`every clock cycle.
`
`I Blend your instruction types. The execution unit has five execution ports
`that can execute up to five micro-ops in 1 clock cycle. To maximize this
`number, you should use a mix of instructions as much as possible. Avoid
`clumping the same kind of operations together (back—to-back loads, stores,
`ALUS).
`
`I Minimize mispredicted branches and partial stalls. Both ofthese are detri-
`mental to the performance of the ROB and the execution units.
`
`
`
`380
`
`
`
`RENDERING OUR SPRITE ON THE PENTIUM H I 357
`
`
`
`I Keep your data in the L1 cache. This allows the load port (2) to bring in the
`data as fast as possible and in turn avoids data dependency stalls among
`micro-ops.
`
`I
`I
`22.8 Retirement Unit
`P
`P
`P
`Y
`The retirement unit acce ts u to three micro—o s in 1 clock c cle. It com-
`mits the final results to the IA registers or to memory. The retirement unit
`guarantees that the micro-ops are retired in the order in which they came
`into the ROB. There is almost nothi11 that ou can do to affect the erfor—
`.
`,
`g
`Y
`‘
`P
`mance of the retirement unit.
`
`I
`I
`I
`22.9 Rendering Our Sprite on the Pentium II
`
`Now that we know what’s important to the Pentium II processor, let’s see if
`our favorite sprite has any problems when it runs on it. This time, however,
`well use VTune to do the analysis.
`
`Figure 227 shows the MMX sprite code analyzed for the Pentium II
`processor using '\/Tune. Notice that, rather than showing the U/V pairing
`
`Si-
`1::
`
`.
`
`<D
`
`4=1=1d€°°d“ EFOUP
`
`‘
`
`‘
`
`'1
`
`1
`
`movci
`
`mmfl, QWUHD PTFE Iesi}
`
`rum/q mmi, nwoao Pm (ear;
`rnovq
`rnrn2, rnm3
`pcmpeqb mm2, mm!)
`panii mm1, mm2
`pandn mrn2, mmfl
`. pm
`mm‘l, rnrn2
`
`:2:
`
`’
`
`
`
`rm‘-.n:i
`in:
`
`DWDRD F-‘TR [edi-’:3],mm‘I 2
`rnairi+B {Mn}
`1
`
`2
`
`FIGURE 22-7 MMX sprite analysis for the Pentium II processor.
`
`
`
`ads to
`‘utjon
`by the
`
`>33
`‘lg
`t 8
`n
`
`to
`
`the
`.
`'
`(amine
`
`;e it
`data
`
`exe-
`
`iough
`y the
`
`of the
`, based
`lide_
`PS
`
`'7.e this
`Avoid
`
`stores,
`
`e detri—
`
`381
`
`
`
`368 I CHAPTER 22 THE PENTIUM II PROCESSOR
`
`columns, VTune shows a “decoder group” column and a micro-op count
`column. The decoder group column, indicated by the curly bracket “{,”
`indicates when two or three instructions are decoded simultaneously
`because they adhere to the 421:1 decoder template (refer to section 22.5 for
`more details). In the “ii-ops” column, VTune shows the number of micro—
`ops that are generated when the instruction is decoded.
`
`In the figure, notice that the highlighted instruction d e C e c X was decoded
`by itself because the instruction sequence does not adhere to the 4:121
`decoder template. The problem is caused because the m 0 v q [e di - 8] ,mm1
`consists of two micro—ops and, thus, has to be the first instruction in a
`decoder group sequence.
`
`You can easily optimize the code for the Pentium II processor by switching
`the two instructions. In this case, the mov q [edi * 8] , mm1 will be
`decoded by the complex decoder, and the following two instructions are
`decoded by the two simple decoders. Figure 22-8 shows the results of opti-
`mizing our sprite. Note the differences in Line 21 of the number of micro-
`ops and the improvement gained.
`
`Static sis forshob:
`
`Top of loop
`
`rriovq mmll IIJW PTH {em}
`mom mrrfl , QWURD PTH Iedi]
`rriovq
`mrn2, mm?‘-
`
`FIC.YmDEC|i3 mm2, mm!)
`pend mmi , mm2
`pandn
`rrin'i2, mmfl
`
`pm
`add
`add
`
`rnm1,mm2
`edi, 8
`esi, 8
`
`rri-:-~.w:i
`dew:
`inz
`
`!.7B.~‘»!’1Z8FiE3I PIP} [Edi-53?), n'm‘i1 2
`ea->4
`1
`main+E {Eh}
`1
`
`FIGURE 22-8 MMX Sprite optimized forthe Pentium II processor.
`
`382
`
`
`
`
`
`SPEED UP GRAPHICS WRITES WITH WRITE COMBINING I 359
`
`iunt
`
`{)))
`
`1.5 for
`.icro-
`
`coded
`1
`
`,mm1
`a
`
`ching
`
`am.
`
`opti-
`liCI'0—
`
`kfi
`\*2~wgj\
`NW
`\-~~,:
`MM\ 1,’
`\J
`
`VTune also warns you about partial register stalls, which are very useful to
`remove. Typically, you can remove partial register stalls with little or no
`impact on performance on the Pentium processor.
`
`
`
`In the fetch unit section, we recommended that you align loops on a l6—byte
`boundary. Notice, however, in Figure 22-8, we did not bother to apply our own
`recommendation: the top of our loop, "main+6:,” is not aligned on a l6—byte
`boundary. Why not? The purpose of that rule was to assure that the decoder
`would have three instructions to decode when itjumps to the top of the loop;
`with luck, the three instructions follow the 4:111 rule. if you examine the first
`three instructions in the loop, you'll notice that they fit within a l6—byte block
`O><OO to O><OF. And since the fetch unit forwards 16 bytes at a time to the
`decoder, the decoder will have three instructions to decode in these 1 6 bytes.
`
`
`22.10 Speed Up Graphics Writes with Write Combining
`
`22.10.1
`
`Operational Overview
`
`By the time the Pentium II processor is in the mainstream market, software—
`only 3D games and high—resolution MPEG2 video will be widely available.
`Unfortunately, one of the greatest bottlenecks for these applications is the
`access speed to graphics memory. A typical software only MPEG2 player
`consumes up to 30 percent of the CPU writing to video memory.
`
`The Pentium II processor implements the Write Combining (WC) memory
`type5 in order to accelerate CPU writes to the video frame buffer. The 32-
`byte buffer delays writes on their way to a WC memory region, so applica-
`tions can write 32 bytes of data to the WC buffer before it bursts them to
`their final destination. The 32—byte burst writes are faster than individual
`byte or DWORD writes, and they consume less bandwidth from the system
`bus.
`
`.
`1
`l
`
`i
`
`Typically, the video driver or the BIOS sets up the frame buffer to be WC
`(similar to the way it is set up now as uncached memory).As usual, you can
`use DirectDraw to retrieve the address of the frame buffer. Therefore, there
`
`is no change required from an application point of view (well, you might
`want to read on).
`
`5. Memory type: These include cached, uncached, WC, and other memory types.
`
`5l-
`
`E<n
`
`.
`
`383
`
`
`
`370 I CHAPTER 22 THE PENTIUM II PROCESSOR
`
`Let’s have a closer look at VVC and determine how it enhances graphics
`application performance.
`
`Assume that you are writing a 320 X 240 image to a WC frame buffer as
`shown in Figure 22-9. Typically, you would write the pixels from left to right,
`sequentially, one pixel at a time. For the sake of simplicity, also assume that
`the address of the frame buffer is aligned on a 32—byte boundary.
`
`When you write the first 32 bytes of line 1 to the frame buffer, those 32
`bytes actually end up in the VVC buffer rather than in video memory. Once
`you write byte 33 to the frame buffer, the WC buffer bursts its contents (the
`first 32 bytes) to video memory and captures the thirty—third byte instead.
`Similarly, the next 31 bytes are held in the WC buffer until the sixty-fifth
`byte is written out. The same process repeats for every package of 32 bytes
`of data aligned on a 32 byte boundary.
`
`So what about the last 32 bytes in the image. How are they flushed out?
`They are eventually flushed out when you write somewhere else in the
`video buffer (for example, when your write out the next frame) or when a
`task switch occurs. Actually, there are plenty of circumstances that cause the
`VVC buffer to be flushed out:
`
`omn:-
`
`ti
`
`2o
`:umIn
`6.3
`
`4-paumun.5
`
`
`
`Graphics Frame
`Buffer Memory
`
`
`
`FIGURE 22-9 WC frame buffer.
`
`384
`
`
`
`
`
`SPEED UP GRAPHICS WRITES WITH WRITE COMBINING I 371
`
`
`
`
`
`CS
`
`as
`
`i right,
`3 that
`
`32
`Once
`
`ts (the
`Ltead.
`ifth
`
`bytes
`
`1en a
`
`se the
`
`Any L1 uncached memory loads or stores (L1 cached loads and stores do
`not flush the WC buffer).
`
`Any WC memory loads or M7C stores to an address that does not map
`into the current WC buffer.
`
`I/O reads or writes.
`
`Context switches, interrupts, IRET, CPUID, Locked instructions and
`VVBINVD instructions.
`
`Notice that the Pentium II processor generates a 32—byte burst write only if
`the VVC buffer is completely full. Otherwise, it performs multiple smaller
`writes to the WC region. These multiple writes are still faster than writing
`to an uncached frame buffer.
`
`22.10.2
`
`Performance Considerations
`
`in short, your WC could enhance your graphics performance if you write
`your data sequentially to the frame buffer. We have listed the following
`guidelines to remind you of what you should consider when you optimize
`for a WC frame buffer.
`
`Always write sequentially to the frame buffer in order to gain perfor-
`mance from 32-byte VVC bursts.
`
`Avoid writing to the frame buffer Vertically. For example, if you write to the
`first pixel in line 1 then line 2, since the second write does not map to the cur-
`rent VVC buffer, the WC buffer (holding only 1 byte) will be flushed out. The
`same thing happens when you write to line 3, 4, and so forth.
`
`WHAT HAVE
`YOU LEARNED?
`
`Now you know about the internal units of the Pentium II processor. More importantly, you
`know what matters to these units so you can get the best performance for your application.
`As a last reminder:
`
`Maximize your code execution from the Li cache,
`Use the new instructions to minimize branches and mispredicted branches.
`
`Avoid partial stalls. They are deadly.
`Use VTune to analyze performance.
`Use a mix of instructions (loads, stores, /\LUs, iVliVlX, and so forth) and apply the 421:1
`decoder template.
`Use Write Combining to blast your video images to the screen.
`Read the next chapter to fami|iari7e yourself with memory optimization issues,
`
`385
`
`
`
`CHAPTER 23
`
`um
`
`Memory Optimization:
`Know Your Data
`
`T
`
`hroughout this section, we've stressed again and again that you should "know your data,"
`know where it is coming from and know where it is going. We've also stressed that the
`ptimizations for the internal components of the processor are mostly useful if the code
`rdata is already in the Li cache. It’s a nice premise, but that's not always the case.
`
`O O
`
`In this chapter we'll talk about
`
`how the data behaves away from home: in the L2 cache or main memory;
`how the data moves between the L1, L2, and main memory and what affects the
`movement of data;
`‘
`
`how to bring the data into the L1 cache and keep it there as long as it's needed; and
`as an added bonus, accesses to video memory, so you can understand how to write
`effectively to video memory.
`
`As you know, multimedia applications deal with a huge amount of data
`that changes continuously from one second to the next. For example, a typ-
`ical MPEG21 clip has 30 fps with a frame size of 704 X 480 pixels per frame
`at an average of 12 bits per pixel. Moreover, since MPEG2 uses bidirectional
`frame prediction, the size of the working data set2 is typically three to four
`
`1. MPEG2 is :1 High Resolution Motion Video Compression Algorithm.
`2. The working data set refers to the maximum si7/5 of data that is used by the application at any given
`moment.
`
`I373:
`
`
`
`WHY READ
`THIS CHAPTER?
`
`
`
`386
`
`
`
`374 I CHAPTER 23 MEMORY OPTIMIZATION: KNOW Youn DATA
`
`times the size of one frame. Taking all of this into account, you can calculate
`the size of the working data set for an MPEG2 decoder as follows:
`
`Data Set Size -"
`
`4 frames * [704 * 408 pixels) * 12 bits/pixel
`_
`8 b1ts/byte
`
`— 1.9 MB
`
`All of these bits definitely do not fit in the L1 cache or even in the L2 cache——
`the L1 cache is 8 or 16K, and the L2 cache ranges between 256 and 512K.
`Therefore, at any given moment, the majority of the data resides in main
`memory rather than in the caches.
`
`The main purpose of this chapter is to emphasize that memory access can
`be very costly, in terms of clock cycles, and to highlight certain access pat-
`terns that are more efficient than others. We’ll also point out the differences
`between the various flavors of the Pentium and Pentium Pro processors
`with regards to cache and memory behavior. We’ll top the chapter off with
`a brief discussion about accessing video memory.
`
`23.] Overview of the Memory Subsystem
`23.1.1
`Architectural Overview
`
`Figure 23-1 shows a simplistic diagram of the memory subsystem for com-
`puters with the Pentium H processor. Notice that the L1 code and data
`caches are internal to the processor and run at the same speed as the core
`engine. The L2 cache resides on a dedicated L2 bus, external to the proces-
`sor, and runs at one half to one third the speed of the processor.3 The mem-
`ory subsystem is connected to the PCI chip set, which connects the
`processor to main memory, PCT bus, and other peripheral devices.
`
`233 MHz
`
`Core
`
`V
`
`PC1391
`440FX
`
`FIGURE 23-’! Memory architecture of a system with the Pentium II processor.
`
`3. The fraction of the bus speed depends on the type of L2 cache used and the speed of the processor.
`
`387
`
`
`
`ulate
`
`he~—
`
`3.lI‘l
`
`can
`
`pat-
`nces
`rs
`
`with
`
`COITI‘
`
`OI‘€
`CBS‘
`CH1-
`
`8550!.
`
`S501‘.
`
`OVERVIEW OF THE MEMORY SUBSYSTEM I 375
`
`The PCI chip set is the glue logic between the processor, memory, DMA,
`and the PCI and AGP4 buses. It manages and controls the traffic between
`the processor and all of these devices. A dedicated bus connects the system
`memory to the PCISet. The PCI bus connects the PCISet to I/O adapters,
`such as graphics, sound, and network cards. The AGP bus is a specialized
`graphics bus that was designed with 3D acceleration in mind; notice that
`the 440LX PCISet is the first chip set with the AGP bus.
`
`23.1.2
`
`Memory Pages and Memory Access Patterns
`
`We’ve mentioned, throughout this section, that the L1 and L2 caches are
`divided into 32—byte cache lines, which represent the least amount of data
`that can be transferred between the L1 cache and main memory. For the
`curious only: you can find out more about the internal architecture of the
`caches from the Intel manuals (things like two—way and four—way set associ-
`ate, and so forth).
`
`Internally, the system memory is divided into smaller units called memory
`pages. Memory pages are typically 2K in size and are aligned on a 2K
`boundary. The only reason we’re talking about memory pages here is that
`because of the design of DRAM chips, certain memory access patterns are
`more efficient than others. In the discussion that follows, you need to come
`outwith one thing: consecutive accesses within the same memory page are
`more eficient than consecutive accesses that cross multiple memory pages.
`
`In this discussion, we’re assuming that the processor missed both the L1
`a11d L2 caches and that it is now fetching data from main memory. As we
`mentioned earlier, the processor fetches an entire cache at a time from main
`memory and writes it out to the cache. Since the processor has a 64-bit data
`bus, it can fetch an entire cache line with four bus transactions.
`
`Now, when the processor requests data from main memory, the memory
`page where the data exists is first “opened”—this is done in the hardware-
`and then the data is retrieved. Once the page is open, it takes less time to
`read or write other data to the same page. Typically, the data sheet for the
`memory chip specifies how long it takes to open the page and perform the
`first read, and how long it takes to perform subsequent reads once the page
`18 open.
`
`4. The Accelerated Graphics Port (AGP) is a specialized graphics bus designed with 3D rendering in
`mind.
`
`5I-
`as
`. <
`
`n.
`
`388
`
`
`
`376 I CHAPTER 23 MEMORY OPTIMIZATION: Know Your: DATA
`
`For example, the data sheet of an Enhanced Data Out (EDO) memory chip
`specifies the sequence {l0—2—2—2}{3—2—2—2} where the numbers represent
`clock cycles. Each curly bracket indicates four bus cycles of 64 bits each——
`that’s one cache line. The first sequence, {10-2-2-2}, specifies the timing if
`the page is first opened and accessed four times. The second sequence, {3—2—
`2-2}, specifies the timing if the page was already open and accessed four
`additional timesflthat means you did not access any other memory page in
`between. The last sequence repeats as long as you access memory within the
`same page. One last thing: only one memory page can be open at any given
`moment.
`
`The data sheet we have been discussing relates to amemory bus running at
`66 MHZ. Now, if we look at another processing speed, say a 233-MHZ pro—
`cessor, the timing becomes {35—7—7—7}{1l—7—7—7} in processor clocks.
`
`Whenever your application jumps to another memory page, the current
`open page is first closed before opening the new page. As a result, it takes an
`additional 24 processor clocks to switch between memory pages—that’s a
`lot of processor clocks to waste. So what can you do about it? Maybe noth-
`ing! Maybe a lot! The whole point is that you should try to organize your
`memory footprint in such a way that you bring the data from main mem-
`ory to the L1 cache in the most efficient manner. For example, if you know
`that most of your data resides in main memory, for example, MPEG2, you
`might try to arrange the data in a smarter fashion such that you can burst it
`to the L1 cache faster.
`
`In MPEG2’s motion comp ensation,5 for example, you typically access three
`reference frame buffers and write the output to a fourth buffer or directly to
`the screen. Typically, when the buffers are allocated, they are allocated in a
`contiguous fashion, separately, as shown in Figure 23-2.With the allocation
`scheme shown in Figure 23—2a, when you access the three frames, you’ll
`definitely cross memory page boundaries and thus reduce the overall appli-
`cation performance. Now, if you interleave the frames on a line—by—line
`boundary, as shown in Figure 23-2b, you’ll have a better chance of accessing
`the three frames from the same memory page, and thus increasing memory
`access efficiency.
`
`5. Motion Compensation is used when inter-frame decoding is used.
`
`389
`
`
`
`
`
`OVERVIEW OF THE MEMORY SUBSYSTEM I 377
`
` Y, U, and V components
`
`
` Ill
`
`
`
`
`
`
`
`YUV9
`
`E‘ Y components only
`In 4 x 4 block
`
`YL/V12-=9 (16Y+ 4U+ 4\/) x 8 =12 bits/pixel
`16 pixels
`
`
`
`
`
`YUV9 ={>(16Y+1U+1V) x B = 9 bits/pixel
`15 pixels
`
`FIGURE 23-2 MPEG2 frame buffer allocation strategy.
`
`23.1.3
`
`Memory Timing
`
`To complete the picture, let’s look at a comparison of the L1 and L2 caches
`and system memory.
`
`TABLE 23-! Memory Architecture and Timing for a System Using the
`Pentium II Processor and EDO Memory
`
`
`“Ll cache
`[l—1—I—1}
`{1—1—l—1}
`4 (l864iMB/. econ
`ii
`L2 cache
`{5—l—i—1}
`{10—2—2-2}
`16 (466 MB/Second)
`
`EDO memory
`SDRAM
`
`{10—2—2-2}
`{3-2-2-2}
`[1 H-1-1]
`[2-1-1-1}
`
`{35-7-7-7}
`{H-7-7-7
`139-4-4-4}
`{7-4-4-4}
`
`56 E133 MB/Secondg '
`32 233 MB/Second
`51 E146 MB/Secondg
`19 392 MB/Second
`
`Access timing for main memory depends on the type of PCISet and 1nem—
`ory used in the system (available types include EDO, FPRAM, SDRAM6).
`SDRAM offers the best access timing because it has a lower repetition7 rate
`{11—1—1—1}{2-1-1-1} relative to EDO {10—2—2—2}{3—2—2—2}. But SDRAMs are
`only supported on systems with the PCISet 440/LX chip set or later.
`
`7. Repetition rate: the timing for fetching the last 3 quad words in a cache line.
`
`'>'I-
`I1:
`L
`
`<I
`
`6. EDO: Enhanced Data Out; FPRAlVl: FastPage RAM; SDRAM: Synchronous DRAM.
`
`y chip
`ent
`
`ch——
`
`ng if
`, {3-2-
`ur
`
`age in
`in the
`
`given
`
`ing at
`pro-
`
`ent
`es an
`
`noth-
`our
`em-
`
`know
`, rou
`urst it
`
`three
`
`tly to
`in a
`ation
`
`appli-
`
`ssing
`mory
`
`390
`
`
`
`378 I CHAPTER 23 MEMORY OPTIMIZATION: Know YOUR DATA
`
`From the CPU point of View, notice that the total number of clocks spent
`accessing main memory depends on the speed of the processor. Faster pro-
`cessors actually wait more clocks for memory than do slower processors.
`For example, if a memory chip takes one nanosecond to respond, a proces-
`sor running at 233 MHz waits 233 clocks before it receives the data, and a
`200 MHz processor waits 200 clocks before it gets the same data. Even
`though both processors Waited the same physical time, 1 nanosecond, the
`faster processor ticked more clocks in that time——and thus it is losing more
`clocks that could be spent doing something more useful.
`
`23.1.4
`
`Performance Considerations
`
`The Pentium II processor includes event counters that can help you under-
`stand the memory footprints of your application. Notice that even though
`some of these counters are not 100 percent accurate, they can give you a
`good indication of your application cache and memory behavior.
`
`TABLE 23-2 Pentium II Processor Cache and Bus Performance Event Counters
`
`
`DATA_MemRef
`All memory accesses including reads and writes to any memory
`WP‘?
`Number of data load/store that miss in the L1 data cache and
`are issued to the L2 cache
`
`,2*LD, L2_ST
`
`_2_LD_lfetch
`
`,2_Rqsts
`
`3US_TranAny
`3US_Tran_BRD
`
`3US_Trans_WB
`
`All instruction and data load requests that miss the Ll cache and
`are issued to the L2 cache
`
`All L2 requests including data loads/stores, instruction fetches,
`and locked accesses
`
`Number of all transactions on the bus
`Number of data cache line reads from the bus
`
`Number of cache lines evicted from the L2 cache because of
`conflict with another cache line
`
`Pa
`
`3US_BrdyClocks
`
`Number of clocks when the bus is not idle
`
`Assuming that you can quantify the amount of data that you read and write
`in a portion of your application, you can derive the following formulas:
`
`L2_LD + L2_ST
`L1 Data Miss Ratio = Total Mem Ref
`
`391
`
`
`
`
`
`ARCHITECTURAL DIFFERENCES AMONG THE PENTIUM AND PENTIUM PRO PROCESSORS I 379
`
`Since L1 cache misses generate L2 cache accesses, we are using the L2 event
`counters to quantify the L1 data miss ratio rather than using the DCU (L1)
`event counters.
`
`L2 Data Read Miss Ratio _
`
`BUS_TranBRD — BUS_TranTFetch
`Total Mem Ref
`
`% L2 Data Requests =
`
`The L2 data read miss ratio represents the numberof cache line reads or
`writes that missed the L2 cache and caused a line to be brought in from
`memory. L2 holds both instruction and data. The %L2 data requests repre-
`sent the percentage of data accesses only from L2.
`
`Bus Utilization =
`
`BUS_BrdyC1ocks
`Tot31 Clocks
`
`% Bus Data Reads =
`
`BUS__TranBRD — BUS_TranIFetCh
`BUS_TmnAny
`
`The Bus Utilization indicates how often the bus is busy moving data
`around (not idle). This includes all bus transactions whether it’s from the
`CPU or from another bus master, DMA, or another processor.
`
`The %Bas Data Reads represents the percentage of the bus used for data
`reads.
`
`3I-
`D:
`.
`
`<n
`
`23.2 Architectural Differences among the Pentium and
`Pentium Pro Processors
`
`To optimize your application for multiple IA processors, you need to pay
`attention to some of the architectural differences between the Pentium and
`
`Pentium Pro processors, For example, there are differences in the behavior
`of the cache subsystem and the organization of the VVrite buffers. These
`architectural differences affect the way you should proceed in optimizing
`your memory.
`
`ent
`
`pro-
`rs.
`oces—
`
`d a
`
`the
`more
`
`nder-
`
`ugh
`
`unters
`
`
`mory
`
`and
`
`e and
`
`hes
`
`of
`
`vrite
`
`392
`
`
`
`380 I CHAPTER 23 MEMORY OPTIMIZATION: KNOW YOUR DATA
`
`23.2.1
`
`Architectural Cache Differences
`
`watch out H
`I only Small Portion
`of cache line is
`t°”°'led °'
`'W'lte wide is
`tg,;:::ecrat:E:|?n2e_
`
`On Pentium processors, when you write to an address in memory that does
`not exist in the Li cache, the data is written directly to the L2 cache without
`touching the L1 cache. If the data does not exist in the L2 cache, the data is
`written directly to system memory without touching the L2 cache. This is
`known as a Read Allocate Cache.
`
`On Pentium Pro processors, if the processor encounters a cache write miss,
`it first bursts the entire cache line to the L1 cache from main memory or the
`L2 cache, and then writes the data to the L1. This is known as a Write Allo-
`cafe on a Write Cache A/Iiss. This behavior is typically advantageous since
`sequential stores in the same cache line are faster because they hit the L1
`cache—unlike the Pentium processor where they’ll be written through. In
`addition, when the stores are committed to main memory or the L2 cache,
`they are committed in one 32—byte burst write, which is faster than individ-
`ual memory writes—thus reducing overall bus utilization.
`
`The Pentium Pro processor implements a nonblocking cache compared to
`the Pentium processor, which implements a blocking cache. When the Pen-
`tium processor encountered a