`Block Sizes and Fetch Strategies
`
`Steven Przybylski
`
`MIPS Computer Systems,
`928 Arques Ave, Sunnyvale, CA 94086
`
`Abstract
`
`This papcr explorcs thc interactioiis hctweeii a cache's
`1)Iock s i x , fctcli sixc ;incl t'elch policy troni the 1)ersl)ective
`ol' m;ixiiiiixing system- IC ve I ~~erl'oiiiiaiice. It bas been
`~)rcvioiisly noted that givcii ;I siniplc l'ctcli stratcgji the
`~xrlonnaiicc optiniiil Idock size is almost always lour or
`eight words [ IO]. 11' Ilicrc is c\/eii ii small cycle time
`~xnalty associaled with either loiigcr blocks or l'etchcs,
`llie~i the pcrl'oniiaiicc-o~~timaI size is ~ ~ ~ I i c c i ~ h l y reduced.
`
`le split cilchc organiutions, whcrc the I'etch nit1 block
`sixes 01'
`iiistriiclioii iiod clatii ciiclics iirc ;ill indel)cntlcot
`design variahles, instriictiori cache block size and l'etcli
`size should bc the silnic. For tlic \vorkloatl i ~ ~ i d write-back
`write policy usctl in this race-driveii siiniil;itioii stiitly, he
`iiislriiction cache block size slioiiltl be iibout ii liiclor ol'
`two greater than tlie data ciiclic l'clch size, which in tuni
`should equal lo or tloublc llic tliit;~ ciiclic block sixc. The
`simplest I'etch s~lstcgy ol' l'ctcliing only OH ii miss aiitl
`slalliiig the CPU until the I'ctch is coiiil)lelc works well.
`lctcli srriitcgies (lo not 1)roducc
`Coniplicatcd
`the
`Ixrloniiaiice
`the
`imlmvciiients
`iniliciitctl
`by
`iicconil)aiiyiiig rcduclioiis in miss ratios hecause 01' liniiletl
`nieniory rcsoiirccs a i d ii strong teiiiporiil clustering ol'
`cache misses. For the cnviroiimciils siniiilatctl here, the
`most effective I'clcli slratcgy improvetl ~xrtoniiiiiicc by
`I .7% and 4.S% o\'cr
`tlic siiiii)lcst slralcgy
`belweeii
`dcscribed a t w e .
`
`I. Introduction
`Over tlic last several years, tlcsigiicrs have I ~ c o
`iiicrciisiiigly conccrncd with llic gro\ving gal) hctwccri tlic
`cycle linics 01' tlynii~iiic RAMS aiitl niicrol)roccssor CPlJs.
`This coiicerii 1i;is s1)urrctl iiilcrcst iii iricrcasing caclic
`block sizes to niiiiiiiiixe iiiilxicl ol' ii rcliilivcly loiig ~ i i i ~ i i i
`nieinory latency. Though tlierc Ii;ivc heen scveriil studies
`publishet1 to tlalc Ilia1 explore the chiiriicteris~i~s of circlics
`longer block sizes 12, 12, IS, 16, 18, 20, 21 I,
`with
`the
`slutlics II~IVC gciicriilly failed to explore the ~xrli~rni;incc
`in~plicalioiis ol' sonic 01' ~lic niore sigiiil'iciiii~ design
`tlccisioos.
`
`CH2887-8/90/0000/0160$01 .OO Q 1990 IEEE
`
`160
`
`in a previous
`This paper exj)antls o i i a section
`work [ 101. In that article, the dependency ol' perlomiance
`oii the block size w a s graphed as a lbriction 01' the niaiii
`thilt each
`iiieiiiory characteristics. The conclusioii \\!as
`mcniory system, as characterized by
`its latency and
`trarisler rate, implies a perl'onii;iiicc-ol)tiinsl block size
`that is inclcpendcat of the CPU characteristics, including
`the cycle lime. Thal analysis vas limited by iintlerlyiiig
`assiini~ilions ol' the study. In particular, csactly one block
`wis tetchetl from niaiii memory on ii cache miss, the CPU
`Wits stalled until the fetch was complclc, arid tlierc was 1 1 0
`assumed cycle lime dcgriidiitioii w4li increasing block
`size. Io [his piper we will relax each 01' tliesc constraiiits
`i i i turn and thereby exaniioc their impact on the hcsl block
`size and the resiiltiiig pcrloniiancc level.
`lo recopition ol' the l'act that the ultimale niciisiire ol' a
`memory hierarchy is tlie perlomiaiicc obtained by lhe
`coniputcr of which it is a part, execution tinic is the
`priiiiary metric used iii this paper to coniliare tlilltreiil
`cache organizalioiis iiiid fetch policies. The use of' ~liis
`metric is csiwcially imi)ortarit in the study ol' policies lhal
`rely oii parallclism betweeii CPU execution aiitl memory
`iictivity to presiimably iniprove perl'onii;ince. Miss ratios
`aloiic are
`inadequate
`li)r jiidgiiig
`their eftcctiveness
`because not all misses arc created equal. In tlic presence of
`ii complicated lictcli strategy iotlivicliial rnisscs caii liavc
`widely varying contributions to the total cycle count,
`clcl)cnding 0 1 1 tlic exiict state 01' the machine iII tlic time.
`As ii result, the rclatioiisliip hctwcen the iiunihcr ol' niisses
`iind llic exectitioil tiiiic is not necessarily str;iiglitl'orwartl.
`this stiitly, dil'fcrcnt systciii conl'igiirations arc
`In
`coriil)arctl using ii trace-driven simulator which iiccuralcly
`keeps track ol activity
`Ic\~cls iii the nicmory
`;ill
`;it
`Iiicri~r~hy diiriiig every niircliiiic cycle.
`A tier describing llic lerniinology ant1 cx])crinienti~l
`nictliotl in Sections 2 aut1 3, we reproduce iii Section 4 the
`I)ascliiic lor
`l'rorii tlic pxvious paper
`rcsulls
`iis
`;I
`
`co~iipi~ris~~i. 111 Scctioii 5 , u'c C X ~ I I I I ~ I I C tlic cflbcts 01'
`ii
`cycle time ~ieiialty I'or long block sizes. It' llic atltlitioniil
`~niiltiplcxors aiitl control ncctlctl to iniplcnicni i1 longer
`block size t1egr;rtles the cycle tinic ol' the CPU l'roiii tlic
`siniple, siiigle word block sizc ciisc, tlicii thcrc clciirly \\*ill
`Ijc iiii inipiic~ 0 1 1 llic ~riidcol'l' hclwccii Iir~eiicy iiiitl Iririisl'cr
`
`DOCKET NO.: IPR-5463750
`
`1
`
`TI-EXHIBIT 1012
`
`
`
`jxriotl that dctcimiiies the optimal block size.
`An iniport;int realization is that the amount of data that
`is fetclictl from memory need not be equal to a block. It
`ciiii, in I'act, Ix: either iiiorc or less. Machilies that prefetch
`instriictioiis arc fetching several blocks per cache miss.
`Scc~ioii 6 asks the qiicstion: how arc the optimal block
`s i x and I'etch sizc related as B I'iinction of tlie memory
`cliiiractcristics. lo contrast, Section 7 asks the questioo:
`what is tlic beael'it 01' more complicated fetch slrategies
`tliat try to overlap CPU execution with riiain memory
`fetches. Fioally, Seclioo 8 suniinarizes arid concludes the
`paper.
`
`2. Terminology
`The literaturc 011 caches is confiisetl with respect to the
`tlcl'iiiitioas of some important
`teniis. The Ibllowing
`dcl'initioes arc unambigi~oiis, coiisistent with the majority
`ol' tlic 1itel;itiirc aiitl ;ipl)ropriate I'or Ihc disciissioii of block
`sizes illid fetch strategies.
`Block:
`A hlock is the unit ol' dala in tlic caclic iissoci;ited witli
`ii 1iij.j. Ao n-wry set iissociativc citche witli tn sets has
`izxm blocks. Block sizes arc always biliary valircs -
`lliiit is, powers of tco - aiitl typically range lioni one
`word1 to I28 words.
`S~~b-block:
`A sub-block is the iioil ol'data that is i~~~ociatetl with a
`valid bit. Siib-blocks arc smaller than or equal lo
`blocks, and iirc generally either ow byte, 011c word or
`oiic block long.
`I n writc-back caches, dirty hits arc
`typically also associ;itetl with sub-blocks.
`Fetch Size:
`tlalii fetched from
`Tlie l'ilch sizc is tlie qaantily 01'
`nrmory at one time. The only hard coostraint is that it
`be an integral iiuniber of sub-hlocks, though in reiilily
`it is typically a binary iiiiinber of blocks. Tlial lhe most
`common fetch size
`is one block coiilributcs
`lo
`coninioii usage of block sizc for fekh sizc and the
`blurred distinction betweeii the two.
`Fetch Policy:
`The fctcli policy is the ;tlgorillini used lo tletemiiiic
`wlicii a niiiiii memory read is to occiir, wliicti words
`are to he fetched, what order they will be reluriwl in,
`and ;II what point the CPU will be allowed to coiitiiiiic
`cxecalion if it is stalled. Designs tliiit prel'elch on a
`ci~chc miss h i i v ~ ii fckh sizc greater tlia~i OW block,
`iiiid r~siinie cxccotioii iis so011 ;IS the niissctl siib-block
`or block is rcturiictl I'roni nicmory.
`(Reid) Cache Miss Hiltio:
`Most researchers tlel'inc tlic caclic miss ratio iis the
`riilio 01' read misses ml write niisscs to the iiumlxr ol'
`iiiemory refcrceccs. However, it is widely recognized
`that reiltls (instruction fetches a~itl loads) i~ntl writes
`(stores) have greatly tliflkriiig f~qiieiicics, policies,
`iiIitl hit a ~ l miss peiialties 17, 171. III tlic light ol' Ihe
`
`is occumiip
`in analysis ot' niemoty
`shitl
`thiit
`
`hierarchies away from miss ratios and towartls overall
`system-level perfoniiaiice, we define the read cache
`miss ratio and write cache miss ratio separalely. Wheo
`used without qualification, miss rurio refers lo !he read
`miss ratio.
`
`3. Experimental Method
`All the resiilts pesented 111 this papcr were generated
`with ii trace-driven simulator stimulatetl by eight liirgc
`address traces. This sectioii brielly describes the simulator
`ant1 the traces. Both are characterized at great length in
`other documeots I I I , IO]. The first of tliesc pirblicalions
`also iocludes a detailed analysis supporting tlic credibility
`ol' conclusio~is based on this nicthod.
`A proper study ol' memory hierarchies io which the
`metric for comparison is execution time requires a
`simulator that acciirately accounts for tinic at all levels ol'
`the hicrarcliy. Specifically, the simulator used here allows
`Ihc iiser to specily the duratioii of reads arid writes to the
`til@ ant1 data portion of caches and write biifl'crs, as well as
`the miiiinium sel~aration between operations. I n addition,
`latcricy aiid transfcr delays between levels 01' the hieriirchy
`ciiii be varied to allow for complete and realistic systeni
`niotlels. A wide variety of fetch strategies ;ind aii cciu~lly
`broad slwctriini of write stri~tegic~ arc provitlcd for.
`The second prcrcquisitc is set of system niotlels to be
`iiscd in the siniulations. Tlic cache design space is
`tlic ollen overlooked
`iiicrcdi bly diverse.
`Includiiig
`temporal
`parameters,
`there
`arc
`literally
`tens of
`intlepcndent design variables per lcvcl in a memory
`hierarchy. Siiice it is impossible lo explore the ciitirc
`dcsigii spacc in aoy oiic study, we h i t oursclvcs by
`fixing most of llic p;iriimeters at coiiimoii v;ilucs thiil arc
`coiisisteot with llic ranges 01'
`llie free variables of the
`cxperinicnt. For insliincc, we tii~vc restricted oiirselves lo ii
`single-level, split inslnicliodtlata cache. A realistic base
`system was chosen as the rcferciice. I n each ot' the
`cxperinients 10 follow, three or morc design paranieters
`arc changed siniultancoi~sly over a portion of the tlesigii
`spacc while all 01' (he others remain fixed.
`Tlie base system is chariictcristic of ii systcm built
`arountl a 25MHz R ISC CMOS processor. Tlirougtioul
`tliis study, we will uiiifoniily assiiiiic tliiit tlic caclics
`tlelcniiinc tlic CPU cycle time. Tlie single Icvcl 01'
`colitiiiiis 6 4 K B in ciicli of the iiistniction iintl tliila
`~ i i ~ h i ~ i g
`c;icIies. Both iirc tlircct-niapi)ctl iriitl havc tlcfaull block
`irntl I'clch sixes of 4 words. Tlic sub-block size is
`iinit'omily one word. The clata caclic is ii write-hack cache
`witli ;I four entry write hiil'fcr, ci~cli entry being one block
`wide. Tlic tlalapalli into ant1 oirl ol' tlic write hiill'cr is equiil
`Both ports cycle at Ihc CPU
`iii width to the b i ~ c k ~ ~ l i i ~ ~ c .
`cycle rate, and they caii operate siniiiltaneoiisly. Writes
`into the cache take two cycles, iintl the write miss policy is
`lo fill tlie sub-block writtcn to with the written data but aol
`lklcli lhc resf of thc block l i ~ n i 111i1i1i ~iie~iiory. Will1
`10
`thcsc clioiccs, bo111 write hits aiitl ~nisscs iirc rclativcly
`qiiick. An ;iggrcssivc writc policy iiiininiixcs thc write's
`
`DOCKET NO.: IPR-5463750
`
`2
`
`TI-EXHIBIT 1012
`
`
`
`coiitiibritioii to the overall execiition tinic and alloivs LIS to
`coiiceiitrate on the read el'fects.
`The caches are coiiiiected by a one word wide hiis
`capable of block transfers ;it a rate of oiic word per cycle.
`The cache read miss penalty is one cycle to get tlie address
`to the niemory, scver;il cycles of memory latency, then
`several cycles to transl'er lhc (lata back into tlic cache. The
`niain memory is characterized by a 18011s conihinetl
`rlccode antl access time, plus a 12011s gap bctiveeii
`operations. The del'ault I'ctch slrategy is to stall the CPU
`on a read miss until the entire fetch is complete aut1 the
`c x h c is loatled (tlemand fetch only 141). During the
`loatling ot' the cache, the requested datum is sliuntcd off
`into a bnfl'er and passed to the CPU on the last cycle of the
`memory transfer so that no atltlitional overhead is needed
`to get the tlatum out of the cache.
`The first toor ol' the eight traces used to stimulate this
`iiiotlel are d r a w lrom the VAX architecture. They are
`long
`concatenations o l several 400,000
`rcfcrence
`snapsliots titken on a VAX 8200. They include olxrating
`system rcferences and true n~~~ltiprogr;lmmi~~g activity.
`
`The traces were 1)reprocessetl to collapse adjiicciit byte
`references to the code segment into long word (32b)
`rcfercnccs, and to transl'omi qoatl word refereiices iiifo lhc
`iippropriale sequence of long word rel'ercnccs. The wami-
`start bountlary is set lo 450,0()0 rel'erenccs for all four
`;itltlresscs. Thc
`traces. All atltli~esscs arc viilual
`generatioii antl t1ct;iilctl charactcristics of tliesc lraces iire
`documciitetl elsewliere I I , 21.
`The second four triiccs arc ilcrivctl I'rom the MIPS
`R2000 architecture. They arc inter1e;ivetl uniprocess,
`ii varicly
`01' optimized C
`virtual adtlress
`traces of
`~)rograms. Thoiigh tlicy (lo not incliitlc iiiiy opatiiig
`system refereiices, the iiitlivitliral uiiiproccss traces were
`randomly interleaved with thc same distribution of context
`switches as were observed in Ihe VAX 1fiiccs. Each
`wiiprocess trace was captnred from the mitltlle of the
`program's execution so tlial its steady slatc behaviour
`would be measured. However, 1he intcrlcaved lraces are
`prepended by references, also approl)riately interleavetl, lo
`all the tlitta aiid code that the individiial progranis touched
`prior lo the beginning of tracing. In this way, tlic caclies
`are completely loaded ;it the wami-starl houiitl;iiy. At Ilia1
`I>oiiit, the cachcs' contents arc itleiitical to wlial woriltl he
`observed
`il' all Ihc Iirograms were lraccd
`l'rom llicir
`hegiiiiiings. Except lor the lack 01' operating syslcms
`referciices, miss iictivity observed over 1lic f i ~ i i i l one
`inillion references 01' cacti trace is itlcotical to 111;11 which
`in a rc;il systcni niiiiiiiig these
`woultl be ohscrvetl
`programs inlcrlcavcd iri that wiiy. This is tnic regardless 01'
`the cache siw or other characteristics. Unlike i n iiiost
`simuliition studies, the validity 01'
`llic miss rates ;iotl
`pcrlorniancc statistics gciieratetl I'rom these traces is iiol
`liiiiitctl cache sizes smaller thiiii the loiiclied tlal;i scl, but
`iristeat1 exleiids lo mucli larger caclie sizes. Fiirtlicrniorc,
`the size 01' the traces iniplies that tlic 95% conlitlcncc
`iiiterv;il around Ihe mcasiiretl miss ratio 21s iiii cstiriiiitc ol'
`tlic real miss ratio is lcss ttiiin plus or iiiiiiiis 0.01 %, cveii
`Ibr cache sizes grcatcr l l i a i i 2MB 1x1.
`
`Though there arc sigiiificaiit quantitiitive tlill~reiiccs io
`the beliaviorir of the two sets of traces, qualitativety they
`are very similar. By using the geometric nieaii of the
`results lrom all eight traces, the goal is to protlucc results
`that are niorc generally applicable. The 1)rel)aralioo ant1
`chariictcristics
`thc traces, and
`ihc dilferences
`;ill
`01'
`iii detail
`bet\\:eeii
`arc
`tlescrihetl
`the
`in.0 sets.
`elscwherc [ I I I.
`
`4. Equal Block Size and Fetch Size
`
`basis I'm
`We will begin by rel)rodricing as
`;I
`compiirison the rcsiilts prcseiitcd previously [ I O l . 111 this
`sectioii we assiinie that the I'ctch size is equal to tlic block
`size, aiid that the fetch stl-iltcgy is the very simple OW
`described above. Given tlicsc assuiiiplioos, we ;ittempt to
`tlcteniiine the best choice for the block size as ii tiinctioii
`01' the memory antl cachc ch;iracteristics.
`The l'etch size is unique iiinorig the orgaoizatioi~i~l
`parameters in that
`it al'fects the ~)erfoniiaiicc of the
`compiiler system through two related mcclianisms. The
`most obvious is through the miss ratio: as tlic fetch size
`iiicre;iscs, the miss ratio tlecrcascs due to llie spalial
`locality of programs. The 1)rob;ibility that an item will bc
`rcfereiicctl sooii decreases wit11 the t1ist;ince lrom the iiiost
`rccciil rel'erence. With increasing lelch size the nieaii
`utility of the words beiiig fetched also drops. Wlieii the
`iiiciiii utility 01' the I'clclictl words drops below the mean
`utility of the words that arc being replaced, tlie miss ratio
`starts to risc again. The iiieitii utility of (lata in the cache is
`primiirily a fiinctioo 01' the cache sizc: sniiiller caches only
`~011tiii11 thiiigs that were used recently and so have a high
`likelihood ol' hciiig used again. For c;icIi cache size. there
`is an optimum fetch size that minimizes the miss ratio.
`There is anothcr secoiitliiry cl'fect oii the iivcragc utility
`01' clata in the caclic h a t is tlel)endeiii on llie block size: it'
`the cache size is conslant, then a s the block size increases
`tlic iirimbcr 01' blocks dccrcases. As tlie iiiinibcr of tags is
`reduced there is less opportunity to hold ii lot of widely
`clistail1 tlala in llie cxhe. Also, with longer blocks, ;I wrilc
`miss k110cks O I I ~ 01 tlic cache more p ~ t ~ ~ i t i i ~ l l y iis~fiil diiIi1
`
`lhaii ill llic case 01' shorter blocks. As it rcsiilt, incrc;isin.g
`the block sizc irlong with tlie I'etch size reduces 1hc fetcli
`s i x at ivhicli llic miss Iwialty is miiiimizerl.
`The secoiitl mcc1i;inism through ~\:IiicIi tlic I'clcli size
`iiffccts 1>er1onniince is via the miss perialty. A loiiger I'ctch
`time, iirid increases the cache miss lxnalty.
`IiikCS I I ~ O ~ C
`Thus I'or small I'clch sizes, [he two cll'ects arc
`iii
`opposilioii:
`iiicrciisiiig {he I'ctch s i x tlccreascs the miss
`ratio bul iiicrciiscs tlic miss 1xiiiilty. The point at which
`those two el'lccts are halanced rlel'ines the minimum
`execution tinic iis ii l'iinclioii of the fctch sizc. This size is
`incvi1;ihly less Ihao the I'ctcli sizc that niiiiimizes the miss
`ralio. This cl'fcct is il1ustr;iled io Figure 4- I . For the iiiotlel
`siiiiulalctl licrc - llic deliitill 64KB caches, antl ii 26011s
`Iateiicy, oiic word per cycle memory system - the load
`miss riilio is ~iiiiiiniizcd t)),
`;I t'cicli aiitl block sizc ol' 32
`
`I62
`
`DOCKET NO.: IPR-5463750
`
`3
`
`TI-EXHIBIT 1012
`
`
`
`words, while the instntctioii cache m i s s ratio is nlininuzed
`by a I‘etch/block size of somewhat more than 64 words. In
`contrast, the fetcldblock size that minimized the execation
`time was 8 words.
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`. . . . . . - b i d Miss G a b
`
`...
`
`A.0.048.2
`0.0428
`
`0.018
`-.0.012
`- .0.006
`
`I
`2
`
`4
`
`
`
`Block Size (Words)
`Block Size De endencies
`(1.0 w per cycle, %O ns latency)
`Figure 4-1
`Snuth has pointed out that if the cache miss peiialty is
`expressed as the siiin of a latency ( la ) aiid a transfer time
`- the ratio of the fetch size arid the transfer rate ( F S / r r ) -
`then the fetch size that minimizes the mean read time - the
`product of the cache miss peiialty aiid the cache iiliss ratio
`- is only dependent on the product of the latency and
`transfer
`rates
`and
`not
`011 either
`laxlr
`),
`(
`independently [21]. To first order, the fetch size that
`tninuilizes
`the mean read
`time also minimizes the
`execution time. This sole dependence on the inernory
`speed product is verified experi~iientally by siniulating a
`variety of niemory conligurations and plotting
`the
`performance optimal block size as a fiiiictioii of the
`product.
`Figure 4-2 s h o w the aggregate relative execution time
`as a fiinction of the block size of both caches and of the
`memory characteristics. In this experiment, and in those
`that follotv, tlie transfer rate is vaiied from one word per
`CPU cycle to four words per cycle. For the cycle time
`chosen for the base system these translate to baiidwidtlis
`of 25MBps and 4OOMBps respectively. The latency is
`similarly varied from 100ns to 42011s. Thus we have a
`large spectmm of
`iiiemory systems experiinentally
`represented.
`inemory characteristics, the
`For each pair of
`perfomiance ciirve has a minimum at the optiniuiii
`blocldfetch size. Fitting a parabola through the lowest
`three points approximates that non-integral block size and
`estiniates the best perfomiance level attainable with that
`hypothetical memory system. These minima are plotted in
`Figure 4-3. They are connected with a lattice that joins
`points with
`the
`same
`transfer ratio and
`latency
`respectively. The coriclusions that we can draw from this
`is that the optinial block size for a split UD system with
`identical block and fetch sizes for both caches is typically
`either four or eight words. If the memory system is
`particularly unbalanced - lhat is, with both the translkr
`rate and latency being either high or low - then the
`optimum could be either two or sixteen words. More
`significantly, since the perfomiance citrves of Figure 4-2
`are very flat iicar their minima, selecting the block size
`
`I63
`
`8
`4
`7
`biock Size (w]
`Execution Time vs Memory Parameters
`Figure 4-2
`
`
`
`:3.,
`
`.‘..,A
`
`9
`40.90
`
`I
`1
`
`I
`2
`
`I
`4
`
`1
`8
`
`I
`16
`
`I W RF liltenol
`1”
`lal-
`m m l a l w
`340 m lalcrcy
`420 m l a l q
`0.250 W per 40 ns peak
`0 53l W per40 %< lxah
`1 Om Wper4O -peak
`2.OmWper40mpeak
`4 Om Wper 40 m F a h
`
`I
`I
`32
`1 8
`64
`Block Size (w]
`
`Optimum Block Size
`Figure 4-3
`iiicorrectly by a factor of two - four versus eight or vice
`versa - does not have a very great perfoniiance impact.
`Smith’s assertion is directly verified in Figure 4-4, in
`which the optimal blocldfetch sizes are plotted against the
`product of the latency and transfer ratc. Indeed, as
`expected, the four line segments, which correspond to the
`different transfer rates, fall on top of cacli other, indicating
`that the optimal block size is a fiinction of the product of
`the two variables aiid not of either inclependently. The
`dotted line iiidicates the block size for each nieinory speed
`product ( luxrr ) at which the transfer time equals the
`latency period. It is iaterestiiig to note that for memory
`systeins that favour fast transfers (large la, Ir and lax lr )
`it is best to spend more time wairiiig on the memory
`latency. For the reverse situation, when the latency is low
`with respect to the transfer rate (small la, ir and Iaxrr ),
`then it is best to spend more than half the time in the
`transfers.
`For larger caches, the web of points in Figitre 4-3
`shifts to the right soniewhat. Figure 4-5 shows the optuinl
`block sizes for a 2MB of cache split equally betivcen the
`instniction and data caches. One can see that
`the
`dependcnce of the optimal block size 011 the lux tr protluct
`remains, and for any particular product the optimal binary
`sized block size can be larger than the 132KB total cache
`size case by up to a factor of two. For caches smaller than
`132KB, the wcb of points does not shift appreciably to the
`
`DOCKET NO.: IPR-5463750
`
`4
`
`TI-EXHIBIT 1012
`
`
`
`y-3
`0-0
`0-0
`0-0
`A-A
`
`0.25wperAom0,ee
`05owperAom3Cvde
`l.mWwmm&e
`z.mwperdom&e
`~ . m w p ~ m & e
`
`. . . . . . Tramler T m - Lat-
`
`Optimum Block Size versus Speed Ratio
`Figure 4-4
`
`...'
`..'
`
`10
`
`
`I
`I
`I
`I
`I
`I
`20
`35 40
`45
`25
`13
`30
`Memory Speed Product (Cyc W/Cyc)
`Optimal Block Size versus Memory Speed Product
`2MB Cache
`Figure 4-5
`left and so the optimal block sizes remain essentially
`unchangcd from Figure 4-3.
`
`5. Effects of Cycle Time Penalties
`There are two main inotivatioiis for changing the
`metric for evaluatiog caches from miss ratios to execution
`the. The first is, of course, relevancy:
`the system
`designer, the cache desigii eiigineer and their managers are
`iiltiniately much more
`the
`interested
`system's
`in
`~)erfoiniance than io the cache's miss ratio. The second,
`more subtle, reason is to expose to the designers the
`tradeoffs that inherently exist between the temnporal and
`Iu this section we
`organizational desigii variables.
`examine one of those tradeoffs:
`the link that exists
`between the CPU cycle tune and the block size. A simple,
`first order, analysis shows that if the CPU cycle time is not
`a fittiction of the block size, as in the previous section,
`then the prfomiance optimal block size is independent of
`the value of the cycle time. In this section we relax this
`assumption arid explore the behaviour of the ~xrfomiaiice
`optimal block size in the sitiiatioii where there is a
`dependelicy betweeii tlie cycle time and tlie block size.
`Again, wc will be assii~ning that tlie block aod fetch sizes
`of both caches are all identical.
`We will examine this tradeoff in very general terms.
`I~istcacl of assuming a fixed cycle time as ii I'iioctioii of the
`block size, we will asstime a small proportional increase in
`the cycle time for each cloubliog of the block size.
`Specifically, we will start by assuniiiig a one percent
`degradation per doubling: that is a 4011s cycle time for a
`block size of one, a 40.40s cycle time for ii block size ol'
`
`two, a 40.80411s cycle tinie for a block size of four, and so
`on. This level of degradation is consistent with iiicreasing
`Ilie fan-in of a multiplexor bet\veeii the CPU and die
`caches, or additional loading on cache address lilies due to
`the additional width of the caches' data array.
`Figure 5-1 sho\vs the characteristic web of points for
`this scenario. The tinshaded synihols sliotv the reference
`scenario o'f no cycle time penalty. For most of the
`mneniory system considered, a
`two to three percent
`perfomiance degradation occiirs due to an effect of this
`magnitude. The maximini perfomiance increase of 4.2%
`occurs at the extreme right had coriier of the web. while
`the minimum (0.7%) occiirs at the left hand corner.
`Interestingly, the cycle time degradation caiises a shift in
`the web of points that is ronghly equivalent to a change in
`the transfer rate. This is intuitively explainable in ternis of
`eqiiivalent influences. Both effects have an increasingly
`pronounced negative inipact on perfonnance as the block
`size increases.
`
`QJ 1.7
`1.6
`
`8 1.3
`1.2
`
`I
`2
`
`I
`4
`
`I
`8
`
`I
`1'6
`
`m 1.0
`g 09
`
`I
`1
`
`I
`I
`1 8
`32
`64
`Block Size (W)
`optimal Block Size: 7% Cycle Time Degradation
`Figure 5-1
`Figure 5-2 shows the corrcspoading graph of the
`optimal block size as a function of the memory speed
`product. It sho~vs two effects: first, there is a sinal1 but
`noticeable decrease in the optimal block size. Oiily in a
`few cases, though, is the biliary optimal block size
`changecl. Secoud, rhe optimal block size's iiidepeiidciice
`li-om the individual lateiicy and transfer rate parameters is
`lost: the line segiiiciits 110 longer line tip exactly oii top of
`each other.
`
`Memory Speed Product (Cyc W/Cyc)
`Optimal Block Size versus Memory Speed Product:
`7% Cycle Time Degradation
`Figure 5-2
`is
`When the percentage clegratlatioii per doubling
`these effects are
`to
`increascd
`five percent, both
`draniatically incremd. The perfomiaiice loss ovcr the
`dcfiiult scciiario avcrages around 10% with a raiigc ol'
`
`DOCKET NO.: IPR-5463750
`
`5
`
`TI-EXHIBIT 1012
`
`
`
`1.2% to 15%. Figure 5-3 shows the choice ofthe optimal
`block size in tliis case. The peiialty for large block sizes is
`so extreme here that the optimal block size never exceeds
`four words.
`
`Memory Speed Product (Cyc * W/Cyc)
`Optimal Block Size versus Memory Speed Product:
`5% Cycle Time Degradation
`Figure 5-3
`Of course, the reality of cache design is dramatically
`different than this very simplified model. The relationship
`between a potential CPU's cycle tinie and the blocldfetcli
`size is iiot well behaved or even monotonic. The selection
`of different RAMS and multiplexors with different access
`times for different cache organizations will affect the
`cycle time in subtlc ways. Therefore, the moral of this
`section is that the designer needs to be aware of the
`general influence oil the optinial block size that results
`from aiiy additional complexity involved in implenienting
`long block and fetch sizes.
`
`6. Different Block and Fetch Sizes
`It is becorning increasingly commoii to have a fetch
`size that is larger than the block size [5]. This is typically
`doiie in conjunction with a more complicated fetch
`strategy that allows for early continuation of execution as
`soon as tlie requested block is received. We, however, will
`investigate these two decisions separately in this and the
`subsequent sections. In this section, the block size and
`fetch size are allo\ved to vary indelxndently, but fetches
`still occur only on misses, and execution is still suspended
`iiiitil the entire fetcli is coniplete.
`Formally, the desire to free both the block and fetch
`size variables conies from the realization that the mean
`read tinie, the protluct of the nuss penalty aiid the miss
`rate ( (la + FSI rr) xMR( B S , F S ) ), is dependent on the
`two variables separately. In particular, the nliss penalty is
`only dependent 011 the fetch size and tlie memory
`characteristics. A designer nljght therefore reasonably
`select the fetch size that is most closely suited to the
`memory system and then select tlie block size that
`nliiiinlizes the miss ratio given that block size.
`Recalling the disciission of the influences on the nuss
`ratio in Section 4, we note that as the block size increases
`there is a slight negative inipacl on the miss rate because
`of the reduction in the number ol' tags. For the default
`cache size and associativity, and ii fetch size of sixteen
`words, the overall miss ratio iiicreases from 0.0087 to
`
`I65
`
`0.0090 as the block size increases from one word to
`sixteen words. All of that change occurs in the data cache,
`since there is no interference between reads and writes in
`the instruction cache: once a block is loaded by a fetch it
`must remain in the cache until a cache miss to a
`conflicting block forces it and all other blocks loaded at
`the same tinie to be replaced. In this case, the data cache
`load nuss ratio went from 0.0161 to 0.0173, .while thc
`instruction fetch miss ratio was constant at 0.0062
`For block sizes greater tharl the fetch size Uiis
`degradation in the miss ratio with increasing block size
`becomes inore clraniatic and affects both the instniction
`and data caches. By the tinie the block size reaches 64
`words, tlie instruction and load miss ratios have reached
`0.0084 and 0.0277 respectively, for a combined read nuss
`ratio of 0.013.
`Thus, given this fetch strategy, there is no intrinsic
`advantage to having a block size greater than the fetch
`size. For the instnictioii cache, there is no advantage to
`having a block size snialler than the fetch size. For tlie
`data cache, there is a iiGss ratio preference for a block size
`that is as small as practical.
`Figure 6-1 shows the results of an experimcnt in which
`both the block and fetch sizes were varied over the range 1
`word to 64 words. For each set of memory characteristics,
`a two dimensional array of execution tinies \vas the result.
`By fitting parabolas first in one direction and then tlie
`other, the opti~nal non-binary fetch and block size could
`be estimated along with the best case perforniance. The
`figure shows the web of the optimal fetch sizes and
`perfonnance levels. Again, the hollow symbols indicate
`the base level case of equal block and fetch sizcs.
`
`8 1.3
`& 1.2
`2 1.1
`m 1.0 <
`g 09
`
`I
`1
`
`I
`2
`
`I
`4
`
`I
`8
`
`I
`1'6
`
`I
`I
`1 8
`32
`64
`Fetch Size (w)
`
`Optimal Fetch Size
`Figure 6-1
`The most significant observation is that there is only a
`sniall iricrease iii the optinial fetch size, and a iiegligible
`inqiroveinent in the performance level (less than 0.6%).
`Clearly, selecting equal block and fetch sizes is a very
`good choice. Figure 6-2 shows the optimal block and
`fetch size for each memory speed product. The dashed
`lines again represent the baseline case of enl'orcctl equal
`block and fetch sizes. Freeing tip the block and fetch sizcs
`has allowed the fetch size to incrcase fractionally and tlie
`block size to decrease slightly. When rcstrictcd to binary
`sized blocks and fetches, the best case is occasionally
`equal blocks and fetches, and occasionally fetches of two
`blocks.
`
`DOCKET NO.: IPR-5463750
`
`6
`
`TI-EXHIBIT 1012
`
`
`
`.E
`c 0"
`
`Memory Speed Product (Cyc W/Cyc)
`Optimal Block and Fetch Sizes
`versus Memory Speed Product
`Figure 6-2
`The c]uestion that theti arises is why was the optimal
`block size not sigtiificantly smaller than in the default
`scenario. The answer is write effects. Even though we
`have selected a write policy
`that minimizes
`the
`perfomiance impa