throbber
The Performance Impact of
`Block Sizes and Fetch Strategies
`
`Steven Przybylski
`
`MIPS Computer Systems,
`928 Arques Ave, Sunnyvale, CA 94086
`
`Abstract
`
`This papcr explorcs thc interactioiis hctweeii a cache's
`1)Iock s i x , fctcli sixc ;incl t'elch policy troni the 1)ersl)ective
`ol' m;ixiiiiixing system- IC ve I ~~erl'oiiiiaiice. It bas been
`~)rcvioiisly noted that givcii ;I siniplc l'ctcli stratcgji the
`~xrlonnaiicc optiniiil Idock size is almost always lour or
`eight words [ IO]. 11' Ilicrc is c\/eii ii small cycle time
`~xnalty associaled with either loiigcr blocks or l'etchcs,
`llie~i the pcrl'oniiaiicc-o~~timaI size is ~ ~ ~ I i c c i ~ h l y reduced.
`
`le split cilchc organiutions, whcrc the I'etch nit1 block
`sixes 01'
`iiistriiclioii iiod clatii ciiclics iirc ;ill indel)cntlcot
`design variahles, instriictiori cache block size and l'etcli
`size should bc the silnic. For tlic \vorkloatl i ~ ~ i d write-back
`write policy usctl in this race-driveii siiniil;itioii stiitly, he
`iiislriiction cache block size slioiiltl be iibout ii liiclor ol'
`two greater than tlie data ciiclic l'clch size, which in tuni
`should equal lo or tloublc llic tliit;~ ciiclic block sixc. The
`simplest I'etch s~lstcgy ol' l'ctcliing only OH ii miss aiitl
`slalliiig the CPU until the I'ctch is coiiil)lelc works well.
`lctcli srriitcgies (lo not 1)roducc
`Coniplicatcd
`the
`Ixrloniiaiice
`the
`imlmvciiients
`iniliciitctl
`by
`iicconil)aiiyiiig rcduclioiis in miss ratios hecause 01' liniiletl
`nieniory rcsoiirccs a i d ii strong teiiiporiil clustering ol'
`cache misses. For the cnviroiimciils siniiilatctl here, the
`most effective I'clcli slratcgy improvetl ~xrtoniiiiiicc by
`I .7% and 4.S% o\'cr
`tlic siiiii)lcst slralcgy
`belweeii
`dcscribed a t w e .
`
`I. Introduction
`Over tlic last several years, tlcsigiicrs have I ~ c o
`iiicrciisiiigly conccrncd with llic gro\ving gal) hctwccri tlic
`cycle linics 01' tlynii~iiic RAMS aiitl niicrol)roccssor CPlJs.
`This coiicerii 1i;is s1)urrctl iiilcrcst iii iricrcasing caclic
`block sizes to niiiiiiiiixe iiiilxicl ol' ii rcliilivcly loiig ~ i i i ~ i i i
`nieinory latency. Though tlierc Ii;ivc heen scveriil studies
`publishet1 to tlalc Ilia1 explore the chiiriicteris~i~s of circlics
`longer block sizes 12, 12, IS, 16, 18, 20, 21 I,
`with
`the
`slutlics II~IVC gciicriilly failed to explore the ~xrli~rni;incc
`in~plicalioiis ol' sonic 01' ~lic niore sigiiil'iciiii~ design
`tlccisioos.
`
`CH2887-8/90/0000/0160$01 .OO Q 1990 IEEE
`
`160
`
`in a previous
`This paper exj)antls o i i a section
`work [ 101. In that article, the dependency ol' perlomiance
`oii the block size w a s graphed as a lbriction 01' the niaiii
`thilt each
`iiieiiiory characteristics. The conclusioii \\!as
`mcniory system, as characterized by
`its latency and
`trarisler rate, implies a perl'onii;iiicc-ol)tiinsl block size
`that is inclcpendcat of the CPU characteristics, including
`the cycle lime. Thal analysis vas limited by iintlerlyiiig
`assiini~ilions ol' the study. In particular, csactly one block
`wis tetchetl from niaiii memory on ii cache miss, the CPU
`Wits stalled until the fetch was complclc, arid tlierc was 1 1 0
`assumed cycle lime dcgriidiitioii w4li increasing block
`size. Io [his piper we will relax each 01' tliesc constraiiits
`i i i turn and thereby exaniioc their impact on the hcsl block
`size and the resiiltiiig pcrloniiancc level.
`lo recopition ol' the l'act that the ultimale niciisiire ol' a
`memory hierarchy is tlie perlomiaiicc obtained by lhe
`coniputcr of which it is a part, execution tinic is the
`priiiiary metric used iii this paper to coniliare tlilltreiil
`cache organizalioiis iiiid fetch policies. The use of' ~liis
`metric is csiwcially imi)ortarit in the study ol' policies lhal
`rely oii parallclism betweeii CPU execution aiitl memory
`iictivity to presiimably iniprove perl'onii;ince. Miss ratios
`aloiic are
`inadequate
`li)r jiidgiiig
`their eftcctiveness
`because not all misses arc created equal. In tlic presence of
`ii complicated lictcli strategy iotlivicliial rnisscs caii liavc
`widely varying contributions to the total cycle count,
`clcl)cnding 0 1 1 tlic exiict state 01' the machine iII tlic time.
`As ii result, the rclatioiisliip hctwcen the iiunihcr ol' niisses
`iind llic exectitioil tiiiic is not necessarily str;iiglitl'orwartl.
`this stiitly, dil'fcrcnt systciii conl'igiirations arc
`In
`coriil)arctl using ii trace-driven simulator which iiccuralcly
`keeps track ol activity
`Ic\~cls iii the nicmory
`;ill
`;it
`Iiicri~r~hy diiriiig every niircliiiic cycle.
`A tier describing llic lerniinology ant1 cx])crinienti~l
`nictliotl in Sections 2 aut1 3, we reproduce iii Section 4 the
`I)ascliiic lor
`l'rorii tlic pxvious paper
`rcsulls
`iis
`;I
`
`co~iipi~ris~~i. 111 Scctioii 5 , u'c C X ~ I I I I ~ I I C tlic cflbcts 01'
`ii
`cycle time ~ieiialty I'or long block sizes. It' llic atltlitioniil
`~niiltiplcxors aiitl control ncctlctl to iniplcnicni i1 longer
`block size t1egr;rtles the cycle tinic ol' the CPU l'roiii tlic
`siniple, siiigle word block sizc ciisc, tlicii thcrc clciirly \\*ill
`Ijc iiii inipiic~ 0 1 1 llic ~riidcol'l' hclwccii Iir~eiicy iiiitl Iririisl'cr
`
`DOCKET NO.: IPR-5463750
`
`1
`
`TI-EXHIBIT 1012
`
`

`

`jxriotl that dctcimiiies the optimal block size.
`An iniport;int realization is that the amount of data that
`is fetclictl from memory need not be equal to a block. It
`ciiii, in I'act, Ix: either iiiorc or less. Machilies that prefetch
`instriictioiis arc fetching several blocks per cache miss.
`Scc~ioii 6 asks the qiicstion: how arc the optimal block
`s i x and I'etch sizc related as B I'iinction of tlie memory
`cliiiractcristics. lo contrast, Section 7 asks the questioo:
`what is tlic beael'it 01' more complicated fetch slrategies
`tliat try to overlap CPU execution with riiain memory
`fetches. Fioally, Seclioo 8 suniinarizes arid concludes the
`paper.
`
`2. Terminology
`The literaturc 011 caches is confiisetl with respect to the
`tlcl'iiiitioas of some important
`teniis. The Ibllowing
`dcl'initioes arc unambigi~oiis, coiisistent with the majority
`ol' tlic 1itel;itiirc aiitl ;ipl)ropriate I'or Ihc disciissioii of block
`sizes illid fetch strategies.
`Block:
`A hlock is the unit ol' dala in tlic caclic iissoci;ited witli
`ii 1iij.j. Ao n-wry set iissociativc citche witli tn sets has
`izxm blocks. Block sizes arc always biliary valircs -
`lliiit is, powers of tco - aiitl typically range lioni one
`word1 to I28 words.
`S~~b-block:
`A sub-block is the iioil ol'data that is i~~~ociatetl with a
`valid bit. Siib-blocks arc smaller than or equal lo
`blocks, and iirc generally either ow byte, 011c word or
`oiic block long.
`I n writc-back caches, dirty hits arc
`typically also associ;itetl with sub-blocks.
`Fetch Size:
`tlalii fetched from
`Tlie l'ilch sizc is tlie qaantily 01'
`nrmory at one time. The only hard coostraint is that it
`be an integral iiuniber of sub-hlocks, though in reiilily
`it is typically a binary iiiiinber of blocks. Tlial lhe most
`common fetch size
`is one block coiilributcs
`lo
`coninioii usage of block sizc for fekh sizc and the
`blurred distinction betweeii the two.
`Fetch Policy:
`The fctcli policy is the ;tlgorillini used lo tletemiiiic
`wlicii a niiiiii memory read is to occiir, wliicti words
`are to he fetched, what order they will be reluriwl in,
`and ;II what point the CPU will be allowed to coiitiiiiic
`cxecalion if it is stalled. Designs tliiit prel'elch on a
`ci~chc miss h i i v ~ ii fckh sizc greater tlia~i OW block,
`iiiid r~siinie cxccotioii iis so011 ;IS the niissctl siib-block
`or block is rcturiictl I'roni nicmory.
`(Reid) Cache Miss Hiltio:
`Most researchers tlel'inc tlic caclic miss ratio iis the
`riilio 01' read misses ml write niisscs to the iiumlxr ol'
`iiiemory refcrceccs. However, it is widely recognized
`that reiltls (instruction fetches a~itl loads) i~ntl writes
`(stores) have greatly tliflkriiig f~qiieiicics, policies,
`iiIitl hit a ~ l miss peiialties 17, 171. III tlic light ol' Ihe
`
`is occumiip
`in analysis ot' niemoty
`shitl
`thiit
`
`hierarchies away from miss ratios and towartls overall
`system-level perfoniiaiice, we define the read cache
`miss ratio and write cache miss ratio separalely. Wheo
`used without qualification, miss rurio refers lo !he read
`miss ratio.
`
`3. Experimental Method
`All the resiilts pesented 111 this papcr were generated
`with ii trace-driven simulator stimulatetl by eight liirgc
`address traces. This sectioii brielly describes the simulator
`ant1 the traces. Both are characterized at great length in
`other documeots I I I , IO]. The first of tliesc pirblicalions
`also iocludes a detailed analysis supporting tlic credibility
`ol' conclusio~is based on this nicthod.
`A proper study ol' memory hierarchies io which the
`metric for comparison is execution time requires a
`simulator that acciirately accounts for tinic at all levels ol'
`the hicrarcliy. Specifically, the simulator used here allows
`Ihc iiser to specily the duratioii of reads arid writes to the
`til@ ant1 data portion of caches and write biifl'crs, as well as
`the miiiinium sel~aration between operations. I n addition,
`latcricy aiid transfcr delays between levels 01' the hieriirchy
`ciiii be varied to allow for complete and realistic systeni
`niotlels. A wide variety of fetch strategies ;ind aii cciu~lly
`broad slwctriini of write stri~tegic~ arc provitlcd for.
`The second prcrcquisitc is set of system niotlels to be
`iiscd in the siniulations. Tlic cache design space is
`tlic ollen overlooked
`iiicrcdi bly diverse.
`Includiiig
`temporal
`parameters,
`there
`arc
`literally
`tens of
`intlepcndent design variables per lcvcl in a memory
`hierarchy. Siiice it is impossible lo explore the ciitirc
`dcsigii spacc in aoy oiic study, we h i t oursclvcs by
`fixing most of llic p;iriimeters at coiiimoii v;ilucs thiil arc
`coiisisteot with llic ranges 01'
`llie free variables of the
`cxperinicnt. For insliincc, we tii~vc restricted oiirselves lo ii
`single-level, split inslnicliodtlata cache. A realistic base
`system was chosen as the rcferciice. I n each ot' the
`cxperinients 10 follow, three or morc design paranieters
`arc changed siniultancoi~sly over a portion of the tlesigii
`spacc while all 01' (he others remain fixed.
`Tlie base system is chariictcristic of ii systcm built
`arountl a 25MHz R ISC CMOS processor. Tlirougtioul
`tliis study, we will uiiifoniily assiiiiic tliiit tlic caclics
`tlelcniiinc tlic CPU cycle time. Tlie single Icvcl 01'
`colitiiiiis 6 4 K B in ciicli of the iiistniction iintl tliila
`~ i i ~ h i ~ i g
`c;icIies. Both iirc tlircct-niapi)ctl iriitl havc tlcfaull block
`irntl I'clch sixes of 4 words. Tlic sub-block size is
`iinit'omily one word. The clata caclic is ii write-hack cache
`witli ;I four entry write hiil'fcr, ci~cli entry being one block
`wide. Tlic tlalapalli into ant1 oirl ol' tlic write hiill'cr is equiil
`Both ports cycle at Ihc CPU
`iii width to the b i ~ c k ~ ~ l i i ~ ~ c .
`cycle rate, and they caii operate siniiiltaneoiisly. Writes
`into the cache take two cycles, iintl the write miss policy is
`lo fill tlie sub-block writtcn to with the written data but aol
`lklcli lhc resf of thc block l i ~ n i 111i1i1i ~iie~iiory. Will1
`10
`thcsc clioiccs, bo111 write hits aiitl ~nisscs iirc rclativcly
`qiiick. An ;iggrcssivc writc policy iiiininiixcs thc write's
`
`DOCKET NO.: IPR-5463750
`
`2
`
`TI-EXHIBIT 1012
`
`

`

`coiitiibritioii to the overall execiition tinic and alloivs LIS to
`coiiceiitrate on the read el'fects.
`The caches are coiiiiected by a one word wide hiis
`capable of block transfers ;it a rate of oiic word per cycle.
`The cache read miss penalty is one cycle to get tlie address
`to the niemory, scver;il cycles of memory latency, then
`several cycles to transl'er lhc (lata back into tlic cache. The
`niain memory is characterized by a 18011s conihinetl
`rlccode antl access time, plus a 12011s gap bctiveeii
`operations. The del'ault I'ctch slrategy is to stall the CPU
`on a read miss until the entire fetch is complete aut1 the
`c x h c is loatled (tlemand fetch only 141). During the
`loatling ot' the cache, the requested datum is sliuntcd off
`into a bnfl'er and passed to the CPU on the last cycle of the
`memory transfer so that no atltlitional overhead is needed
`to get the tlatum out of the cache.
`The first toor ol' the eight traces used to stimulate this
`iiiotlel are d r a w lrom the VAX architecture. They are
`long
`concatenations o l several 400,000
`rcfcrence
`snapsliots titken on a VAX 8200. They include olxrating
`system rcferences and true n~~~ltiprogr;lmmi~~g activity.
`
`The traces were 1)reprocessetl to collapse adjiicciit byte
`references to the code segment into long word (32b)
`rcfercnccs, and to transl'omi qoatl word refereiices iiifo lhc
`iippropriale sequence of long word rel'ercnccs. The wami-
`start bountlary is set lo 450,0()0 rel'erenccs for all four
`;itltlresscs. Thc
`traces. All atltli~esscs arc viilual
`generatioii antl t1ct;iilctl charactcristics of tliesc lraces iire
`documciitetl elsewliere I I , 21.
`The second four triiccs arc ilcrivctl I'rom the MIPS
`R2000 architecture. They arc inter1e;ivetl uniprocess,
`ii varicly
`01' optimized C
`virtual adtlress
`traces of
`~)rograms. Thoiigh tlicy (lo not incliitlc iiiiy opatiiig
`system refereiices, the iiitlivitliral uiiiproccss traces were
`randomly interleaved with thc same distribution of context
`switches as were observed in Ihe VAX 1fiiccs. Each
`wiiprocess trace was captnred from the mitltlle of the
`program's execution so tlial its steady slatc behaviour
`would be measured. However, 1he intcrlcaved lraces are
`prepended by references, also approl)riately interleavetl, lo
`all the tlitta aiid code that the individiial progranis touched
`prior lo the beginning of tracing. In this way, tlic caclies
`are completely loaded ;it the wami-starl houiitl;iiy. At Ilia1
`I>oiiit, the cachcs' contents arc itleiitical to wlial woriltl he
`observed
`il' all Ihc Iirograms were lraccd
`l'rom llicir
`hegiiiiiings. Except lor the lack 01' operating syslcms
`referciices, miss iictivity observed over 1lic f i ~ i i i l one
`inillion references 01' cacti trace is itlcotical to 111;11 which
`in a rc;il systcni niiiiiiiig these
`woultl be ohscrvetl
`programs inlcrlcavcd iri that wiiy. This is tnic regardless 01'
`the cache siw or other characteristics. Unlike i n iiiost
`simuliition studies, the validity 01'
`llic miss rates ;iotl
`pcrlorniancc statistics gciieratetl I'rom these traces is iiol
`liiiiitctl cache sizes smaller thiiii the loiiclied tlal;i scl, but
`iristeat1 exleiids lo mucli larger caclie sizes. Fiirtlicrniorc,
`the size 01' the traces iniplies that tlic 95% conlitlcncc
`iiiterv;il around Ihe mcasiiretl miss ratio 21s iiii cstiriiiitc ol'
`tlic real miss ratio is lcss ttiiin plus or iiiiiiiis 0.01 %, cveii
`Ibr cache sizes grcatcr l l i a i i 2MB 1x1.
`
`Though there arc sigiiificaiit quantitiitive tlill~reiiccs io
`the beliaviorir of the two sets of traces, qualitativety they
`are very similar. By using the geometric nieaii of the
`results lrom all eight traces, the goal is to protlucc results
`that are niorc generally applicable. The 1)rel)aralioo ant1
`chariictcristics
`thc traces, and
`ihc dilferences
`;ill
`01'
`iii detail
`bet\\:eeii
`arc
`tlescrihetl
`the
`in.0 sets.
`elscwherc [ I I I.
`
`4. Equal Block Size and Fetch Size
`
`basis I'm
`We will begin by rel)rodricing as
`;I
`compiirison the rcsiilts prcseiitcd previously [ I O l . 111 this
`sectioii we assiinie that the I'ctch size is equal to tlic block
`size, aiid that the fetch stl-iltcgy is the very simple OW
`described above. Given tlicsc assuiiiplioos, we ;ittempt to
`tlcteniiine the best choice for the block size as ii tiinctioii
`01' the memory antl cachc ch;iracteristics.
`The l'etch size is unique iiinorig the orgaoizatioi~i~l
`parameters in that
`it al'fects the ~)erfoniiaiicc of the
`compiiler system through two related mcclianisms. The
`most obvious is through the miss ratio: as tlic fetch size
`iiicre;iscs, the miss ratio tlecrcascs due to llie spalial
`locality of programs. The 1)rob;ibility that an item will bc
`rcfereiicctl sooii decreases wit11 the t1ist;ince lrom the iiiost
`rccciil rel'erence. With increasing lelch size the nieaii
`utility of the words beiiig fetched also drops. Wlieii the
`iiiciiii utility 01' the I'clclictl words drops below the mean
`utility of the words that arc being replaced, tlie miss ratio
`starts to risc again. The iiieitii utility of (lata in the cache is
`primiirily a fiinctioo 01' the cache sizc: sniiiller caches only
`~011tiii11 thiiigs that were used recently and so have a high
`likelihood ol' hciiig used again. For c;icIi cache size. there
`is an optimum fetch size that minimizes the miss ratio.
`There is anothcr secoiitliiry cl'fect oii the iivcragc utility
`01' clata in the caclic h a t is tlel)endeiii on llie block size: it'
`the cache size is conslant, then a s the block size increases
`tlic iirimbcr 01' blocks dccrcases. As tlie iiiinibcr of tags is
`reduced there is less opportunity to hold ii lot of widely
`clistail1 tlala in llie cxhe. Also, with longer blocks, ;I wrilc
`miss k110cks O I I ~ 01 tlic cache more p ~ t ~ ~ i t i i ~ l l y iis~fiil diiIi1
`
`lhaii ill llic case 01' shorter blocks. As it rcsiilt, incrc;isin.g
`the block sizc irlong with tlie I'etch size reduces 1hc fetcli
`s i x at ivhicli llic miss Iwialty is miiiimizerl.
`The secoiitl mcc1i;inism through ~\:IiicIi tlic I'clcli size
`iiffccts 1>er1onniince is via the miss perialty. A loiiger I'ctch
`time, iirid increases the cache miss lxnalty.
`IiikCS I I ~ O ~ C
`Thus I'or small I'clch sizes, [he two cll'ects arc
`iii
`opposilioii:
`iiicrciisiiig {he I'ctch s i x tlccreascs the miss
`ratio bul iiicrciiscs tlic miss 1xiiiilty. The point at which
`those two el'lccts are halanced rlel'ines the minimum
`execution tinic iis ii l'iinclioii of the fctch sizc. This size is
`incvi1;ihly less Ihao the I'ctcli sizc that niiiiimizes the miss
`ralio. This cl'fcct is il1ustr;iled io Figure 4- I . For the iiiotlel
`siiiiulalctl licrc - llic deliitill 64KB caches, antl ii 26011s
`Iateiicy, oiic word per cycle memory system - the load
`miss riilio is ~iiiiiiniizcd t)),
`;I t'cicli aiitl block sizc ol' 32
`
`I62
`
`DOCKET NO.: IPR-5463750
`
`3
`
`TI-EXHIBIT 1012
`
`

`

`words, while the instntctioii cache m i s s ratio is nlininuzed
`by a I‘etch/block size of somewhat more than 64 words. In
`contrast, the fetcldblock size that minimized the execation
`time was 8 words.
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`. . . . . . - b i d Miss G a b
`
`...
`
`A.0.048.2
`0.0428
`
`0.018
`-.0.012
`- .0.006
`
`I
`2
`
`4
`
`
`
`Block Size (Words)
`Block Size De endencies
`(1.0 w per cycle, %O ns latency)
`Figure 4-1
`Snuth has pointed out that if the cache miss peiialty is
`expressed as the siiin of a latency ( la ) aiid a transfer time
`- the ratio of the fetch size arid the transfer rate ( F S / r r ) -
`then the fetch size that minimizes the mean read time - the
`product of the cache miss peiialty aiid the cache iiliss ratio
`- is only dependent on the product of the latency and
`transfer
`rates
`and
`not
`011 either
`laxlr
`),
`(
`independently [21]. To first order, the fetch size that
`tninuilizes
`the mean read
`time also minimizes the
`execution time. This sole dependence on the inernory
`speed product is verified experi~iientally by siniulating a
`variety of niemory conligurations and plotting
`the
`performance optimal block size as a fiiiictioii of the
`product.
`Figure 4-2 s h o w the aggregate relative execution time
`as a fiinction of the block size of both caches and of the
`memory characteristics. In this experiment, and in those
`that follotv, tlie transfer rate is vaiied from one word per
`CPU cycle to four words per cycle. For the cycle time
`chosen for the base system these translate to baiidwidtlis
`of 25MBps and 4OOMBps respectively. The latency is
`similarly varied from 100ns to 42011s. Thus we have a
`large spectmm of
`iiiemory systems experiinentally
`represented.
`inemory characteristics, the
`For each pair of
`perfomiance ciirve has a minimum at the optiniuiii
`blocldfetch size. Fitting a parabola through the lowest
`three points approximates that non-integral block size and
`estiniates the best perfomiance level attainable with that
`hypothetical memory system. These minima are plotted in
`Figure 4-3. They are connected with a lattice that joins
`points with
`the
`same
`transfer ratio and
`latency
`respectively. The coriclusions that we can draw from this
`is that the optinial block size for a split UD system with
`identical block and fetch sizes for both caches is typically
`either four or eight words. If the memory system is
`particularly unbalanced - lhat is, with both the translkr
`rate and latency being either high or low - then the
`optimum could be either two or sixteen words. More
`significantly, since the perfomiance citrves of Figure 4-2
`are very flat iicar their minima, selecting the block size
`
`I63
`
`8
`4
`7
`biock Size (w]
`Execution Time vs Memory Parameters
`Figure 4-2
`
`
`
`:3.,
`
`.‘..,A
`
`9
`40.90
`
`I
`1
`
`I
`2
`
`I
`4
`
`1
`8
`
`I
`16
`
`I W RF liltenol
`1”
`lal-
`m m l a l w
`340 m lalcrcy
`420 m l a l q
`0.250 W per 40 ns peak
`0 53l W per40 %< lxah
`1 Om Wper4O -peak
`2.OmWper40mpeak
`4 Om Wper 40 m F a h
`
`I
`I
`32
`1 8
`64
`Block Size (w]
`
`Optimum Block Size
`Figure 4-3
`iiicorrectly by a factor of two - four versus eight or vice
`versa - does not have a very great perfoniiance impact.
`Smith’s assertion is directly verified in Figure 4-4, in
`which the optimal blocldfetch sizes are plotted against the
`product of the latency and transfer ratc. Indeed, as
`expected, the four line segments, which correspond to the
`different transfer rates, fall on top of cacli other, indicating
`that the optimal block size is a fiinction of the product of
`the two variables aiid not of either inclependently. The
`dotted line iiidicates the block size for each nieinory speed
`product ( luxrr ) at which the transfer time equals the
`latency period. It is iaterestiiig to note that for memory
`systeins that favour fast transfers (large la, Ir and lax lr )
`it is best to spend more time wairiiig on the memory
`latency. For the reverse situation, when the latency is low
`with respect to the transfer rate (small la, ir and Iaxrr ),
`then it is best to spend more than half the time in the
`transfers.
`For larger caches, the web of points in Figitre 4-3
`shifts to the right soniewhat. Figure 4-5 shows the optuinl
`block sizes for a 2MB of cache split equally betivcen the
`instniction and data caches. One can see that
`the
`dependcnce of the optimal block size 011 the lux tr protluct
`remains, and for any particular product the optimal binary
`sized block size can be larger than the 132KB total cache
`size case by up to a factor of two. For caches smaller than
`132KB, the wcb of points does not shift appreciably to the
`
`DOCKET NO.: IPR-5463750
`
`4
`
`TI-EXHIBIT 1012
`
`

`

`y-3
`0-0
`0-0
`0-0
`A-A
`
`0.25wperAom0,ee
`05owperAom3Cvde
`l.mWwmm&e
`z.mwperdom&e
`~ . m w p ~ m & e
`
`. . . . . . Tramler T m - Lat-
`
`Optimum Block Size versus Speed Ratio
`Figure 4-4
`
`...'
`..'
`
`10
`
`
`I
`I
`I
`I
`I
`I
`20
`35 40
`45
`25
`13
`30
`Memory Speed Product (Cyc W/Cyc)
`Optimal Block Size versus Memory Speed Product
`2MB Cache
`Figure 4-5
`left and so the optimal block sizes remain essentially
`unchangcd from Figure 4-3.
`
`5. Effects of Cycle Time Penalties
`There are two main inotivatioiis for changing the
`metric for evaluatiog caches from miss ratios to execution
`the. The first is, of course, relevancy:
`the system
`designer, the cache desigii eiigineer and their managers are
`iiltiniately much more
`the
`interested
`system's
`in
`~)erfoiniance than io the cache's miss ratio. The second,
`more subtle, reason is to expose to the designers the
`tradeoffs that inherently exist between the temnporal and
`Iu this section we
`organizational desigii variables.
`examine one of those tradeoffs:
`the link that exists
`between the CPU cycle tune and the block size. A simple,
`first order, analysis shows that if the CPU cycle time is not
`a fittiction of the block size, as in the previous section,
`then the prfomiance optimal block size is independent of
`the value of the cycle time. In this section we relax this
`assumption arid explore the behaviour of the ~xrfomiaiice
`optimal block size in the sitiiatioii where there is a
`dependelicy betweeii tlie cycle time and tlie block size.
`Again, wc will be assii~ning that tlie block aod fetch sizes
`of both caches are all identical.
`We will examine this tradeoff in very general terms.
`I~istcacl of assuming a fixed cycle time as ii I'iioctioii of the
`block size, we will asstime a small proportional increase in
`the cycle time for each cloubliog of the block size.
`Specifically, we will start by assuniiiig a one percent
`degradation per doubling: that is a 4011s cycle time for a
`block size of one, a 40.40s cycle time for ii block size ol'
`
`two, a 40.80411s cycle tinie for a block size of four, and so
`on. This level of degradation is consistent with iiicreasing
`Ilie fan-in of a multiplexor bet\veeii the CPU and die
`caches, or additional loading on cache address lilies due to
`the additional width of the caches' data array.
`Figure 5-1 sho\vs the characteristic web of points for
`this scenario. The tinshaded synihols sliotv the reference
`scenario o'f no cycle time penalty. For most of the
`mneniory system considered, a
`two to three percent
`perfomiance degradation occiirs due to an effect of this
`magnitude. The maximini perfomiance increase of 4.2%
`occurs at the extreme right had coriier of the web. while
`the minimum (0.7%) occiirs at the left hand corner.
`Interestingly, the cycle time degradation caiises a shift in
`the web of points that is ronghly equivalent to a change in
`the transfer rate. This is intuitively explainable in ternis of
`eqiiivalent influences. Both effects have an increasingly
`pronounced negative inipact on perfonnance as the block
`size increases.
`
`QJ 1.7
`1.6
`
`8 1.3
`1.2
`
`I
`2
`
`I
`4
`
`I
`8
`
`I
`1'6
`
`m 1.0
`g 09
`
`I
`1
`
`I
`I
`1 8
`32
`64
`Block Size (W)
`optimal Block Size: 7% Cycle Time Degradation
`Figure 5-1
`Figure 5-2 shows the corrcspoading graph of the
`optimal block size as a function of the memory speed
`product. It sho~vs two effects: first, there is a sinal1 but
`noticeable decrease in the optimal block size. Oiily in a
`few cases, though, is the biliary optimal block size
`changecl. Secoud, rhe optimal block size's iiidepeiidciice
`li-om the individual lateiicy and transfer rate parameters is
`lost: the line segiiiciits 110 longer line tip exactly oii top of
`each other.
`
`Memory Speed Product (Cyc W/Cyc)
`Optimal Block Size versus Memory Speed Product:
`7% Cycle Time Degradation
`Figure 5-2
`is
`When the percentage clegratlatioii per doubling
`these effects are
`to
`increascd
`five percent, both
`draniatically incremd. The perfomiaiice loss ovcr the
`dcfiiult scciiario avcrages around 10% with a raiigc ol'
`
`DOCKET NO.: IPR-5463750
`
`5
`
`TI-EXHIBIT 1012
`
`

`

`1.2% to 15%. Figure 5-3 shows the choice ofthe optimal
`block size in tliis case. The peiialty for large block sizes is
`so extreme here that the optimal block size never exceeds
`four words.
`
`Memory Speed Product (Cyc * W/Cyc)
`Optimal Block Size versus Memory Speed Product:
`5% Cycle Time Degradation
`Figure 5-3
`Of course, the reality of cache design is dramatically
`different than this very simplified model. The relationship
`between a potential CPU's cycle tinie and the blocldfetcli
`size is iiot well behaved or even monotonic. The selection
`of different RAMS and multiplexors with different access
`times for different cache organizations will affect the
`cycle time in subtlc ways. Therefore, the moral of this
`section is that the designer needs to be aware of the
`general influence oil the optinial block size that results
`from aiiy additional complexity involved in implenienting
`long block and fetch sizes.
`
`6. Different Block and Fetch Sizes
`It is becorning increasingly commoii to have a fetch
`size that is larger than the block size [5]. This is typically
`doiie in conjunction with a more complicated fetch
`strategy that allows for early continuation of execution as
`soon as tlie requested block is received. We, however, will
`investigate these two decisions separately in this and the
`subsequent sections. In this section, the block size and
`fetch size are allo\ved to vary indelxndently, but fetches
`still occur only on misses, and execution is still suspended
`iiiitil the entire fetcli is coniplete.
`Formally, the desire to free both the block and fetch
`size variables conies from the realization that the mean
`read tinie, the protluct of the nuss penalty aiid the miss
`rate ( (la + FSI rr) xMR( B S , F S ) ), is dependent on the
`two variables separately. In particular, the nliss penalty is
`only dependent 011 the fetch size and tlie memory
`characteristics. A designer nljght therefore reasonably
`select the fetch size that is most closely suited to the
`memory system and then select tlie block size that
`nliiiinlizes the miss ratio given that block size.
`Recalling the disciission of the influences on the nuss
`ratio in Section 4, we note that as the block size increases
`there is a slight negative inipacl on the miss rate because
`of the reduction in the number ol' tags. For the default
`cache size and associativity, and ii fetch size of sixteen
`words, the overall miss ratio iiicreases from 0.0087 to
`
`I65
`
`0.0090 as the block size increases from one word to
`sixteen words. All of that change occurs in the data cache,
`since there is no interference between reads and writes in
`the instruction cache: once a block is loaded by a fetch it
`must remain in the cache until a cache miss to a
`conflicting block forces it and all other blocks loaded at
`the same tinie to be replaced. In this case, the data cache
`load nuss ratio went from 0.0161 to 0.0173, .while thc
`instruction fetch miss ratio was constant at 0.0062
`For block sizes greater tharl the fetch size Uiis
`degradation in the miss ratio with increasing block size
`becomes inore clraniatic and affects both the instniction
`and data caches. By the tinie the block size reaches 64
`words, tlie instruction and load miss ratios have reached
`0.0084 and 0.0277 respectively, for a combined read nuss
`ratio of 0.013.
`Thus, given this fetch strategy, there is no intrinsic
`advantage to having a block size greater than the fetch
`size. For the instnictioii cache, there is no advantage to
`having a block size snialler than the fetch size. For tlie
`data cache, there is a iiGss ratio preference for a block size
`that is as small as practical.
`Figure 6-1 shows the results of an experimcnt in which
`both the block and fetch sizes were varied over the range 1
`word to 64 words. For each set of memory characteristics,
`a two dimensional array of execution tinies \vas the result.
`By fitting parabolas first in one direction and then tlie
`other, the opti~nal non-binary fetch and block size could
`be estimated along with the best case perforniance. The
`figure shows the web of the optimal fetch sizes and
`perfonnance levels. Again, the hollow symbols indicate
`the base level case of equal block and fetch sizcs.
`
`8 1.3
`& 1.2
`2 1.1
`m 1.0 <
`g 09
`
`I
`1
`
`I
`2
`
`I
`4
`
`I
`8
`
`I
`1'6
`
`I
`I
`1 8
`32
`64
`Fetch Size (w)
`
`Optimal Fetch Size
`Figure 6-1
`The most significant observation is that there is only a
`sniall iricrease iii the optinial fetch size, and a iiegligible
`inqiroveinent in the performance level (less than 0.6%).
`Clearly, selecting equal block and fetch sizes is a very
`good choice. Figure 6-2 shows the optimal block and
`fetch size for each memory speed product. The dashed
`lines again represent the baseline case of enl'orcctl equal
`block and fetch sizes. Freeing tip the block and fetch sizcs
`has allowed the fetch size to incrcase fractionally and tlie
`block size to decrease slightly. When rcstrictcd to binary
`sized blocks and fetches, the best case is occasionally
`equal blocks and fetches, and occasionally fetches of two
`blocks.
`
`DOCKET NO.: IPR-5463750
`
`6
`
`TI-EXHIBIT 1012
`
`

`

`.E
`c 0"
`
`Memory Speed Product (Cyc W/Cyc)
`Optimal Block and Fetch Sizes
`versus Memory Speed Product
`Figure 6-2
`The c]uestion that theti arises is why was the optimal
`block size not sigtiificantly smaller than in the default
`scenario. The answer is write effects. Even though we
`have selected a write policy
`that minimizes
`the
`perfomiance impa

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket