throbber
Scalable Performance Analysis: The Pablo Performance
`Analysis Environment
`
`Daniel A. Reed~
`Phillip C. Roth'
`
`1:1.uth A. Aydt'
`Keith A. Shields'
`Luis F. Tavera'
`
`Roger J. Noe'
`Bradley W. Schwartz'
`
`Department of Computer Science
`University of Illinois
`Urbana, Illinois 61801
`
`Abstract
`
`Developers of application codes for massively paral(cid:173)
`lel computer ~y~terws face daunting performance tun(cid:173)
`ing and optimization problems that must be solved if
`massively parallel systems are to fulfill their promise.
`Recording and analyzing the dynamics of applica(cid:173)
`tion program, system software, and hardware inter(cid:173)
`actions is the key to understanding and the prerequi(cid:173)
`site to performance tuning, but this instrumentation
`and analysis must not unduly perturb program execu(cid:173)
`tion. Pablo is a performance analysis environment de(cid:173)
`signed to provide unobtrusive performance data cap(cid:173)
`ture, analysis, and presentation across a wide variety
`of scalable parallel systems. Current efforts include
`dynamic statistical clustering to reduce the volume of
`data that must be captured and complete performance
`data immersion via head-mounted displays.
`
`1
`
`Introduction
`
`As computational science becomes an equal partner
`to theory and experiment, there is growing consen(cid:173)
`sus that massively parallel systems are the only tech(cid:173)
`nically and economically viable approach to achiev(cid:173)
`ing the computing power needed to effectively study
`the "Grand Challenge'1 problems (e.g., ocean and cli(cid:173)
`mate modeling, computational fluid and combustion
`
`'"Supported in part by the Adv-anced Research Projects
`Agency under ARPA Contract No. DAVT63~91~C~0029, by
`the National Science Foundation under grants NSF IRI 92~
`12976 and NSF CDA87~22836, by the National Aeronautics and
`Space Administration under NASA Contract Number NAG~l~
`613, and by a collaborative research agreement with the Intel
`Supercomputer Systems Division.
`
`dynamics, or computational drug design) [3]. Sev(cid:173)
`eral vendors huve begun delivery of massively purullel
`systems with peak performance of a few hundred gi(cid:173)
`gaflops, and all have clear performance growth paths
`to multiple teraflops within three years.
`Unfortunately, the developer of an application code
`for a massively parallel system faces many of the same
`difficulties one encounters when developing a man(cid:173)
`agement structure for a large, human organization -
`information acquisition problems and communication
`delays can limit performance. In a parallel system, the
`interactions occur on a much smaller time scale and
`the potential information volume is much greater.
`Given this complexity, the key to optimizing the
`performance of parallel application codes is under(cid:173)
`standing the dynamic interactions of application pro(cid:173)
`gram, system software, and hardware. By recording
`the dynamic pattern of software and hardware inter(cid:173)
`actions, one can identify and remove performance bot(cid:173)
`tlenecks. To gain insight from this data and to tune
`soitware, the data must be processed and presented in
`ways that not only show trends but also allow detailed
`exploration of small scale behavior.
`The remainder of this paper is organized as follows.
`In §2-§4, we describe the design philosophy and soft(cid:173)
`ware components of Pablo, a performance analysis en(cid:173)
`vironment designed to provide performance data cap(cid:173)
`ture, analysis, and presentation across a wide variety
`of scalable parallel systems. In §5, we describe an ap(cid:173)
`proach to dynamic statistical clustering that promises
`to dramatically reduce the volume of performance
`data that. must. be captured, while still providing accu(cid:173)
`rate, dynamic performance metrics. This is followed in
`§6 by a description of high-modality data presentation
`techniques that can immerse a performance analyst in
`
`PALO ALTO NETWORKS Exhibit 1028 Page 1
`
`

`
`a virtual world of dynamic performance data. Finally,
`§7 summarizes the current soitware state and suggests
`avenues for further research.
`
`with massively parallel systemsj these are the subject
`oi §5 and §6.
`
`3 Pablo Software Overview
`
`•
`
`2 Pablo Design Philosophy
`
`The precursor to performance optimization is in(cid:173)
`~ight -
`one mu~t under~tand not only the eirecb of
`potential optimizations, but also how and why they af(cid:173)
`fect the system. Traditional, static performance mea(cid:173)
`sures specify what happened (e.g., processor utiliza(cid:173)
`tion was low), but they reveal frustratingly little about
`the more important issues: why and how it happened.
`To understand these issues, one must capture and ana(cid:173)
`lyze the dynamics of application, system software, and
`hardware interactions. Hardware designers use logic
`analyzers to capture and display transient signals in
`different portions of a hardware designj designers of
`parallel application codes and system software need
`comparable tools.
`Drawing on the logic analyzer analogy 1 an ideal per(cid:173)
`formance analysis environment should support inter(cid:173)
`active insertion of instrumentation points 1 as well as
`data analysis, reduction, and display. Moreover, the
`environment should be portable across a range of par(cid:173)
`allel architectures, its performance and data analysis
`capabilities should be scalable v:ith the size of the sys(cid:173)
`tem being studied, and it should be eztensible, allow(cid:173)
`ing users to add environment functionality as needed.
`The motivations for performance environment porta(cid:173)
`bility and extensibility are Similar to those for all large
`software systems -
`one wants an environment that
`can be used in multiple contexts and that can grow
`with the user's sophistication. Less obvious is the fact
`that as parallel system scale in size and performance,
`the performance environment must scale commensu(cid:173)
`rately.
`Performance environment scalability not only im(cid:173)
`plies that the environment must be capable of cap(cid:173)
`turing and analyzing data from very large numbers of
`processors, it must also be capable of presenting that
`data in ways that are intuitive and instructive. Cur(cid:173)
`rent graphical techniques often represent the states
`of individual processors (e.g., by a colored square for
`each processor), and they do not scale to thousands of
`processors. For a massively parallel system, processor
`behaviors often form a small number of equivalence
`classes; it frequently suffices to see aggregate behavior
`with detail for equivalence class representatives and
`outliers. Hence, ne'N data capture and display idioms
`are needed if performance environments are to scale
`
`•
`
`•
`
`1
`
`'
`
`,.
`
`1
`
`( ' '
`
`The design of the Pablo performance analysis en(cid:173)
`vironment draws on the lessons learned from the de-
`~1gn, uup1en1en~auon anu u~e 01 ~wo prev1ou~ gener-
`ations of performance analysis software [6, 7]. Based
`on this experience, the bulk of the Pablo environment
`design effort has centered on supporting portability,
`scalability, and extensibility. Pablo is best viewed as
`a toolkit for the construction of performance analy(cid:173)
`sis environments. As such, it consists of two primary
`components: (1) portable software instrumentation,
`and (2) portable performance data analysis, with a
`performance data meta-format coupling the instru(cid:173)
`mentation with the data analysis. The instrumenta(cid:173)
`tion, data format, analysis environment, and display
`options are briefly described belowj for greater detail,
`see [9, 5, 8].
`Pablo's portable software instrumentation has been
`designed to support interactive specification of source
`code instrumentation points. The software instrumen(cid:173)
`tation can be used to gather performance data about
`either system or application codes, though our initial
`target is the latter .... A ... s part of the instrumentation,
`we have developed and are continuing to develop three
`pieces of software: a graphical interface for instrumen(cid:173)
`tation specification, modified C and Fortran parsers
`that emit instrumented source code, and a library
`of data capture templates whose members can cap(cid:173)
`ture performance data generated by the instrumented
`source code when it is executed on distributed memory
`parallel systems. Our initial architectural targets are
`the Thinking Machines CM-5 and the Intel Paragon
`XP /S systems.
`The performance analysis component of Pablo con(cid:173)
`sists of a set of data transformation modules that can
`be graphically interconnected, in the style of AVS [10],
`to form an acyclic, directed data analysis graph. Per(cid:173)
`formance data flows through the graph nodes and is
`transformed to yield the desired performance metrics.
`Each performance data transformation module con(cid:173)
`sists of a data transformation core (i.e., the actual
`data transform) and a system-provided data access
`"wrapper" that can read and write a self-documenting
`data stream meta-format that includes internal defi(cid:173)
`nitions of data types, sizes, and names, but does not
`include embedded semantics. The -..•:rapper's data ac(cid:173)
`cess methods permit access to data field information
`
`PALO ALTO NETWORKS Exhibit 1028 Page 2
`
`

`
`without a priori knowledge of the data stream struc(cid:173)
`ture, and associated soitware supports data replication
`and distribution to other modules.
`Although performance analysis occasionally re(cid:173)
`quires knowledge of architecture-specific data seman(cid:173)
`tics, the Pablo design philosophy presumes that em(cid:173)
`bedding this information in either the trace data for(cid:173)
`mat or the analysis software modules will preclude
`module reuse on differing architectures or applica(cid:173)
`tion software environments.
`For this reason,
`the
`performance data format has no embedded seman(cid:173)
`tics (i.e., there are no predefined event types or data
`sizes). Similarly, most data transformation modules
`provide architecture-neutral data reductions (e.g., av(cid:173)
`erages or histograms) on user-specified data stream
`components. Thus, one can write a data transforma(cid:173)
`tion module that maintains a running average without
`knowing the range of data values, their semantic in(cid:173)
`terpretation1 or their storage scheme.
`The performance data analysis software is based
`on an object-oriented design, written in c++) and by
`virtue of the performance data meta-format and the
`architecture-neutral data reductions, designed to be
`easily ported to new machine architectures. Also, be-
`cause the data analysis modules communicate solely
`via message passing, individual modules can poten(cid:173)
`tially execute m parallel, allowmg the construction of
`scalable data analysis graphs.
`In addition to dynamic X window graphics ior the
`display of performance data, Pablo supports the use
`of sonic data presentation via the replay of sampled
`sounds on a Sun SparcStation audio port or the use of
`a sound synthesizer via the Musical instrument Digital
`Interface (MIDI).
`
`4 Scalable Software Instrumentation
`
`Historically, performance measurement and instru(cid:173)
`mentation techniques have included profiling, event
`counting, and event tracing. Event tracing subsumes
`both profiling and countingj each trace event includes
`information on where, what, and whenj from this
`trace, one can construct both execution profiles and
`descriptions of tern poral behavior. Data associated
`with each trace event typically include the identity of
`the event (i.e., what happened), the time the action
`occurred (i.e., when it happened), and information
`about where the event occurred (e.g., the processor
`and task). The conceptual simplicity of event tracing
`belies its difficulty. A data capture library implemen-
`tation must balance event data rates against undue
`perturbation of the instrumented codej the only thing
`
`more disheartening than inadequate data is a large
`volume 01 aata whose accuracy is suspect. Below,
`we describe the design of the Pablo instrumentation
`library and our approach to limiting the volume oi
`captured data and its perturbation.
`
`4.1 Software Components
`
`As Figure 1 suggests1 the three components of the
`Pablo instrumentation software are a graphical inter(cid:173)
`face for interactively specifying source code instrumen(cid:173)
`tation points, parsers that emit source code with em(cid:173)
`bedded calls to a trace capture library, and a data
`capture library that records the performance data.
`This approach is necessarily a com promise to maxi-
`mize portability. Our intent is that the three indepen(cid:173)
`dent, though cooperating, instrumentation software
`components can be quickly ported to a new parallel
`system or the individual components can be replaced
`by machine-specific versions.
`The parsers accept source code to be instrumented
`and produce the parse tree information needed by the
`graphical instrumentation interface. The graphical in(cid:173)
`terface, based on X and Motif, interprets the parse
`tree data information and allows the user to graphi(cid:173)
`cally specify source code instrumentation points.
`Although the data capture library supports the
`instrumentation and capture of data from the man(cid:173)
`ual instrumentation of arbitrary source code points,
`at present, the graphical instrumentation interface
`supports only the interactive instrumentation of en(cid:173)
`tryjexlt points for procedure calls and outer loops.
`This limitation is intentionalj it is the most detailed
`level of instrumentation possible without precluding
`major optimizations by the compiler. Moreover, it
`strikes the right balance of detail and flexibility -
`additional instrumentation excessively perturbs the
`computation and generates masSive amounts of per-
`formance data.
`
`,....
`;1
`,.
`'
`'
`, .
`•
`'
`•('
`('.
`\JlVeU ~Ue ~peClHCaHOU 01 l!l~HUllleU~aHOU p011H~ 1
`the parsers emit modified source code with embedded
`instrumentation. At execution time, the inserted in(cid:173)
`strumentation code invokes routines supplied by the
`data capture library, producing performance data in a
`standard event data format [9, 1].
`
`4.2
`
`Instrumentation Options
`
`To balance the countervailing needs of detailed data
`and minimal perturbation, the Pablo instrumenta-
`tion sofh•1are supports three class of instrumentation
`events: tracing, counting, and time intervals. Each of
`
`PALO ALTO NETWORKS Exhibit 1028 Page 3
`
`

`
`Call Tree
`
`De> criplion f lm~~ ..
`
`__ s_= __ c_o_de_~~l __ p_.,_.,. _ _.l Sowce Coie
`
`S tanda!d Fonrul
`Exec:ution Trace
`
`-
`-
`
`Parallel
`
`h»irumented
`
`I Exec:u tW le Frog"""
`'--Mac-hine_· ___J1-
`
`Legend
`
`lil:l&l:l:
`
`Ploj e ct D e'""lop ed
`
`-
`
`VendorSupplied
`
`Compiler I
`I
`t Object C oie
`I lr»irumented
`I Data Caplme I
`I
`____, ....... ii-----il Lili my
`r=~/=q:j" r=\~=41
`1-:=11~
`I Compremon ~
`I C bteru,;
`~
`
`~
`
`Lj,]<.,-
`c__ _
`
`lbt:" lbttt r.~r.:l =:+~ H~ al a dH+.ttt:=l IX)~t:=l ~r.=
`::r.p·t::-:1 tl~m ~)r dti al~ m::d l>f:"tll~thal br.:.
`1.1t~~::-:t:" t'~t:: I:;., t t1>t~r.: I I bt" ~)::':::':D t ttt:==:-:t: ~) f a ::..1~ r5::-:
`t-1:-t:"t:=! ~::r..g., ~~ ::x:n !L:d~n :>t~:r:-:t:·d!~tt: !;=!-"~~ ::-:::~Ut::~ ~~~ !'th.
`1 j mt: l::t 1 H;;, pt~):;-:t-;;,;;,;:)t), ar.:d a p·tt !"ht mm::::-:t:- da 1 a r51!-:-
`t:-1":: It_;· =:;, j::-tGd D~j fm t:::~}:: = :-:::;.1 m::::-:t:. lr.: ::-:~m I t~=:;,l: ::-:~~ ;m I
`t'l:-t:'t:: I:;.::-::;) t:: I al t:: t::~) 1::::..tt ~hI a; ~)d_,- I bt: ::-:~)1:: t:: I ~ll bt: t:: 1:: m-
`•. r- • ••.• ~.
`~-- • •• r- ......... ........ ~- ~ .............. ,;
`..... ; .~ .•• ~ ....
`l.>r.r :.J 1 :.J:.~:.~1.: r ru.::,~;;,::o, HI• ~ r.::.~ ru1~ a.: 1 ;;.:.,: : ~Fr.::..: 1 r.:<. 1 ~ nH;- :.J 1 ~F.~r.:
`t'l:-t:'t:: I o:~::-: 1:: t tt't:::3!":' 0 II :!":'t:: t. r.:o I ::J~ t::~ rs~~ t::l . 1.'1-::t- l)-~~hb
`::m 1 ~F V:F p1 1:: re ;~ 1::-m r .r ~Rm'i'l'::.. ~:::;.,:;.r:;., 1 :;} ::..p·e::-:~1-f 'i'l' r.:er.: :;-::;, 1:: r.:1
`:
`.
`.
`:,-.,
`a
`.
`t-1:-t:"t:: I tt-::"!~Hd~ ~h) dd I+K:t:d ~ t::
`lht: l>t"t h t t"RH t::::-:t: dH I H
`'i+"l-::t:r.: 1 bt: ::-:~)1~ r.: 1 ~:·ai l~t: b~:;;, j r.:::-: tt-::~~ l::t a I~H-
`f:;it- ~.t:.~. =
`1~::-:D~m ~~Dat::l~l_r). {r.: I be !im~l~t::~ ::":~~:;;,:!":": bt j:=-tD::·~·:dDte
`ar.:d h~)l> ttl r_r/:t":"::;~l ~r.:~l t l~mttlalhr.:: I·H~ pt~)dl~::-:~ ar.:
`t-:=; t::-: I~ I hr.: ~~~ t"R rRm J ~)f j>t~)::":td I~ tt: at::d ~~X) I>~ t:: """~X":HI b t::~.
`
`'i-
`
`'
`
`lbr.
`
`•
`
`l~mt" v.:ny1r.:~ ::-:~}t::lt~hl~lhr.: I~) l~)h~ :t":"::;t:::-:l~lbr.: l~mt".
`l':ll "'"""" <li "~ I bttt d'"""'' I h p, hh :i" I" ""1>-
`1 D t:r. Hht~~ t .r ~ 1~ ppm 1 ~ Ol>l tor.: a I= D:Er.t"t-1;¥ rj 11 tt :r.::; 1 t-t::~Dt::
`f1~r.:::-:1*:}r.:~ I hal ::-:m~ pt~}::-:~~t""""ttl dalahdmt"~l *:.. 'i+"t~l­
`ltt I~) ll:::r. l>t:thtrRar.:::-::r.~hla fk. Vh 1bh mt:::-:h~r.:hrR:
`~~~tt~ ::-:Hr.: ::-:tt-::~l:r. H~b:r.t-h:~ t"'=-"t"t::l~ {:r..g. = l::t ::-:~)ml::.ar.:­
`~ r.:~ m D~ I~ 1>k t"'=-"t"t::l ~): ::..t:it:::-:1 ~~:--d_r dh,::-:md t"l:-"t"t:: I~: m ~:d­
`tt"::-:::)td fdd:.. ~~~r.:g li t::~)l;+"~~~gt" ~l lbt: a!>!>H(cid:173)
`~f_r :!":"1:--t"t::l
`::-:a I hr.: :;.::)fl~iti~ tt" ... ~ htbH~:o-~~) t: ~)t :Er.t"t"RHt:: I ~::-:aU_r :;-:~)ffi pt;-:;.:;.
`! l:t" l>:!":tfmmat:::":t":" da!a !D tt"dn:":'":"
`!l::t":" ~~D~:t":" Df da!::t
`t t::-:m d~L l~ ~ a r.: t"::; ~~m p~t" Df I bt" h II t"t: ~} t::t" m~~ bl ::-: tt"-
`~• .~"'I~,...+- I )...~•1
`~.:,... ,...~ ........ ~ .... ~,Ei~,...
`,..._.-,....,.,_..:I' t+-
`~·~ +-
`.-...~,,..,..,,.._,,I+-:.
`u.,. >r oHor•••n•
`• onro HV onr'"""'~'V '"'c'"lr•r•'~ lc,,n.,~r•r•'•
`pt~}nt-:;, ft~}m pt~}::-::t"=:~l~tt" t:r.:l t _r/:t":"::;~l +~Ia. 1.'l::h mt:::-:b(cid:173)
`~~ r.:hm: ! }.~) !::~}. ~~~i>! ~ ~t ::J !'tll~t": t"!'tl V)';=!-"t"t~ !::~:t:H ! ~)
`t;::; I t"t::d lh:r. })~~hh ~ t::~l t I~ t"Rtt I~~ I tor.: ~ t:: ~~ ':··mjt-1 t 0 f ';!v~~_r~.
`
`'i+"~ll::
`P~r.:a•.r: l~mt-~r.:lttv.:~ e"""t"t::l~~~:;;,:;.,::)~~~~t-~m t"'=-"t-t::l
`a ~~~t ~)f :;..::)l~t::-:t: ::":~)dt: !X~t::l~ ~:~.t:.: H b:~k~~ ::":~)dt: ftH:it"
`mt-r.: 1). t:a::-:1-:: ~)::-:::-:1~ t ttt=:-:t- l>t~x~ 1~::-:t-:;, ar.: t-1:-"tt 1 ::":::) r.:l ~~ d r.:~
`!b:t":" ! ~~:'":" !ba! dap-::.,:•·d dnt~t::~ :t":"::;~n!br.: of !bt- :;.::)Dt::"':'""
`::-:~)d:r. ft a::-d: mtt I . lk-:m~:;,:r. I b:r. ~~:;;,:Er,~x-:hl hr.: h 'i+"~l b h:~k~~:
`t~~!"t:r.t !bn:: :::-1-::_y-J::-:~~: ::-:~)~~ ftH~:}ffit:r.:!~: ~)t::t: =:-::Hr. ~~d~m~!
`b0::-:aU_r ~~t::~ 8::-:ar.: I ::-:::)d:t":"
`::...!:::-:I b r.:~ ar.:d mt-::~:;;.D t:t":" I bti t
`
`tt"(cid:173)
`l>t:bad:) t
`i>tt I I~ t l~~l tot:: ~)r i>t~)wa m
`E ::;::-:~ 1:--t"
`ma~ r.:::..
`! 1-::t" h~r.::t":" Df t"~t::! ! ta:ii r.~ ar.:d h!:'.. ~~ m~! :!":":j
`~!::..
`})~)I t"t:: I~~~ i>tt II~ t h~l b r.::;, W t::gt: ft~) m
`'i+"Ht-:;.pt-::1=:~ I~:Er.t".
`! "tt: l>t:~g r.: { :r..g.: ::J m :~:r. ~b~~:'::) 1;=!-"t::) ! ~) m~~~gr.:H r.! {:r..g.:
`::-:1-::ar.:~~r.:~ ll::t" ::-:::)dt" ... ~ :"t~l ~a~ pall: m t~)tdt"t~r.:~ t"~r.:l::..).
`
`PALO ALTO NETWORKS Exhibit 1028 Page 4
`
`

`
`Moreover, in the hands of a novice, event tracing can
`produce prodigious volumes oi inaccurate data. Users
`instrument procedures without a priori knowledge of
`procedure invocation frequency, which can lead unin(cid:173)
`tentionally to large data volumesj see §5.
`The Pablo trace library has been designed to reduce
`the likelihood of malignant perturbations by monitor(cid:173)
`ing and dynamically altering the volume, frequency,
`and types of event data recorded. If events with rel(cid:173)
`atively large amounts of associated trace data occur
`at rates higher than the user desires 1 the trace library
`will automatically substitute less invasive data record(cid:173)
`ing (e.g., periodic event counts rather than complete
`event traces). If the event rate returns to more mod(cid:173)
`est levels, the trace library v:ill resume recording of all
`events.
`Pablo;s implementation of adaptive instrumenta(cid:173)
`tion control associates a user-specified maximum trace
`level with each event. Each of the three event classes
`(tracing, counting, and time intervals) has an asso(cid:173)
`ciated, internal event threshold. If the user-specified
`trace level for an event of a particular class does not
`fall below the threshold for the class, the event data
`are recorded in the performance data file. Other(cid:173)
`wise, the class of the event is reduced (i.e., trace and
`time interval events are converted to count events). If
`the event 1s user-specified trace level is too low even
`for count events 1 the Pablo data capture library will
`silently record the occurrence of the event, but will
`not produ('.e a. performa.n('.e data. re('.ord. By ('.hoosing
`appropriate trace levels, it is possible to completely
`exclude certain events from the performance data file.
`In addition to instrumentation control via user(cid:173)
`specified trace levels, the Pablo instrumentation soft(cid:173)
`ware will dynamically adjust event trace levels within
`the user-specified range (i.e., the instrumentation soft(cid:173)
`ware may reduce the trace level below the user(cid:173)
`specified point, but it will not increase it beyond that
`point). Event trace levels vary dynamically based on
`each event's rate of occurrence. The adaptive instru(cid:173)
`mentation control software associates user-specified
`low water and high water marks with each event. If
`successive occurrences of a particular event occur in a
`time interval less than the low water mark, the instru(cid:173)
`mentation software responds to this high event rate by
`reducing this event's current trace level. Should this
`high rate persist, the trace level will eventually drop
`below one or more of the event trace level thresholds,
`reducing the volume of recorded event data. Con(cid:173)
`versely, if successive occurrences of a particular event
`occur in a time interval greater than the high 'Na(cid:173)
`ter mark, the instrumentation software increases the
`
`j
`
`I
`
`~
`I
`
`v
`
`V"l: .1.
`
`.l:J;:)
`
`~ 25M
`• v >.
`; -
`20M j
`I 128 PEs
`"" • "' v
`5
`j 15M 1
`I
`/
`1
`~
`I
`• 10M j
`j
`• Cl mj
`~nv~
`I /
`j
`.~ j um I I ~and 128 PEs (Clusters) I
`
`0
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`
`Time (seconds)
`
`Figure 2: Performance Data Rates
`
`event's current trace level, perhaps increasing the vol(cid:173)
`ume of recorded data.
`The Pablo data capture library also monitors the
`aggregate event rate using a similar algorithm. To(cid:173)
`gether, the event thresholds and the global threshold
`allow the Pablo instrumentation software to balance
`event data volume against application perturbation 1
`maximizing the amount of useful trace data.
`
`5 Dynamic Statistical Clustering
`
`Altnough semantic aata compression and throttles
`on performance data rates can both reduce the vol(cid:173)
`ume of captured performance data, for systems with
`thousands of processors, they are not sufficient. As
`an illustration, .Figure 2 shows the volume of perfor(cid:173)
`mance data captured during the execution of a ray
`tracing code on 64 and 128 processor Intel iPSC/860
`hypercubesj the code was instrumented to record both
`the pattern of procedure entries and exists and all
`message transmissions and receipts. Even for a triv(cid:173)
`ial 128 X 128 pixel image, the unthrottled event data
`rate approached 10 megabytes/second, and the total
`event volume was nearly 22 metabytes. For a thou(cid:173)
`sand processor system and a more realistic image size,
`the data volume would be several gigabytes, even for
`a code that executes for just a few minutes. Clearly, a
`different approach is required -this volume of perfor(cid:173)
`mance data is prohibitive. Without dedicated perfor(cid:173)
`mance monitoring hardware, extracting the data from
`a parallel system 'Nould so perturb program execution
`that the data would be of little value.
`
`PALO ALTO NETWORKS Exhibit 1028 Page 5
`
`

`
`single-program-multiple data
`the
`Fortunately,
`(SPNiD) and data parallel programming models are
`the most commonly used on massively parallel sys(cid:173)
`tems. In both cases, the same code executes on ail pro(cid:173)
`cessors, with local, data-dependent branches. Hence,
`despite the large numbers of processors, there are only
`a few equivalence classes of processor behavior.
`The n dynamic performance metrics for each pro(cid:173)
`cessor define a group of evolving, n-dimensional
`points. Most points are closely clustered because they
`If
`correspond to processors with similar behavior.
`one could identify representatives of these equivalence
`classes and only record data from those representa(cid:173)
`tives, one would obtain detailed performance metrics
`without ('.a.pturing gigabytes of data..
`Statistical clustering [4] is one of the standard tech(cid:173)
`mques for identifying data equivalence classes.
`:r-vfost
`clustering techniques are either hierarchical or parti(cid:173)
`uonal. Hierarchical techniques initially assign each
`point to a singleton cluster and then combine clusters
`to produce a sequence oi hierarchical clusters. In con(cid:173)
`trast, partitional techniques divide the group of points
`into some number of independent clustersj no hierar(cid:173)
`chical relationship is presumed.
`Because both program and processor behaviors
`change during execution, the number of clusters, their
`sizes and their shapes will vary. Hence, any effective
`clustering technique must adapt to these changes by
`identifying clusters and cluster representatives dynam(cid:173)
`ically (i.e., as the parallel application executes).
`A host of thorny research issues must be resolved
`before the efficacy of dynamic statistical clustering is
`establishedj these include identifying good clustering
`algorithms, choosmg clustering frequencies, selecting
`cluster representatives and representative weights, as(cid:173)
`sessing the computation costs oi clustering, and quan(cid:173)
`tifying the relation between data volume and perfor(cid:173)
`mance prediction accuracy. Simply put, any dynamic
`clustering algorithm must track changes in program
`behavior by adaptively identifying new behavioral
`equivalence classes, selecting representatives from each
`class, saving data from those representatives, and not(cid:173)
`ing representative weights, while not unduly perturb(cid:173)
`ing program execution.
`As an early assessment of the feasibility of clus(cid:173)
`tering to reduce data volumes, we have developed a
`prototype implementation that uses dynamic perfor(cid:173)
`mance metrics to cluster processor behavior. Figure
`2 shows that the volume of captured data with clus(cid:173)
`tering is nearly an order of magnitude smaller than if
`data from all processors is captured. In addition, the
`performance metrics computed from the reduced data
`
`are encouragingly similar to those obtained using data
`from all the processors. Because the number oi behav(cid:173)
`ioral equivalence classes generally is small compared to
`the number oi processors, we believe the data reduc(cid:173)
`tion should be even larger for systems with hundreds
`or thousands of processors.
`
`6 Performance Data Immersion
`
`On a massively parallel system, the volume of cap(cid:173)
`tured performance data is potentially very large. Al(cid:173)
`though statistical clustering techniques can reduce the
`volume of data that must be captured, and the Pablo
`performance analysis software supports both graphi(cid:173)
`cal and sonic data presentation, the human-computer
`interface remains bandwidth-limited.
`Fortunately, new technologies now make it possible
`both to soni('.ally and visually immer.<:~e a. user in data..
`From a sensory perspective, the computer need not be
`simply the medium of interaction beh•1een user and
`dataj instead, users can interact directly with the data
`(e.g., moving their heud to chunge their visual field of
`view or using their hand to rotate or transform the
`data). Data immersion provides a reality perceived
`by the senses, albeit a simulated or "virtual reality."
`The user is no longer an external observer of data, but
`rather an active participant with the data.
`Using this technology, one can immerse users and
`performance analysts in a virtual world that is based
`on dynamic performance data from massively paral(cid:173)
`lel systems. Through a head-mounted display, three(cid:173)
`dimensional sound cues, and direct data manipulation,
`the user can explore and viscerally experience the dy(cid:173)
`namic behavior of application and system software on
`a massively parallel system. Below, we describe our
`current approach to performance data immersion.
`
`6.1 Data Implications
`
`tech(cid:173)
`Although performance data presentation
`niques share many attributes with more traditional
`scientific visualization,
`the nature of performance
`data does necessitate different data presentation
`metaphors. Because massively parallel systems con(cid:173)
`tain hundreds or thousands of processors, each po(cid:173)
`tentially with 5-10 associated, dynamic performance
`metrics drawn from multiple system levels, perfor(cid:173)
`mance data occupy a very sparsely populated, high(cid:173)
`dimensional space. Moreover, some met.rics are dis(cid:173)
`crete, others are continuous, and their dynamic ranges
`can differ by multiple orders of magnitude. In con(cid:173)
`sequence, performance data presentation techniques
`
`PALO ALTO NETWORKS Exhibit 1028 Page 6
`
`

`
`1
`
`,
`
`1
`
`•
`
`share many of the problems common to statistical
`r.,.,
`aata anruys1s l-'J, ana most tecnmques Ior v1suanzmg
`grid-based data (.e.g., volume visualization) are not
`directly applicable.
`
`1
`
`,
`
`,
`
`1
`
`•
`
`"
`
`•
`
`1•
`
`•
`
`6.2 Presentation Metaphors
`
`' .
`
`"
`
`1
`
`1
`
`'
`
`1
`
`, .
`
`•
`
`1
`
`The projection of a circle from three to two dimen(cid:173)
`sions is an ellipse; the eccentricity is determined by the
`viewing perspective and the orientation of the circle.
`Inferring that the perceived ellipse is, in fact, a circle,
`is only possible if one can view multiple projections.
`Similarly, understanding the "shape'1 of multiple per(cid:173)
`formance data metrics in a high-dimensional space is
`only possible if one can examine multiple projections.
`As we saw in §5, if each point in the space represents
`a. pro('.essor, all pro('.essors have similar performa.n('.e if
`the points are all clustered in the space whose dimen(cid:173)
`sions are defined by the performance met:cics; outliers
`are likely candidates for further study.
`Using dynumic performance dutu captured -..vith
`the Pablo instrumentation library, we have imple(cid:173)
`mented a performance data presentation metaphor
`that shows all the possible three-dimensional projec-
`Hon~ 01 a ~par~e1y popu1a~eu 1 n-uuueuswuat ~pace.
`This metaphor is a generalization of a two-dimensional
`scatterplot matrix [2].
`Two-dimensional scatterplot matrices contain an
`array of x-y scatterplots that collectively represent all
`n 2 -n combinations of two of then variables; then de(cid:173)
`generate cases where both variables are identical lie on
`the diagonal of the scatterplot matrix. Although the
`two-dimensional projections allow one to see the rela(cid:173)
`tionship between variable pairs, it is difficult to gen(cid:173)
`eralize the relationship to three or more variables. To
`exploit our kinematic and spatialization senses, higher
`dimensional projections are needed.
`Our three-dimensional extension of a scatterplot
`matrix contains n 3 cubes, each of which is a three(cid:173)
`dimensional scatterplot (i.e. 1 each of the three cube
`axes represents one of the n performance metrics). If
`one imagines a.n nxnxn ('.ube ofs('.a.tterplot ('.ubes, and
`labels each of the cubes along each axis with a different
`one of then performance met:cics, then each individual
`cube represents a projection of the performance data
`from n to three dimensions. Just us u t-..vo dimensional
`scatterplot matrix has n degenerate combinations, a
`three-dimensional scatterplot array has
`
`it contains all the one(cid:173)
`ray is three-fold degenerate -
`dimensional projections and all data values n lie on
`the diagonal of the respective cubes. Similarly, there
`are three cube planes that each contain n 2 - n two(cid:173)
`fold degenerate cubes -
`each of these cubes are two(cid:173)
`dimensional projections and all data values in each of
`these cubes lie in a diagonal plane.
`Although many of the cubes are redundant (i.e.,
`they are simply transpositions of other cubes), and
`some are degenerate, they provide symmetry and mul(cid:173)
`tiple perspectives as one moves about the array of
`cubes. Inside an individual scatterplot cube, each dis(cid:173)
`played point corresponds to a processor, and its spa(cid:173)
`tial position is determined by the value of the perfor(cid:173)
`manc

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket