`(10) Patent No.:
`(12) United States Patent
`Feb. 10, 2004
`(45) Date of Patent:
`Dingetal.
`
`
`US006691067B1
`
`(54) ENTERPRISE MANAGEMENT SYSTEM AND
`METHOD WHICH INCLUDESSTATISTICAL
`RECREATION OF SYSTEM RESOURCE
`USAGE FOR MORE ACCURATE
`MONITORING, PREDICTION, AND
`PERFORMANCE WORKLOAD
`CHARACTERIZATION
`
`(75)
`
`Inventors: Yiping Ding, Dover, MA (US);
`(3) Newman, Cambridge, MA
`
`9/2001 Urano et al. oo... 709/223
`1/2003 Hafez et al. we. 709/224
`
`5/2003 Hafez et al. 0... 709/224
`5/2003 Ding etal. oe 702/186
`
`6,289,379 B1 *
`6,513,065 Bl *
`6,560,647 B1 *
`6,564,174 Bl *
`* cited by examiner
`.
`.
`.
`Primary Examiner—Patrick Assouad
`(74) Attorney, Agent, or Firm—Wong, Cabello, Lutsch,
`Rutherford & Brucculeri, LLP
`(57)
`ABSTRACT
`
`(73) Assignee: BMC Software, Inc., Houston, ‘IX
`(US)
`;
`oo,
`;
`Subjectto any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`;
`(*) Notice:
`
`A system and method for estimating statistics concerning
`system metrics to provide for the accurate and efficient
`monitoring of one or more computer systems. The system
`preferably comprises a distributed computing environment,
`iLe., an enterprise, which comprises a plurality of intercon-
`nected computer systems. At
`least one of the computer
`systems is an agent computer system which includes agent
`software and/or system software for the collection of data
`relating to one or more metrics, 1.e., measurements of system
`(21) Appl. No.: 09/287,601
`resources. Metric data is continually collected over the
`course of a measurement interval, regularly placed into a
`(22)
`Filed:
`Apr. 7, 1999
`registry
`of metrics, and
`then periodically
`sampled
`from the
`7
`gisiry of
`metri
`d then
`periodically
`sampled from
`th
`(SL)
`Tint, C17
`ceeccecccccsseeessseeesessseeeesssees GO06F 19/00
`registry indirectly. Sampling-related uncertainty and inac-
`(52) US. CL.
`702/186; 709/224; 709/226
`curacy arise from two primary sources:
`the unsampled
`(58) Field of Search— ,
`702/186: FIA/L:
`residual segments of seen (i.e., sampled and_therefore
`2,
`:
`709/224, 226
`known) events, and unseen (i.e., unsampled and therefore
`unknown) events. The total unsampled utilization and the
`total unseen utilization are accurately estimated according to
`the properties of one or more process service time distribu-
`tions. The total unseen utilization is also estimated with an
`iterative method using gradationsofthe sample interval. The
`length distribution of the unseen processes is determined
`with the same iterative method.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5,655,081 A *
`8/1997 Bonnell et ale
`sesssscese 709/202
`5,696,701 A * 12/1997 Burgess et ale vacssesseee 714/25
`5,761,091 A *
`6/1998 Agrawalctal. .....
`vee 702/186
`5,796,633 A *
`8/1998 Burgess et al. .......... 702/187
`5,920,719 A *
`7/1999 Sutton et al.
`.....
`we 717/130
`6,269,401 B1 *
`7/2001 Fletcher et al.
`....0000... 709/224
`
`75 Claims, 18 Drawing Sheets
`
`Google Exhibit 1044
`Google Exhibit 1044
`Google v. Valtrus
`Google v. Valtrus
`
`
`
`
`
`
`Has
`
`
`sample interval
`
`
`A expired?
`704
`Yes
`
`Collect raw performance
`data at a high frequency
`700
`
`Store and/or update raw
`datain registry of metrics
`702
`
`
`
`
`
`
`Sample
`registry of metrics
`706
`
`Has
`
`measurement
`interval L expired?
`708
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 1 of 18
`
`US 6,691,067 BI
`
`
`
`
`
`JUSWWUOIIAUZBunnduosesiidiejuyuy
`
`BalOPIM
`
`YIOMION
`
`ZOL
`
`yev0}
`
`yIOMJON
`
`
`
`Bally|e00
`
`FATATTOTATETET
`
`r=a=
`
`
`
`
`
`9@ZLG2ZLeZcLSLL9bt
`
`
`
`
`
`
`
`b‘Old
`
`
`
`
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 2 of 18
`
`US 6,691,067 B1
`
`1
`
`60
`
`
`
`|
`
`—_
`
`L\-
`
`150
`
`
`
`
`
`~
`
`
`
`—
`
`
`
`
`
`FIG. 2
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 3 of 18
`
`US 6,691,067 B1
`
`Jezijensi,
`
`OlP
`
`JOIPSd
`
`80P
`
`azAjeuy
`
`907
`
`IND}PellOD
`
`vor
`
`JO]UO
`
`OV
`
`M@IAIBAO
`
`©Old
`
`
`
`
`
`$10}09||09e1eq
`
`vO€e
`
`juaby
`
`ZOE
`
`
`
`OPSponsJosu0D
`
`O€Sporjusby
`
`esdie}uy
`
`juswabeueyy
`
`08}Wejsis
`
`
`
`
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 4 of 18
`
`US 6,691,067 BI
`
`JO}UOW
`
`9|OSuo)
`
`JO}UOY|
`
`9|0SUOD
`
`
`
`MBIAIBAQCJOWUO/Y
`
`Ov
`
`anand
`
`uonejsiboyOcr
`
`toe
`
`|eqUMOPIUG|+eyeqydeig|—eeJ
`
` aisibonyanak:|_SisombeyWaly|soyUoyy
` UMOp|||!sisondeyydes|Gaiam”Sire2=4sjsonbayayepdr)|Z0P|saloljoSuey|'Sue!
`
`
`uowseguabeuey|4,s}sanbay
`
`TTTT
`
`SUUEIY|
`
`vSls
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 5 of 18
`
`US 6,691,067 BI
`
`||
`
`
`
` (eT|fuoysodayjequeg|t-te
`
`yusby
`
`8|OSUOd
`
`
`
`ajepd||sisanboyyelgwaBy|sjsenbayLalypoeeeaten4
`
`a_—_(DYNO)AnsiBay331109|OF
`
`
`BYEXIWOUe}eq392102eqXWNeyedWVejeqwaysks!JeceJ0}99)}09Jo}93)}09Jo}93]}09JO}O9}105J0}93||09J0}99||09|iejeqesegks
`
`
`
`
`
`
`
`
`jnduanand)jndu|anandjndu|anantyyndu|anantjnduyanenyynduy||82cEPece9¢2eGeceBCE||anenr
`
`
`
`
`
`
`
`
`
`
`
`
`en,I,|ns!Oealeorerezieoe|(Pl
`UOWSEBIIAIESs}senbayydess||fuantyuoisianJuaby|sjsanbay
`
`
`
`
`
`
`
`
`
`2202
`
`
`
`s}sanbayU0i}99/]09
`
`
`sjsanbeyUMOp|G
`~eeLe
`
`
`
`Ove
`
`
`
`UOWESdIAIES
`
`qz0¢
`
`GSls
`
`swelBord/s}duasssp
`
`
`suojeoddypeyyy
`
`92
`
`vee
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 6 of 18
`
`US 6,691,067 BI
`
`Jeziensip
`
`alld
`
`(siA’)
`
`POLY
`
`Sil4[APO/\
`
`(puu’)
`
`2997
`
`
`
`DEOPJONazv|euy
`
`
`
`oedgall4IND
`
`
`
`(ue’)(inBue’)
`
`yorCOP
`
`ozAjeuy
`
`907
`
`9‘SIA
`
`
`
`MIIAIOACazAjeuy
`
`suoday
`
`ec/P
`
`
`
`JO}IPSIxeL
`
`L9P
`
`INDezAjeuy
`
`09¢
`
`
`
`|S@WENSse0014
`
`
`
`‘InduJaspj
`
`
`
`SOWENJaSh
`
`
`
`
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 7 of 18
`
`US 6,691,067 BI
`
`oll
`
`
`q0Z¢gop
`
`
`si)(pur)
`
`lezjensiall4ePOW\
`
`
`
`MBIAIBAOJOIPAld
`
`sasq)|eeeeeeeEee
`
`UOHeINBYUO|nduy sabueyguoneinByuo|SUOI}SLIOD
`soUeUsISYMOIQ
`
`
`
`
`
`
`
`
`SilISPOW|
`
`(pu)
`
`og9p
`
`aJeEMPJEH
`
`
`
`(muy’)sails
`
`697
`
`ZSls
`
`
`
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 8 of 18
`
`US 6,691,067 BI
`
` Collect raw performance
`
`
`102
`
`
`
`data at a high frequency
`£00
`
`
`Store and/or update raw
`data in registry of metrics
`
` Has
`
`
`
`sampleinterval
`A expired?
`
`704
`
`registry of metrics
`£06
`
`
`
`
`
`Has
`
`
`measurement
`
`interval L expired?
`
`
` Sample
`
`708
`
`FIG. 8
`
`
`
`US 6,691,067 B1 my
`
`Time
`
`©
`—
`(9
`
`=L
`
`L
`
`=
`wy
`
`x o
`
`7
`ae
`o
`
`Time
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 9 of 18
`
`my sees
`
`az
`
`==
`
`AN
`
`z 7= <
`
`j
`
`x
`=
`wy
`
`5w“
`
`/
`oO
`
`Nc
`
`o
`
`ee rn An
`S
`
`.
`wWooct
`
`/
`o
`w_
`co
`
`QS coe
`
`o
`©
`LL
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 10 of 18
`
`US 6,691,067 B1
`
`Determineatotal
`Determine the
`uncaptured utilization U,,
`measurementinterval Z
`£20
`138
`
`Determine
`
`722
`
`Determine a total
`unseen utilization U,,
`
`FIG. 11
`
`Determine one or more
`(quantity d) process service
`time distributions
`£40
`
`Determine a quantity n,, of
`seen processes which follow
`eachdistribution 7
`£42
`
`Determine a mean residual
`time *; for each distribution 7
`144
`
`FIG. 12
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 11 of 18
`
`US 6,691,067 B1
`
`Determine the processservice time
`distribution
`£50
`
`Determine the quantity x, of seen
`processes whichfollow this distribution
`{82
`
`Determine
`
`G,(r)=P(R <r)|X >t)=
`
`P(tt<x<t+r)
`P(X >t)
`
`15
`
`754
`
`Determine
`A
`
`7 = [rdG.(r)
`
`0
`
`End
`
`FIG. 13
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 12 of 18
`
`US 6,691,067 B1
`
`Determine the process
`service time distribution to be
`an exponential distribution
`with service rate 1
`760
`
`Determine the quantity ,, of
`seen processes whichfollow
`this exponential distribution
`162
`
`
`
`Determine
`
`| Ll. as= A a
`7
`( Fae
`r
`
`164
`
`Determine the process
`service time distribution to be
`a uniform distribution
`between zero and C
`780
`
`Determine the quantity , of
`seen processes which follow
`this uniform distribution
`182
`
`r=
`
`Determine
`
`min(C —t, A)
`2
`
`£84
`
`FIG. 14
`
`FIG. 15
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 13 of 18
`
`US 6,691,067 BI
`
`
`
`Determine the process
`service time distribution to be
`an unknown distribution
`
`
`
`
`
`
`
`820
`
`
`
`Determine the quantity x, of
`seen processes which follow
`
`this unknowndistribution
`822
`
`
`
`Determine
`
`
`Ss; —6,)
`ic CP
`
`Determine the process
`service time distribution to be
`an unknowndistribution
`800
`
`
`
`
`Determine the quantity 7, of
`seen processes which follow
`this unknown distribution
`802
`
`
`
`¥ max[0,(s, —5,)]
`
`
`
`Determine
`
`FIG. 16
`
`FIG. 17
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 14 of 18
`
`US 6,691,067 B1
`
`
`
`Determine a total captured
`utilization U, as the sum ofall
`sampled lengths of all seen
`processes over the
`measurementinterval L
`
`840
`
`
`
`
`
`
`
`
`
`
`
`
`
`Determine a total measured
`utilization U,
`842
`
`
`
` Determine
`U = U,, —U, —U,,
`
`844
`
`FIG. 18
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 15 of 18
`
`US 6,691,067 B1
`
`“ LI 7— {|||mi%
`
`my
`n
`
`n
`
`mel m2
`n
`n
`
`n
`
`27-2
`n
`
`2m -1
`n
`
`Hit¢:
`
`|
`
`(¢-N—
`
`|
`
`| |
`
`“N41
`
`wee
`
`|
`
`n
`
`|
`
`ping 1-1
`
`|
`
`fn
`
`|
`
`wen [| || a||
`
`(n-1)"——
`
`A
`
`(n-1) 41
`
`nH
`
`m-2
`
`m—1
`
`FIG. 19
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 16 of 18
`
`US 6,691,067 B1
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 17 of 18
`
`US 6,691,067 B1
`
` Create m x n buckets
`
`860
`
`
`Count the processesin the
`current bucket
`
`
`Place each seen processinto
`the appropriate bucket
`
`862
`
`
`
`Start with the bucket with the
`864
`longest seen process(es)
`
`
`
`866
`
`
`
`Subtract the fraction of
`
`
`longer processesthat landed
`in this bucket
`
`868
`
`
`
`Multiply by m, the number of
`buckets per sampleinterval A
`870
`
`
` Estimate the number of
`872
`
`
` Is this
`
`Descend to the
`
`
`
`
`bucketat the next
`the lowest-ranked
`
`
`lower rank
`bucket?
`
`
`
`876
`874
`
`unseen processes
`
`
`
`FIG. 21
`
`
`
`U.S. Patent
`
`Feb. 10, 2004
`
`Sheet 18 of 18
`
`US 6,691,067 B1
`
`U63) —
`
`(d_-
`
`i i-l €ax)
`t:(1—€ne.) + > (m—(k+))f,
`
`U.,,
`
`k=0
`
`FIG. 22
`
`Osc) —
`
`m—(J+I)))f,
`
`Y MS,
`i(1—e,,,)+ ¥(m—(k+D)f,
`
`U,,
`
`FIG. 23
`
`
`
`US 6,691,067 B1
`
`1
`ENTERPRISE MANAGEMENT SYSTEM AND
`METHOD WHICH INCLUDESSTATISTICAL
`RECREATION OF SYSTEM RESOURCE
`USAGE FOR MORE ACCURATE
`MONITORING, PREDICTION, AND
`PERFORMANCE WORKLOAD
`CHARACTERIZATION
`
`BACKGROUND OF THE INVENTION
`
`10
`
`1. Field of the Invention
`
`2
`mines the usefulness of the performance model for system
`capacity planning. ‘he degree of reliability also determines
`the usefulness of the performance statistics presented to
`end-users by performancetools.
`Sensitivity to sampling frequency varies among data
`types. Performance data can be classified into three catego-
`ries: cumulative, transient, and constant. Cumulative data is
`data that accumulates over time. For example, a system CPU
`time counter may collect the total number of secondsthat a
`processor has spent in system state since system boot. With
`transient data, old data is replaced by new data. For example,
`the amount of free memory is a transient metric which is
`updated periodicallyto reflect the amount of memory notin
`use. However, values such as the mean, variance, and
`standard deviation can be computed based on a sampling
`history ofthe transient metric. The third type of performance
`data, constant data, does not change over the measurement
`interval or lifetime of the event. For example, system
`configuration information, process ID, and processstart time
`are generally constant values.
`Of the three data types, transicnt performance metrics are
`the mostsensitive to variations in the sample interval and are
`therefore the mostlikely to be characterized by uncertainty.
`For example, with infrequent sampling, some state changes
`may be missed completely. However, cumulative data may
`also be rendered uncertain by infrequent sampling, espe-
`cially with regard to the variance of such a metric. Clearly,
`then, uncertainty of data caused by infrequent sampling can
`cause serious problemsin performance modeling. Therefore,
`the goal is to use sampling to capture the essence of the
`system state with a sufficient degree of certainty.
`Nevertheless, frequent sampling is usually not a viable
`option because of the heavy resource usage involved.
`For the foregoing reasons, there is a need for data col-
`lection and analysis tools and methods that accurately and
`efficiently reflect system resource usage at a lower sampling
`frequency.
`
`SUMMARYOF THE INVENTION
`
`The present invention is directed to a system and method
`that meet the needs for more accurate and efficient moni-
`
`toring and prediction of computer system performance. In
`the preferred embodiment, the system and method are used
`in a distributed computing environment, i.e., an enterprise.
`The enterprise comprisesa plurality of computer systems, or
`nodes, which are interconnected through a network.Atleast
`one of the computer systems is a monitor computer system
`from which a user may monitor the nodes of the enterprise.
`At least one of the computer systems is an agent computer
`system. An agent computer system includes agent software
`and/or system software that permits the collection of data
`relating to one or more metrics, 1.e., measurements of system
`resources on the agent computer system. In the preferred
`embodiment, metric data is continually collected at a high
`frequency over the course of a measurement interval and
`placed into a registry of metrics. The metric data is not used
`directly but rather is routincly sampled at a constant sample
`interval from the registry of metrics. Because sampling uses
`substantial system resources, sampling is preferably per-
`formed at a lesser frequency than the frequency of collec-
`tion.
`
`Sampled metric data can be used to build performance
`models for analysis and capacity planning. However, less
`frequent sampling can result in inaccurate models and data
`uncertainty, especially regarding the duration of events or
`processes and the number of events or processes. The
`
`15
`
`30
`
`35
`
`The present invention relates to the collection, analysis,
`and management of system resource data in distributed or
`enterprise computer systems, and particularly to the more
`accurate monitoring of the state of a computer system and
`more accurate prediction of system performance.
`2. Description of the Related Art
`The data processing resources of business organizations
`are increasingly taking the form of a distributed computing ,
`environment in which data and processing are dispersed
`over
`a network comprising many interconnected,
`heterogeneous, geographically remote computers. Such a
`computing environment
`is commonly referred to as an
`enterprise computing environment, or simply an enterprise.
`Managersof the enterprise often employ software packages
`known as enterprise management systems to monitor,
`analyze, and manage the resources of the enterprise. Enter-
`prise management systems mayprovide for the collection of
`measurements, or metrics, concerning the resources of indi-
`vidual systems. For example, an enterprise management
`system might
`include a software agent on an individual
`computer system for the monitoring of particular resources
`such as CPU usage or disk access. U.S. Pat. No. 5,655,081
`discloses one example of an enterprise managementsystem.
`In a sophisticated enterprise management system, tools
`for the analysis, modeling, planning, and prediction of
`system resource utilization are useful for assuring the sat-
`isfactory performance of one or more computer systems in
`the enterprise. Examples of such analysis and modeling
`tools are the “ANALYZE” and “PREDICT” components of
`“BEST/1 FOR DISTRIBUTED SYSTEMS”available from
`BMCSoftware, Inc. Such tools usually require the input of
`periodic measurements of the usage of resources such as
`central processing units (CPUs), memory, hard disks, net-
`work bandwidth, and the like. To ensure accurate analysis
`and modeling, therefore, the collection of accurate perfor-
`mance data is critical.
`
`40
`
`45
`
`Many modernoperating systems, including “WINDOWS
`NT”and UNIX,are capable of recording and maintaining an
`cnormous amount of performance data and other data con-
`cerning the state of the hardware and software of a computer
`system. Such data collection is a key step for any system
`performance analysis and prediction. The operating system
`or system software collects raw performancedata, usually at
`a high frequency, stores the data in a registry of metrics, and
`then periodically updates the data. In most cases, metric data
`is not used directly, but is instead sampled from theregistry.
`Sampling at a high frequency, however, can consume sub-
`stantial system resources such as CPUcycles, storage space,
`and I/O bandwidth. Therefore, it is impractical to sample the
`data at a high frequency. On the other hand,
`infrequent
`sampling cannot capture the complete system state:
`for
`example, significant short-lived events and/or processes can
`be missed altogether. Infrequent sampling may therefore
`distort a model of a system’s performance. The degree to
`which the sampled data reliably reflects the raw data deter-
`
`50
`
`55
`
`60
`
`65
`
`
`
`US 6,691,067 B1
`
`3
`present invention is directed to reducing said uncertainty.
`Uncertainty arises from two primary sources: the unsampled
`segment of a seen process or event, and the unseen process
`or event. A seen processis a process that is sampledat least
`once; therefore, its existence and starting time are known.
`However, the residual time or utilization between the last
`sampling of the process or event and the death of the process
`or the termination of the event is unsampled and unknown.
`An unseen processis shorter than the sample interval and is
`not sampled at all, and therefore its entire utilization is
`unknown. Nevertheless, the total unsampled(i.e., residual)
`utilization and the total unseen utilization can be estimated
`with the system and method of the present invention.
`In determining the total unsampledutilization, a quantity
`of process service time distributions are determined, and
`each of the seen processes are assigned respective process
`service time distributions. For each distribution, a mean
`residual time is calculated using equations provided by the
`system and method. The total unsampled utilization is the
`sum of the mean residual time multiplied by the numberof
`seen processes for each distribution, all divided by the
`measurementinterval.
`
`In determining the total unseen utilization, first the total
`captured utilization is determined to be the sum of the
`sampled utilizations of all seen processes over the measure-
`ment interval. Next the total measured utilization, or the
`“actual” utilization over
`the measurement
`interval,
`is
`obtained from the system software or monitoring software.
`The difference between the total measured utilization and
`
`the total captured utilization is the uncertainty. Because the
`uncertainty is due to either unsampled segments or unseen
`events, the total unseen utilization is calculated to be the
`uncertainty (the total measured utilization minus the total
`captured utilization) minus the total unsampled utilization.
`Whenthe total measured utilization is not available, the
`total unseen utilization is estimated with an iterative bucket
`method. A matrix of buckets are created, wherein each row
`corresponds to the sample interval and each bucket to a
`gradation of the sample interval. Each processis placed into
`the appropriate bucket according to how many times it was
`sampled and when in the sample interval it began. Starting
`with the bucket with the longest process(es) and working
`iteratively back through the other buckets, the number of
`unseen processes are estimated for each length gradation of
`the sample interval. The iterative bucket methodis also used
`to determine a length distribution of unseen processes.
`In response to the determination of utilizations described
`above, the system and method are able to use this informa-
`tion in modeling and/or analyzing the enterprise. In various
`embodiments, the modeling and/or analyzing may further
`comprise one of more of the following: displaying the
`determinations to a user, predicting future performance,
`graphing a pertormance prediction, generating reports, ask-
`ing a user for further data, permitting a user to modify a
`model of the enterprise, and altering a configuration of the
`enterprise in response to the determinations.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`A better understanding of the present invention can be
`obtained when the following detailed description of the
`preferred embodimentis considered in conjunction with the
`following drawings, in which:
`FIG. 1 is a network diagram ofanillustrative cntcrprisc
`computing environment;
`FIG. 2 is an illustration of a typical computer system with
`computer software programs;
`
`10
`
`15
`
`30
`
`35
`
`40
`
`50
`
`55
`
`60
`
`65
`
`4
`FIG. 3 is a block diagram illustrating an overview of the
`enterprise management system according to the preferred
`embodimentof the present invention;
`FIG. 4 is a block diagram illustrating an overview of the
`Monitor component of the enterprise management system
`according to the preferred embodimentof the present inven-
`tion;
`FIG. 5 is a block diagram illustrating an overview of the
`Agent component of the enterprise management system
`according to the preferred embodimentof the present inven-
`tion;
`FIG. 6 is a block diagram illustrating an overview of the
`Analyze component of the enterprise management system
`according to the preferred embodimentof the present inven-
`tion;
`FIG. 7 is a block diagram illustrating an overview of the
`Predict component of the enterprise management system
`according to the preferred embodimentof the present inven-
`tion;
`FIG. 8 is a flowchart illustrating an overview of the
`collection and sampling of metric data;
`FIG. 9 is a diagram illustrating an unsampled segment of
`a seen event;
`
`FIG. 10 is a diagram illustrating an unseen event;
`FIG. 11 is a flowchart illustrating an overview of the
`eslimalion of metric dala statistics;
`FIG. 12is a flowchart illustrating the determination of the
`total uncaptured utilization;
`FIG. 13 is a flowchart further illustrating the determina-
`tion of the total uncaptured utilization;
`FIG. 14is a flowchart illustrating the determination of the
`portion of the total uncaptured utilization for an exponential
`distribution;
`FIG. 15 is a flowchart illustrating the determination of the
`portion of the total uncaptured utilization for a uniform
`distribution;
`FIG. 16is a flowchart illustrating the determination of the
`portion of the total uncaptured utilization for an unknown
`distribution;
`FIG. 17 is a flowchartillustrating an alternative method of
`the determination of the portion of the total uncaptured
`utilization for an unknowndistribution;
`FIG. 18 is a flowchart illustrating the determination of the
`total unseen utilization;
`FIG. 19 illustrates a matrix of buckets used in the esti-
`mation of the total unseen utilization;
`FIG. 20 illustrates a specific example of the estimation of
`the total unseen utilization with buckets;
`FIG. 21 is a flowchart illustrating the iterative bucket
`method of estimating the total unseen utilization;
`FIGS. 22 and 23 are equations which are used to generate
`a length distribution of the unseen processes.
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENT
`
`U.S. Pat. No. 5,655,081 titled “System for Monitoring and
`Managing Computer Resources and Applications Across a
`Distributed Environment Using an Intelligent. Autonomous
`Agent Architecture” is hereby incorporated by reference as
`though fully and completely sct forth hercin.
`U.S. Pat. No. 5,761,091 titled “Method and System for
`Reducing the Errors in the Measurements of Resource
`Usage in Computer System Processes and Analyzing Pro-
`
`
`
`US 6,691,067 B1
`
`5
`cess Data with Subsystem Data” is hereby incorporated by
`reference as though fully and completely set forth herein.
`FIG. 1 illustrates an cntcrprisc computing cnvironment
`according to one embodiment of the present invention. An
`enterprise 100 comprises a plurality of computer systems
`which are interconnected through one or more networks.
`Although one particular embodimentis shownin FIG.1, the
`enterprise 100 may comprise a variety of heterogeneous
`computer systems and networks whichare interconnected in
`a variety of ways and which run a variety of software
`applications.
`One or more local area networks (LANs) 104 may be
`includedin the enterprise 100. A LAN 104is a network that
`spans a relatively small area. Typically, a LAN 104 is
`confined to a single building or group of buildings. Each
`node(i.e., individual computer system or device) on a LAN
`104 preferably has its own CPU with which it executes
`programs, and each node is also able to access data and
`devices anywhere on the LAN 104. The LAN 104 thus
`allows many users to share devices(e.g., printers) as well as
`data stored on file servers. The LAN 104 may be charac-
`terized by any of a variety of types of topology (ie., the
`geometric arrangement of devices on the network), of pro-
`tocols (i.e., the rules and encoding specifications for sending
`data, and whether the network uses a peer-to-peeror client/
`server architecture), and of media (e.g., twisted-pair wire,
`coaxial cables, fiber optic cables, radio waves). As illus-
`trated in FIG. 1, the enterprise 100 includes one LAN 104.
`However, in alternate embodiments the enterprise 100 may
`include a plurality of LANs 104 which are coupled to one
`another through a wide area network (WAN) 102. A WAN
`102 is a network that spans a relatively large geographical
`area.
`
`Each LAN 104 comprises a plurality of interconnected
`computer systems and optionally one or more other devices:
`for example, one or more workstations 110@, one or more
`personal computers 1124, one or more laptop or notebook
`computer systems 114, one or more server computer systems
`116, and one or more network printers 118. As illustrated in
`FIG. 1, the LAN 104 comprises one of each of computer
`systems 110a, 112a, 114, and 116, and one printer 118. The
`LAN 104 may be coupled to other computer systems and/or
`other devices and/or other LANs 104 through a WAN 102.
`One or more mainframe computer systems 120 may
`optionally be coupled to the enterprise 100. As shown in
`FIG. 1, the mainframe 120 is coupled to the enterprise 100
`through the WAN 102, but alternatively one or more main-
`frames 120 may be coupled to the enterprise 100 through
`one or more LANs 104. As shown, the mainframe 120 is
`coupled to a storage deviceorfile server 124 and mainframe
`terminals 122a, 122b, and 122c. The mainframe terminals
`122a, 122b, and 122c access data stored in the storage
`device or file server 124 coupled to or comprised in the
`mainframe computer system 120.
`The enterprise 100 may also comprise one or more
`computer systems which are connected to the enterprise 100
`through the WAN 102: as illustrated, a workstation 1105 and
`a personal computer 1125. In other words,the enterprise 100
`may optionally include one or more computer systems
`which are not coupled to the enterprise 100 through a LAN
`104. For example, the enterprise 100 may include computer
`systems which are geographically remote and connected to
`the enterprise 100 through the Internet.
`The present
`invention preferably comprises computer
`programs 160 stored on or accessible to each computer
`system in the enterprise 100. FIG. 2 illustrates computer
`
`6
`programs 160 and a typical computer system 150. Each
`computer system 150 typically comprises components such
`as a CPU 152, with an associated memory media. The
`memory media stores program instructions of the computer
`programs 160, wherein the programinstructions are execut-
`able by the CPU 152. The memory media preferably com-
`prises a system memory such as RAM and/or a nonvolatile
`memory such as a hard disk. The computer system 150
`further comprises a display device such as a monitor 154, an
`alphanumeric input device such as a keyboard 156, and
`optionally a directional input device such as a mouse 158.
`The computer system 150 is operable to execute computer
`programs 160.
`When the computer programs are executed on one or
`more computer systems 150, an enterprise management
`system 180 is operable to monitor, analyze, and manage the
`computer programs, processes, and resources of the enter-
`prise 100. Each computer system 150 in the enterprise 100
`executes or runs a plurality of software applications or
`processes. Each software application or process consumes a
`portion of the resources of a computer system and/or net-
`work:
`for example, CPU time, system memory such as
`RAM,nonvolatile memory such as a hard disk, network
`bandwidth, and input/output (I/O). The enterprise manage-
`ment system 180 permits users to monitor, analyze, and
`manage resource usage on heterogeneous computer systems
`150 across the enterprise 100.
`FIG. 3 shows an overview of the enterprise management
`system 180. The enterprise management system 180
`includesat least one console node 400 andatleast one agent
`node 300, but it may include a plurality of console nodes 400
`and/or a plurality of agent nodes 300. In general, an agent
`node 300 executes software to collect metric data on its
`
`computer system 150, and a console node 400 executes
`software to monitor, analyze, and manage the collected
`metrics from one or more agent nodes 300. A metric is a
`measurement of a particular system resource. For example,
`in the preferred embodiment,
`the enterprise management
`system 180 collects metrics such as CPU, disk I/O, file
`system usage, database usage, threads, processes, kernel,
`registry, logical volumes, and paging. Each computer system
`150 in the cnterprisc 100 may comprise a console node 400,
`an agent node 300, or both a console node 400 and an agent
`node 300 in the preferred embodiment, server computer
`systems include agent nodes 300, and other computer sys-
`tems may also comprise agent nodes 300asdesired,e.g., file
`servers, print servers, e-mail servers, and internet servers.
`The console node 400 and agent node 300 are characterized
`by an end-by-endrelationship: a single console node 400
`maybelinked to a single agent node 300, or a single console
`node 400 maybelinked to a plurality of agent nodes 300,or
`a plurality of console nodes 400 may be linked to a single
`agent node 300, or a plurality of console nodes 400 may be
`linked to a plurality of agent nodes 300.
`In the preferred embodiment, the console node 400 com-
`prises four user-visible components: a Monitor component
`402, a Collect graphical user interface (GUI) 404, an Ana-
`lyze component 406, and a Predict component 408. In one
`embodiment, all four components 402, 404, 406, and 408 of
`the console node 400 are part of the “BEST/1 FOR DIS-
`TRIBUTED SYSTEMS” software package or
`the
`“PATROL” software package, all available from BMC
`Software, Inc. The agent node 300 comprises an Agent 302,
`one or more data collectors 304, Universal Data Repository
`(UDR)history files 210a, and Universal Data Format (UDF)
`history files 212a. In alternate embodiments, the agent node
`300 includes either of UDR 210a or UDF 2122, but not both.
`
`10
`
`15
`
`30
`
`45
`
`50
`
`55
`
`60
`
`65
`
`
`
`US 6,691,067 B1
`
`7
`The Monitor component 402 allows a user to monitor, in
`real-time, data that is being collected by an Agent 302 and
`being sent to the Monitor 402. The Collect GUI 404 is
`employed to schedule data collection on an agent node 302.
`The Analyze component 406 takes historical data from a
`UDR 2104 and/or UDF 212a to create a model of the
`enterprise 100. The Predict component 408 takes the model
`from the Analyze component 406 and allowsa userto alter
`the model by specifying hypothetical changes to the enter-
`prise 100. Analyze 406 and Predict 408 can create output in
`a format which can be understood and displayed by a
`Visualizer tool 410. In the preferred embodiment, Visualizer
`410 is the “BEST/1-VISUALIZER”available from BMC
`Software, Inc. In one embodiment, Visualizer 410 is also
`part of the console node 400.
`The Agent 302 controls data collection on a particular
`computer system and reports the data in real time to one or
`more Monitors 402. In the preferred embodiment, the Agent
`302 is the part of the “BEST/1 FOR DISTRIBUTED SYS-
`TEMS”software package available from BMC Software,
`Inc. The data collectors 304 collect data from various
`
`processes and subsystemsof the agent node 300. The Agent
`302 sends real-time data to the UDR 210a, which is a
`database of historical data in a particular data format. The
`UDF 212ais similar to the UDR 210a, but the UDF 212a
`uses an alternative data format and is written directly by the
`data collectors 304.
`
`FIG. 4 showsan overview of the Monitor component 402
`of the console node 400 of the enterprise management
`system 180. The Monitor 402 comprises a Manager Daemon
`430, one or more Monitor Consoles(asillustrated, 420a and
`420b), and a Policy Registration Queue 440. Although two
`Monitor Consoles 420a and 420b are shownin FIG. 4, the
`present invention contemplates that one or more Monitor
`Consoles may be executing on any of one or more console
`nodes 400.
`
`In the preferred embodiment, the Monitor Consoles 420a
`and 420b use a graphical user interface (GUDfor user input
`and information display. Preferably, the Monitor Consoles
`420a and 420b are capable of sending severaldifferent types
`of requests to an Agent 302, including: alert requests, update
`requests, graph requests, and drilldown requests. An alert
`request specifies one or more thresholds to be checked on a
`routine basis by the Agent 302 to detect a problem on the
`agent node 300. For example, an alert request might ask the
`Agen