`Castle Park
`Cambridge CB3 0RD
`United Kingdom
`
`TELEPHONE:
`INTERNATIONAL:
`FAX:
`E-MAIL:
`
`Cambridge (01223) 515010
`+44 1223 515010
`+44 1223 359779
`apm@ansa.co.uk
`
`ANSA Phase III
`
`Monitoring in Distributed Systems
`
`Yigal Hoffner
`
`Abstract
`
`A general model of management is introduced and used in the development of a model of the
`management of monitoring for object-based federated distributed systems. This model is
`subsequently used to show how monitoring and its management can be implemented in such
`systems.
`
`The information and structures necessary for conducting a monitoring session with multiple
`objects are presented. The problem of managing a monitoring session, where the set of objects
`under observation changes dynamically, is addressed. Finally, the problem of management
`across federation boundaries is discussed.
`
`APM.1008.01
`
`Approved
`Architecture Report
`
`25th October 1994
`
`Distribution:
`Supersedes:
`Superseded by:
`
`Copyright ª 1994 Architecture Projects Management Limited
`The copyright is held on behalf of the sponsors for the time being of the ANSA Workprogramme.
`
`HP_1021_0001
`
`
`
`HP_1021_0002
`
`HP_1021_0002
`
`
`
`Monitoring in Distributed Systems
`
`HP_1021_0003
`
`
`
`HP_1021_0004
`
`HP_1021_0004
`
`
`
`Monitoring in Distributed Systems
`
`Yigal Hoffner
`
`APM.1008.01
`
`25th October 1994
`
`HP_1021_0005
`
`
`
`The material in this Report has been developed as part of the ANSA Architec-
`ture for Open Distributed Systems. ANSA is a collaborative initiative, managed
`by Architecture Projects Management Limited on behalf of the companies
`sponsoring the ANSA Workprogramme.
`
`The ANSA initiative is open to all companies and organisations. Further infor-
`mation on the ANSA Workprogramme, the material in this report, and on other
`reports can be obtained from the address below.
`
`The authors acknowledge the help and assistance of their colleagues, in spon-
`soring companies and the ANSA team in Cambridge in the preparation of this
`report.
`
`Architecture Projects Management Limited
`
`Poseidon House
`Castle Park
`CAMBRIDGE
`CB3 0RD
`United Kingdom
`
`TELEPHONE UK
`INTERNATIONAL
`FAX
`
`(01223) 515010
`+44 1223 515010
`+44 1223 359779
`apm@ansa.co.uk
`
`Copyright ª
` 1994 Architecture Projects Management Limited
`The copyright is held on behalf of the sponsors for the time being of the ANSA
`Workprogramme.
`
`Architecture Projects Management Limited takes no responsibility for the con-
`sequences of errors or omissions in this Report, nor for any damages resulting
`from the application of the ideas expressed herein.
`
`HP_1021_0006
`
`
`
`Contents
`
`3
`3
`3
`3
`4
`
`5
`5
`5
`6
`6
`6
`6
`7
`
`9
`9
`9
`9
`10
`10
`11
`11
`11
`12
`14
`
`15
`15
`15
`
`19
`19
`19
`19
`20
`20
`
`23
`23
`23
`23
`24
`
`1
`1.1
`1.2
`1.3
`1.4
`
`2
`2.1
`2.2
`2.3
`2.3.1
`2.3.2
`2.3.3
`2.3.4
`
`3
`3.1
`3.2
`3.2.1
`3.2.2
`3.2.3
`3.2.4
`3.2.5
`3.2.6
`3.3
`3.4
`
`4
`4.1
`4.2
`
`5
`5.1
`5.2
`5.3
`5.4
`5.4.1
`
`6
`6.1
`6.2
`6.3
`6.4
`
`Introduction
`Abstract
`Audience, scope and purpose
`Context
`Overview
`
`Monitoring
`The purpose of monitoring
`Modelling and the level of monitoring
`The problems of monitoring
`Direct and indirect observations
`Complete and incomplete observations
`Presentation problems
`Monitoring and interference
`
`Distribution and monitoring
`Introduction
`Aspects of distribution
`Physical separation
`Concurrency
`Heterogeneity
`Federation
`Scaling
`Evolution
`Problems and reversed assumptions
`Conclusions about monitoring in a distributed system
`
`Approach to monitoring in distributed systems
`Introduction
`Monitoring and its management in object-based federated distributed
`systems
`
`A model of monitoring and its management
`Introduction
`The generic model of management
`Applying the model of management to monitoring
`Developing the model of management of monitoring
`The generic model
`
`The management of monitoring
`Introduction
`The monitoring process
`Areas of management
`Management of generation of monitoring events
`
`APM.1008.01
`
`Monitoring in Distributed Systems
`
`i
`
`HP_1021_0007
`
`
`
`Contents
`
`ANSA Phase III
`
`24
`25
`25
`26
`26
`
`27
`27
`27
`27
`28
`28
`28
`29
`29
`29
`29
`
`31
`31
`31
`31
`31
`31
`32
`32
`33
`33
`34
`34
`34
`35
`36
`
`37
`37
`37
`37
`38
`38
`39
`40
`40
`42
`42
`42
`42
`
`6.5
`6.6
`6.7
`6.8
`6.9
`
`7
`7.1
`7.2
`7.3
`7.3.1
`7.3.2
`7.4
`7.5
`7.5.1
`7.5.2
`7.6
`
`8
`8.1
`8.2
`8.2.1
`8.2.2
`8.2.3
`8.3
`8.4
`8.5
`8.5.1
`8.6
`8.7
`8.8
`8.8.1
`8.8.2
`
`9
`9.1
`9.1.1
`9.1.2
`9.2
`9.3
`9.4
`9.5
`9.5.1
`9.5.2
`9.5.3
`9.5.4
`9.5.5
`
`Management of distribution, collation and logging
`Management of processing
`Management of processing and presentation processes
`Management facilities
`Monitoring, management and system development epochs
`
`Monitoring and management in objects
`Introduction
`Model of monitoring facilities in an object
`Monitoring and management facilities
`Object query operations
`Object activation operations
`Model of monitoring facilities in a capsule
`Monitoring and management facilities
`Capsule query operations
`Capsule activation operations
`The granularity of monitoring
`
`Managing a monitoring session
`Introduction
`Monitoring a configuration of multiple objects
`Managing a dynamic monitoring session
`Scope of monitoring
`Views and information in a monitoring session
`Assumptions about the monitoring session
`Monitoring session management facilities
`Managing a monitoring session
`The phases of a monitoring session
`Setting up the monitoring session components
`Setting up of the initial scope of monitoring
`Managing an ongoing monitoring session
`Extending the scope of Monitoring
`Notifying the MMgr
`
`Monitoring across boundaries
`Introduction
`Framework for standardization
`Management domains and boundaries
`Integrating application and distributed infrastructure monitoring
`Integrating monitoring in a single domain
`Integrating monitoring across several domains
`Monitoring facilities standardization issues
`Basic set of events - a taxonomy
`Management of monitoring in objects
`Access to management and scope of monitoring
`The Monitoring manager and Monitoring collator
`Presentation issues
`
`ii
`
`Monitoring in Distributed Systems
`
`APM.1008.01
`
`HP_1021_0008
`
`
`
`1 Introduction
`
`1.1 Abstract
`
`A general model of management is introduced and used in the development of
`a model of the management of monitoring for object-based federated
`distributed systems. This model is subsequently used to show how monitoring
`and its management can be implemented in such systems.
`The information and structures necessary for conducting a monitoring session
`with multiple objects are presented. The problem of managing a monitoring
`session, where the set of objects under observation changes dynamically, is
`addressed. Finally, the problem of management across federation boundaries
`is discussed.
`
`1.2 Audience, scope and purpose
`
`This document develops a model of monitoring and its management for object-
`based federated distributed systems. It is addressed to designers of distributed
`systems.
`The approach described in this document requires that the underlying
`distributed systems infrastructure can represent managed entities as
`encapsulated objects and can transmit references to object interfaces as
`parameters of management operations. Such a capability is the basis of the
`ISO Basic Reference Model for Open Distributed Processing [X.900 92], the
`OMG Combined Object Request Broker Specification [OMG 91] and the ANSA
`architecture [AR.001 93].
`
`1.3 Context
`
`This document should be read in conjunction with [TR.39 93] and [TR.41 93]. The
`documents are related to each other as follows:
`• TR.39 explains the philosophy and general approach to management in
`object-based federated distributed systems
`• TR.41 explains the problems with visualizing distributed systems and
`discusses the requirements such a process poses to the monitoring
`infrastructure
`this document explains and develops a model of monitoring and its
`management. It uses the model presented in TR.39 in order to construct a
`model of the management of monitoring. The requirements which the
`visualization of distributed systems impose on the monitoring
`infrastructure, and which are outlined in TR.41 are also used.
`
`•
`
`APM.1008.01
`
`Monitoring in Distributed Systems
`
`3
`
`HP_1021_0009
`
`
`
`Introduction
`
`1.4 Overview
`
`ANSA Phase III
`
`Monitoring is the process of obtaining, collecting, and presenting the
`information required by an observer about the observed system [JOYCE 87],
`[DOMAINS 92], [MCDOWELL 89], [SAMANI 92], [SLOMAN 89a], [SLOMAN 89b],
`[LABARRE 91] and [WINTERBOTHAM 87].
`Monitoring is always carried out with a purpose in mind. The general aim is to
`obtain information in order to construct a model of system behaviour or to
`modify an existing model. The general activity of monitoring a system can be
`specialized to a particular purpose such as accounting, debugging or testing,
`among others. The specialization of monitoring to the different purposes
`determines the type and the way in which information collected.
`Distribution and more specifically, dealing with issues such as heterogeneity,
`autonomy, physical separation and concurrency, complicates the process of
`monitoring. The design and development of monitoring facilities needs to
`deal with these problems.
`The object-based approach to building distributed systems requires the
`designers to incorporate monitoring and management facilities in each object.
`This paper is an investigation of the facilities which should be included in each
`object if monitoring is to be viable in distributed systems.
`Monitoring often involves correlating concurrent events at multiple objects.
`Management structures are, therefore, necessary to maintain information on
`the objects participating in a monitoring session, and manage the monitoring
`facilities in each of them. In addition, appropriate structures for collecting
`monitoring information are required.
`In a distributed system, a monitoring session can evolve dynamically, as
`activities related to an application spread throughout the system. The
`management of a monitoring session must therefore be able to extend and
`contract the set of objects under observation. The distributed system
`infrastructure must allow access to management functions of an object given a
`reference to one of the object’s service interfaces.
`When monitoring across federation boundaries, differences in monitoring and
`management facilities must be accommodated either by prior agreement to
`provide common facilities, or by supplying the appropriate translators which
`allow interworking. There is also a need to provide channels and mechanisms
`for resolving policy conflicts across federation boundaries. The integration of
`monitoring facilities of different systems is an important part of the problem of
`monitoring across boundaries.
`
`4
`
`Monitoring in Distributed Systems
`
`APM.1008.01
`
`HP_1021_0010
`
`
`
`2 Monitoring
`
`2.1
`
`The purpose of monitoring
`
`Monitoring is carried out in order to obtain information about a system, and in
`general, monitoring is part of the process of management (Figure 2.1). Among
`the many activities which involve monitoring we find:
`•
`debugging
`•
`testing
`•
`accounting
`•
`performance evaluation
`•
`resource utilisation analysis
`•
`security
`•
`fault detection
`•
`teaching aid.
`Monitoring and its management are concerned with providing the
`necessary information in order to allow the construction of the required model
`of the observed system and its presentation. It is the purpose of monitoring
`which dictates what should be observed and also how the information is to be
`obtained.
`
`Figure 2.1: The relationship between management and monitoring
`
`System
`
`Controlling
`
`Monitoring
`
`Decision
`making
`
`2.2 Modelling and the level of monitoring
`
`The different purposes for which monitoring is carried out can be executed at
`different levels. Thus, for example, debugging a single object as opposed to
`debugging the interactions among multiple objects will require different
`events to be observed. A language debugger will require events to be
`generated at a smaller level of granularity than that which is aimed at
`debugging the interactions between objects.
`Some of the models constructed for the purposes listed in §2.1 will require a
`different model or models of the distributed system. The exact level of
`
`APM.1008.01
`
`Monitoring in Distributed Systems
`
`5
`
`HP_1021_0011
`
`
`
`Monitoring
`
`ANSA Phase III
`
`modelling will dictate the granularity of the events the observer wishes to
`monitor.
`
`2.3
`
`The problems of monitoring
`
`The following is an exposition of the problems encountered when monitoring
`centralized and distributed computer systems.
`
`2.3.1 Direct and indirect observations
`The behaviour of some systems can be directly observed, thereby making the
`process of monitoring relatively straight forward. In computer systems most
`events of interest cannot be observed directly without special facilities,
`thereby requiring the incorporation of a monitoring infrastructure in such
`systems. The monitoring infrastructure will also facilitate the management of
`monitoring.
`There may be several levels of indirection between the observed system and
`the observer. Indirection may:
`• make changes to the observations which are not related to the behaviour
`of the observed system, for example, change the order of the messages
`sent to the observer
`affect the reliability of the observations
`introduce distance between observer and observed system and hence lack
`of trust
`directly influence the behaviour of the observed system (interference).
`•
`Distribution complicates the process of monitoring, introducing additional
`levels of indirection and subsequently additional problems. These are
`discussed in Chapter 3.
`
`•
`•
`
`2.3.2 Complete and incomplete observations
`Completeness and incompleteness refer to whether the information necessary
`in order to construct a particular model of an observed system is available or
`not.
`It could be argued that any observation of a system only reveals part of the
`system. This is not a problem when the observer is constructing a particular
`model of the system and the observation fits this model. However,
`incompleteness can cause problems when it is not intended or not catered for
`[TR.41 93]. For example:
`• when information cannot be obtained thereby hiding certain aspects of the
`system from the observer
`• when hidden information makes some of the available information non-
`interpretable by creating the wrong context for its interpretation.
`In reality, the two problems may stem from the same cause.
`
`2.3.3 Presentation problems
`In many cases it is necessary to modify the information from the observed
`events in a system in order to overcome the following problems (Figure 2.2):
`
`6
`
`Monitoring in Distributed Systems
`
`APM.1008.01
`
`HP_1021_0012
`
`
`
`ANSA Phase III
`
`Monitoring
`
`•
`
`•
`
`•
`
`•
`
`observed events appear in a form which is not amenable for immediate
`use by the observer
`observed events occur at a rate which cannot be easily used by the
`observer
`the volume of observed events may be such that it overwhelms the
`observer
`in a system in which has no central point of observation, events of interest
`may occur at different parts of the system. Structures and processes which
`collect and order the information from the observed events are therefore
`necessary.
`
`2.3.4 Monitoring and interference
`Every system is affected by being monitored. The extent of the influence may
`or may not be negligible from the point of view of the user(s) or the observer(s)
`of the system. There is a relation between the flexibility of the monitoring
`facilities, the cost of implementation, and the extent to which they interfere
`with the behaviour of the system.
`The most general requirement from monitoring, which is independent of the
`purpose for which it is introduced, is that although the sequence of system
`events may change as a result of the interference caused by monitoring, it
`must not result in an illegal sequence of events taking place.
`
`Figure 2.2: Monitoring: transforming the information from the system events
`
`System
`
`System
`Event
`
`Monitoring
`Event
`
`Observer
`
`Monitor
`
`APM.1008.01
`
`Monitoring in Distributed Systems
`
`7
`
`HP_1021_0013
`
`
`
`Monitoring
`
`ANSA Phase III
`
`8
`
`Monitoring in Distributed Systems
`
`APM.1008.01
`
`HP_1021_0014
`
`
`
`3 Distribution and monitoring
`
`3.1
`
`Introduction
`
`This chapter discusses aspects of distribution which affect monitoring:
`physical separation, concurrency, heterogeneity, federation, scaling and
`evolution. A summary of the assumptions which are no longer valid when
`monitoring distributed systems, as opposed to centralised systems, is
`presented. Some general comments are made about the philosophy of
`management in an object-oriented distributed system. The chapter also
`identifies the three major problem areas of providing monitoring in distributed
`systems: management of monitoring, reconstruction of the causal flow of
`events, and presentation of monitoring information.
`
`3.2 Aspects of distribution
`
`The following sections discuss the aspects of distribution which have an effect
`on monitoring: physical separation, concurrency, heterogeneity, federation,
`scaling and evolution [WARNE 91].
`
`3.2.1 Physical separation
`In a distributed system the physical separation of objects is unavoidable. In
`addition, communication delays among objects are usually variable and
`unpredictable. As a result there is no single point of reference from which
`events in the entire system can be directly observed. In order to obtain a global
`view of the system it is necessary to collect information on local events from
`several locations, from which a reconstruction of the flow of global events can
`be made. For example, to determine whether a certain event at one location is
`causally related to another event at some other location.
`There are situations in which it is not possible to monitor events in certain
`parts of the system. This may be the result of the absence of monitoring
`facilities, or policy decisions imposed on an object. There are two additional
`complications in distributed systems:
`•
`failures can occur during communication
`•
`services may partially fail.
`Such failures may affect not only the activities being monitored, but also the
`monitoring of these activities, resulting in incomplete information.
`This complicates the reconstruction of the flow of events in the system, and
`results in an incomplete picture of the system. This problem is addressed in
`more detail in [TR.41 93].
`Distributed systems are characterized by the possibility of partial failures.
`Partial failures may lead to situations where some but not all of the managed
`objects in a system can be accessed. Moreover, some of the management
`infrastructure itself may fail. Fault tolerant techniques may therefore have to
`
`APM.1008.01
`
`Monitoring in Distributed Systems
`
`9
`
`HP_1021_0015
`
`
`
`Distribution and monitoring
`
`ANSA Phase III
`
`be applied to the management facilities themselves in order to make them
`more resilient to failure.
`The physical separation of systems together with the variable communication
`delays also means that there is no single point of control in a distributed
`system. This together with the absence of a single point of observation means
`that checkpoints, tracing, breakpoints and single stepping of a distributed
`application are difficult, if not impossible without changing the nature of the
`system.
`Figure 2.1 in Chapter 2 shows the relationship between a system, its monitor
`and controller, and the decision making process. If the system is distributed,
`the absence of a single point of control and the absence of a single point of
`observation implies that the monitor and controller must be distributed as
`well. Furthermore, in some systems the decision making process may either be
`distributed and/or have to be carried out in the face of incomplete information
`
`3.2.2 Concurrency
`Distributed systems will support multiple objects and activities. Bindings
`between objects will be set up and discarded, and objects will be able to invoke
`other objects asynchronously through these bindings. Furthermore, objects
`will be created and destroyed as the need arises. The dynamic initiation and
`termination of activities will lead to situations where the activities stemming
`from an application may spread throughout the system. The extent of the
`initiated activities may not be known in advance.
`From the point of view of monitoring this creates several difficulties. In order
`to gain sufficient understanding of the flow of events in a system it may not be
`enough to monitor a single object or simply its interactions with other objects.
`In fact we may wish to gain information on how activities spread in a system.
`Thus we may wish to:
`•
`fully activate monitoring of objects with which a monitored object
`interacts
`follow the chain of activity as it moves from one object to another.
`•
`As different combinations of these strategies may occasionally be required, the
`management of the monitoring activities in such circumstances will be
`difficult unless extremely flexible management structures can be provided.
`Together with different monitoring activation strategies, additional event
`information to allow the observer to follow activities throughout the system is
`necessary.
`
`3.2.3 Heterogeneity
`Large scale distributed systems inevitably include some diversity in their
`hardware, operating systems and their distributed system infrastructure. It is
`reasonable to assume, therefore, that this diversity will be reflected in the
`implementations of monitoring facilities. Distribution does not only refer to
`the run-time physical separation of components, but also to the possibility of a
`distributed development environment. In such a case it is possible to have
`different implementations of monitoring which do not conform to one another.1
`
`1. This may happen between heterogeneous systems but may also happen in a
`homogeneous environment.
`
`10
`
`Monitoring in Distributed Systems
`
`APM.1008.01
`
`HP_1021_0016
`
`
`
`ANSA Phase III
`
`Distribution and monitoring
`
`3.2.4
`
`In order to make possible monitoring across heterogeneous systems, it is
`necessary to reach agreement on monitoring conformance issues. These are
`discussed in Chapter 9.
`Standard management facilities cannot be assumed across domain
`boundaries. Different monitoring and control facilities may exist in different
`management domains.
`The problem of the integration of management infrastructures where different
`monitoring and control facilities may exist can be overcome through
`agreement on facilities or by the incorporation of facilities which allow
`dynamic integration of the different local management facilities
`
`Federation
`The existence of centralised ownership and universal and technical control in
`large scale distributed systems cannot be assumed, and separate sources of
`authority will inevitably reside side by side. In such systems a “federated”
`style of interworking will be necessary in which no participant is in control of
`the others. Each system controls its own services locally according to its
`policies. Different monitoring policies must be anticipated within federated
`systems and problems will arise when attempting to monitor across federation
`boundaries between systems whose monitoring policies clash. Cooperation
`between systems requires the parties responsible for them to negotiate the use
`of services either prior to the request for use of service or as a result of such a
`request.
`Examples of possible areas where negotiation is needed are:
`• where the authority allowed to request monitoring may differ
`• where the collation and logging strategies may be different causing, for
`example, security compromise or unacceptable resource usage
`• where granting access to monitoring management may be related in
`different systems to different conditions, e.g. system load, number of
`users, time of day, etc.
`
`3.2.5 Scaling
`As discussed in the section on concurrency, activities in a distributed system
`can spread and encompass large parts of the system. In cases where
`monitoring is expected to report on such activities, it is important to note that
`the monitoring activity itself will have to spread, thus consuming increasing
`storage, processing and communication resources. It is therefore essential that
`(the distribution of) monitoring itself scales well.
`The requirement for scaling needs monitoring structures which can
`accommodate distribution, system evolution, and growth of the activity in the
`face of resource constraints and performance requirements. Both management
`and collation structures must be designed with scaling in mind; these issues
`are dealt with in Chapter 7 and Chapter 8.
`
`3.2.6 Evolution
`Distributed systems will evolve over time, possibly in an inconsistent manner.
`If monitoring procedures change over time, there may be clashes between
`monitoring standards embedded in new components and monitoring
`standards in existing components. The problems arising in evolving systems
`
`APM.1008.01
`
`Monitoring in Distributed Systems
`
`11
`
`HP_1021_0017
`
`
`
`Distribution and monitoring
`
`ANSA Phase III
`
`are often similar in nature to those arising in heterogeneous and federated
`systems.
`
`3.3 Problems and reversed assumptions
`
`Certain problems associated with monitoring in a centralized system are
`exacerbated when dealing with distributed systems. However, problems also
`arise because of the reversal of many of the implicit assumptions made when
`monitoring centralized systems:
`• no central point of control: not being able to directly control the entire
`system from any single point requires extensions to sequential techniques
`involving monitoring, which in a centralized system are based on the
`existence of a single thread of control. Examples of such techniques are
`break-points, single stepping and checkpoints
`• no central point of observation: not being able to directly observe the
`system in its entirety from a single point of observation requires the
`collection of locally observed events in order to construct global views.
`However, this is more complicated than simply collecting the monitoring
`information. This is due to the fact that coupled with non-deterministic
`communication delays, collation of monitoring information cannot be
`based on the assumption that the order in which monitoring information
`is collected is related to the order in which they occurred. Furthermore,
`events of interest may consist of sequences of events which occur at
`different points of observation, thus requiring sequence recognition
`facilities
`• no central source of monitoring information: a frequent implicit
`assumption in a centralized systems is that the source of the monitoring
`information is a single source, and that error and monitoring messages
`will be sent to directly to the user's terminal or to a local file. Neither of
`these assumptions holds in a distributed system. Collation strategies are
`necessary to cater for multiple sources and destinations
`• no central point of decision making: the process of making decisions
`in a distributed system may itself be distributed. This may also be the
`case with the management of monitoring resulting in more than one
`manager in a monitoring session or an object participating in more than
`one monitoring session
`incomplete observability: in some cases it is not possible to observe
`certain parts of the system at all or only partially, resulting in incomplete
`information about events in that part of the system
`• non-determinism: distributed, asynchronous systems are inherently
`non-deterministic. Thus, two executions of the same program may
`produce different, but nevertheless valid, ordering of events. This makes
`the reproduction of errors and the creation of certain test conditions
`difficult, if not impossible, at times (monitoring information can be used to
`reproduce test conditions if monitoring interference can be minimized
`sufficiently)
`• monitoring interference: the dependencies between different processes
`in a distributed system are such that any change in the behaviour of one
`process can alter the behaviour of the entire system. The inclusion of
`
`•
`
`12
`
`Monitoring in Distributed Systems
`
`APM.1008.01
`
`HP_1021_0018
`
`
`
`ANSA Phase III
`
`Distribution and monitoring
`
`•
`
`monitoring in a distributed system can alter the behaviour of a program in
`a manner which is important to the observer
`replication: in a distributed system an object may be implemented as a
`replicated group [AR.002 93]. When wishing to monitor such an object it is
`necessary to have the appropriate facilities to deal with both cases:
`— a group with replication transparency
`— a group without replication transparency
`• migration: in a distributed system objects may migrate from one system
`to another. This will cause difficulties with the control and collation of the
`monitored objects. The appropriate monitoring and management facilities
`must deal with such cases
`• passivation: objects may be passivated [AR.006]. It is necessary to
`decide what the meaning of monitoring a passivated object is and how to
`notify the monitoring session of such a case
`objects, encapsulation and security: One of the problems with
`monitoring in an object-oriented system is that the notion of monitoring is
`directly opposed to one of the fundamental characteristics of such
`systems, namely that of encapsulation.Ensuring that the state of objects
`and their associated procedures are protected from external observation
`and interference creates a conflict with the need to monitor those objects.
`For example, the incorporation and usage of monitoring facilities in an
`object may clash with security requirements.
`objects administer their own management: in contrast to centralized
`systems which are characterized by a central management entity,
`management facilities are distributed to the objects. This also applies to
`facilities for management of monitoring
`• monitoring as a distributed activity: the monitoring of a distributed
`system is itself a distributed activity and it therefore requires:
`— tools which allow the management of the process of monitoring access
`and use to remote resources
`— that the monitoring services and their associated management
`structures do not interfere with the performance of the system to an
`unacceptable degree and that they scale well when active in large
`distributed systems
`— dynamic and selective monitoring activation: the ability to define
`the granularity of monitoring, activate it at run-time and modify it as
`the need arises without re-compilation. The granularity of monitoring
`is the level to which a single activity can be monitored in an object
`without having to activate the entire monitoring in the object
`agents and roles: in a distributed system: the assumption that the same
`agent in the same location may carry out several roles does not hold in
`distributed systems. For example, the application programmer,
`application user and the observer roles may be carried out by different
`agents in different locations
`visualization and system models: this concerns the need to present the
`data produced during a monitoring session to the user in an intelligible
`form, relating it to known models of the system. Distribution adds an
`extra level of complexity to intelligible presentation of monitoring
`
`•
`
`•
`
`•
`
`•
`
`APM.1008.01
`
`Monitoring in Distributed Systems
`
`13
`
`HP_1021_0019
`
`
`
`Distribution and monitoring
`
`ANSA Phase III
`
`information. Special analysis and visualization tools are therefore
`essential.
`
`3.4 Conclusions about monitoring in a distributed system
`
`•
`
`The problems cited above which distribution introduces to monitoring can be
`grouped together into three major problem areas:
`the definition, design and incorporation of a monitoring and
`•
`management infrastructure to facilitate the dynamic monitoring of
`distributed systems
`ordering and reconstruction of the flow of events in a distributed
`system from the monitoring information: the transformation of a
`collection of monitoring information of local events into a global picture.
`The ability to reconstruct can be seen as a pre-requisite for providing
`useful presentations of monitoring information
`the visualization of monitoring information in order to provide the
`observer with useful models of the system and the activities in it.
`
`•
`
`14
`
`Monitoring in Distributed Systems
`
`APM.1008.01
`
`HP_1021_0020
`
`
`
`4 Approach to monitoring in distributed
`systems
`
`4.1
`
`Introduction
`
`This chapter presents the approach to monitoring and management in object-
`based federated distributed systems, based on the problems cited in Chapter
`3. A more detailed description of the approach and the rationale behind it is
`given in [TR.39 93].
`
`4.2 Monitoring and its management in object-based federated distributed
`systems
`
`The principles of encapsulation and autonomy of objects means that each
`object will have its management service and an interface to it (Figure 4.1).
`
`Figure 4.1: Objects have their own monitoring management service and an interface to it
`
`Management
`interface
`
`Mon
`Service
`
`Service
`
`Service
`interface
`
`Obtaining a referenc