`
`DHPN-1003 / Page 1 of 181
`
`
`
`Clusters for
`
`High Availability
`
`DHPN-1003 / Page 2 of 181
`
`
`
`
`
`Blinn
`
`Blommers
`
`Costa
`
`Crane
`
`Fernandez
`
`Fristrup
`Fristrup
`Grady
`
`Grosvenor, Ichiro,
`O’Brien
`Gunn
`Helsel
`
`Helsel
`Kane
`Knouse
`Lewis
`
`Madell, Parsons, Abegg
`Malan, Letsinger,
`Coleman
`McFarland
`
`McMinds/Whitty
`Phaal
`
`Poniatowski
`Poniatowski
`Thomas
`
`Weygant
`Witte
`
`Hewlett-Packard Professional Books
`
`Portable Shell Programming: An Extensive Collection of
`Bourne Shell Examples
`Practical Planning for Network Growth
`Planning and Designing High Speed Networks
`Using lOOVG—AnyLAN, Second Edition
`A Simplified Approach to Image Processing: Classical and
`Modern Techniques
`Configuring the Common Desktop Environment
`USENET: Netnews for Everyone
`The Essential Web Surfer Survival Guide
`
`Practical Software Metrics for Project
`Management and Process Improvement
`
`Mainframe Downsizing to Upsize Your Business:
`lT—Preneuring
`
`A Guide to NetWare® for UNIX®
`Graphical Programming: A Tutorial for HP VEE
`Visual Programming with HP—VEE
`PA—RISC 2.0 Architecture
`
`Practical DCE Programing
`The Art & Science of Smalltalk
`
`Developing and Localizing International Software
`
`Object—Oriented Development at Work: Fusion In
`the Real World
`
`X Windows on the World: Developing
`Intemationalized Software with X, Motii®, and CDE
`Writing Your Own OSF/Motif Widgets
`LAN Traffic Management
`
`The HP—UX System Administrator’s “How To” Book
`HP-UX 10.x System Administration “How To” Book
`Cable Television Proof—of—Performance: A Practical
`Guide to Cable TV Compliance Measurements Using
`a Spectrum Analyzer.
`Clusters for High Availability: A Primer of HP—UX Solutions
`Electronic Test Instruments
`
`DHPN-1003 / Page 3 of 181
`
`
`
`Clusters for High
`Availability
`
`A Primer of HP-UX Solutions
`
`Peter Weygant
`
`Hewlett-Packard Company
`
`”mm
`
`Prentice Hall PTR
`
`Upper Saddle River, New Jersey 07458
`
`DHPN-1003 / Page 4 of 181
`
`
`
`Editorial/Production Supervision: Joanne Anzalone
`Acquisitions Editor: Karen Gettman
`Manufacturing Manager: Alexis R. Heydt
`Cover Design: Design Source
`Manager, Hewlett—Packard Press: Pat Pekary
`
`© 1996 by Hewett-Packard Company
`
`
`
`Published by Prentice Hall PTR
`Prentice-Hall, Inc.
`A Simon & Schuster Company
`Upper Saddle River, NJ 07458
`
`All rights reserved. No part of this book may be
`reproduced, in any form or by any means, without
`permission in writing from the publisher.
`
`MC / ServiceGuard and MC / LockManager are registered trademarks of Hewlett-Packard
`Company. Oracle is a trademark of Oracle Corporation. Symmetrix and EMC are
`trademarks of EMC Corporation. NFS is a trademark of Sun Microsystems, Inc. UNIX
`is a registered trademark in the United States and in other countries, licensed exclusively
`through X/ Open Company, Ltd.
`
`The publisher offers discounts on this book when ordered in bulk quantities.
`For more information, contact the Corporate Sales Department, PTR Prentice Hall, One
`Lake Street, Upper Saddle River, M 07458. Phone: 800-382—3419. FAX: 201—236-7141. e—mail:
`corpsales@prenhall.com
`
`Printed in the United States of America
`
`10 9 8 7
`
`ISBN 0—13—494758-4
`
`HP Part Number 33936-90007
`
`Prentice—Hall International (UK) Limited, London
`Prentice—Hall of Australia Pty, Limited, Sydney
`Prentice—Hall of Canada, Inc., Toronto
`Prentice—Hall Hispanoamericana S.A., Mexico
`Prentice—Hall of India Private Limited, New Delhi
`Prentice—Hall of Japan, Inc, Tokyo
`Simon 8: Schuster Asia Pte. Ltd., Singapore
`Editora Prentice-Hall do Brasil, Ltd., Rio de Janeiro
`
`DHPN-1003 / Page 5 of 181
`
`
`
`
`
`Contents
`
`
`Foreword
`
`Preface
`
`Acknowledgements
`
`About the Author
`
`BASIC HIGH AVAILABILITY CONCEPTS
`
`What is High Availability?
`Available
`
`Highly Available
`Highly Available Computing
`Service Levels
`
`Continuous Availability
`Fault Tolerance
`
`Matching Availability to User Needs
`Choosing a Solution
`
`High Availability as a Business Requirement
`
`High Availability as Insurance
`
`xiii
`
`xv
`
`xvi
`
`xvii
`
`I—l.
`
`
`
`0000NNGNQWU‘ll-KUJNN
`
`DHPN-1003 / Page 6 of 181
`
`
`
`Contents
`
`High Availability as Opportunity
`Cost of High Availability
`
`What Are the Measures of High Availability?
`
`Calculating Availability
`Expected Period of Operation
`Calculating Mean Time Between Failures
`
`9
`10
`
`11
`
`11
`12
`14
`
`Understanding the Obstacles to High Availability 16
`
`Duration of Outages
`Time Lines for Outages
`Causes of Planned Downtime
`Causes of Unplanned Downtime
`Severity of Unplanned Outages
`Designing for Reaction to Failure
`Identifying Points of Failure
`
`Preparing Your Organization for High
`Availability
`
`Stating Availability Goals
`Building the Appropriate Physical Environment
`Creating Automated Processes
`Using a Development and Test Environment
`Maintaining a Stock of Spare Parts
`Defining an Escalation Process
`Planning for Disasters
`Training System Administration Staff
`Using Dry Runs
`Documenting Every Detail
`
`17
`18
`20
`22
`23
`23
`24
`
`25
`
`25
`27
`27
`28
`28
`29
`29
`29
`30
`30
`
`The Starting Point for a Highly Available System 31
`
`Basic Hardware Reliability
`Software Quality
`Intelligent Diagnostics
`
`vi
`
`31
`32
`32
`
`DHPN-1003 / Page 7 of 181
`
`
`
`Contents
`
`Comprehensive System Management Tools
`Maintenance and Support Services
`
`Moving to High Availability
`
`Summary
`
`CREATING A HIGH AVAILABILITY CLUSTER
`
`Identifying Single Points of Failure in a
`Stand-alone System
`
`Eliminating Power Sources as Single
`Points of Failure
`
`Individual UPS Units
`
`Power Passthrough UPS Units
`
`Eliminating Disks as Single Points of Failure
`
`Data Protection with Disk Arrays
`Data Protection with Software Mirroring
`
`Eliminating the SPU as a Single Point of Failure
`
`33
`33
`
`34
`
`35
`
`39
`
`40
`
`45
`
`45
`
`46
`
`48
`
`49
`51
`
`54
`
`Eliminating Single Points of Failure in Networks 57
`
`Points of Failure in Client Connectivity
`Examples of Points of Failure
`Points ofPailure in Inter-Node Communication
`Eliminating the Failure Points
`Providing Redundant LAN Connections
`Configuring Local Switching of LAN Interfaces
`Providing Redundant PDDI Connections
`Using Dual Attached PDDI
`Redundancy for Dialup Lines, Hardwired Serial Connec—
`tions and X25
`
`57
`58
`60
`60
`61
`61
`66
`68
`
`69
`
`vii
`
`DHPN-1003 / Page 8 of 181
`
`
`
`Contents
`
`Eliminating Software as a Single Point of Failure 70
`71
`
`Tailoring Applications for Cluster Use
`
`Implementing the
`High Availability Cluster
`
`Complete High Availability Solution
`
`HP’s HIGH AVAILABILITY CLUSTER
`
`COMPONENTS
`
`Choosing HA Architectures and Cluster
`Components
`
`Active/Standby Configurations
`Using MC/SeroiceGuard
`Active/Active Configurations
`Using MC/SeroiceGuard
`How MC/SeroiceGuaral Works
`Parallel Database Configuration
`Using MC/LockManager
`Oracle Parallel Server
`
`How MC/LockManager Works with OPS
`
`Selecting Other HA Subsystems
`
`MirrorDisk/UX
`High Availability Disk Storage Enclosure
`High Availability Disk Arrays
`EMC Disk Arrays
`]ournaled File System
`OnLine]FS
`Transaction Processing Monitors
`Uninterruptible Power Supplies
`System and Network Management Tools
`
`viii
`
`73
`
`74
`
`77
`
`78
`
`79
`
`82
`84
`
`90
`91
`92
`
`94
`
`95
`95
`96
`97
`97
`98
`99
`99
`100
`
`DHPN-1003 / Page 9 of 181
`
`
`
`Contents
`
`Using Mission Critical Consulting and
`Support Services
`
`Availability Management Service
`Business Continuity Support
`Business Recovery Services
`
`106
`
`106
`107
`109
`
`SAMPLE HIGH AVAILABILITY SOLUTIONS
`
`111
`
`Highly Available NFS System for Publishing
`
`High Availability Software and Packages
`Hardware Configuration
`Responses to Failures
`
`Stock Quotation Service
`
`High Availability Software and Packages
`Hardware Configuration
`Responses to Failures
`
`Order Entry and Catalog Application
`
`High Availability Software and Packages
`Hardware Configuration
`Responses to Failures
`
`Insurance Company Database
`
`Two—Node OPS Configuration
`
`112
`
`112
`114
`115
`
`120
`
`121
`122
`124
`
`127
`
`127
`131
`133
`
`134
`
`135
`
`iX
`
`DHPN-1003 / Page 10 of 181
`
`
`
`Contents
`
`5
`
`GLOSSARY OF HIGH AVAILABILITY
`
`TERMINOLOGY
`
`AdminCenter
`
`Adoptive Node
`ADT
`AFR
`Alternate Node
`Annualized Failure Rate
`
`Architecture for HA
`Availability
`Average Downtime
`Cluster
`ClusterView
`
`Continuous Availability
`Custody
`Downtime
`Failure
`Failover
`Fault Tolerance
`
`Grouped Net
`Hardware Mirroring
`Highly Available
`Hot Plug Capability
`Hot Swap Capability
`LAN
`
`LAN interface
`Logical Volume Manager
`MC/LockManager
`MC/ServiceGuard
`Mean Time Between Failures
`
`Mean Time to Repair
`MirrorDisk/LIX
`Mirroring
`
`X
`
`~
`
`I
`
`139
`
`140
`
`140
`140
`140
`140
`141
`
`142
`142
`143
`144
`145
`
`145
`145
`145
`145
`146
`146
`
`146
`146
`146
`147
`147
`147
`
`147
`148
`148
`148
`148
`
`149
`150
`151
`
`DHPN-1003 / Page 11 of1‘81
`
`
`
`Contents
`
`MTBP
`MTTR
`Network Node Manager
`Node
`Open View
`OperationsCenter
`Planned Downtime
`Primary Node
`Package
`Process Resource Manager
`RAID
`Redundancy
`Reliability
`Relocatable IP Address
`Service
`Service Level Agreement
`Shared Logical Volume Manager
`Single Point of Failure
`SLVM
`Software Mirroring
`SPOF
`SPU
`Subnet
`
`SwitchOver/LIX
`System Processor Unit
`Transfer of Packages
`Unplanned Downtime
`Volume Group
`
`Index
`
`Xi
`
`151
`151
`151
`151
`152
`152
`152
`152
`153
`153
`153
`154
`154
`154
`154
`155
`155
`155
`156
`156
`156
`156
`156
`
`156
`157
`157
`157
`157
`
`159
`
`DHPN-1003 / Page 12 of 181
`
`
`
`Foreword
`
`
`
`
`
`Foreword
`
`Over the last ten years, UNIX systems have moved from the spe-
`cialized role of providing desktop computing power for engineers
`into the broader arena of commercial computing. This evolution is
`the result of continual dramatic improvements in functionality, re-
`liability, performance, and supportability. We are now well into
`the next phase of the UNIX evolution: providing solutions for mis—
`sion critical computing.
`
`To best meet the requirements of the data center for availability,
`scalability, and flexibility, Hewlett-Packard has developed a ro-
`bust cluster architecture for HP-UX that combines multiple sys-
`tems into a high availability cluster. Individual computers, known
`as nodes, are connected in a loosely—coupled manner, each main—
`taining its own separate processors, memory, operating system,
`and storage devices. Special system processes bind these nodes to—
`gether and allow them to cooperate to provide outstanding levels
`of availability and flexibility for supporting mission critical appli-
`cations. The nodes in a cluster can be configured either to share
`data on a set of disks or to obtain exclusive access to data.
`
`To maintain Hewlett—Packard’s commitment to the principles of
`open systems, our high availability clusters use standards-based
`hardware components such as SCSI disks and Ethernet LANs.
`There are no proprietary APIs that force vendor lock-in, and most
`applications will run on a high availability cluster without modifi—
`cation.
`
`xiii
`
`DHPN-1003 / Page 13 of 181
`
`
`
`As the world’s leading vendor of open systems, Hewlett-Packard
`is especially proud to publish this primer on cluster solutions for
`high availability. Peter Weygant has done a fine job of presenting
`the basic concepts, architectures, and terminology used in HP’s
`cluster solutions. This is the place to begin your exploration of the
`world of high availability clusters.
`
`Xuan Bui
`
`Hewlett-Packard General Systems Division
`Research and Development Laboratory Manager
`
`Xiv
`
`DHPN-1003 / Page 14 of 181
`
`
`
`Preface
`
`
`
`
`Preface
`
`This guide is about high availability (HA) computing through enterprise
`clusters. It presents basic concepts and terms, then describes the use of
`cluster technology to provide highly available open systems solutions
`for the commercial enterprise. Here are the topics:
`
`0 Chapter 1, “Basic High Availability Concepts,” presents the lan—
`guage used to describe highly available systems and components and
`introduces ways of measuring availability.
`
`- Chapter 2, “Creating a High Availability Cluster,” describes in more
`detail the principles of HA configuration, with examples.
`
`- Chapter 3, “HP’s High Availability Cluster Components,” is an over—
`view of HP‘s current roster of high availability software and hard—
`ware offerings.
`
`. Chapter 4, “Sample HA Solutions,” discusses afew concrete exam-
`ples of highly available cluster solutions.
`
`- Chapter 5, “Glossary,” gives definitions of important words and
`phrases used to describe high availability.
`
`Additional information is available in the HP publications Managing
`MC/ServiceGuard and Configuring OPS Clusters with MC/LockMan-
`agen The HP 9000 Servers Configuration Guide contains detailed
`information about supported high availability configurations. This and
`other more specialized documents on enterprise clusters are available
`from your HP representative.
`
`XV
`
`DHPN-1003 / Page 15 of 181
`
`
`
`
`5‘5WSWW MWWWW’“WMW‘Emi/‘W‘W*Wmvm”
`
`Acknowledgments
`
`This book has benefited from the careful review of many individuals
`inside and outside of Hewlett—Packard. The author gratefully acknowl-
`edges the contributions of these colleagues, many of whom are listed
`here: Joe Algieri, Sally Anderson, Joe Bac, Bob Baird, Trent Bass, Dan
`Beringer, Claude Brazell, Thomas Buenermann, Xuan Bui, Karl-Heinz
`Busse, Bruce Campbell, Larry Cargnoni, Gina Cassinelli, Marian
`Cochran, Annie Cooperman, Ron Czinski, Dan Dickerman, Pam Dick-
`errnan, Larry Dino, Janie Felix, John Foxcroft, Shivaji Ganesh, Janet
`Gee, Mike Gutter, Terry Hand, Michael Hayward, Frank Ho, Margaret
`Hunter, Lisa Iarkowski, Art Ipri, Michael Kahn, Marty King, Clark
`Macaulay, Gary Marcos, Debby Mclsaac, Doug McKenzie, Tim Met—
`calf, Parissa Mohamadi, Alex Morgan, Markus Ostrowicki, Bob Ramer,
`Bob Sauers, Wesley Sawyer, David Scott, Dan Shive, Christine Smith,
`Eric Soderberg, Steve Stichler, Tim Stockwell, Brad Stone, Liz Tam,
`Bob Togasaki, Emil Velez, Tad Walsh, and Bev Woods. A special thank
`you goes to those groups of Hewlett-Packard customers who read and
`commented on early versions of the manuscript. Errors and omissions
`are the author’s sole responsibility.
`
`xvi
`
`DHPN-1003 / Page 16 of 181
`
`
`
`About the Author
`
`
`
` ixéW “Nammwmwmmm’“'mw"m‘w’m‘mm“wmmmmwmw
`
`About the Author
`
`Peter S. Weygant is a Learning Products Engineer in the General Sys—
`tems Solutions laboratory at Hewlett-Packard. Formerly a professor of
`English, he has been a technical writer and consultant in the computer
`industry for the last 15 years. He has developed documentation and
`managed publication projects in the areas of digital imaging, relational
`database technology, and high availability systems. He has a BA degree
`in English Literature from Colby College as well as MA and PhD
`degrees in English from the University of Pennsylvania.
`
`xvii
`
`DHPN-1003 / Page 17 of 181
`
`
`
`
`
`CHAPTER 1
`Basic High Availability
`Concepts
`
`
`This book takes an elementary look at high availabil-
`ity (HA) computing and how it is implemented through
`enterprise—level cluster solutions. We start in this chapter
`with some of the basic concepts of HA. Here’s What we’ll
`cover:
`
`0 What is High Availability?
`
`0 High Availability as a Business Requirement
`
`. What Are the Measures of High Availability?
`
`0 Understanding the Obstacles to High Availability
`
`' Preparing Your Organization for High Availability
`
`0 The Starting Point for a High Availability System
`
`0 From High Reliability to High Availability
`
`0 Designing a Highly Available System
`
`DHPN-1003 / Page 18 of 181
`
`
`
`Basic High Availability Concepts
`
`Later chapters explore the implementation of high
`availability in clusters, then describe HP’s high availability
`products in more detail. A separate chapter is devoted to
`concrete examples of business solutions that use HA.
`
`
`
`What is High Availability?
`
`Before exploring the implications of high availability
`in computer systems, we need to define some terms. What
`do we mean by phrases like ”availability,” ”high availabil-
`ity,” and ”high availability computing?”
`
`Available
`
`The term available describes a system that provides a
`specific level of service as needed. This idea of availability
`is part of everyday thinking. In computing, availability is
`generally understood as the period of time when services
`are available (for instance, 16 hours a day, six days a week)
`or as the time required for the system to respond to users
`(for example, under 1 second response time). Any loss of
`service, whether planned or unplanned, is known as an
`
`outage. Downtime is the duration of an outage measured
`in units of time (e. g., minutes or hours).
`
`DHPN-1003 / Page 19 of 181
`
`
`
`What is High Availability?
`
`Highly Available
`
`
`
`Figure 1. 1 Highly Available Services: Electricity
`
`is
`Highly available characterizes a system that
`designed to avoid the loss of service by reducing or manag-
`ing failures as well as minimizing planned downtime for
`the system. We expect a service to be highly available when
`life, health, and well—being, including the economic well-
`being of a company, depend on it.
`
`DHPN-1003 / Page 20 of 181
`
`
`
`Basic High Availability Concepts
`
`For example, we expect electrical service to be highly
`available. All but the smallest, shortest outages are unac-
`ceptable, since we have geared our lives to depend on elec—
`tricity for refrigeration, heating, and lighting, in addition to
`less important daily needs.
`
`Even the most highly available services occasionally
`go out, as anyone who has experienced a blackout or
`brownout in a large city can attest. But in these cases, we
`expect to see an effort to restore service at once. When a
`failure occurs, we expect the electric company to be on the
`road fixing the problem as soon as possible.
`
`Highly Available Computing
`
`In many businesses, the availability of computers has
`become just as important as the availability of electric
`power itself. Highly available computing uses computer
`systems which are designed and managed to operate with only
`a small amount of planned and unplanned downtime.
`
`Note that highly available is not an absolute. The needs
`of different businesses for high availability are quite
`diverse. International businesses or companies running
`multiple shifts may require user access to databases around
`the clock. Financial institutions must be able to transfer
`
`funds at any time of night or day, seven days a week. On
`the other hand, some retail businesses may require the
`
`DHPN-1003 / Page 21 of 181
`
`
`
`What is High Availability?
`
`
`
`Figure 1.2 Service Outage
`
`computer to be available only 18 hours a day, but during
`these 18 hours they may require sub-second response time
`for transaction processing.
`
`Service Levels
`
`The service level of a system is the degree of service
`the system will provide to its users. Often, the service level
`is spelled out in a document known as a service level agree-
`ment
`(SLA). The service levels your business requires
`determine the kind of applications you develop, and high
`availability systems provide the hardware and software
`
`DHPN-1003 / Page 22 of 181
`
`
`
`Basic High Availability Concepts
`
`framework in which these applications can work effec-
`tively to provide the needed level of service. High avail-
`ability implies a service level in which both planned and
`unplanned computer outages do not exceed a small stated
`value.
`
`Continuous Availability
`
`Continuous availability means non-stop service, that
`is, there are no planned or unplanned outages at all. This is
`a much more ambitious goal than high availability, since
`there can be no lapse in service. In effect, continuous avail-
`ability is an ideal state rather than a characteristic of any
`real world system.
`’
`
`The term is sometimes used to indicate a very high
`level of availability in which only a very small known
`quantity of downtime is acceptable. Note that high avail—
`ability does not imply continuous availability.
`
`Fault Tolerance
`
`Fault tolerance is not a degree of availability so much
`as a method for achieving very high levels of availability. A
`fault tolerant system is characterized by redundancy in
`most of the hardware components, including CPU, mem-
`ory, I / O subsystems, and other elements. A fault tolerant
`system is one that has the ability to continue service in spite
`of a hardware or software failure. However, even fault tol-
`
`erant systems are subject to outages from human error.
`Note that high availability does not imply fault tolerance.
`
`DHPN-1003 / Page 23 of 181
`
`
`
`What is High Availability?
`
`Matching Availability to User Needs
`
`A failure affects availability when it results in an
`unplanned loss of service that lasts long enough to create a
`problem for users of the system. User sensitivity will
`depend on the specific application. For example, a failure
`that is corrected within one second may not result in any
`perceptible loss of service in an environment that does on-
`line transaction processing (OLTP); but for a scientific
`application that runs in a real-time environment, one sec-
`ond may be an unacceptable interval.
`
`the challenge is to
`Since any component can fail,
`design systems in which problems can be predicted and
`isolated before a failure occurs and in which failures are
`
`quickly detected and corrected when they happen.
`
`Choosing a Solution
`
`Your exact requirements for availability determine the
`kind of solution you need. For example, if the loss of a sys—
`tem for a few hours of planned downtime is acceptable to
`you, then you may not need to purchase storage products
`with hot pluggable disks. On the other hand, if you cannot
`afford a planned period of maintenance during which a
`disk replacement could be done on a mirrored disk system,
`then you may wish to consider a HA disk array that sup—
`ports hot plugging or hot swapping of components.
`(Descriptions of these HA products appear in later sec-
`tions.)
`
`DHPN-1003 / Page 24 of 181
`
`
`
`Basic High Availability Concepts
`
`«N,
`
`
`
`we: miWMhmmmsmwmawxW,“mmmmlmflugxwmmmm
`
`High Availability as a Business
`Requirement
`
`In the current business climate, high availability com-
`puting is often seen as a requirement, not a luxury. From
`one perspective, high availability is a form of insurance
`against the loss of business due to computer downtime.
`From another point of View, high availability provides new
`opportunities by allowing your company to provide better
`and more competitive customer service.
`
`High Availability as Insurance
`
`High availability computing is often seen as insurance
`against the following kinds of damage:
`
`0 Loss of income
`
`0 Customer dissatisfaction
`
`0 IVIissed opportunities
`
`For commercial computing, a highly available solu—
`tion is needed when loss of the system results in loss of rev—
`enue. In such cases, the application is said to be mission—
`critical. For all mission-critical applications — that is, where
`income may be lost through downtime — high availability
`is a requirement. In banking, for example, the ability to
`obtain certain account balances 24 hours a day may be mis—
`sion-critical. In parts of the securities business, the need for
`
`DHPN-1003 / Page 25 of 181
`
`
`
`_ High Availability as a Business Requirement
`
`high availability may only be for that portion of the day
`when the stock market is active; at other times, systems
`may be safely brought down.
`
`High Availability as Opportunity
`
`Highly available computing provides a business
`opportunity, since there is an increasing demand for
`”around the clock” computerized services in areas as
`diverse as banking and financial market operations, com-
`munications, order entry and catalog services, resource
`management, and others. It is not possible to give a simple
`definition of when an application is mission—critical or of
`when high availability of the application creates new
`opportunities; this depends on the nature of the business.
`However, in any business that depends on computers, the
`following principles are always true:
`
`0 The degree of availability required is determined by
`business needs. There is no absolute amount of
`
`availability that is right for all businesses.
`
`0 There are many ways to achieve high availability.
`
`0 The means of achieving high availability affects all
`aspects of the system.
`
`0 The likelihood of failure can be reduced by creating
`an infrastructure that stresses clear procedures and
`preventive maintenance.
`
`0 Recovery from failures must be planned.
`
`DHPN-1003 / Page 26 of 181
`
`
`
`Basic High Availability Concepts
`
`Some or all of the following are expectations for the
`
`software applications that run in mission—critical environ—
`ments:
`
`' There should be a low rate of application failures,
`
`that is, a maximum time between failures.
`
`0 Applications should be able to recover after failure.
`
`0 There should be minimal scheduled downtime.
`
`0 The system should be configurable without shut—
`down.
`
`I 0 System management tools must be available.
`
`Cost of High Availability
`
`As with other kinds of insurance, the cost depends on
`
`the degree of availability you choose. Thus the value of
`
`high availability to the enterprise is directly related to the
`
`costs of outages. The higher the cost of outage, the easier it
`
`becomes to justify the expense of high availability solu—
`
`tions. As the degree of availability approaches the ideal of
`
`100% availability, the cost of the solution increases more
`
`rapidly. Thus, the cost of 99.95% availability is significantly
`greater than the cost of 99.5% availability, and the cost of
`
`99.5% availability is significantly greater than 99% avail-
`
`ability, and so on.
`
`10
`
`DHPN-1003 / Page 27 of 181
`
`
`
`What Are the Measures of High Availability?
`
`What Are the Measures of High
`Availability?
`
`Availability and reliability can be described in
`terms of numbers, though doing so can be very mis-
`leading. In fact, there is no standard method for model—
`ing or calculating the degree of availability in a
`computer system. The important thing is to create clear
`definitions of What the numbers mean and then use
`
`them consistently. Remember that availability is not a
`measurable attribute of a system like CPU clock speed.
`Availability can only be measured historically, based on
`the behavior of the actual system. Moreover, in measur-
`ing availability, it is important to ask not simply, ”Is the
`application available?” but ”Is the entire system pro—
`viding service at the proper level?”
`
`Availability is related to reliability, but they are not the
`same thing. Availability is the percentage of total system
`time the computer system is accessible for normal usage.
`Reliability is the amount of
`time before a system is
`expected to fail. Availability includes reliability.
`
`Calculating Availability
`
`The formula in Figure 1.3 defines availability as the
`percentage of elapsed time that a unit can be used. Elapsed
`time is continuous time (operating time + downtime).
`
`11
`
`DHPN-1003 / Page 28 of 181
`
`
`
`Basic High Availability Concepts
`
`
` (Total Elapsed Time — Sum
`
`
`of Imperative Times)
`% Availability =
`
`Total Elapsed Time
`
`
`
`Figure 1.3 Availability
`
`Availability is actually the probability that a unit is
`available (that is, operating normally). Availability is usu—
`ally expressed as a percentage of hours per week, month, or
`year during which the system and its services can be used
`for normal business.
`
`Expected Period of Operation
`
`Measures of availability must be seen against the
`background of the organization’s expected period of opera—
`tion of the system. The following tables show the actual
`
`12
`
`DHPN-1003 / Page 29 of 181
`
`
`
`What Are the Measures of High Availability?
`
`hours of uptime and downtime associated with different
`percentages of availability for two common periods of
`operation. Table 1.1 shows 24x7x365, which stands for a
`system that is expected to be in use 24 hours a day, seven
`days a week, 365 days a year.
`
`Table 1. 1 Uptime and Downtime for a 24x7x365 System
`
`Expected
`Uptime
`
`Allowable
`Downtime
`
`Time
`
`
`(Availability Minimum Maximum Remaining
`
`
`
`
`99%
`8672
`88
`0
`
`
`99.5%
`8716 L 44
`0
`
`99.95%
`8755
`5
`O
`F
`
`
`100%
`
`8760
`
`0
`
`O
`
`This table shows that there is no remaining time on the
`system at all. All the available time in the year (8760 hours)
`is accounted for. This means that all maintenance must be
`
`carried out either when the system is up or during the
`allowable downtime hours. In addition, the higher the per-
`centage of availability, the less time is allowable for failure.
`
`Table 1.2 shows a 12x5x52 system, which is expected
`to be up for 12 hours a day, five days a week, 52 weeks a
`year.
`
`13
`
`DHPN-1003 / Page 30 of 181
`
`
`
`Basic High Availability Concepts
`
`Table 1.2 Uptime and Downtime for a 12X5X52 System
`
`Remaining
`Maximum
`Availability Minimum
`Expected
`Allowable
`Time
`
`Uptime
`Downtime
`
`99%
`3088
`32
`5642
`
`
`
`
`
`
`
`99.5%
`3104
`l
`16
`5642
`
`99.95%
`3118
`1
`2
`5642
`
`100%
`3118
`l
`0
`5642
`
`This table shows that for the 12X5x52 system, there are
`5642 hours of remaining time, which can be used for
`planned maintenance operations requiring the system to be
`down.
`
`Calculating Mean Time Between Failures
`
`Availability is related to failure rates of system compo-
`nents. A common measure of equipment reliability is the
`mean time between failures (MTBF). This measure is usu-
`
`ally provided for individual system components, such as
`disks. Measures like these are useful, but they are only one
`dimension of the complete picture of high availability. For
`example, they do not take into account the differences in
`recovery times after failure.
`
`MTBF is given by the formula shown in Figure 1.4.
`
`14
`
`DHPN-1003 / Page 31 of 181
`
`
`
`What Are the Measures of High Availability?
`
` {#7
`
`Total Operating Time
`
`
`
`Figure 1.4 Mean Time Between Failures
`
`The MTBF is calculated by summing the actual operat—
`ing times of all units, including units that do not fail, and
`dividing that sum by the sum of all failures of the units.
`Operating time is the sum of the hours when the system is
`in use (that is, not powered off).
`
`The MTBF is a statement of the time between failures
`
`of a unit or units. In common applications, the MTBF is
`
`used as a statement of the expected future performance
`based on the past performance of a unit or population of
`units. The failure rate is assumed to remain constant when
`
`the MTBF is used as a predictive reliability measure.
`
`15
`
`DHPN-1003 / Page 32 of 181
`
`
`
`Basic High Availability Concepts
`
`When gauging reliability for multiple instances of the
`same unit, the individual MTBF figures are divided by the
`
`number of units. This may result in much lower MTBF fig—
`ures for the disks in the system as a whole. For example, if
`the MTBF for a disk mechanism is 500,000 hours, and the
`
`MTBF of a disk module including fans and power supplies
`
`is 200,000 hours, then the MTBF of 200 disks together in the
`system is 1000 hours, which means about 9 expected fail—
`ures a year. The point is that the greater the number of units
`operating together in a group, the greater the expected fail—
`ure rate within the group.
`
`
`
`wwwwmfiW .
`
`
`,.
`7mg“
`»
`an
`m .
`flaw
`
`mam—mm
`
`Understanding the Obstacles to High
`Availability
`
`It is important to understand the obstacles to high
`
`availability computing. This section describes some terms
`that people often use to describe these obstacles.
`
`A specific loss of a computer service as perceived by
`the user is called an outage. The duration of an outage is
`
`downtime. Downtime is either planned or unplanned.
`Necessary outages are sometimes planned for system
`upgrades, movement of an application from one system to
`another, physical moves of equipment, and other reasons.
`
`16
`
`DHPN-1003 / Page 33 of 181
`
`
`
`Understanding the Obstacles to High Availability
`
`Unplanned outages occur when there is a failure
`somewhere in the system. A failure is a cessation of nor-
`mal operation of some component. Failures occur in
`hardware, software, system and network management,
`and in the environment. Errors of human judgment also
`
`cause failures. Not all failures cause outages, of course;
`and not all unplanned outages are caused by failures.
`Natural disasters and other catastrophic events can also
`
`disrupt service.
`
`Duration of Outages
`
`An important aspect of an outage is its duration.
`Depending on the application, the duration of an outage
`may be significant or insignificant. A 10-second outage
`may not be critical, but two hours may be fatal to one
`application, while another application may not even tol-
`erate a lO—second outage. Thus, your characterization of
`availability must encompass the acceptable duration of
`
`outages.
`
`i
`
`As an example, if your goal is 99.5% availability on a
`24X7x365 system, you are allowed a maximum of 44 hours
`of downtime per year. But you still need to determine what
`duration is acceptable for a single outage. A large number
`of 10-second outages might be acceptable (the total in 44
`hours is 15,840 10~second outages); but most likely, a single
`outage of 44 hours would be unacceptable.
`
`17
`
`DHPN-1003 / Page 34 of 181
`
`
`
`Basic High Availability Concepts
`
`Time Lines for Outages
`
`The importance of high availability can be seen in the
`following illustrations, which show the time lines for a
`computer system outage following a disk crash. Figure 1.5
`shows a sequence of events that might take place when an
`OLTP client experiences a disk crash on a conventional sys-
`tem using unmirrored disks for data; when the disk
`crashes,