`for
`High
`Availability
`
`TER S. WEYGAN
`Howlett—Packard° Prof ss a
`
`DHPN-1003
`Dell Inc. vs. Electronics and Telecommunications, IPR2013-00635
`Page 1 of 181
`
`
`
`Clusters for
`High Availability
`
`DHPN-1003 / Page 2 of 181
`
`
`
`Hewlett-Packard Professional Books
`
`Blinn (cid:9)
`
`Blommers (cid:9)
`
`Costa
`
`Crane
`
`Fernandez (cid:9)
`
`Fristrup (cid:9)
`
`Fristrup (cid:9)
`Grady (cid:9)
`
`Portable Shell Programming: An Extensive Collection of
`Boume Shell Examples
`
`Practical Planning for Network Growth
`
`Planning and Designing High Speed Networks
`Using 100VG-AnyLAN, Second Edition
`
`A Simplified Approach to Image Processing: Classical and
`Modem Techniques
`
`Configuring the Common Desktop Environment
`
`USENET: Netnews for Everyone
`
`The Essential Web Surfer Survival Guide
`
`Practical Software Metrics for Project
`Management and Process Improvement
`
`Grosvenor, Ichiro, (cid:9)
`O'Brien (cid:9)
`
`Mainframe Downsizing to Upsize Your Business:
`IT-Preneuring
`
`Gums (cid:9)
`Helsel (cid:9)
`
`Helsel (cid:9)
`Kane (cid:9)
`Knouse (cid:9)
`Lewis (cid:9)
`Made11, Parsons, Abegg (cid:9)
`Malan, Letsinger, (cid:9)
`Coleman (cid:9)
`McFarland (cid:9)
`
`A Guide to NetWare® for UNIX®
`
`Graphical Programming: A Tutorial for HP VEE
`
`Visual Programming with HP-VEE
`
`PA-RISC 2.0 Architecture
`
`Practical DCE Programming
`
`The Art & Science of Smalltalk
`
`Developing and Localizing International Software
`
`Object-Oriented Development at Work: Fusion In
`the Real World
`
`X Windows on the World: Developing
`Internationalized Software with X, Motif®, and CDE
`
`McMinds/VVhitty (cid:9)
`Phaal (cid:9)
`
`Writing Your Own OSF/Motif Widgets
`
`LAN Traffic Management
`
`Poniatowski (cid:9)
`
`Poniatowski (cid:9)
`Thomas (cid:9)
`
`Weygant (cid:9)
`
`Witte (cid:9)
`
`The HP-UX System Administrator's "How To" Book
`
`HP-UX 10.x System Administration "How To" Book
`
`Cable Television Proof-of-Performance: A Practical
`Guide to Cable TV Compliance Measurements Using
`a Spectrum Analyzer.
`
`Clusters for High Availability: A Primer of HP-UX Solutions
`
`Electronic Test Instruments
`
`DHPN-1003 / Page 3 of 181
`
`(cid:9)
`(cid:9)
`
`
`Clusters for High
`Availability
`
`A Primer of HP-UX Solutions
`
`Peter Weygant
`
`Hewlett-Packard Company
`
`Prentice Hall PTR
`Upper Saddle River, New Jersey 07458
`
`DHPN-1003 / Page 4 of 181
`
`
`
`EditoriallProduction Supervision: Joanne Anzalone
`Acquisitions Editor: Karen Gettman
`Manufacturing Manager: Alexis R. Heydt
`Cover Design: Design Source
`Manager, Hewlett-Packard Press: Pat Pekary
`
`© 1996 by Hewlett-Packard Company
`
`Published by Prentice Hall PTR
`Prentice-Hall, Inc.
`A Simon & Schuster Company
`Upper Saddle River, NJ 07458
`
`All rights reserved. No part of this book may be
`reproduced, in any form or by any means, without
`permission in writing from the publisher.
`MC/ServiceGuard and MC/LockManager are registered trademarks of Hewlett-Packard
`Company. Oracle is a trademark of Oracle Corporation. Symmetrix and EMC are
`trademarks of EMC Corporation. NFS is a trademark of Sun Microsystems, Inc. UNIX
`is a registered trademark in the United States and in other countries, licensed exclusively
`through X/Open Company, Ltd.
`The publisher offers discounts on this book when ordered in bulk quantities.
`For more information, contact the Corporate Sales Department, PTR Prentice Hall, One
`Lake Street, Upper Saddle River, NJ 07458. Phone: 800-382-3419. FAX: 201-236-7141. e-mail:
`corpsales@prenhall.com
`
`Printed in the United States of America
`10 9 8 7
`
`ISBN 0-13-494758-4
`
`HP Part Number B3936-90007
`
`Prentice-Hall International (UK) Limited, London
`Prentice-Hall of Australia Pty. Limited, Sydney
`Prentice-Hall of Canada, Inc., Toronto
`Prentice-Hall Hispanoamericana S.A., Mexico
`Prentice-Hall of India Private Limited, New Delhi
`Prentice-Hall of Japan, Inc., Tokyo
`Simon & Schuster Asia Pte. Ltd., Singapore
`Editora Prentice-Hall do Brasil, Ltd., Rio de Janeiro
`
`DHPN-1003 / Page 5 of 181
`
`
`
`Contents
`
`Foreword
`Preface
`Acknowledgements
`About the Author
`
`1 BASIC HIGH AVAILABILITY CONCEPTS (cid:9)
`
`What is High Availability? (cid:9)
`Available (cid:9)
`Highly Available (cid:9)
`Highly Available Computing (cid:9)
`Service Levels (cid:9)
`Continuous Availability (cid:9)
`Fault Tolerance (cid:9)
`Matching Availability to User Needs (cid:9)
`Choosing a Solution (cid:9)
`High Availability as a Business Requirement (cid:9)
`High Availability as Insurance (cid:9)
`
`1
`
`2
`2
`3
`4
`5
`6
`6
`7
`7
`8
`8
`
`V
`
`DHPN-1003 / Page 6 of 181
`
`
`
`Contents
`
`High Availability as Opportunity (cid:9)
`9
`Cost of High Availability (cid:9)
`10
`What Are the Measures of High Availability? (cid:9)
`11
`Calculating Availability (cid:9)
`11
`Expected Period of Operation (cid:9)
`12
`Calculating Mean Time Between Failures (cid:9)
`14
`Understanding the Obstacles to High Availability 16
`Duration of Outages (cid:9)
`17
`Time Lines for Outages (cid:9)
`18
`Causes of Planned Downtime (cid:9)
`20
`Causes of Unplanned Downtime (cid:9)
`22
`Severity of Unplanned Outages (cid:9)
`23
`Designing for Reaction to Failure (cid:9)
`23
`Identifying Points of Failure (cid:9)
`24
`Preparing Your Organization for High
`Availability (cid:9)
`25
`Stating Availability Goals (cid:9)
`25
`Building the Appropriate Physical Environment (cid:9)
`27
`Creating Automated Processes (cid:9)
`27
`Using a Development and Test Environment (cid:9)
`28
`Maintaining a Stock of Spare Parts (cid:9)
`28
`Defining an Escalation Process (cid:9)
`29
`Planning for Disasters (cid:9)
`29
`Training System Administration Staff (cid:9)
`29
`Using Dry Runs (cid:9)
`30
`Documenting Every Detail (cid:9)
`30
`The Starting Point for a Highly Available System 31
`31
`Basic Hardware Reliability (cid:9)
`Software Quality (cid:9)
`32
`Intelligent Diagnostics (cid:9)
`32
`
`vi
`
`DHPN-1003 / Page 7 of 181
`
`
`
`Contents
`
`Comprehensive System Management Tools (cid:9)
`Maintenance and Support Services (cid:9)
`Moving to High Availability (cid:9)
`Summary (cid:9)
`
`2 CREATING A HIGH AVAILABILITY CLUSTER (cid:9)
`
`33
`33
`34
`35
`
`39
`
`40
`
`Identifying Single Points of Failure in a
`Stand-alone System (cid:9)
`Eliminating Power Sources as Single
`45
`Points of Failure (cid:9)
`45
`Individual UPS Units (cid:9)
`46
`Power Passthrough UPS Units (cid:9)
`48
`Eliminating Disks as Single Points of Failure (cid:9)
`49
`Data Protection with Disk Arrays (cid:9)
`51
`Data Protection with Software Mirroring (cid:9)
`Eliminating the SPU as a Single Point of Failure 54
`Eliminating Single Points of Failure in Networks 57
`57
`Points of Failure in Client Connectivity (cid:9)
`58
`Examples of Points of Failure (cid:9)
`Points of Failure in Inter-Node Communication (cid:9)
`60
`Eliminating the Failure Points (cid:9)
`60
`Providing Redundant LAN Connections (cid:9)
`61
`61
`Configuring Local Switching of LAN Interfaces (cid:9)
`Providing Redundant FDDI Connections (cid:9)
`66
`Using Dual Attached FDDI (cid:9)
`68
`Redundancy for Dialup Lines, Hardwired Serial Connec-
`69
`tions and X.25 (cid:9)
`
`vii
`
`DHPN-1003 / Page 8 of 181
`
`
`
`Contents
`
`Eliminating Software as a Single Point of Failure 70
`Tailoring Applications for Cluster Use (cid:9)
`71
`Implementing the
`High Availability Cluster (cid:9)
`Complete High Availability Solution (cid:9)
`
`73
`74
`
`3 HP's HIGH AVAILABILITY CLUSTER
`COMPONENTS (cid:9)
`
`Choosing HA Architectures and Cluster
`Components (cid:9)
`Active/Standby Configurations
`Using MC/ServiceGuard (cid:9)
`Active' Active Configurations
`Using MC/ServiceGuard (cid:9)
`How MC/ServiceGuard Works (cid:9)
`Parallel Database Configuration
`Using MC/LockManager (cid:9)
`Oracle Parallel Server (cid:9)
`How MC/LockManager Works with OPS (cid:9)
`Selecting Other HA Subsystems (cid:9)
`MirrorDisk/UX (cid:9)
`High Availability Disk Storage Enclosure (cid:9)
`High Availability Disk Arrays (cid:9)
`EMC Disk Arrays (cid:9)
`Journaled File System (cid:9)
`OnLineJFS (cid:9)
`Transaction Processing Monitors (cid:9)
`Uninterruptible Power Supplies (cid:9)
`System and Network Management Tools (cid:9)
`
`viii
`
`77
`
`78
`
`79
`
`82
`84
`
`90
`91
`92
`94
`95
`95
`96
`97
`97
`98
`99
`99
`100
`
`DHPN-1003 / Page 9 of 181
`
`
`
`Contents
`
`Using Mission Critical Consulting and
`Support Services (cid:9)
`Availability Management Service (cid:9)
`Business Continuity Support (cid:9)
`Business Recovery Services (cid:9)
`
`106
`106
`107
`109
`
`4 SAMPLE HIGH AVAILABILITY SOLUTIONS (cid:9)
`
`111
`
`Highly Available NFS System for Publishing 112
`112
`High Availability Software and Packages (cid:9)
`114
`Hardware Configuration (cid:9)
`115
`Responses to Failures (cid:9)
`120
`Stock Quotation Service (cid:9)
`High Availability Software and Packages (cid:9)
`121
`122
`Hardware Configuration (cid:9)
`124
`Responses to Failures (cid:9)
`127
`Order Entry and Catalog Application (cid:9)
`127
`High Availability Software and Packages (cid:9)
`131
`Hardware Configuration (cid:9)
`133
`Responses to Failures (cid:9)
`134
`Insurance Company Database (cid:9)
`135
`Two-Node OPS Configuration (cid:9)
`
`ix
`
`DHPN-1003 / Page 10 of 181
`
`
`
`Contents
`
`5 GLOSSARY OF HIGH AVAILABILITY
`TERMINOLOGY (cid:9)
`
`AdminCenter (cid:9)
`Adoptive Node (cid:9)
`ADT (cid:9)
`AFR (cid:9)
`Alternate Node (cid:9)
`Annualized Failure Rate (cid:9)
`Architecture for HA (cid:9)
`Availability (cid:9)
`Average Downtime (cid:9)
`Cluster (cid:9)
`ClusterView (cid:9)
`Continuous Availability (cid:9)
`Custody (cid:9)
`Downtime (cid:9)
`Failure (cid:9)
`Failover (cid:9)
`Fault Tolerance (cid:9)
`Grouped Net (cid:9)
`Hardware Mirroring (cid:9)
`Highly Available (cid:9)
`Hot Plug Capability (cid:9)
`Hot Swap Capability (cid:9)
`LAN (cid:9)
`LAN interface (cid:9)
`Logical Volume Manager (cid:9)
`MC/LockManager (cid:9)
`MC/ServiceGuard (cid:9)
`Mean Time Between Failures (cid:9)
`Mean Time to Repair (cid:9)
`MirrorDisk/UX (cid:9)
`Mirroring (cid:9)
`
`139
`
`140
`140
`140
`140
`140
`141
`142
`142
`143
`144
`145
`145
`145
`145
`145
`146
`146
`146
`146
`146
`147
`147
`147
`147
`148
`148
`148
`148
`149
`150
`151
`
`DHPN-1003 / Page 11 of 181
`
`
`
`Contents
`
`MTBF (cid:9)
`MTTR (cid:9)
`Network Node Manager (cid:9)
`Node (cid:9)
`Open View (cid:9)
`OperationsCenter (cid:9)
`Planned Downtime (cid:9)
`Primary Node (cid:9)
`Package (cid:9)
`Process Resource Manager (cid:9)
`RAID (cid:9)
`Redundancy (cid:9)
`Reliability (cid:9)
`Relocatable IP Address (cid:9)
`Service (cid:9)
`Service Level Agreement (cid:9)
`Shared Logical Volume Manager (cid:9)
`Single Point of Failure (cid:9)
`SLVM (cid:9)
`Software Mirroring (cid:9)
`SPOF (cid:9)
`SPU (cid:9)
`Subnet (cid:9)
`SwitchOver/UX (cid:9)
`System Processor Unit (cid:9)
`Transfer of Packages (cid:9)
`Unplanned Downtime (cid:9)
`Volume Group (cid:9)
`Index (cid:9)
`
`xi
`
`151
`151
`151
`151
`152
`152
`152
`152
`153
`153
`153
`154
`154
`154
`154
`155
`155
`155
`156
`156
`156
`156
`156
`156
`157
`157
`157
`157
`159
`
`DHPN-1003 / Page 12 of 181
`
`
`
`Foreword
`
`=M=FIMMIZSPMME
`Foreword
`
`Over the last ten years, UNIX systems have moved from the spe-
`cialized role of providing desktop computing power for engineers
`into the broader arena of commercial computing. This evolution is
`the result of continual dramatic improvements in functionality, re-
`liability, performance, and supportability. We are now well into
`the next phase of the UNIX evolution: providing solutions for mis-
`sion critical computing.
`
`To best meet the requirements of the data center for availability,
`scalability, and flexibility, Hewlett-Packard has developed a ro-
`bust cluster architecture for HP-UX that combines multiple sys-
`tems into a high availability cluster. Individual computers, known
`as nodes, are connected in a loosely-coupled manner, each main-
`taining its own separate processors, memory, operating system,
`and storage devices. Special system processes bind these nodes to-
`gether and allow them to cooperate to provide outstanding levels
`of availability and flexibility for supporting mission critical appli-
`cations. The nodes in a cluster can be configured either to share
`data on a set of disks or to obtain exclusive access to data.
`
`To maintain Hewlett-Packard's commitment to the principles of
`open systems, our high availability clusters use standards-based
`hardware components such as SCSI disks and Ethernet LANs.
`There are no proprietary APIs that force vendor lock-in, and most
`applications will run on a high availability cluster without modifi-
`cation.
`
`DHPN-1003 / Page 13 of 181
`
`
`
`As the world's leading vendor of open systems, Hewlett-Packard
`is especially proud to publish this primer on cluster solutions for
`high availability. Peter Weygant has done a fine job of presenting
`the basic concepts, architectures, and terminology used in HP's
`cluster solutions. This is the place to begin your exploration of the
`world of high availability clusters.
`
`Xuan Bui
`Hewlett-Packard General Systems Division
`Research and Development Laboratory Manager
`
`xiv
`
`DHPN-1003 / Page 14 of 181
`
`
`
`Preface
`
`Preface
`
`This guide is about high availability (HA) computing through enterprise
`clusters. It presents basic concepts and terms, then describes the use of
`cluster technology to provide highly available open systems solutions
`for the commercial enterprise. Here are the topics:
`
`• Chapter 1, "Basic High Availability Concepts," presents the lan-
`guage used to describe highly available systems and components and
`introduces ways of measuring availability.
`
`• Chapter 2, "Creating a High Availability Cluster," describes in more
`detail the principles of HA configuration, with examples.
`
`• Chapter 3, "HP's High Availability Cluster Components," is an over-
`view of HP's current roster of high availability software and hard-
`ware offerings.
`
`• Chapter 4, "Sample HA Solutions," discusses a few concrete exam-
`ples of highly available cluster solutions.
`
`• Chapter 5, "Glossary," gives definitions of important words and
`phrases used to describe high availability.
`
`Additional information is available in the HP publications Managing
`MC/ServiceGuard and Configuring OPS Clusters with MC/LockMan-
`ager. The HP 9000 Servers Configuration Guide contains detailed
`information about supported high availability configurations. This and
`other more specialized documents on enterprise clusters are available
`from your HP representative.
`
`xv
`
`DHPN-1003 / Page 15 of 181
`
`
`
`Acknowledgments
`
`This book has benefited from the careful review of many individuals
`inside and outside of Hewlett-Packard. The author gratefully acknowl-
`edges the contributions of these colleagues, many of whom are listed
`here: Joe Algieri, Sally Anderson, Joe Bac, Bob Baird, Trent Bass, Dan
`Beringer, Claude Braze11, Thomas Buenermann, Xuan Bui, Karl-Heinz
`Busse, Bruce Campbell, Larry Cargnoni, Gina Cassinelli, Marian
`Cochran, Annie Cooper (cid:9) man, Ron Czinski, Dan Dickerman, Pam Dick-
`elinan, Larry Dino, Janie Felix, John Foxcroft, Shivaji Ganesh, Janet
`Gee, Mike Gutter, Terry Hand, Michael Hayward, Frank Ho, Margaret
`Hunter, Lisa Iarkowski, Art Ipri, Michael Kahn, Marty King, Clark
`Macaulay, Gary Marcos, Debby McIsaac, Doug McKenzie, Tim Met-
`calf, Parissa Mohamadi, Alex Morgan, Markus Ostrowicki, Bob Ramer,
`Bob Sauers, Wesley Sawyer, David Scott, Dan Shive, Christine Smith,
`Eric Soderberg, Steve Stichler, Tim Stockwell, Brad Stone, Liz Tam,
`Bob Togasaki, Emil Velez, Tad Walsh, and Bev Woods. A special thank
`you goes to those groups of Hewlett-Packard customers who read and
`commented on early versions of the manuscript. Errors and omissions
`are the author's sole responsibility.
`
`xvi
`
`DHPN-1003 / Page 16 of 181
`
`
`
`About the Author
`
`About the Author
`
`Peter S. Weygant is a Learning Products Engineer in the General Sys-
`tems Solutions laboratory at Hewlett-Packard. Formerly a professor of
`English, he has been a technical writer and consultant in the computer
`industry for the last 15 years. He has developed documentation and
`managed publication projects in the areas of digital imaging, relational
`database technology, and high availability systems. He has a BA degree
`in English Literature from Colby College as well as MA and PhD
`degrees in English from the University of Pennsylvania.
`
`xvii
`
`DHPN-1003 / Page 17 of 181
`
`
`
`CHAPTER 1
`Basic High Availability
`Concepts
`
`This book takes an elementary look at high availabil-
`ity (HA) computing and how it is implemented through
`enterprise-level cluster solutions. We start in this chapter
`with some of the basic concepts of HA. Here's what we'll
`cover:
`
`• What is High Availability?
`• High Availability as a Business Requirement
`• What Are the Measures of High Availability?
`• Understanding the Obstacles to High Availability
`• Preparing Your Organization for High Availability
`• The Starting Point for a High Availability System
`• From High Reliability to High Availability
`• Designing a Highly Available System
`
`1
`
`DHPN-1003 / Page 18 of 181
`
`
`
`Basic High Availability Concepts
`
`Later chapters explore the implementation of high
`availability in clusters, then describe HP's high availability
`products in more detail. A separate chapter is devoted to
`concrete examples of business solutions that use HA.
`
`FAMSWKM=MR.EFMM
`
`What is High Availability?
`
`Before exploring the implications of high availability
`in computer systems, we need to define some terms. What
`do we mean by phrases like "availability," "high availabil-
`ity," and "high availability computing?"
`
`Available
`
`The term available describes a system that provides a
`specific level of service as needed. This idea of availability
`is part of everyday thinking. In computing, availability is
`generally understood as the period of time when services
`are available (for instance, 16 hours a day, six days a week)
`or as the time required for the system to respond to users
`(for example, under 1 second response time). Any loss of
`service, whether planned or unplanned, is known as an
`outage. Downtime is the duration of an outage measured
`in units of time (e.g., minutes or hours).
`
`2
`
`DHPN-1003 / Page 19 of 181
`
`
`
`What is High Availability?
`
`Highly Available
`
`Figure 1.1 Highly Available Services: Electricity
`
`Highly available characterizes a system that is
`designed to avoid the loss of service by reducing or manag-
`ing failures as well as minimizing planned downtime for
`the system. We expect a service to be highly available when
`life, health, and well-being, including the economic well-
`being of a company, depend on it.
`
`3
`
`DHPN-1003 / Page 20 of 181
`
`
`
`Basic High Availability Concepts
`
`For example, we expect electrical service to be highly
`available. All but the smallest, shortest outages are unac-
`ceptable, since we have geared our lives to depend on elec-
`tricity for refrigeration, heating, and lighting, in addition to
`less important daily needs.
`
`Even the most highly available services occasionally
`go out, as anyone who has experienced a blackout or
`brownout in a large city can attest. But in these cases, we
`expect to see an effort to restore service at once. When a
`failure occurs, we expect the electric company to be on the
`road fixing the problem as soon as possible.
`
`Highly Available Computing
`
`In many businesses, the availability of computers has
`become just as important as the availability of electric
`power itself. Highly available computing uses computer
`systems which are designed and managed to operate with only
`a small amount of planned and unplanned downtime.
`
`Note that highly available is not an absolute. The needs
`of different businesses for high availability are quite
`diverse. International businesses or companies running
`multiple shifts may require user access to databases around
`the clock. Financial institutions must be able to transfer
`funds at any time of night or day, seven days a week. On
`the other hand, some retail businesses may require the
`
`4
`
`DHPN-1003 / Page 21 of 181
`
`
`
`What is High Availability?
`
`Figure 1.2 Service Outage
`
`computer to be available only 18 hours a day, but during
`these 18 hours they may require sub-second response time
`for transaction processing.
`
`Service Levels
`
`The service level of a system is the degree of service
`the system will provide to its users. Often, the service level
`is spelled out in a document known as a service level agree-
`ment (SLA). The service levels your business requires
`determine the kind of applications you develop, and high
`availability systems provide the hardware and software
`
`5
`
`DHPN-1003 / Page 22 of 181
`
`
`
`Basic High Availability Concepts
`
`framework in which these applications can work effec-
`tively to provide the needed level of service. High avail-
`ability implies a service level in which both planned and
`unplanned computer outages do not exceed a small stated
`value.
`
`Continuous Availability
`Continuous availability means non-stop service, that
`is, there are no planned or unplanned outages at all. This is
`a much more ambitious goal than high availability, since
`there can be no lapse in service. In effect, continuous avail-
`ability is an ideal state rather than a characteristic of any
`real world system.
`
`The term is sometimes used to indicate a very high
`level of availability in which only a very small known
`quantity of downtime is acceptable. Note that high avail-
`ability does not imply continuous availability.
`
`Fault Tolerance
`Fault tolerance is not a degree of availability so much
`as a method for achieving very high levels of availability. A
`fault tolerant system is characterized by redundancy in
`most of the hardware components, including CPU, mem-
`ory, I/O subsystems, and other elements. A fault tolerant
`system is one that has the ability to continue service in spite
`of a hardware or software failure. However, even fault tol-
`erant systems are subject to outages from human error.
`Note that high availability does not imply fault tolerance.
`
`6
`
`DHPN-1003 / Page 23 of 181
`
`
`
`What is High Availability?
`
`Matching Availability to User Needs
`
`A failure affects availability when it results in an
`unplanned loss of service that lasts long enough to create a
`problem for users of the system. User sensitivity will
`depend on the specific application. For example, a failure
`that is corrected within one second may not result in any
`perceptible loss of service in an environment that does on-
`line transaction processing (OLTP); but for a scientific
`application that runs in a real-time environment, one sec-
`ond may be an unacceptable interval.
`
`Since any component can fail, the challenge is to
`design systems in which problems can be predicted and
`isolated before a failure occurs and in which failures are
`quickly detected and corrected when they happen.
`
`Choosing a Solution
`
`Your exact requirements for availability determine the
`kind of solution you need. For example, if the loss of a sys-
`tem for a few hours of planned downtime is acceptable to
`you, then you may not need to purchase storage products
`with hot pluggable disks. On the other hand, if you cannot
`afford a planned period of maintenance during which a
`disk replacement could be done on a mirrored disk system,
`then you may wish to consider a HA disk array that sup-
`ports hot plugging or hot swapping of components.
`(Descriptions of these HA products appear in later sec-
`tions.)
`
`7
`
`DHPN-1003 / Page 24 of 181
`
`
`
`Basic High Availability Concepts
`
`High Availability as a Business
`Requirement
`
`In the current business climate, high availability com-
`puting is often seen as a requirement, not a luxury. From
`one perspective, high availability is a form of insurance
`against the loss of business due to computer downtime.
`From another point of view, high availability provides new
`opportunities by allowing your company to provide better
`and more competitive customer service.
`
`High Availability as Insurance
`
`High availability computing is often seen as insurance
`against the following kinds of damage:
`
`• Loss of income
`• Customer dissatisfaction
`• Missed opportunities
`
`For commercial computing, a highly available solu-
`tion is needed when loss of the system results in loss of rev-
`enue. In such cases, the application is said to be mission-
`critical. For all mission-critical applications — that is, where
`income may be lost through downtime — high availability
`is a requirement. In banking, for example, the ability to
`obtain certain account balances 24 hours a day may be mis-
`sion-critical. In parts of the securities business, the need for
`
`8
`
`DHPN-1003 / Page 25 of 181
`
`
`
`High Availability as a Business Requirement
`
`high availability may only be for that portion of the day
`when the stock market is active; at other times, systems
`may be safely brought down.
`
`High Availability as Opportunity
`
`Highly available computing provides a business
`opportunity, since there is an increasing demand for
`"around the clock" computerized services in areas as
`diverse as banking and financial market operations, com-
`munications, order entry and catalog services, resource
`management, and others. It is not possible to give a simple
`definition of when an application is mission-critical or of
`when high availability of the application creates new
`opportunities; this depends on the nature of the business.
`However, in any business that depends on computers, the
`following principles are always true:
`
`• The degree of availability required is determined by
`business needs. There is no absolute amount of
`availability that is right for all businesses.
`• There are many ways to achieve high availability.
`• The means of achieving high availability affects all
`aspects of the system.
`• The likelihood of failure can be reduced by creating
`an infrastructure that stresses clear procedures and
`preventive maintenance.
`• Recovery from failures must be planned.
`
`9
`
`DHPN-1003 / Page 26 of 181
`
`
`
`Basic High Availability Concepts
`
`Some or all of the following are expectations for the
`software applications that run in mission-critical environ-
`ments:
`
`• There should be a low rate of application failures,
`that is, a maximum time between failures.
`
`• Applications should be able to recover after failure.
`
`• There should be minimal scheduled downtime.
`
`• The system should be configurable without shut-
`down.
`
`• System management tools must be available.
`
`Cost of High Availability
`
`As with other kinds of insurance, the cost depends on
`the degree of availability you choose. Thus the value of
`high availability to the enterprise is directly related to the
`costs of outages. The higher the cost of outage, the easier it
`becomes to justify the expense of high availability solu-
`tions. As the degree of availability approaches the ideal of
`100% availability, the cost of the solution increases more
`rapidly. Thus, the cost of 99.95% availability is significantly
`greater than the cost of 99.5% availability, and the cost of
`99.5% availability is significantly greater than 99% avail-
`ability, and so on.
`
`10
`
`DHPN-1003 / Page 27 of 181
`
`
`
`What Are the Measures of High Availability?
`
`What Are the Measures of High
`Availability?
`
`Availability and reliability can be described in
`terms of numbers, though doing so can be very mis-
`leading. In fact, there is no standard method for model-
`ing or calculating the degree of availability in a
`computer system. The important thing is to create clear
`definitions of what the numbers mean and then use
`them consistently. Remember that availability is not a
`measurable attribute of a system like CPU clock speed.
`Availability can only be measured historically, based on
`the behavior of the actual system. Moreover, in measur-
`ing availability, it is important to ask not simply, "Is the
`application available?" but "Is the entire system pro-
`viding service at the proper level?"
`
`Availability is related to reliability, but they are not the
`same thing. Availability is the percentage of total system
`time the computer system is accessible for normal usage.
`Reliability is the amount of time before a system is
`expected to fail. Availability includes reliability.
`
`Calculating Availability
`
`The formula in Figure 1.3 defines availability as the
`percentage of elapsed time that a unit can be used. Elapsed
`time is continuous time (operating time + downtime).
`
`11
`
`DHPN-1003 / Page 28 of 181
`
`
`
`Basic High Availability Concepts
`
`% Availability =
`
`(Total Elapsed Time -Sum
`of ____Inoperative Times)
`-------------
`Total Elapsed Time
`
`Figure 1.3 Availability
`
`Availability is actually the probability that a unit is
`available (that is, operating normally). Availability is usu-
`ally expressed as a percentage of hours per week, month, or
`year during which the system and its services can be used
`for normal business.
`
`Expected Period of Operation
`
`Measures of availability must be seen against the
`background of the organization's expected period of opera-
`tion of the system. The following tables show the actual
`
`12
`
`DHPN-1003 / Page 29 of 181
`
`
`
`What Are the Measures of High Availability?
`
`hours of uptime and downtime associated with different
`percentages of availability for two common periods of
`operation. Table 1.1 shows 24x7x365, which stands for a
`system that is expected to be in use 24 hours a day, seven
`days a week, 365 days a year.
`
`Table 1.1 Uptime and Downtime for a 24x7x365 System
`
`Availability Minimum
`Expected
`Uptime
`
`Maximum
`Allowable
`Downtime
`
`Remaining
`Time
`
`99%
`
`99.5%
`
`99.95%
`
`100%
`
`8672
`
`8716
`
`8755
`
`8760
`
`88
`
`44
`
`5
`
`0
`
`0
`
`0
`
`0
`
`0
`
`This table shows that there is no remaining time on the
`system at all. All the available time in the year (8760 hours)
`is accounted for. This means that all maintenance must be
`carried out either when the system is up or during the
`allowable downtime hours. In addition, the higher the per-
`centage of availability, the less time is allowable for failure.
`
`Table 1.2 shows a 12x5x52 system, which is expected
`to be up for 12 hours a day, five days a week, 52 weeks a
`year.
`
`13
`
`DHPN-1003 / Page 30 of 181
`
`
`
`Basic High Availability Concepts
`
`Table 1.2 Uptime and Downtime for a 12x5x52 System
`
`Availability Minimum
`Expected
`Uptime
`
`Maximum
`Allowable
`Downtime
`
`Remaining
`Time
`
`99%
`
`99.5%
`
`99.95%
`
`100%
`
`3088
`
`3104
`
`3118
`
`3118
`
`32
`
`16
`
`2
`
`0
`
`5642
`
`5642
`
`5642
`
`5642
`
`This table shows that for the 12x5x52 system, there are
`5642 hours of remaining time, which can be used for
`planned maintenance operations requiring the system to be
`down.
`
`Calculating Mean Time Between Failures
`Availability is related to failure rates of system compo-
`nents. A common measure of equipment reliability is the
`mean time between failures (MTBF). This measure is usu-
`ally provided for individual system components, such as
`disks. Measures like these are useful, but they are only one
`dimension of the complete picture of high availability. For
`example, they do not take into account the differences in
`recovery times after failure.
`
`MTBF is given by the formula shown in Figure 1.4.
`
`14
`
`DHPN-1003 / Page 31 of 181
`
`
`
`What Are the Measures of High Availability?
`
`MTBF =
`
`Total Operating Time
`------------------- ------- —
`Total No. of Failures
`
`Figure 1.4 Mean Time Between Failures
`
`The MTBF is calculated by summing the actual operat-
`ing times of all units, including units that do not fail, and
`dividing that sum by the sum of all failures of the units.
`Operating time is the sum of the hours when the system is
`in use (that is, not powered off).
`
`The MTBF is a statement of the time between failures
`of a unit or units. In common applications, the MTBF is
`used as a statement of the expected future performance
`based on the past performance of a unit or population of
`units. The failure rate is assumed to remain constant when
`the MTBF is used as a predictive reliability measure.
`
`15
`
`DHPN-1003 / Page 32 of 181
`
`
`
`Basic High Availability Concepts
`
`When gauging reliability for multiple instances of the
`same unit, the individual MTBF figures are divided by the
`number of units. This may result in much lower MTBF fig-
`ures for the disks in the system as a whole. For example, if
`the MTBF for a disk mechanism is 500,000 hours, and the
`MTBF of a disk module including fans and power supplies
`is 200,000 hours, then the MTBF of 200 disks together in the
`system is 1000 hours, which means about 9 expected fail-
`ures a year. The point is that the greater the number of units
`operating together in a group, the greater the expected fail-
`ure rate within the group.
`
`Understanding the Obstacles to High
`Availability
`
`It is important to understand the obstacles to high
`availability computing. This section describes some terms
`that people often use to describe these obstacles.
`
`A specific loss of a computer service as perceived by
`the user is called an