`
`0 REILLY®
`
`Duane Wessels
`
`AHBLT-2013.001
`
`
`
`Web Caching
`
`AHBLT-2013.002
`
`
`
`Web Caching
`
`Duane Wessels
`
`O'REILLY®
`Beijing · Cambridge · Farnham · Koln · Paris · Sebastopol · Taipei · Tokyo
`
`AHBLT-2013.003
`
`
`
`Web Caching
`by Duane Wessels
`
`Copyright © 2001 O'Rellly & Associates, Inc. All rights reserved.
`Printed in the United States of America.
`
`Published by O'Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472.
`
`Editors: Nathan Tarkington and Paula Ferguson
`
`Production Editor: Leanne Clarke Soylemez
`
`Cover Designer: Edie Freedman
`
`Printing History:
`
`June 2001:
`
`First Edition.
`
`Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered
`trademarks of O'Reilly & Associates, Inc. Many of the designations used by manufacturers
`and sellers to distinguish their products are claimed as trademarks. Where those designations
`appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the
`designations have been printed in caps or initial caps. The association between the image of
`a rock thrush and web caching is a trademark of O'Reilly & Ass9ciates, Inc.
`
`While every precaution has been taken in the preparation of this book, the publisher assumes
`no responsibility for errors or omissions, or for damages resulting from the use of the
`information contained herein.
`
`Library of Congress Cataloging-in-Publication Data
`
`Wessels, Duane.
`Web Caching/Duane Wessels
`p. cm.
`ISBN 1-56592-536-X
`1. Cache memo1y. 2. Browsers (Computer programs) 3. Software configuration
`management. 4. World Wide Web. I. Title.
`
`TK7895.M4 W45 2001
`004.5'3--clc21
`
`ISBN: 1-56592-536-X
`[CJ
`
`2001033173
`
`--
`
`Pre}
`
`1.
`
`2.
`
`3,
`
`AHBLT-2013.004
`
`
`
`Table of Contents
`
`Preface ..................................................................................................................... ix
`
`1. Introduction .................................................................................................. 1
`1.1 Web Architecture ........................................................................................ 2
`1.2 Web Transport Protocols ........................................................................... 6
`1.3 Why Cache the Web? ............................................................................... JO
`1.4 Why Not Cache the Web? ........................................................................ 13
`1. 5 Types of Web Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
`1.6 Caching Proxy Features ........................................................................... 17
`1.7 Meshes, Clusters, and Hierarchies .......................................................... 18
`1.8 Products .................................................................................................... 19
`
`2. How Web Caching Works ....................................................................... 21
`2.1 HTTP Requests ......................................................................................... 21
`Is It Cachable? .......................................................................................... 24
`2.2
`2.3 Hits, Misses, and Freshness ..................................................................... 34
`2.4 Hit Ratios .................................................................................................. 37
`2.5 Validation ................................................................................................. 38
`2.6 Forcing a Cache to Refresh ..................................................................... 41
`2.7 Cache Replacement ................................................................................. 44
`
`3. Politics of Web Caching ........................................................................... 48
`3.1 Privacy ...................................................................................................... 49
`3.2 Request Blocking ..................................................................................... 55
`3.3 Copyright .................................................................................................. 57
`
`v
`
`AHBLT-2013.005
`
`
`
`vi
`
`Tc:tble
`
`3.4 Offensive Content .................................................................................... 63
`3.5 Dyna1nic Web Pages ................................................................................ 64
`3.6 Content Integrity ...................................................................................... 65
`3.7 Cache Busting and Server Busting .......................................................... 66
`3.8 Advertising ............................................................................................... 68
`3.9 Trust .......................................................................................................... 69
`3.10 Effects of Proxies ................................................................................... 70
`
`4. Configuring Cache Clients ..................................................................... 72
`4.1 Proxy Addresses ....................................................................................... 73
`4.2 Manual Proxy Configuration ................................................................... 73
`4.3 Proxy Auto-Configuration Script ............................................................. 77
`4.4 Web Proxy Auto-Discovery ..................................................................... 83
`4.5 Other Configuration Options .................................................................. 84
`4.6 The Botto111 Line ...................................................................................... 84
`
`5. Interception Proxying and Caching .................................................. 86
`5.1 Overview .................................................................................................. 87
`5.2 The IP Layer: Routing .............................................................................. 89
`The TCP Layer: Ports and Delive1y ......................................................... 96
`5.4 The Application Layer: HTTP ............................................................... JOO
`5.5 Debugging Interception ........................................................................ 101
`Issues ...................................................................................................... 102
`5.6
`5.7 To Intercept or Not To Intercept .......................................................... 108
`
`6. Configuring Servers to Work with Caches .................................... 109
`6.1
`Important HTTP Headers ...................................................................... 110
`6.2 Being Cache-Friendly ............................................................................ 115
`6.3 Being Cache-Unfriendly ........................................................................ 127
`Other Issues for Content Providers ...................................................... 128
`
`7. Cache Hierarchies .................................................................................. 132
`7.1 How Hierarchies Work .......................................................................... 132
`7.2 Why Join a Hierarchy? ........................................................................... 134
`7.3 Why Not Join a Hierarchy? .................................................................... 136
`7.4 Optimizing Hierarchies ......................................................................... 142
`
`8. Ii
`8.
`8.
`8.
`8.
`8
`9. c
`9
`9
`9
`
`10. L
`11
`11
`11
`1<
`1<
`
`}1
`
`}i
`
`1
`
`1·
`
`11. A
`1
`1
`
`12. 1
`1
`1
`1
`1
`1
`1
`
`AHBLT-2013.006
`
`
`
`Table of Contents
`
`vii
`
`8. Intercache Protocols .............................................................................. 144
`8.1
`ICP .......................................................................................................... 145
`8.2 CARP ...................................................................................................... 156
`8.3 HTCP ...................................................................................................... 158
`8.4 Cache Digests ........................................................................................ 159
`8.5 Which Protocol to Use .......................................................................... 163
`
`9. Cache Clusters ......................................................................................... 165
`9.1 The Hot Spare ........................................................................................ 166
`9.2 Throughput and Load Sharing .............................................................. 167
`9.3 Bandwidth .............................................................................................. 168
`
`10. Design Considerations for Caching Services ............................... 170
`10.1 Appliance or Software Solution .......................................................... 170
`10.2 Disk Space ........................................................................................... 173
`10.3 Memory ................................................................................................ 175
`10.4 Network Interfaces .............................................................................. 175
`10.5 Operating Systems ............................................................................... 176
`10.6 High Availability .................................................................................. 177
`10.7
`Intercepting Traffic .............................................................................. 178
`10.8 Load Sharing ........................................................................................ 179
`10.9 Location ................................................................................................ 180
`10.10 Using a Hierarchy .............................................................................. 180
`
`11. Monitoring the Health of Your Caches .......................................... 182
`11.1 What to Monitor? ................................................................................. 183
`11.2 Monitoring Tools .................................................................................. 186
`
`12. Benchmarking Proxy Caches ............................................................. 191
`12.1 Metrics .................................................................................................. 192
`12.2 Performance Bottlenecks .................................................................... 194
`12.3 Benchmarking Tools ............................................................................ 197
`12.4 Benchmarking Gotchas ....................................................................... 203
`12.5 How to Benchmark a Proxy Cache .................................................... 206
`12.6 Sample Benchmark Results ................................................................. 210
`
`AHBLT-2013.007
`
`
`
`Table
`
`A. Analysis of Production Cache Trace Data .................................... 215
`
`B. Internet Cache Protocol ....................................................................... 235
`c. Cache Array Routing Protocol .......................................................... 246
`
`D. Hypertext Caching Protocol ............................................................... 254
`
`E. Cache Digests ........................................................................................... 266
`
`E HTTP Status Codes ............................. ; ................................................... 274
`
`G. US.C. 17 Sec. 512. Limitations on Liability
`Relating to Material Online ............................................................... 279
`
`H. List of Acronyms ..................................................................................... 282
`
`Bibliography ..................................................................................................... 288
`
`Index .................................................................................................................... 291
`
`-
`
`When
`with a
`sorts c
`ware.
`the far
`
`In ord
`and th
`Usuall1
`HOST~
`a whH
`later di
`unusec
`file, M<
`
`Althou
`accour
`referer
`Before
`see if t
`
`Nowac
`makes
`mation
`more,
`savingf
`
`In man
`basic i·
`
`AHBLT-2013.008
`
`
`
`42
`
`2:How Web
`
`Works
`
`2. 6. J The no-cache Directive
`The no-cache directive notifies a cache that it cannot return a cached copy. Even if
`a fresh copy of the response-with a specific expiration time-is in the cache, the
`client's request must be foiwarded to the origin server. RFC 2616 calls such a
`request an "end-to-end validation" (Section 14.9.4). The no-cache directive is sent
`when you click on the Reload button on your browser. In an HTTP request, it
`looks like this:
`
`GET /index.html HITP/1.1
`Cache-control: no-cache
`
`Recall that the Cache-control header does not exist in the HTTP/1.0 standard.
`Instead, HTTP/1.0 clients use a Pragma header for the no-cache directive:
`
`Pragrna: no-cache
`
`no-cache is the only directive defined for the Pragma header in RFC 1945. For back(cid:173)
`wards compatibility, RFC 2616 also defines the Pragma header. In fact, many of the
`recent HTTP /1.1 browsers still use Pragma for the no-cache directive instead of the
`newer Cache-control.
`
`Note that the no-cache directive does not necessarily require the cache to purge its
`copy of the object. The client may generate a conditional request (with If-modi(cid:173)
`fied-since or another validator), in which case the origin server's response may
`be 304 (Not Modified). If, however, the server responds with 200 (OK), then the
`cache replaces the old object with the new one.
`
`The interaction between no-cache and If-modified-since is tricky and often the
`source of some confusion. Consider, for example, the following sequence of
`events:
`
`1. You are viewing an HTML page in your browser. This page is cached in your
`browser and was last modified on Friday, February 16, 2001, at 12:00:00.
`
`2. The page author replaces the current HTML page with an older, backup copy
`of the page, perhaps with this Unix command:
`
`mv index.html.old index.html
`
`Now there is a "new" version of the HTML page on the server, but it has an
`older modification timestamp.
`
`3. You try to reload the HTML page by using the Reload button. Your browser
`sends this request:
`
`GET http://www.foo.com/index.html
`Pragrna: no-cache
`If-Modified-Since: Fri, 16 Feb 2001 09:46:18 GMT
`
`2.6 Fore
`
`4. The
`play:
`
`You cou
`the "neV1
`
`If you a1
`ing on f,
`If you u
`Alternati
`prevents
`Note tha
`
`In additi
`browser
`single ir
`are <lisp
`tion" co
`external
`for you
`objects.
`
`Another
`When t
`retrieve
`course,
`request:
`ply mo'
`
`As a ca
`the no(cid:173)
`migh t t
`necess2
`ment f<
`these f<
`to get i
`If-mod:
`
`2.6.2
`The ma
`that th
`
`AHBLT-2013.009
`
`
`
`4. The origin server sends a 304 (Not Modified) response and your browser dis-
`plays the same page as before.
`
`You could click on Reload until your mouse wears out and you would never get
`the "new" HTML page. What can you do to see the correct page?
`
`If you are using Netscape Navigator, you can hold the Shift key down while click-
`on Reload. This instructs Netscape to leave out the If-modified-since header.
`If you use Internet Explorer, hold down the Ctr! key while clicking on Reload.
`Alternatively, you can flush your browser's cache and then press Reload, which
`prevents the browser from sending an If-modified-since header in its request.
`Note that this is a user-agent problem, not a caching proxy problem.
`
`In addition to the above problem, the Reload button, as implemented in most web
`browsers, leaves much to be desired. For example, it is not possible to reload a
`single inline image object. Similarly, it is not possible to reload web objects that
`are displayed externally from the browser, such as sound files and other "applica(cid:173)
`tion" content types. If you need to refresh an image, Postscript document, or other
`externally displayed object, you may need to ask the cache administrator to do it
`for you. Some caches may have a web form that allows you to refresh cache
`objects. For this you need to know (and type in) the object's full URL.
`
`Another problem with Reload is that it is often misused simply to rerequest a page.
`When the Web seems slow, we often interrupt a request as the page is being
`retrieved. To request the page again, you might use the Reload button. This, of
`course, sends the no-cache directive. Browsers do not have a button which
`requests a page again without sending no-cache. You can accomplish this by sim(cid:173)
`ply moving the cursor to the URL location box and pressing the Enter key.
`
`As a cache administrator, you might wonder if caches ever can, or should, ignore
`the no-cache directive. A person who keeps a close watch on bandwidth usage
`might have the impression that the Reload button gets used much more often than
`necessary. Some products, such as Squid, have features that provide special treat(cid:173)
`ment for no-cache requests. However, I personally do not recommend enabling
`these features because they violate the HTTP/1.1 protocol and leave users unable
`to get up-to-date information. One Squid option turns a no-cache request into an
`If-modified-since request. Another ignores the no-cache directive entirely.
`
`2. 6.2 The max-age Directive
`The max-age directive specifies in seconds the maximum age of a cached response
`that the client is willing to accept. Whereas no-cache means "I won't accept any
`
`AHBLT-2013.010