`
`Reducing Network Traffic
`
`O’RE ILLY*
`
`Duane Wessels
`
`
`
`Web Caching
`
`Duane Wessels
`
`O’REILLY*
`Beijing » Cambridge « Farnbam « Kdln « Parts « Sebastopol - Taipei « Tokyo
`
`
`
`Web Caching
`by Duane Wessels
`Copyright © 2001 C’Reilly & Associates, Inc. All rights reserved.
`Printed in the United States of America.
`
`Published by O'Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472.
`
`Edifors: Sathan Torkingion and Paula Ferguson
`
`Production Editor: Leanne Clarke Soylemez
`
`Cover Designer: Edie Freedman
`
`Printing History:
`June 2001:
`
`First Edition.
`
`Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered
`trademarks of O'Reilly & Associates, Inc. Manyof the designations used by manufacturers
`and sellers to distinguish their products are claimed as trademarks. Where those designations
`appearin this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the
`designations have been printed in caps or initial caps. The association berween the image of
`a sock thrush and web caching is a trademark of O'Reilly & Associates, Inc,
`
`While every precaution has been taken in the preparation of this book, the publisher assumes
`no responsibility for emors or omissions, or for damages resulting from the use of the
`information contained herein,
`
`Library ofCongress Cataloging-in-Publication Data
`Wessels, Duane,
`Web Caching/Duane Wessels
`p. cm.
`ISBN 1-56592-536-X.
`1. Cache memory. 2. Browsers (Computer programs) 3. Software configuration
`Management. 4. World Wide Web. L Title.
`TRTS95.M4 W45 2001
`004.5'3—de21
`
`2001035173
`
`ISBN; 1-56592-536-X
`[c]
`
`
`
`
`
`iso Chapter &Intercache Protocols
`
`Rr
`
`P
`
`Average ICP service time from neighbor caches
`
`ICP hit ratio
`
`The average service ime without ICP is simply:
`4
`
`The average service time with ICP is:
`P5,, +01 — PS, + Ra
`We are interested in the case when:
`
`which reduces to:
`
`PS, 401-5, + Ry < 5,
`
`P>
`
`
`Ry
`So7 3h
`In other words, when the ICP hit ratio is larger than the ratio of the ICP service
`time to the difference in the HTTP service times, then the benefits of the neighbor
`hits are greater than the costs of doing the ICP queries. If the ICP hit ratio is
`smaller than the service time ratio, using ICP does not reduce the overall service
`time for user requests.
`Also note that some people use caches primanily for saving bandwidth, not for
`reducing service times. They probably don't mind the slight extra delay if it
`reduces their monthly bandwidth costs.
`
`813.2 Bandwidth
`
`Speaking of bandwidth, you may be wondering how much ofit ICP consumes. It
`is quite easy to calculate the additional bandwidth ICP uses. The size of an ICP
`query is 24 bytes plus the length of the URL. The average URL length seems to be
`about 55 characters, so ICP queries average about 80 bytes. ICP replies are four
`bytes smaller than queries.
`
`Taken together, an ICP query/reply transaction uses about 160 bytes per neighbor
`before including UDP and IP headers. A cache with 3 neighbors uses 480 bytes for
`each cache miss. To put this in perspective, consider that the average web object
`size is about 10 KB. Thus, a cache with three [CP neighbors increases its band-
`width consumption by about 5% on average. Note that this corresponds to a mea-
`surement made at the cache’s network interface. Whether ICP increases your wide
`area bandwidth depends on the location of your neighbor caches.
`
`
`
`ot
`
`Chapter & Intercache Protocols
`
`To predict hits in a neighbor, a cache needs to know which objects ane stored
`there, A neighbor's cache contents can be represented as a database or directory.
`These directories must be exchanged between neighbors. By looking up objects in
`the directory, we can predict cache hits. Since a cache's content changes over
`time, the directories must also be updated. A primary goal of our new protecol is
`to make the transmission and storage of this directory as efficient as possible.
`
`Probably the worst we could do is use a plain list of cached URLs. The average
`URL length is about 55 bytes, or 440 bits. The URL list is poor choice for a direc-
`tory because URLs are an inefficient representation—they use only printable char-
`acters, and some sequences appear very frequently. In information theory terms,
`URLs have a low amount of entropy. A better option is to use a list of hash values
`of the URLs, With hash values, we may have collisions where two or more URLs
`have the sanie hash value. A collision's probability is related to the hash value
`range and the number of URLs in the list. Fora cache of about one million objects,
`we probably need 24 to 32 bits per object. That's a significant reduction in direc-
`tory size compared with 440 bits per object in a URL list. Believe it or not, we can
`do even better. With an algorithm known as a Bloom filter, storage is reduced to
`only a fewbits per object.
`
`&4.1 Bloom Filters
`
`The Bloom filter algorithm, first described by Burton Bloom in 1970, encodes a set
`of input keys with the help of some hash functions. Recall that a hash function is
`simply an algonthm that takes a chunk of data as input (the #ey) and produces an
`integer as output, A good hash function has the characteristic of being random.
`That is, given a sufficiently large number of input keys, the ourput values should
`have an apparently random distribution. Another hash function characteristic is the
`range of its output. Often, the range is expressed as a power of 2, such as 32 bits
`or 22, One of the most often used hash functions is called MD5, which is short for
`Message Digest (Version 5) [Rivest, 1992], The MD5 hash function outputs 128-bit
`values and is used extensively in the current Cache Digest implementation.
`A Bloomfilter is defined by two parameters, K independent hash functions and an
`array of M bits. The bits of the filter are used to efficiently encode a collection of Vv
`items, For a Cache Digest, the collection is the set of object keys (URIs or URLs)
`stored ina cache. To construct a Bloom filter, we iterate over the items in the col-
`lection. Applying the K hash functions to each item gives us A hash values. Each
`hash value represents a bit position in the filter that should be turned on, If a hash
`value is larger than AW, we apply the modulus operator to keep it in range. Note
`that M should be less than the hash function range; otherwise, some bits can never
`be turned on.
`
`