`
`Duane Wessels
`
`O’REILLY*
`Beijing - Cambridge - Farnbam - K6éIn - Paris - Sebastopol - Taipei - Tokyo
`
`APPLE 1055
`Apple v. SpaceTime3D, Inc.
`IPR2023-00242
`
`APPLE 1055
`Apple v. SpaceTime3D, Inc.
`IPR2023-00242
`
`1
`
`
`
`Web Caching
`by Duane Wessels
`Copyright © 2001 O'Reilly & Associates, Inc. All rights reserved.
`Printed in the United States of America
`
`Published by O'Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472.
`
`Editors: Nathan Torkington and Paula Ferguson
`
`Production Editor: Leanne Clarke Soylemez
`
`Gover Designer: Edie Freedman
`
`Printing History:
`June 2001;
`
`First Edition.
`
`Nutshell Handbook,the Nutshell Handbooklogo, and the O'Reilly logo are registered
`trademarks of O'Reilly & Associates, Inc. Manyof the designations used by manufacturers
`andsellers to distinguish their products are claimed as trademarks, Where those designations
`appearin this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the
`designations have been printed in capsorinitial caps. The association between the image of
`a rock thrush and web cachingis a trademark of O'Reilly & Associates, Inc.
`While every precaution has been taken in the preparation ofthis book, the publisher assumes
`no responsibility for errors or omissions, or for damagesresulting fromthe use of the
`information contained herein,
`
`Library ofCongress Cataloging-in-Publication Data
`Wessels, Duane.
`Web Caching/Duane Wessels
`p. cm.
`ISBN 1-56592-536-X
`1. Cache memory. 2. Browsers (Computer programs) 3. Software configuration
`management. 4. World Wide Web.I. Title.
`TK7895.M4 W45 2001
`004.5'3--de21
`ISBN; 1-56592-536-X
`Ic]
`
`2001033173
`
`
`
`
`
`Table of Contents
`
`PHOFACE voeccsssesseessssissessseserssauesssansnceessnisenssseseessvenesssveessssanereesseecsensiecsanisensanoanecssnnneers BOC
`
` 1. Introduction. ......
`
` 1.1 Web Architecture .
`
`1.2 Web Transport Protocols ......c.00 anarennennnnnenansrrenneesrennnennpanigrisid HU SEESAENES 6
`13 Why Cache the Web? |... niiiiaa male nanannm ERMEIER 10
`1.4 Why Not Cache the Web?
`
`15 Types of Web Caches
`ts
`16 Caching Proxy Features
`
`1.7 Meshes, Clusters, and Hierarchies .....c.ccccccee eects LE
`VB PrOCUCtS eee cece ceee reece ceeesceeseesasssnssiensesseanesersueneiseneeqiseeisesaenseneeneneneese 19
`
`2. How Webs Caching WOPRS .seccccccccsceccccsesscsesssesneees
`2.1 HTTP Requests ..
`
`2.2
`Is It Cachable? ...
`2.3 Hits, Misses, and Freshness .......cccccssessesesrssreeessreserseeseareeeererranteseesnsensentees
`
`DA Hit Ratios oe ceeees cesses ee esesee sascha hidscGAcas sca’ cuaaSaUnetebuiDeNRIR SHAS OND 37
`2.5 Validation 20...
`
`2.6 Forcing a Cache to Refresh
` 2.7 Cache Replacement.........
`3. Politics Of WED CACBIIG eccccsecssessseeseessssesseesiissnesnessecansstesesseneenrecsnneess 48
`
`3.1
`Privacy......
`3.2 Request Blocking .
`
`3.3. Copyright..
`
`2
`
`
`
`
`vt
`Table of Contents
`
`Table of Contents
`
`vit
`
`34 Offensive Content vce
`. 63
`
`3.5 Dynamic Web Pages ..ceceeee
`. 64
`
`3.6 Content Integrity oo ccceesee nee ees
`- 65
`
`. 66
`3.7. Cache Busting and Server Busting...
`
`3.8 Advertising .
`. 68
`3.9 Trust..
`69
`
`3.10 Effects of Proxies
`70
`
`4, Configuring Cache Clients
`wf
`
`4.1 Proxy Addresses ............
`73
`
`4.2. Manual Proxy Configuration
`savenaesaneanee
`73
`
`43 Proxy Auto-Configuration Script
`.
`77
`44 Web Proxy Auto-Discovery .....
`83
`
`wens OA
`4.5 Other Configuration Options
`4.6 The Bottom Line ..
`. 84
`
`
`5. Interception Proxyitig ANG CACDING vessecccsssesssssersnivisssscsssnessice: 86
`
`5.1 Overview oe
`87
`5.2 The IP Layer: Routing .
`&
`
`5.3. The TCP Layer: Ports and Delivery
`96
`5.4 The Application Layer: HTTP
`. 100
`
`5.5 Debugging Interception........
`. 101
`TSSUCS oo esc cesssneseeneesveessesseeeneesnvecsnseerssesissenviecnnusensnesnnsssivessareasaversanies 102
`5.6
`5.7 To Intercept or Not To Intercept 0... cccecccsecsecnsesereerireeserennrseeansenvevces 108
`
`6. Configuring Servers to Work with CaCBCS .oo....ccccceccccecscevccvevevrs LOD
`
`. 110
`6.1
`Important HTTP Headers
`
`215
`6.2. Being Cache-Friendly.....
`63 Being Cache-Unfriendly ....
`. 127
`6.4 OtherIssues for Content Providers ....ccecsesssesssssssiceesiessessssessesiveccesesees 128
`
`7. Cache Hierarchies......
`. 132
`
`7.1 How Hierarchies Work
`we 132
`72
`. 134
`. 136
`73
`7A Optimizing Hierarchies o......cccccccccscssesssssesrseveeevsevesvesvenscsueeusnseencavsnavesees 142
`
`8. Intercache Protocols .
`
` 8.1
`
`8.2
`8.3
`84 Cache Digests
`8.5 Which Protocol t0 US@ ciesssessssesesssseetssnsavsrsessrssdessesesreesereesreesenres LOZ
`
`10.
`
`il.
`
`12.
`
`Cache Clusters
`
`9.1 The Hot Spare
`'
`
`9.2 Throughput and Load Sharing
`
`9.3 Bandwidth ween
`
`Design Considerationsfor Caching Services
`10.1. Appliance or Software Solution
`10.2 Disk Space .acicccceeeccccens
`
`10.39 Memory occ
`
`10.4 Network Interfaces
`10.5 Operating SyStems oo... cece cents rere eerenerennanenenanenes
`
`10.6 High Availability ...
`
`Intercepting Traffic
`10.7
`
`10.8 Load Sharing
`
`10.9 Location
`10.10 Using a Hierarchy
`
`Monitoring the Health of Your Caches .
`
`se
`11.1 What to Monitor? eerie
`11.2 Monitoring TOONS 0.0... eccscsesseeseeee re ceeeeceereeteeeeseendineesneiteeentendeases 186
`
`Benchmarking Proxy Caches
`
`T2.1 Metrics ...ceecceeeeeeseteeee
`12.2 Performance Bottlenecks
`
`12.3. Benchmarking Tools
`
`12.4 Benchmarking Gotchas
`
`
`12.5 How to Benchmark a Proxy Cache 12.6 Sample Benchmark Results............
`
`
`
`3
`
`
`
`viii
`
`Table ofContents
`
`A, Analysis ofProduction Cache Trace DAtQ@ oooccccccscccsssssseseeesccs. 215
`B, Internet Cache Protocol ...occccccccccccsssssssscessevsccssssse.
`escevanee 235
`
`C. Cache Array ROUting PYOTOCOD eeecccesesisssessescrscsssnieee.. 246
`D, Hypertext Caching Protocol .o....cccccsossssssssesisvesossssssseeeeosecc. 254
`
`F. CACBC DISCSES aeeceetccseseesesnstnsstesscessetinsnintiitinnetintiiseeeee.. 266
`El HTTP Status Codes oocceoccccccccccccccccccocc
`
`G@ USC. 17 Sec. 512. Limitations on Liability
`Relating to Material Online occ. sta oronenacennsanansstya sen 279
`
`
`EL. List OfACHONYMS
`
`wooseccoessesssnsssirtictinvissstivissseteseeececcc, 282
`
`FIDHOQVAPBY esrcs sesestsisninsiectvtnsivtantats pattie. 288
`
`
`INBOX cocer seeniitincnenniiinneiesnriiinaninuniiiuiuieeeecc. 291
`
`
`
`Preface
`
`When I first started using the Internet in 1986, my friends and I were obsessed
`with anonymous FTP servers. What a wonderful concept! We could downloadall
`sorts of interesting files, such as FAQs, source code, GIF images, and PC share-
`ware. Of course, downloading could be slow, especially from the busy sites like
`the famous WSAR-SIMTEL20.ARMY.MIL archive,
`
`In order to download files to my PC, I would first fip them to my Unix account
`and then use Zmodem to transfer them to my PC through my 1200 bps modem.
`Usually,
`I deleted a file after downloading it, but there were certain files—tlike
`HAOSTS. TXT and the “Anonymous FTP List”—that I kept on the Unix system. After
`a while,
`I had some scripts to automatically locate and retrieve a list of files for
`later download. Since our accounts had disk quotas, I had to carefully remove old,
`unused files and keep the useful ones. Also, I knew that if I had to delete a useful
`file, Mark, Mark, Ed, Jay, or Wim probably had a copyin their account.
`Although I didn’t realize it at the time, I was caching the FTP files. My Unix
`account provided temporary storage for the files 1 was downloading. Frequently
`referenced files were kept as long as possible, subject to disk space limitations.
`Before retrieving a file from an FTP server, I often checked myfriend’s “caches” to
`see if they already had what I was lookingfor.
`Nowadays, the World Wide Web is whereit's at, and caching is here too. Caching
`makes the Webfeel faster, especially for popular pages. Requests for cached infor-
`mation come back muchfaster than requests sent to the content provider. Further-
`more, caching reduces network bandwidth, which translates directly into cost
`savings for many organizations.
`In many ways, web caching is similar to the way it was in the Good OP Days. The
`basic ideas are the same: retrieve and store files for the user. When the cache
`
`4
`
`
`
`Preface
`
`Preface
`
`xt
`
`becomesfull, some files must be deleted, Web caches can cooperate and talk to
`each other when looking for a particular file before retrieving it from the source.
`Of course, web caching is significantly more sophisticated and complicated than
`my early Internet years. Caches are tightly integrated into the web architecture,
`often without the user’s knowledge. The Hypertext Transfer Protocol was designed
`with caching in mind. This gives users and content providers more control (per-
`haps too much) over the treatment of cached data.
`In this book, you'll
`learn how caches work, how clients and servers can take
`advantage of caching, what issues are important, how to design a caching service
`for yourorganization, and more.
`
`Audience
`The material in this book is relevant to the following groups of people:
`Administrators
`This book is primarily writen for those of you whoare, or will be, responsible
`for the day-to-day operation of one or more web caches. You might work for
`an ISP, a corporation, or an educational institution. Or perhaps you'd like to
`set up a web cache for your home computer
`Contentproviders
`I sincerely hope that content providers take a look at this book, and especially
`Chapter 6, to see how making their content more “cache aware” can improve
`their users’ surfing experiences.
`Web developers
`Anyone developing an application that uses HTTP needs to understand how
`web caching works, Many users today are behind firewalls and caching prox-
`ies. A significant amount of HTTP traffic is automatically intercepted and sent
`to web caches. Failure to take caching issues into considcration may adversely
`affect the operation of your application.
`Web users
`Usually, the people who deploy caches want them to be transparent to the
`end user. Indeed, users are often unaware thal they are using a web cache.
`Even so, if you are “only” a user, I hope that youfind this book useful and
`interesting. It can help you understand why you sometimes see stale web
`Pages and what you can do aboutit. If you are concerned about yourprivacy
`on the Internet, be sure to read Chapter 3. If you want to know how to con-
`figure your browser for caching, see Chapter 4.
`
`To benefit from this book, you need to have only a user-level understanding ofthe
`Web. You should know that Netscape Navigator and Internet Explorer are web
`
`browsers, that Apache is a web server, and that Jttp:/Avww.oreilly.com is a URL. If
`you have some Unix system administration experience, you can use some of the
`examplesin later chapters.
`
`What You Will and Won't Find Here
`Chapter 1 introduces caching and provides some background material to help the
`rest of the book make sense. In addition, companies that provide caching products
`are listed here. In Chapter 2, we'll dive into the Hypertext Transfer Protocol and
`explore its features for caching. Chapter3 is relatively nontechnical and discusses
`some of the controversies that surround web caching, such as copyrights and pri-
`vacy.
`
`In Chapter 4, you'll see the various ways to configure user agents (browsers) for
`caching, with a focus on Netscape Navigator and Microsoft Internet Explorer. Many
`administrators prefer to automatically intercept and divert HTTP connections to a
`cache. We'll talk about that in Chapter 5. Then, in Chapter 6, we’ll turn to servers
`and see how content providers can maketheir information cache-friendly.
`Chapter 7 and Chapter 8 are about cache hierarchies. First we'll talk about them in
`general, including why you should or should notparticipate in a hierarchy. Then
`you'll learn about the protocols caches use to communicate with each other. Chap-
`ter 9 is a short chapter about cacheclusters. Although clusters have some things in
`common with cache hierarchies,
`it
`is easier to understand some of the nuances
`after you've learned about the intercache protocols.
`In Chapter 10, I'll walk you through some of the decisions you'll face in procuring
`and building a caching service for your organization. Following that, Chapter 11
`offers advice on monitoring the health of your caches once they are operational.
`For the Unix-savvy, I'll show how to set up UCD-SNMPD and RRDTool forthis
`purpose. Chapter 12 is about benchmarking the performance of caches.
`I analyze some logfiles from production caches in Appendix A. Here you can see
`some samplefile size distributions, content types, HTTP headers, andhit ratio sim-
`ulations. The next four appendixes are about intercache protocols. Appendix B
`describes the technical details of ICP. Appendix D does the same for HTCP,
`Appendix C for CARP, and Appendix E for cache digests. Appendix F is a list of
`HTTP status codes from RFC 2616. Appendix G contains the text of a U.S. copy-
`right statute that mentions caching. Finally, in Appendix H, you'll find definitions
`for many of the acronymsI use in this book.
`The new, hot topics in the caching industry are streaming media and content dis-
`tribution networks. This book focuses on HTTP and FTP caching techniques with
`proven results, eschewing technology thatis still evolving.
`
`
`
`5
`
`
`
`
`
`wet Preface
`
`Caching Resources
`information about
`Here are a few resources you can use to find additional
`caching. Caching vendors’ websites are listed in Section 1.8, “Products.”
`
`Web Sites
`
`See the following web sites for more information about web caching:
`bitp.//;www.web-caching.com
`including
`This well-designed site contains a lot of up-to-date information,
`product information, press releases, links to magazinearticles online, industry
`events, and job postings. The site also has a bulletin board discussion forum.
`The banner ads at the top of every page are mildly annoying.
`bttp://www.caching.com
`This professionalsite is full of web caching information in the form of press
`releases, upcoming events, vendor white papers, online magazine articles, and
`analyst commentaries. The site is apparently sponosred by a number of the
`caching companies.
`bitp.//wunv.web-cache.com
`At my ownsite for caching information, you'll find mostly links to other web
`sites,
`including caching vendors and caching-related services.
`I also try to
`keep up with relevant technical documents (e.g., RFCs) and research papers.
`btlp://dmoz.org/Computers/Software/Interinet/Siervers/Proxy/Caching/
`The Open Directory Project has a decent-sized collection of web caching links
`at the above URL.
`bitp://wwwwrec.org
`The Web Replication and Caching (WREC) working group of the IETF is offi-
`cially dead, but thissite still has some useful information,
`bitp://www.iwew.org
`This site provides information about the series of annual International Web
`Caching Workshops.
`
`Mailing Lists
`The following mailinglists discuss various aspects of web caching:
`isp-caching
`Currently the most active caching-related mailing list. Averages about 2-3
`messages per day. Posting here is likely to result in a number of salespeople
`knocking on your mailbox. One of the great things about this list
`is that
`replies automatically go back to the list unless the message composer is
`
`Preface
`
`
`
`careful. On many occasions people have posted messages that they wish they
`could take back! For more information,visit btip./Avww.isp-caching.com.
`WEBI
`WEBI (Web Intermediaries) is a new IETF working group, replacing WREC.
`The discussion is bursty, averaging about 1-2 messages per day. This is not an
`appropriate forum for discussion of web caching in general; topics should be
`related to the working group charter. Currently the group is addressing inter-
`mediary discovery (a la WPAD) and the resource update protocol. For addi-
`tional
`information,
`including the charter and subscription instructions, visit
`bttp:/www.telf.org/biml. charters/webi-charter.him.
`HTTP working group
`the mailing list still
`Although the HTTP working group is officially dead,
`receives a small amountof traffic. Messagesare typically from somebody ask-
`ing for clarification about the RFC. For subscription information and access to
`the archives, visit btip.//www.ics.uci.edu/pub/ietybtip/bypermay.
`loadbalancing
`The people onthis list discuss all aspects of load balancing, including hints for
`configuring specific hardware, performance issues, and security alerts. Traffic
`averages about 3-4 messages per day. For more more information, visit
`bttp://wwwlbdigest.com.
`
`Conventions Used in This Book
`Luse the following typesetting conventions in this book:
`Italic is also used for
`tOoes for emphasis and to signify the first use of a term.
`URLs, host names, email addresses, FIP sites, file and directory names, and
`commands.
`Constant width
`Used for HTTP header names and directives, such as Tf-modified-since and
`mo-cache.
`
`How To Contact Us
`You can contact the author at wessels@packet-pushers,.com.
`Please address comments and questions concerning this book to the publisher:
`OReilly & Associates, Inc.
`101 Mortis Street
`:
`Sebastopol, CA 95472
`(800) 998-9938 Cin the United States or Canada)
`
`6
`
`
`
`_ eeaLL
`
`Preface
`
`Preface
`
`oe
`
`(707) 829-0515 (international or local)
`(707) 829-0104 (fax)
`
`We have a web page for this book, where we list examples, errata, or any addi-
`tional information. You can access this page at:
`btip.//www.oreilly.com/catalog/webcaching/
`To comment or ask technical questions aboutthis book, send emailto:
`bookquestions@oreilly.com
`
`For more information about our books, conferences, software, Resource Centers,
`and the O'Reilly Network, see our website at:
`bttp://www.oreilly.com
`
`Acknowledgments
`Jam extremely lucky to have been put in the position to write this book. There
`are so many people who have helped me along the way. First, I want to thank Dr.
`Jon Sauer and Dr. Ken Klingenstein at the University of Colorado for supporting
`my graduate work in this field. Huge thanks to Michael Schwartz, Peter Danzig,
`and other members of the Harvest project for the most enjoyable job Pll probably
`ever have.
`I don’t think I can ever thank k claffy and Hans-Werner Braun of
`NLANR enoughfor taking me in and allowing me to work on the IRCache project.
`I am also in karma-debt to all of my CAIDA friends (Tracie, Amy, Rochell, Jennif-
`fer, Jambi) for taking care of business in San Diego so I could stay in Boulder,
`Thanks to Marla Mcehl and the National Center for Atmospheric Research for a
`place to sit and an OC-3 connection.
`
`This book has benefited immensely from the attentive eyes of the folks who
`reviewed the manuscript: Ittai Gilat (Microsoft), Lee Beaumont (Lucent), Jeff Boote
`(NCAR), Reuben Farrelley, and Valery Soloviev (nktomi). Special thanks also to
`Andy Cervantes of The Privacy Foundation.
`As usual, the folks at O'Reilly have done a fantastic job. Nat, Paula, Lenny, Erik:
`Let’s do it again sometime!
`
`Since I've been working on web caching, I have been very fortunate to work with
`many wonderful people. I truly appreciate the support and friendship of Jay Adel-
`son, Kostas Anagnostakis, Pei Cao, Glenn Chisholm, Ian Cooper, Steve Feldman,
`Henry Guillen, Martin Hamilton, Ted Hardie, Solom Heddaya, Ron Lee, Ulana Leg-
`edza, Carlos Maltzahn, John Martin, Ingrid Melve, Wojtek Sylwestrzak, Bill Wood-
`cock, and Lixia Zhang.
`
`Thanks to my family (Karen, John, Theresa, Roy) for their constant support and
`understanding. Despite the efforts of my good friends Ken, Scott, Bronwyn, Gen-
`nevive, and Brian, who tried to tie up all my free time,
`I finished anyway! My
`coworkers, Alex Rousskov and Matthew Weaver, are champs for putting up an
`endless barrage of questions, and for tolerating my odd working hours. A big
`thank you to everyone who writes free software, especially the FreeBSD hackers.
`But mostof all, thanks to all the Squid users and developers out there!
`
`
`
`7
`
`
`
`
`
`
`
`Introduction
`
`The term cache has French roots and means,literally, to store. As a data process-
`ing term, caching refers to the storage of recently retrieved computer information
`for future reference. The stored information may or may not be used again, so
`caches are beneficial only when the cost of storing the information is less than the
`cost of retrieving or computing the information again.
`The concept of caching has found its way into almost every aspect of computing
`and networking systems. Computer processors have both data and instruction
`caches. Computer operating systems have buffer caches for disk drives andfilesys-
`tems. Distributed (networked) filesystems such as NFS and AFS rely heavily on
`caching for good performance. Internet routers cache recently used routes. The
`Domain Name System (DNS)
`servers cache hostname-to-address and other
`lookups.
`Caches work well because of a principle known as focality of reference. There are
`two flavors of locality: temporal and spatial. Temporal locality means that some
`pieces of data are more popular than others. CNN’s home page is more popular
`than mine. Within a given period of time, somebody is more likely to request the
`CNN page than my page.Spatial locality meansthat requests for certain pieces of
`data are likely to occur together. A request for the CNN homepageis usually fol-
`lowed by requests for all of the page’s embedded graphics. Caches use locality of
`reference to predict future accesses based on previous ones. When the prediction
`is correct, there is a significant performance improvement. In practice, this tech-
`nique works so well that we would find computer systems unbearably slow with-
`out memory and disk caches. Almostall data processing tasks exhibit locality of
`reference and therefore benefit from caching.
`When requested data is found in the cache, we call it a Ait, Similarly, referenced
`data that is not cached ic nemaca mice The narfnemanne imneniuamant that a
`
`
`
`8
`
`
`
`
`
`a Chapter 1: introduction
`
`Li Web Architecture
`
`3
`
`cache provides is based mostly on the difference in service times for cache hits
`compared to misses. The percentage of all requests that are hits is called the pit
`ratio,
`
`Any system that utilizes caching must have mechanisms for maintaining cache con-
`sistency. This is the process by which cached copies are kept up-to-date with the
`originals. We say that cached data is either fresh or stale. Caches can reuse fresh
`copies immediately, but stale data usually requires validation. The algorithms that
`are to maintain consistency may be either weak or strong. Weak consistency
`means that the cache sometimes returns outdated information, Strong consistency,
`on the other hand, means that cached data is always validated beforeit is used.
`CPU andfilesystem caches require strong consistency, However, some types of
`caches, such as those in routers and DNSresolvers, are effective even if they
`return stale information.
`
`We know that caching plays an important role in modern computer memory and
`disk systems. Can it be applied to the Web with equal success? Ask different peo-
`ple and you're likelyto get different answers. For some, caching is critical to mak-
`ing the Web usable. Others view caching as a necessary evil. A fraction probably
`considerit just plain evil (Tewksbury, 1998].
`In this book, I'll talk about applying caching techniques to the World Wide Web
`and try to convince you that web caching is a worthwhile endeavor. We'll see how
`web caches work, how they interact with clients and servers, and the role that
`HTTP plays. You'll learn about a number of protocols that are used to build cache
`clusters and hierarchies. In addition to talking about the technical aspects,
`I also
`spend a lot of time on the issues and politics, The Web presents some interesting
`problems dueto its highly distributed nature.
`
`After you've read this book, you should be able to design and evaluate a caching
`proxy solution for your organization. Perhaps you'll install a single caching proxy
`on your firewall, or maybe you néed many caches located throughout your net-
`work, Furthermore, you should be well prepared to understand and diagnose any
`problems that may arise from the Operation or failure of your caches. If you're a
`content provider, then I hope I'll have convinced youto increase the cachability of
`the information youserve.
`
`L.1 Web Architecture
`Before we can talk more about caching, we need to agree on some terminology.
`Whenever possible, I use words and meanings taken from Internet standards doc-
`uments. Unfortunately, colloquial usage of web caching terminologyis often just
`different enough to be confusing.
`
`1.1.1 Chlents and Servers
`The fundamental building blocks of the Web (and indeed most distributed sys-
`tems) are clents and servers. A web server manages and provides access to a set
`of resources. The resources might be simple text files and images, or something
`more complex, such as a relational database. Clients, also known as user agents,
`initiate a transaction by sending a request to a server, The server then processes
`the request and sends a response back to the client.
`On the Web, most transactions are download operations; the client downloads
`some information from the server. In these cases, the requestitself is quite small
`(about 200 bytes) and contains the name ofthe resource, plus a smal! amount of
`additional information from the client. The information being downloaded is usu-
`ally an image ortext file with an average size of about 10,000 bytes. This charac-
`teristic of the Web makes cable- and satellite-based Internet services viable. The
`data rates for receiving are much higher than the data rates for sending because
`web users mostly receive information.
`A small percentage of web transactions are more correctly characterized as upload
`operations. In these cases, requests are relatively large and responses are very
`small. Examples of uploads include sending an email message and transferring an
`image file from your computerto a server.
`The most common webclients are called browsers. These are applications such as
`Netscape Navigator and Microsoft Internet Explorer. The purpose of a browseris
`to render the web content for us to view and interact with. Because of the myriad
`of features present in web browsers, they are really very large and complicated
`programs. In addition to the GUI-basedclients, there are a few simple command-
`line client programs, such as Lynx and Wget.
`A numberof different servers are in widespread use on the Web. The Apache
`HTTP server is a popular choice and freely available. Netscape, Microsoft, and
`other companies also have server products. Many content providers are concerned
`with the performance of their servers. The most popular sites on the Net can
`receive ten million requests per day with peak request rates of 1000 per second. At
`this scale, both the hardware and software must be very carefully designed to
`cope with the load. Many sites run multiple servers in parallel to handle their high
`requestrates and for redundancy.
`Recently, there has been a lot of excitement surrounding peer-to-peer applications,
`such as Napster. In these systems, clients share files and other resources (e.g., CPU
`cycles) directly with each other. Napster, which enables people to share MP3files,
`does not store the files on its servers. Rather,
`it acts as a directory and returns
`pointers to files so that two clients can communicate directly. In the peer-to-peer
`realm, there are no centralized servers; every client is a server.
`
`
`
`9
`
`
`
`4
`
`
`
`5
`
`Chapter 1: Introduction
`
`Li Web Architecture
`
`The peer-to-peer movementis relatively young but already very popular. It’s likely
`that a significant percentage of Internettraffic today is due to Napster alone. How-
`ever, I won't discuss peer-to-peer clients in this book. One reason for this is that
`Napster uses its own transfer protocol, whereas here we'll focus on HTTP.
`
`1.1.2 Proxies
`
`Muchof this book is about proxies. A proxy is an intermediary in a web transac-
`tion. It
`is an application that sits somewhere between the client and the origin
`server. Proxies are often used on firewalls to provide security. They allow (and
`record) requests from the internal network to the outside Internet.
`
`A proxy behaveslike both a client and a server. It acts like a server to clients, and
`like a client to servers. A proxy receives and processes requests from clients, and
`then it forwards those requests to origin servers, Some people refer to proxies as
`“application layer gateways.” This name reflects the fact that the proxy lives at the
`application layer of the OSI reference model,
`just like clients and servers. An
`important characteristic of an application layer gateway is that it uses two TCP
`connections: one to the client and oneto the server. This has important ramifica-
`tions for some of the topics we'll discuss later,
`
`Proxies are used for a numberof different things, including logging, access con-
`trols, filtering, translation, virus checking, and caching. We'll talk more about these
`and the issues they create in Chapter 3.
`
`L1.3 Web Objects
`T use the term object to refer to the entity exchanged between a client and a
`server. Some people may use document or page, but these terms are misleading
`because they imply textual information or a collection of text and images. “Object”
`is generic and better describes the different types of content returned from servers,
`such as audio files, ZIP files, and C programs. The standards documents (RFCs)
`that describe web components and protocols prefer the terms entity, resource, and
`response. My use of object corresponds to their use of entity, where an object
`(entity) is a particular response generated from a particular resource, Web objects
`have a numberof important characteristics, including size (number of bytes), type
`(HTML, image, audio, etc.), time of creation, and time of last modification,
`In broad terms, web resources can be considered either dynamic or static.
`Responses for dynamic resources are generated on the fly when the request is
`made. Static responses are pregenerated,
`independent of client requests. When
`people think of dynamic responses, often what comes to mind are stock quotes,
`live camera images, and web page counters. Digitized photographs, magazine arti-
`cles, and software distributions are all static information. The distinction between
`
`is not necessarily so clearly defined. Many web
`dynamic and static content
`resources are updated at various intervals (perhaps daily) but not uniquely gener-
`ated on a per-request basis. The distinction between dynamic andstatic resources
`is important becauseit has serious consequences for cache consistency.
`
`LL4 Resource Identifiers
`Resource identifiers are a fundamental piece of the architecture of the Web. These
`are the names and addresses for web objects, analogous to street addresses and
`telephone numbers. Officially,
`they are called Universal Resource Identifiers, or
`URIs. They are used by both people and computers alike. Caches use them to
`identify and index the stored objects. According to the design specification, RFC
`2396, URIs must be extensible, printable, and able to encode all current and future
`naming schemes. Because of these requirements, only certain characters may
`appear in URIs, and some characters have special meanings.
`Uniform Resource Locators (URLs) are the most common form of URI in use today.
`The URL syntax is described in RFC 1738. Here are some sample URLs:
`bttp://jeww.zoidbergnet
`bttp.//www.oasis-open,org/docbook/index.btml
`Sipsftpfreebsdorg/pub/FreeBSD/README.TXT
`URLs have a very important characteristic worth mentioning here. Every URL
`includes a network host address—either a hostname or an IP address. Thus, a URL
`is bound to a specific server, called the origin server. This characteristic has some
`negative side effects for caching. Occasionally, the same resource exists on two or
`more servers, as occurs with mirror sites. When a resource has more than one
`name,
`it can get cached under different names. This wastes storage space and
`bandwidth.
`
`Uniform Resource Names (URNs) are similar to URLs, but they refer to resources in
`a location-independent manner. RFC 2141 describes URNs, which are also some-
`times called persistent names. Resources named with URNs can be moved from
`one server (location) to another without causing problems. Here are some sample
`(hypothetical) URNs:
`urn:duns:0023 72413:annual-report-1997
`urnisbn:156592530X
`
`its birthplace at CERN in Geneva,
`left
`the World Wide Web Project
`In 1995,
`Switzerland, and became the World Wide Web Consortium. In conjunction with
`this move,
`their web site location changed from info.cern.ch to wuw.w3c.org.
`Everyone who used a URL with the old location received a page with a link to the
`
`
`
`10
`
`10
`
`
`
`
`
`6 Chapter 1: Introduction
`
`1.2 Web Transport Protocols
`
`
`
`new location and a reminder to “update your links and hotlist."* Had URNs been
`implemented and in use back then, such a problem could have been avoided,
`Another