`Many of the designations used by manufacturers and sellers to distin-
`guish their products are claimed as trademarks. Where those designa-
`tions appear in this book and Addison-Wesley was aware of a trademark
`claim, the designations have been printed in initial capital letters.
`The publisher offers discounts on this book when ordered in quantity
`for special sales.
`For more information, please contact:
`Corporate & Professional Publishing Group
`Addison-Wesley Publishing Company
`One Jacob Way
`Reading, Massachusetts 01867
`Stein, Lincoln D., 1960-
`How to set up and maintain a World Wide Web site : the guide for
`information providers / Lincoln D. Stein.
`Includes index.
`ISBN 0-201-63389-2 (alk. paper)
`1. World Wide web (Information retrieval system)
`TK5105.888.S74 1995
`I. Title.
`Copyright © 1995 by Addison-Wesley Publishing Company, Inc.
`All rights reserved. N0 part of this publication may be reproduced, stored in a
`retrieval system, or transmitted, in any form or by any means, electronic,
`mechanical, photocopying, recording, or otherwise, without the prior written
`permission of the publisher. Printed in the United States of America. Published
`simultaneously in Canada.
`1 2 3 4 5 6 7 8 9-CRW-98979695
`First printing, August 1995
`Guided Tour
`information sharing among collaborators, but interest in the system soon
`spread to other laboratories and academic institutions.
`A turning point for the Web came in February 1993, when the U.S.
`National Center for Superconducting Applications (NCSA) released an
`early Version of Mosaic, a Web browser for Unix machines running the X
`Windows system. Mosaic used icons, popup menus, rendered bitmapped
`text, and color links to display hypertext documents. In addition, Mosaic
`was capable of incorporating color images directly onto the page along
`with the text, and provided support for sounds, animation, and other
`types of multimedia. In mid November 1993, Mosaic was released simul-
`taneously for three popular platforms: the Apple Macintosh, Microsoft
`Windows-based machines, and X Windows.
`The Web took off explosively. In October 1993, eight months after the
`release of Mosaic for X Windows, the number of Web servers registered at
`CERN had increased to 500. A year later there were an estimated 4600
`sites, with more being added exponentially. In August 1994, Web network
`traffic on the National Science Foundation's Internet backbone exceeded
`that for e-mail, the only service ever to do so. Recent estimates of the Web
`I put the number of servers at more than 12,000, and estimate an annual
`growth rate of 3000%.
`A short walk through the World Wide Web will show you what it's all
`about. The screen shots that follow use a Macintosh-based Web browser
`called MacWeb, produced and distributed freely by EINet (a service run
`by Microelectronics and Computer Corporation). MacWeb was chosen
`for the screen shots mainly because it isn't Mosaic. Although Mosaic and
`the Web have become synonymous in the public perception, Mosaic is
`only the best known browser; many others are available both freely and
`. commercially.
`Figure 1.1: SIPB Main Page. We start our tour at the MIT Student
`Information Processing Board (SIPB), a Web site maintained by one of
`MIT's student organizations. The Web has no particular starting point, so
`this is as good a place to jump in as any. The first thing that grabs your
`attention is the Web's use of the document metaphor. The Web is organized
`. as a series of pages, each with a distinctly book-like feel. You'll find para—
`graphs, headings, subheadings, changes of font and emphasis, indented
`lists, and embedded color graphics. The underlined words and phrases
`are hypertext links. These links, when selected, take the user to a different
`page or to a different.lo’cation on the same page. In this case, we use the
`mouse to select the link named ”IAP Course Guide” to learn more about
`what's going on during MIT's Independent Activities Period.
`-3 File Edit Options Navigate Hotlist
`U.ll.UlU.MlT.EDU Home Pa I a
`worm ‘Hide’ ‘iF?‘ta',l)V:£»e
`FIGURE 1.1 MIT SIPB Main Page
`Figure 1.2: Freshman Fishwrap. This link takes us to another page, this
`one maintained by the Freshman Pishwrap, a student newspaper. Each page
`on the Web has a unique address, known as its URL, or Uniform Resource
`Locator. You can see the URL for this page in the box on the upper right-
`hand corner of this Web browser's window. URL formats are explained in
`great depth later, but for now just notice that the URL begins with the text
`http, indicating that this page is accessed using the Hypertext Transfer
`Protocol (HTTP) and that the Internet address of the machine on which
`this page lives is fishwrap—docs . edu. Also notice
`that this page lives on a different machine than the SIPB main page, which
`is hosted by www . mit . edu.
`This page contains a graphic calendar with instructions to click on a
`day in order to see the corresponding class schedule. This is an example
`of a clickable map. Clicking the mouse on different parts of the image
`takes us to different pages. In this case, we click on January 9, marked
`”IAP Start.”
`Figure 1.3: IAP Schedule for January 9. This link takes us to a course
`schedule. The schedule itself is made up of more links, any one of which we
`could select to get a short course description and pointers to other courses of .
`interest. Instead, we'll do some more exploring. We jump back to the main
`SIPB page (by clicking the browser's left arrow button a few times) and select
`the link marked ”official MIT web server.”
`a‘ File Edit options Navigate Hotlist
`FIGURE 1.3 Independent Activities Period Schedule
`IP Addresses
`Domain Names
`TCP/IP uses a static addressing scheme in which each and every machine on
`the Internet is assigned a unique, unchanging IP address. IP addresses are
`32-bit numbers that are usually written out as four 8-bit numbers separated
`by dots. Examples of IP addresses include and
`Although the four billion addresses sounds like more than enough to go
`around, this isn’t_ really the case. For one thing, various ranges of IP ad-
`dresses are reserved for special purposes such as multicasting. For another,
`IP addresses are organized in a hierarchical way into a series of networks
`and subnetworks. The Network Information Center (NIC) allocates blocks of
`contiguous addresses to organizations and regional networks (Table 2.1). A
`small organization, such as a privately held company, might receive the
`block of 255 addresses from to (this is called a class
`”C” address.) It could then divvy the addresses up among its various depart-
`ments. A large organization, such as a university, might receive the block of
`. approximately 65,000 addresses from to (this is a
`class ”B” address.) Even larger entities, such as the U.S. military or the
`NEARnet regional network, could be granted one or more class ”A”
`addresses, such as the block to, encompassing more
`than 16 million addresses. The advantages of this hierarchical way of divid-
`ing the addresses are twofold. Organizationally, it's simpler to give blocks of
`addresses to organizations and allow them to divide them up as they see fit.
`Technically, it's much easier for network routers to determine how to get
`packets of data from one address to another when the Internet is organized
`into a series of networks and subnetworks.
`As a result of its rapid growth, the Internet is close to running out of
`unallocated addresses. A new system that uses longer addresses will
`replace the current one over the next few years. The new system will be
`designed to maintain compatability with the current addressing scheme.
`Raw IP addresses are unfriendly. They are difficult to remember and hard
`to type. For this reason, IP addresses are usually assigned human readable
`names using a distributed hierarchical lookup system known as the Domain
`Name System (DNS). In DNS, each machine has a unique name consisting
`of multiple parts separated by dots. The first part is the machine's host
`name, followed by a list of domains. The first domain is usually an identifier
`for the organization to which the machine belongs, followed by
`TABLE 2.1 Networks and Hosts
`Example Address
`Network Part
`Host Part
` 192.66.12.C 56
`more organizational subtitles if necessary, and finally a label for the top-
`leoel domain. In the USA, the top-level domain is usually an identifier for
`the type of organization, edu for education institutions, com for commer-
`cial organizations, mil for military establishments, net for network
`providers, and org for organizations that don't fit anywhere else. For the
`rest of the world, the top—level domain usually identifies the country: jp
`for Japan, de for Germany (Deutschland), ch for Switzerland, and so on.
`The host name and domains together form a fully qualified domain name that
`uniquely identifies that machine on the Internet. The dots in domain names
`have no correspondence to the dots in IP addresses. Whereas IP addresses
`have four parts, domain names may have two, three, or more, depending
`on how the local naming system happens to have been setup.
`For example, one of the Sun workstations inside the Whitehead
`Institute of Biomedical Research's local network has the IP address
` Its full domain name is loco.wi .mit . edu. Here's how the
`name is formed (Figure 2.1): its host name is loco, it belongs to a network
`maintained by the Whitehead Institute, wi, which in turn is part of MIT's
`network, mi t, which is itself a U.S. educational institution, edu.
`The information in the DNS system is distributed among a large num-
`ber of DNS databases, each one stored on a name server maintained by the
`organization responsible for its piece of the network. When a program is
`given a domain name to connect to, it must first send an inquiry to its local
`name server in order to find the numeric IP address to which the name cor-
`responds. If the name server doesn't know (and often it doesn't), it queries
`another name server closer to the destination, and that name server may in
`turn query a third. For example, a program in Japan wanting to look up the
`address of loco .wi .mit .edu, might first send a query to one of the name
`servers in the US. responsible far the edu names. That machine would
`then forward the request to the MIT machine responsible for the mic
`domain, which would in turn defer to a name server at the Whitehead
`Institute. Physically, the DNS databases are just human-readable tables. To
`add or modify a machine name, the local DNS administrator makes a sim-
`ple addition or modification to the table.
`host name
`yet another
`g_ L
`FIGURE 2.1 Anatomy of a Fully Qualified Domain Name
`organization organization type
`One of the nice features of the DNS is that a single machine can have
`one or more ”aliases” assigned to it in addition to its true name. This fea-
`ture is widely used by Web administrators to give descriptive names to
`their server machines. For example, an organization whose domain
`name is Capricorn . org might run its Web server on a host named
`toggenberg . Capricorn . org. Instead of using this as its publicly
`known Web name, the organization could create a www alias for the
`machine, making it known to the world as www. capricorn . org. In
`addition to being the obvious name for people to guess at when trying to
`find the organization's Web server, use of the alias makes it easy to move
`the Web service to a different machine later. The Web administrator just
`has to let the person who runs the local DNS know that the alias needs to
`be reassigned to the new machine.
`To establish a corffinunications channel between two programs running
`on different machines, or even two programs running on the same
`machine, one program must initiate the connection and the other accept it.
`This is accomplished using a client/ server scheme. The server runs first.
`When it first starts up it signals the operating system that it wants to
`accept incoming network connections. Then it waits around for the con-
`nections to start rolling in. When a client on a remote machine needs to
`send or retrieve information from the server, it opens up a connection to
`the server, passes information back and forth, and closes the connection.
`,Most servers can handle multiple simultaneous incoming connections.
`They do this either by duplicating themselves in memory each time an
`incoming connection comes in, or by cleverly interleaving their communi-
`cations activity.
`The distinction between client and server rests on who initiates the
`connection and who accepts it. Although the server is usually the informa-
`tion provider and the client is usually the information customer, this is not
`necessarilylthe case. However, it is generally true that the client usually
`interacts directly with the user, processing keystrokes and displaying
`results, while the server skulks unseen in the background.
`When two programs want to communicate with each other, it isn't enough
`for them to know each others’ IP addresses. They also need a way to ren-
`dezvous. This is because a single machine often runs multiple types of
`servers. For example, the typical Unix machine offers a telnet service for
`network log—ins, a time service for exchanging the time of day, an ftp
`service for transferring files, and several others. A machine offering Web
`or Gopher services will run HTTP or Gopher servers as well. When a pro-‘
`gram connects to a remote machine, how does it ensure that it will connect
`to the right program?
`Clients and
`This is done through well~kn0wn ports. A port is to an IP address what
`an apartment number is to an apartment building's street address: the IP
`address identifies the machine, and the port identifies a particular
`program running on the machine (Figure 2.2). Ports are identified by a
`number from 0 to 65,535. When a servgr starts up, it notifies the operating
`system to reserve a particular port. On Unix systems port numbers
`between D and 1024 are privileged: They can only be reserved by servers
`run by the root user (also known as the superuser). The other ports are
`available for anyone’s use. (Personal computers don't havethis restriction
`on the use of low-numbered ports.) Well-known ports are those which, by
`convention, are assigned to be used for particular services (Table 2.2). For
`example, port 23 is used for Telnet, and port 80 is used for the .Web’s
`hypertext transfer protocol, HTTP.
`FIGURE 2.2 Clients Use Well-Known Port Numbers to Identify Particular Server
`Programs Running on a Host
`TABLE 2.2 Well-Known Ports for Common Protocols
`NTTP (Usenet news)
`WAIS 210
`* This discussion glosses over the fact that there are really two low—level TCP/IP communi-
`cations protocols: TCP, a reliable protocol suitable for sending long streams of data, and
`UDP, an unreliable protocol suitable for exchanging brief messages. Although TCP is pre—
`ferred by most servers, including all the servers discussed in this book, some specialized
`servers use UDP instead. A TCP and a UDP program can both use the same port number
`without conflict, because in actuality a network program is uniquely identified by the
`combination of an IP address, a port number, and a communications protocol.
`Daemons and
`For example, when a Web server starts up, it reserves port 80 for its
`exclusive use (unless it's been configured to use a different one). Incoming
`clients know they should use port 80 for connecting to HTTP servers,
`making the rendezvous successful.
`In Unix systems, servers are run in either of two modes: stand-alone or
`under the control of a program called inetd. Stand-alone servers, also
`known as daemons, follow the model described earlier. They start up, listen
`for incoming connections, service the requests, and then go back to listen-
`ing. Most daemons can service multiple simultaneous incoming connec-
`tions. They do this by ”forking” a copy of themselves whenever there's a
`new incoming connection. The copy handles the request, leaving the origi-
`nal free to listen for new requests.
`It's possible for a_ system to support dozens of servers, each one assigned
`to a different port. At any time, only a fraction of them are actually doing
`any work, the rest are just hanging around, waiting for a connection, and
`consuming memory needlessly. To prevent this waste, the ”super daemon,”
`inetd was invented. When inetd starts up it reads a configuration file
`that gives it a list of ports to listen to and servers to run in response to
`incoming connections on each port. When a client connects to one of these
`ports inetd quickly launches the designated server and hands off the con-
`nection to it. When the communication is finished, the server exits, releasing
`system resources. inetd will launch it again when needed.
`Most servers, including the FTP, Telnet, and Gopher servers, run
`under inetd. Although Web HTTP servers can be configured to run this
`way as well, they usually aren't. Web servers, large programs with long
`and complex configuration files, take a significant amount of time to
`launch, and performances suffers seriously when run under inetd. For
`this reason, Web servers are usually run in stand-alone daemon mode.
`Uniform Resource Locators
`Because browsers speak many different protocols, there has to be some
`unambiguous way of telling them how and where to find an item of inter-
`est on the Internet. This is done through Uniform Resource Locator (URL)
`notation, a straightforward way of indicating the protocol, host, and loca-
`tion of an Internet resource.
`If you've used any of the Web browsers,
`you're already familiar with URLs: they are the ”address” of a Web page.
`The anatomy of an URL is diagrammed in Figure 2.3. The first part of
`the URL specifies the communications protocol. It's separated from the *
`rest of the URL by a colon. The second part, beginning with a double slash
`and ending with a single slash, is the name of the host machine on which
`the resource resides and optionally the communications port to which you
`will connect. It's only necessary to specify the port if for some reason the
`remote server has been configured to use a nonstandard port. Otherwise
`the default port will be used (see Table 2.2 for a list of default ports). The
`host can be specified either by name (preferred), or by dotted Internet
`address. The rest of the URL is the path, a string of characters that tells the
`server how to locate the resource. Its format is different for each of the
`protocols: In some cases it will be the path to a file; in others it will be a
`query used to retrieve a document from a database or other program.
`Only some characters are legal within URLs. Upper and lowercase let-
`ters, numerals, and the characters $_@ . — are OK. The characters
`=; / # ? : %&+ and the space character are also legal but have special mean-
`ings. Everything else, including tabs, spaces, carriage returns, newlines,
`accented characters, and other symbols are illegal. To include these char-
`acters in an URL they must be escaped, using an escape code consisting of
`the °/o sign followed by the two-digit hexadecimal code of the character.
`For example, a carriage return can be entered into a URL with ”°/o0D”, a
`space with ”%20", and the percent sign itself with the sequence ”°/o25”.
`You'll find a list of ASCII codes in Table 2.3 as well as in Appendix B.
`It can be difficult to remember which characters are legal and which
`aren't. Fortunately, most browsers are pretty forgiving. Commonly used
`”illegal” characters, such as the ~ symbol, are automatically translated
`into the correct escape code by browsers before being sent to the server.
`host name
`http: //www . capricorniorg : 8 O 8 O /expensive_fish/kobi . html
`FIGURE 2.3 Anatomy of an URL
`Complete Versus URLs can be complete, partial, or relative. Complete URLs contain all parts
`Partial URLS
`of the URL, including the protocol part, the host name part, and the docu-
`ment path. A hypertext link containing a complete URL will always point
`the browser to the correct location. An example of a complete URL is:
`http : //www. Capricorn . org./careers/heavy_industry . html 1
`TABLE 2.3 ASCII Character Codes
`Dec Hex Char
`CR H‘
`‘ 36
`In contrast, an example of a partial URL is the simpler
`/careers/heavy_industry .htm1
`In partial URLs, the protocol and host name parts are left off and the
`URL begins with the path name part. When browsers encounter links con-
`taining this type of URL, they interpret the URL relative to the current
`page, assuming the same protocol and host name. In the preceding exam-
`ple, if the user is viewing the document
`http: //www. capricorn . org/heavy__industry . html
`and selects a link referring to URL / careers / steel .html, the browser
`would interpret this partial URL as if it were written out as
`http : / /www . Capricorn . org/careers/steel .html
`This shorthand notation can be taken even further to create relative URLs.
`In this type not only are the protocol and host omitted, but part of the path
`is left out as well, as in the stripped down
`strip__mining .html
`Everything, including the path itself, is now interpreted relative to the cur-
`rent document. The path names of relative URLs follow the same conven-
`tions as relative paths in the Unix and MS-DOS file systems. The directory
`name ”.” is used to indicate the current directory and the name ”..” is used
`to indicate the directory above the current one. So the relative URL
`automotive/ openings .html refers to a document in a directory below
`the current document, whereas .
`. / light__industry . html tells the
`browser to hop up one level before looking for the document.
`Relative URLs are most useful for creating logically linked sets of doc-
`uments within a site. The documents refer to each other using relative
`links only, allowing the entire set to be moved from place to place within a
`site, or even to a new site entirely without changing all the links. Absolute
`URLs are usually used to refer to documents located at remote sites.
`Chapter 5 shows how this works.
`Specific URLS
`There are as many different kinds of URLs as there are protocols supported
`by browsers. This section lists the common ones, and Table 2.4 gives a
`quick summary.
`These are the most basic of URLs. They specify a file located on the local
`machine. The general form of a file URL is:
`file: / / /path__to__the__fi1e
`TABLE 2.4 Common URLs
`Local files
`file: / / /usr/local/birds/emus . gif
`HTTP protocol
`ht tp: //a . remote . host /birds/emus . gi f
`ht tp : / / 61 . remote . host /birds/
`ht tp: //a . remote . host /cgi—bin/ search?emu
`ht tp : / /a . remote . hos t /cgi —bin/ search
`ht tp: //a . remote . host /~fred/tapir . gi :E
`ht tp : / / a . remote . host / ~fred/
`FTP protocol
`ftp: / /a . remote . host /pub/emus . gi f
`ftp : / / a . remote . host /pub/ server
`ftp: //fred : xyzzy@a . remote . host / le“t‘ter . txt
`Gopher protocol
`gopher : / / a . remote . host/
`Telnet protocol
`telnet: / /a . remote . host/
`SMTP protocol
`mailto: fred@bedrock . Capricorn . org
`NNTP protocol
`news : comp . infosystems .www . providers
`WAIS protocol
`wai s : / / a . remote . host /birds__o f__NA?emu
`A file on the local computer
`A file on an HTTP server
`A directory listing on an HTTP server
`An executable script on an HTTP server
`An executable script without parameters
`A file in a user-supported HTTP directory
`A listing of a user-supported HTTP directory
`A file on an anonymous FTP server
`A directory listing on an anonymous FTP server
`A file on an FTP server that requires a user name
`Top—level menu of a Gopher host
`Telnet to a remote host
`Send mail to user
`Read recent news in a newsgroup
`WAIS search on the named document index
`The host name and port should always be left blank in this type of URL
`(with one exception, as discussed later). Following this is the full path
`name to the file of interest using whatever notation is appropriate for the
`browser's operating system (slash for Unix, backslash for DOS, and colon
`for Macintosh OS). Most if not all browsers are kind enough to translate
`the Unix path notation into the local language, so a Unix-style path name,
`using slashes to separate directories, always works.
`File URLs should never be used in documents intended to serve over
`the Web. Say a user is browsing an HTML document that contains a link
`to file: ///usr/local/games/l1ama_attack. When the user selects
`this link the browser will attempt to retrieve a file named 1 lama__attack
`from the user's local file system, which is probably not what was intended.
`File URLs are best used during testing of a set of HTML documents, or for
`documents that are intended for local consumption only. However, a bet-
`ter solution is to use relative URLs during the development of a set of
`linked pages. Otherwise all the links will have to be revised when you
`move the finished documents into place.
`It's possible for a file URL to specify a host in the host name section. If
`it does so, the URL isn't treated as a file URL at all, but as an FTP URL.
`The browser will attempt to retrieve the file Via the anonymous file trans-
`fer protocol as described later. This is an archaic feature included for back-
`ward compatibility with old documents and should be avoided.
`Web servers, by definition, speak HTTP. Naturally enough, HTTP URLs
`account for the vast majority of URLs that you will see. The format of an
`http: //hostname :port /path/to/the/resource
`As with other URLs you need only specify the communications port if
`the remote HTTP server is configured to something other than the stan—
`dard port 80. The resource path has exactly the same format as a Unix
`path name: The slashes separate a hierarchy of directories. Double dots (..)
`can be used to move up in the directory hierarchy and a single dot (.) indi-
`cates the current directory.
`Although the path used in an HTTP URL looks like a Unix path, it
`doesn't usually correspond exactly to a real physical file path on the
`remote machine. For one thing, the Web server interprets the URL path
`relative to the document root directory set in the server's configuration
`(the next chapter describes how this is done). For example, the URL
`http: //www. Capricorn . org/cooking/Curry . html
`may very well point to a file physically located on host Capricorn. org
`located at
`/local/web/Cooking/Curry . html.
`The path part of this kind of URL is often called a virtual path.
`The response by the HTTP server to the request for a particular URL is
`somewhat different depending on the resource type. If the path name
`points to a file, the server will return its contents. The browser can then do
`Whatever is appropriate for the type. If the path name points to a directory,
`the HTTP server will do one of two things. If the directory contains a wel-
`come page (often named welcome .html or index . html), this document
`will be retrieved and sent to the browser. This is how to drop the user into
`the welcome page when she accesses the site's root directory with an URL
`like http: / /www. Capricorn . org / . If no such file exists, the server will
`construct a directory listing on the fly and send it back to the browser.
`Depending on the server configuration, this listing may contain icons,
`hypertext links, file descriptions, and the contents of any README files
`found in the directory (examples of directory listings are shown in the next
`chapter, Figures 3.1 and 3.2). Servers can also be configured to ignore cer-
`tain types of files or to give others special treatment. Refer to the next chap-
`ter for full details on configuring your server for the various directory list-
`ing display options.
`HTTP URLs can also point to executable scripts. When an HTTP server
`receives a request for an URL that involves a server script, it invokes the
`program and sends the program's output to the browser. You can't tell
`from looking at it whether an URL points to a regular document or to a
`script, but if you do happen to know that a particular URL points to an
`executable script, you can pass information to it by following the URL
`with a question mark and a query string:
`The format of the query string can get fairly complex and is taken up in
`more detail in Chapter 8.
`Another common type of HTTP URL looks like
`This points to a user—supp0rted directory, a set of pages located in user
`f red’s home directory. This feature lets ordinary users of the Web host
`create and maintain their own home pages.
`FTP (file transfer protocol) is one of the oldest and probably still the most
`popular of the methods for moving files around the Internet. The usual
`FTP URL looks like
`ftp: //hostname/pat:h/to/the/file
`The browser will attempt to retrieve the file pointed to by an FTP URL by
`connecting to the specified host via anonymous FTP and issuing the correct
`sequence of commands to download the indicated file. If the URL points to a
`directory rather than a file, the browser constructs a directory listing that can
`be used for selecting files or for navigating to other directories. This means
`that the simple ftp: / / hostname/ can be used to browse an entire FTP site.
`Some FTP sites require a user name and password for access. These
`sites can be handled with the full form of the FTP URL:
`ftp: //user : password@hostname : port /path/to/the/ file
`For example, here's an URL that can be used for retrieving a file under the
`user name f red, password bedrock:
`ftp: //fred:bedrock@www.Capricorn. org/strip_mining . html
