`
`Proceedings of the
`1995 USENIX Technical Conference
`
`January 16 - 20, 1995
`New Orleans, Louisiana, USA
`
`GOOG 1042
`IPR of U.S. Patent No. 8,601,154
`
`
`
`For additional copies of these proceedings, write:
`
`USENIX Association
`2560 Ninth Street, Suite 215
`Berkeley, CA 94710 USA
`
`The price is $30 for members and $39 for nonmembers.
`Outside the USA and canada, please add
`$20 per copy for postage (via air printed matter).
`
`Past USENIX Technical Conferences
`
`1994 Summer Boston
`1994 Winter San Diego
`1993 Summer Cincinnati
`1993 Winter San Diego
`1992 Summer San Antonio
`1992 Winter San Francisco
`1991 Summer Nashville
`1991 Winter Dallas
`1990 Summer Anaheim
`1990 Winter Washington, DC
`1989 Summer Baltimore
`1989 Winter San Diego
`
`1988 Summer San Francisco
`1988 Winter Dallas
`1987 Summer Phoenix
`1987 Winter Washington, DC
`1986 Summer Atlanta
`1986 Winter Denver
`1985 Summer Portland
`1985 Winter Dallas
`1984 Summer Salt Lake City
`1984 Winter Washington, DC
`1983 Summer Toronto
`1983 Winter San Diego
`
`1995 © Copyright by The USENIX Association
`All Rights Reserved.
`
`This volume is published as a collective work. Rights to individual
`papers remain with the author or the author's employer. Permission is
`granted for the noncommercial reproduction of the complete work for
`educational or research purposes. USENIX acknowledges all trade(cid:173)
`marks herein.
`
`ISBN 1-880446-67-7
`
`Printed in the United States of America on 50% recycled paper, 10-15% post consumer waste. @
`I FEB 26 ZOIO I
`
`M.I.T. LIBRARIES
`
`RECEIVED
`
`
`
`(cid:3) (cid:55)(cid:75)(cid:76)(cid:86)(cid:3)(cid:80)(cid:68)(cid:87)(cid:72)(cid:85)(cid:76)(cid:68)(cid:79)(cid:3)(cid:80)(cid:68)(cid:92)(cid:3)(cid:69)(cid:72)(cid:3)(cid:83)(cid:85)(cid:82)(cid:87)(cid:72)(cid:70)(cid:87)(cid:72)(cid:71)(cid:3)(cid:69)(cid:92)(cid:3)(cid:38)(cid:82)(cid:83)(cid:92)(cid:85)(cid:76)(cid:74)(cid:75)(cid:87)(cid:3)(cid:79)(cid:68)(cid:90)(cid:3)(cid:11)(cid:55)(cid:76)(cid:87)(cid:79)(cid:72)(cid:3)(cid:20)(cid:26)(cid:3)(cid:56)(cid:17)(cid:54)(cid:17)(cid:3)(cid:38)(cid:82)(cid:71)(cid:72)(cid:12)(cid:3)
`
`SIFT - A Tool for
`Wide-Area Information Dissemination *
`
`Tak W. Yan Hector Garcia-Molina
`Department of Computer Science
`Stanford University
`Stanford) CA 94305
`{tyan, hector }@cs.stanford.edu
`
`Abstract
`
`The dissemination model is becoming increasingly
`important in wide-area information system. In this
`model , the user subscribes to an information dissem(cid:173)
`ination service by submitting profiles that describe
`his interests. He then passively receives new, fil(cid:173)
`tered information. The Stanford Information Filter(cid:173)
`ing Tool (SIFT) is a tool to help provide such ser(cid:173)
`vice. It supports full-text filtering using well-known
`information retrieval models. The SIFT filtering en(cid:173)
`gine implements novel indexing techniques, capable
`of processing large volumes of information against a
`large number of profiles. It runs on several major
`Unix platforms and is freely available to the public.
`In this paper we present SIFT's approach to user in(cid:173)
`terest modeling and user-server communication. We
`demonstrate the processing capability of SIFT by de(cid:173)
`scribing a running server that disseminates USENET
`News. We present an empirical study of SIFT's per(cid:173)
`formance, examining its main memory requirement
`and ability to scale with information volume and user
`population.
`
`1
`
`Introduction
`
`Technological advances have made wide-area infor(cid:173)
`mation sharing commonplace. A suite of tools have
`
`*This research was sponsored by the Advanced Research
`Projects Agency (ARPA) of the Department of Defense un(cid:173)
`der Grant No.MDA972-92-J-1029 with the Corporation for Na(cid:173)
`tional Research Initiatives (CNRI). The views and conclusions
`contained in this document are those of the authors and should
`not be interpreted as necessarily representing the official poli(cid:173)
`cies or endorsement, either expressed or implied, of ARPA, the
`U.S. Government, or CNRI.
`
`emerged for network information finding and dis(cid:173)
`covery; e.g., Wide-Area Information Servers (WAIS)
`[KM91], archie [ED92], World- Wide Web (WWW)
`[BLCGP92], and gopher [McC92]. However , these
`new tools have one important missing element. They
`provide a means to search for existing information,
`but lack a mechanism for continuously informing the
`user of new information. The exploding volume of
`digital information makes it difficult for the user,
`equipped with only search capability, to keep up with
`the fast pace of information generation. Instead of
`making the user go after the information , it is desir(cid:173)
`able to have information selectively flow to the user.
`In an information dissemination (aka. alert, informa(cid:173)
`tion filtering, selective dissemination of information)
`service, the user expresses his interests in a number
`of long-term, continuously evaluated queries, called
`profiles. He will then passively receive documents fil(cid:173)
`tered according to the profiles. Such a service wi.ll be(cid:173)
`come increasingly important and form an indispens(cid:173)
`able tool for the dynamic environment of wide-area
`information systems.
`A very simple kind of information dissemination
`service is already available on the Internet: mailing
`lists (see e.g., [Kro92]). Hundreds of mailing lists ex(cid:173)
`ist, covering a wide variety of topics. The user sub(cid:173)
`scribes to lists of interest to him and receives mes(cid:173)
`sages on the topic via email. He may also send mes(cid:173)
`sages to the lists to reach other subscribers. LIST(cid:173)
`SERV is a software system for maintaining mailing
`lists. A problem with the mailing list mechanism as
`a tool for information dissemination is that it pro(cid:173)
`vides a crude granularity of interest matching. A user
`whose information need does not exactly match cer(cid:173)
`tain lists will either receive too many irrelevant or too
`few relevant messages. The USENET News (or Net-
`
`1995 USENIX Technical Conference- January 16-20, 1995 - New Orleans, LA
`
`177
`
`
`
`news) system [Kro92], an electronic bulletin board
`system on the Internet, is similar in nature to mail(cid:173)
`ing lists. While Netnews is extremely successful with
`millions of users and megabytes of daily traffic, at
`the same time it often creates an information over(cid:173)
`load. Like mailing lists, the coarse classification of
`topics into newsgroups means that a user subscribing
`to certain newsgroups may not find all articles inter(cid:173)
`esting, and also he will miss relevant articles posted
`in newsgroups that he does not subscribe to.
`
`Recent research efforts on information filtering fo(cid:173)
`cus on the filtering effectiveness, attempting to pro(cid:173)
`vide more fine-grained filtering using relational, rule(cid:173)
`based , information retrieval (IR) , and artificial intel(cid:173)
`ligence approaches. With few exceptions, they are
`often small scale (i.e., involving a small number of
`users or profiles) and thus the need to provide effi(cid:173)
`cient filtering is not apparent. However, in a large(cid:173)
`scale wide-area system where the number of informa(cid:173)
`tion providers and seekers are large, efficiency in the
`dissemination process is an important issue and must
`be addressed.
`
`The Stanford Information Filtering Tool (SIFT) is
`a tool for information providers to perform large-scale
`information dissemination. It can be used to set up
`a clearinghouse service that gathers large amount of
`information and selectively disseminates the informa(cid:173)
`tion to a large population of users. It supports full(cid:173)
`text filtering, using well-known and well-studied IR
`models. The SIFT filtering engine implements novel
`indexing techniques, capable of scaling to large num(cid:173)
`ber of documents and profiles. It runs on several
`major Unix platforms and is freely available to the
`public by anonymous ftp at URL :
`
`ftp://db.stanford.edu/pub/sift/sift-l.O.tar.Z
`
`In this paper we describe SIFT. We present its ap(cid:173)
`proach to user interest modeling and user-server com(cid:173)
`munication. We demonstrate the processing capabil(cid:173)
`ity of SIFT by describing a running server that dis(cid:173)
`seminates tens of thousands of Netnews articles daily
`to some 13,000 subscriptions. We describe the imple(cid:173)
`mentation of SIFT, focusing on the filtering engine.
`Finally we present an empirical study of SIFT's per(cid:173)
`formance, examining its main memory requirement
`and ability to scale with information volume and user
`population.
`
`2 Other Previous Work
`
`Boston Community Information System [GBBL85] is
`an experimental information dissemination system.
`Like SIFT, it allows a finer granularity of interest
`matching than mailing lists or Netnews. Users can
`express their interests with IR-style, keyword-based
`profiles. The system broadcasts new information via
`radio channel to all users, who then apply their own
`filters locally. While the radio communication chan(cid:173)
`nel makes broadcast inexpensive, local processing of
`mostly irrelevant information is very expensive. (For
`every user, a personal computer is dedicated for this
`purpose -this is cited as a major source of complaints
`from the users.)
`Information Lens (MGT+87] provides categoriza(cid:173)
`tion and filtering of semi-structured messages such
`as email. The user defines rules for filtering, and
`the processing is done at the user site. It provides
`effective filtering, but local processing is expensive
`for large-scale information dissemination . Similarly,
`the "kill file" mechanism in certain news reader pro(cid:173)
`grams allows the user to locally screen out irrele(cid:173)
`vant articles. A kill file only removes specified ar(cid:173)
`ticles from newsgroups that a user subscribes to, but
`it does not discover relevant articles in other news(cid:173)
`groups. To provide the same kind of filtering power
`as SIFT would require much local processing. It is
`more cost-effective to pool profiles together to share
`the processing overhead.
`The Tapestry system (GNOT92] is a research pro(cid:173)
`totype that uses the relational model for matching
`user interests and documents; filtering computation
`is done not on the properties of individual docu(cid:173)
`ments, but rather on the entire append-only doc(cid:173)
`ument database. Efficient query processing tech(cid:173)
`niques are proposed for handling this kind of queries.
`Tapestry is built on top of a commercial relational
`database system. The Pasadena system [WF91] in(cid:173)
`vestigates the effectiveness of different IR techniques
`in filtering. It collects new documents from several
`Internet information sources, and periodically run
`profiles against them. Similar to SIFT, it uses a
`clearinghouse approach, but efficient filtering is not
`addressed.
`In [YGM94b, YGM94c] we proposed a variety of
`indexing techniques for speeding up information fil(cid:173)
`tering under IR models. There we evaluated these
`techniques using analysis and simulation. The SIFT
`filtering engine is a real implementation of one class
`of index structures that we found to be efficient.
`
`178
`
`1995 USENIX Technical Conference- January 16-20, 1995- New Orleans, LA
`
`
`
`Uaer
`
`Documents
`
`3.1.1 F iltering M odel
`
`The interest profile can be expressed in one of t wo
`IR models: boolean and vector space (Sal89). We first
`focus on the vector space model.
`In vector space model, queries and documents are
`identified by terms, usually words. If there are m
`terms for content identification, then a document D
`is conceptually represented as an m-dimensional vec(cid:173)
`tor D = (w1, w2 , ... , wm), where weight w; for term
`l; signifies its statistical importance, such as its fre(cid:173)
`quency in the document. We may also write D as
`((t;.,w;,), ... ,(t;.,w;.)) where w;j i= 0; e.g., ( (un(cid:173)
`derwater, 60), (archeology, 60) ). A query is similarly
`represented. For a document-query pair, a similarity
`measure (such as the dot product) can be computed
`to determine how "similar" the two are. In an IR en(cid:173)
`vironment, the top ranked documents are retrieved
`for a query.
`In an information filtering setting, a key question
`is how many documents to return to the user. We
`could have allowed the user to receive a fixed number
`of top ranked documents per update period, e.g., as
`done in [FD92]. However, this is not very desirable:
`in a period when there are many relevant documents,
`he may miss some (low recall1 ); in a period when
`there are few interesting documen ts, he may receive
`irrelevant ones (low p1·ecision1). We instead allow
`the user to specify a relevance threshold, which is the
`minimum similarity score that a document must have
`against the profile for it to be delivered. A default
`value is supplied to the user for convenience.
`Instead of using the vector space model, the user
`may use boolean profiles to specify words that he
`wants in documents received, and words to be ex(cid:173)
`cluded. For example, the boolean profile "fly fishing
`not underwater" is for documents that contain both
`words "fly" and "fishing" but not the word "under(cid:173)
`water." The reader may note that the SIFT boolean
`model only allows conjunction and negation of words.
`However, the user may approximate d isjunction se(cid:173)
`mantics by submitting multiple subscriptions (though
`a document may match more than one subscriptions).
`
`3.1.2 Profile Constructio n an d Modification
`
`To assist the user with t he construction of a profile, a
`SIFT server provides a test run facility. The user may
`
`1 Recall is the proportion of relevant documents retumed
`and precision is the proportion of documents returned that ar~
`relevant.
`
`Figure 1: An overview of SIFT
`
`3 SIFT
`
`We begin our presentation of SIFT with an example.
`Suppose a user is interested in underwater archeol(cid:173)
`ogy (Figure 1). He sends an email subscribe request
`to a SIFT server specifying the profile "underwater
`archeology," and optionally parameters that control,
`for example, how often he wants to be updated or how
`long his subscription is for. He may alternativelv ac(cid:173)
`cess the SIFT server via WWW, using a graphical
`WWW client interface to fi ll out a form with the
`subscription information. (Before he subscribes, he
`may test run his profile against an index of a sam(cid:173)
`ple document collection.) His subscription is stored
`in the subscription database. As the SIFT server re(cid:173)
`ceives new documents, the fi ltering engine will pro(cid:173)
`cess them against the stored subscriptions, and noti(cid:173)
`fications will be sent out based on the user-specified
`parameters.
`In the following we detail the SIFT system from
`the perspectives of user interest modeling and com(cid:173)
`munication protocol. We then illustrate SIFT using
`the Netnews SIFT server.
`
`3.1 U ser Inter est Modeling
`
`A user subscribes to a SIFT server with one or more
`subscriptions, one for each topic of interest. A sub(cid:173)
`scription includes an IR-style profile, and as men(cid:173)
`tioned additional parameters to control the frequency
`of updates, the amount of information to receive, and
`the length of the subscription. A subscription is iden(cid:173)
`tified by the email address of the user and a subscrip(cid:173)
`tion identifier.
`
`1995 USENIX Technical Conference- January 16-20, 1995- New Orleans, LA
`
`179
`
`
`
`run his initial profile against an existing, representa(cid:173)
`tive collection of documents to test the effectiveness
`of his profile. He may interactively change the pro(cid:173)
`file and (for weighted profiles) adjust the threshold
`to the desired level of precision and recall. When he
`is satisfied with the performance of the filtering, he
`may then subscribe with the selected settings.
`After the user receives some periodic updates, he
`may decide to modify his profile or change the thresh(cid:173)
`old for vector space profiles. He may do this by
`accessing the SIFT server. Furthermore, for vec(cid:173)
`tor space profiles, relevance feedback [Sal89], a well(cid:173)
`known technique in IR to improve retrieval effective(cid:173)
`ness, can be used. The user simply gives SIFT the
`documents that he finds interesting; after examining
`them, the server adjusts the weights of the words in
`the user's profile accordingly.
`
`3.2 Communication Protocol
`
`There are two modes of communication between the
`user and a SIFT server. In the interactive mode, the
`user subscribes, test-runs a profile, views, updates,
`or cancels his subscriptions. In the passive mode, the
`user periodically receives information updates. In(cid:173)
`stead of developing a new communication protocol,
`we make use of current technologies: email [Cro82]
`and World-Wide Web HTIP [BLCGP92].
`First we discuss the interactive communication
`mode. Email communication is the lowest common
`denominator of network connectivity. By having an
`email interface, a SIFT server is accessible from users
`with less powerful machines, with limited network
`capability, or behind Internet-access firewalls. We
`adopt the LISTSERV mailing list syntax wherever
`possible, to ensure that minimal learning is required .
`We also use default settings to reduce the complexity
`of email requests for novice users.
`As the appeal of hypermedia navigation and the de(cid:173)
`velopment of sophisticated client interfaces are mak(cid:173)
`ing WWW the preferred tool for wide-area infor(cid:173)
`mation sharing, we also developed a WWW access
`interface for SIFT. Using a WWW client program,
`the user interacts with the SIFT server through a
`user-friendly graphical interface. We believe the dual
`email and WWW access covers the tradeoff between
`wide availability and ease of use.
`In the passive mode of user notification, a SIFT
`server sends out email messages that contain excerpts
`of new, potentially relevant documents (certain num(cid:173)
`ber of lines from the beginning, as specified by the
`
`user). After the user reads the excerpts, he may ac(cid:173)
`cess the SIFT server to retrieve the entire documents.
`Currently the excerpts are formatted to be read from
`regular mail readers. We plan to offer the option of
`formatting them in HTML, so that the user may view
`the notifications from sophisticated WWW viewers,
`and then interactively retrieve interesting documents
`from SIFT or provide feedback via HTIP.
`
`3.3 SIFTing Netnews
`
`Using SIFT, we have set up a server for selectively
`disseminating Netnews articles (text articles only; bi(cid:173)
`nary ones are first screened out). Like any other SIFT
`server, the user accesses the Netnews SIFT server via
`email or WWW. The reader is encouraged to try it
`out: for email access, please send an electronic mes(cid:173)
`sage to netnewsCldb. stanford. edu, with the word
`"help" in the body; for WWW access, please con(cid:173)
`nect to http://sift.stanford.edu. In February
`1994, we publicized the Netnews server in two news(cid:173)
`groups; within ten days of the announcement, we re(cid:173)
`ceived well over a thousand profiles. The number of
`profiles keeps increasing and now (November 1994)
`exceeds 13,000, submitted by users from almost all
`continents. Table 1 shows some interesting statistics
`obtained from the server on the day of November 10,
`1994. The average number of articles is over the week
`of November 4 - 10, 1994.
`
`Number of subscriptions
`Number of users
`Daily average number of articles
`Average notification period (in days)
`
`13,381
`5,146
`45,127
`1.4
`
`Table 1: Netnews SIFT server statistics
`
`Apparent from these numbers is that the load on
`the SIFT server is very high. It is necessary to match
`an average of over 45,000 articles against some 13,000
`profiles and deliver most updates within a day. The
`efficient implementation of the SIFT filtering engine
`enables the job to be done on regular hardware, a
`DECstation 5000/240. Even though we believe SIFT
`is efficient, there is a limit to the load that it can
`handle. It is necessary to replicate the server; in fact,
`we have been in contact with several sites (in the U.S.
`and Europe) that expressed interests in providing the
`same service.
`The Netnews SIFT server is not meant to be a
`replacement of the Netnews system, which is an in-
`
`180
`
`1995 USENIX Technical Conference- January 16-20, 1995- New Orleans, LA
`
`
`
`valuable tool for carrying on distributed discussions.
`Rather, it provides a complementary filtering service.
`Only the beginning lines of matched articles are sent
`out. The user may, after determining that an article
`is relevant, access his own local news host to access
`the article. In cases where he does not have access to
`a news host, he may request the whole article be sent
`from the SIFT server.
`Sifting Netnews is just one application of SIFT.
`We have set up another SIFT server, disseminating
`Computer Science Technical Reports (email access:
`elibOdb . stanford . edu, WWW access will soon be
`available). This server has close to 1,000 subscrip(cid:173)
`tions (November 1994).
`
`To provide the test run capability, we need to in(cid:173)
`dex a collection of documents. This can be done by
`some publicly available software, and we choose to
`use the index engine in the WAIS distribution (called
`waisindex). From our experience it is a very robust
`implementation. However, since we use a different
`scoring scheme than WJ\IS , we made slight modifica(cid:173)
`tions to the WAIS code to make it compatible with
`our filtering engine (i.e., giving the same results). The
`major changes are to support the use of relevance
`threshold and the processing of boolean queries. The
`modified waisindex code is included in our SIFT dis(cid:173)
`tribution.
`
`4
`
`Implementation
`
`SIFT is implemented in C and has been compiled
`on several Unix platforms: DEC Ultrix 4.2, HPUX
`8.07, and Sun OS 4.1. In the following we first give
`an overview of the components of the system and then
`focus on the implementation of the filtering engine.
`
`4.1 System Components
`
`The program mailui implements the email request
`handler shown in Figure 1. It parses email messages,
`assuming the format in [Cro82] . User requests are
`processed by accessing the subscription database and
`the document index (for test runs).
`The program wwvui is a so-called "cgi-script" for
`use with the HTTP daemon released by National
`Center for Supercomputing Applications. It accepts
`requests submitted by users using a WWW client
`(with a form-filling graphical interface), and, simi(cid:173)
`lar to mailui , it accesses the subscription database
`and document index to process the requests.
`The program filter implements the filtering en(cid:173)
`gine. It accepts as input a list of documents (in the
`form of Unix path names) and matches them against
`the stored profiles. The implementation details are
`discussed in the next subsection. The output from
`filter is a file of matchings.
`After the filtering is completed, the matchings are
`sorted by users and subscription identifiers. This is
`necessary because SIFT sends out updates on a per
`subscription basis. After sorting, the program alert
`can be used to send out the message one by one,
`using Unix sendmail. Included in the messages are
`excerpts from the matched documents (a few lines
`from the beginning of the document).
`
`4.2 Filtering Engine Implementation
`
`In IR, to efficient process queries against a collection
`of documents, an inverted index [Sal89] of documents
`is built, which associales a word with its occurrences
`in the documents. In the information filtering sce(cid:173)
`nario, we may use the same approach: an index is
`built of a batch of new documents, and profiles are
`treated as stored queries that are run periodically.
`This is the approach used in an earlier SIFT proto(cid:173)
`type. We refer to this as the document index ap(cid:173)
`proach. This approach may not be very appropriate
`with a high volume of new information. The docu(cid:173)
`ment index is large and resides on disk. Running a
`large number of profiles against it requires a lot of
`disk 1/0s.
`In the current SIFT implementation , we use a dif(cid:173)
`ferent approach. The idea is to consider information
`filtering as the dual problem of IR, and treat pro(cid:173)
`files as documents and documents as queries. Instead
`of building a document index , we build a profile in(cid:173)
`dex. In [YGM94b,YGM94c] we propose and analyze
`a variety of profile indexing techniques for the vec(cid:173)
`tor space and boolean models respectively.
`In our
`SIFT implementation, we have chosen good indexing
`techniques for the two models that are compatible in
`nature and combined them together in the same in(cid:173)
`dex structure. Below we briefly present lhis combined
`index structure through an example. For a detailed
`description, as well as descriptions of other profile in(cid:173)
`dexing techniques and analysis, the reader is referred
`to [YGM94b,YGM94c].
`Referring to Figure 2, suppose Pl is a weighted pro(cid:173)
`file ((underwater, 60), (archeology, 60)} and P2 is a
`boolean profile "fly fishing not underwater." For each
`word, we associate it with an inverted list of postings
`that reference profiles containing it. For example, in
`
`1995 USENIX Technical Conference- January 16-20, 1995 - New Orleans, LA
`
`181
`
`
`
`5 P erformance St udy
`
`A practical challenge for SIFT is to process a new
`batch of documents within the smallest time unit
`of notification . For example, in the Netnews SIFT
`server, it is critical that the filter and notify process
`is finished within 24 hours. Thus, it is important
`to understand and characterize the performance of
`the SIFT system, and measure how it scales with the
`number of profiles and volume of information.
`First we study the performance of the processing
`time with respect to the number of documents and
`number of profiles. We then compare the perfor(cid:173)
`mance of the filtering engine with and the alternative
`document index approach discussed in Section 4.2.
`Finally we investigate the main memory requirement
`of the filtering engine. All experiments are run on
`a dedicated DECstation 5000/240 running Ultrix 4.2
`and all times reported are real (wall clock) time.
`The study was performed in early July, 1994. The
`test data consisted of 38,000 articles received on the
`day of July 6, 1994 by our department's news host
`and 7 ,000 randomly selected subscriptions from the
`Netnews SIFT server at that time. The average ar(cid:173)
`ticle size was 269 words (a word is defined as an al(cid:173)
`phanumeric string longer than 2 characters). The
`subscriptions were stored on a local disk, while the
`news articles were stored on an NFS-mounted disk
`(from the news host). We repealed the experiments
`with test data from two other days and the results
`were within ±10% of those reported below.
`In the result graphs tha.t follow, we divide SIFT
`processing time into these f?ur steps:
`
`1. Build Time The profile index structure is built.
`Other auxiliary data structures are allocated and
`initialized.
`
`2. Filtering Time Documents are run against the
`index one by one. Document-profile matchings
`are written into a file.
`
`3. Sorting Time The document-profile matching
`file is sorted by user emai l address and subscrip(cid:173)
`tion identifier, using Unix sort command.
`
`4. Notify Time The sorted matching file is read.
`Excerpts of matchings for each subscription are
`prepared into an email message. Unix sendmail
`is invoked to send out each message.
`
`Threshold
`
`Score
`
`·1
`
`Figure 2: Profile index and auxiliary data structures
`
`Figure 2 the inverted list for word "underwater" is
`made up of two postings referencing P1 and P2. In
`the posting, if the profile is weighted, we include the
`weight of the word in the profile; e.g., 60 for "un(cid:173)
`derwater" in Pl. If the profile is a boolean one, we
`assign a weight of 1 to a non-negated word, and -1 to
`a negated one.
`In addition, we make use of two arrays called
`Threshold and Score; each profile has an entry in each
`array. For each profile, its Threshold entry stores the
`minimum score that a document must have against it
`to be a match. For a boolean profile, this minimum
`is equal to the number of non-negated words. For a
`weighted profile, the minimum is derived using the
`relevance threshold specified by the user. T he Score
`entry is used to accumulate the score that a document
`has against the profile.
`Now suppose a document D containing the words
`"underwater archeology" (and none of the other pro(cid:173)
`file words) is being processed. Suppose the vec(cid:173)
`tor space representation for D is ((underwater, 80),
`(archeology, 80), ... ); i.e., the words "underwater"
`and "archeology" are both assigned weights of 80.
`To process the word "underwater," we look up the
`directory and access its inverted list. For a weighted
`profile like Pl , the document weight of "underwater"
`is multiplied with the weight in the posting (60), and
`the product is accumulated in P1 's Score entry {note
`that we are calculating the dot product of the two
`vectors). For a boolean profile like l>2, we simply add
`its posting weight (-1 for "underwater") to the Score
`entry. Similarly the word "archeology" is processed.
`Finally, profiles whose resulting Score entry is no less
`than the Threshold entry match the document , such
`as Pl in Figure 2.
`
`182
`
`1995 USENIX Technical Conference- January 16-20, 1995- New Orleans, LA
`
`
`
`, T1.
`
`,.
`
`Suits 1- Filter 4- Son + Noiily Time -O-~
`Build + Filler 1- Sort Tirrie -+---
`Build + Filter ‘firne 8--
`Build Time —)<—
`
`_..o-''
`
`,..es--
`
`.—-‘$.--‘U.-‘B“
`
`"-‘‘“B,,..---
`
`0
`
`5000
`
`..~sr"ar
`--)<:ai¢—~ ><'
`><«
`)9
`x I
`15030 20000 25000 30000 35000
`NumberofDocumen1s
`
`10000
`
`Build + Filter + Son 4- Nolily Time -om
`Buiici 1- Filter + Sort Time —i---
`Build 1- Filter Time E}-
`Build Time ->(—
`
`4000
`
`8000
`5000
`Number oi Profiles
`
`10000
`
`12000
`
`14000
`
`Figure 3: SIFT performance vs. # documents
`
`Figure 4: SIFT performance vs. # profiles
`
`5.1 Coping with Information Volume
`
`First we investigate the relationship between the pro-
`cessing time and the number of documents. Figure
`3 shows the results of processing the 38,000 articles
`against the 7,000 subscriptions. The time it takes
`to build the profile index is very small and is as ex-
`pected independent of the number of documents. The
`filtering time is linear with respect
`to the number
`of documents, with slope 0.21 sec./document [value
`obtained by curve-fitting using Mathematica). The
`time it takes to sort the matching file is negligible.
`The proportionality constant for the total running
`time is 0.63 sec. /document. [Recall that the articles
`are stored on a remote news host, so some fraction
`
`of this time is spent in reading the articles via local
`network. This is confirmed by looking at the CPU
`utilization, which is found to be 42.8%.)
`
`To measure the notify time, since it would be im-
`possible to send the same updates repeatedly to our
`real users, the notify times shown in Figure 3 (except
`the rightmost data point) are interpolated results.
`We assume that the time it takes to send out notifi-
`
`cations is proportional to the number of matchings in
`the matching file. We have verified this assumption
`with a separate experiment and derived the propor-
`tionality constant. Then in the experiment for Figure
`3, we count the number of matchings in the matching
`file at the data points shown. We compute the notify
`time as the product of the number of matchings and
`the proportionality constant. The results show that
`a large fraction {z 63% for 38,000 documents) of the
`total time is spent in sending out subscriptions.
`
`5.2 Coping with Subscription Growth
`
`In the Second experiment, we look at the relation-
`ship between the processing time and the number of
`profiles. We repeat
`the process of filtering 38,000
`documents against 1, 1,000, 4,000, 7,000, and 14,000
`subscriptions. The last run is done simply by dupli-
`cating the 7,000 subscriptions in the database. Fig-
`ure 4 shows the results. Again, the profile index build
`time is negligible. For the filtering time, we find that
`even for one profile, it takes 5,105 sec.
`to filter. This
`is the time spent reading in the articles via NFS. Be-
`yond the initial start—up cost, filter time is appar-
`ently linear to the number of profiles, with slope 0.62
`sec./profile. The notify time is similarly obtained as
`discussed above. We find the total running time to
`be linear with respect to the number of profiles, and
`the slope is 2.63 sec./profile.
`
`5.3 Performance Improvement
`
`To quantify