throbber
The USENIX Association
`
`Proceedings of the
`1995 USENIX Technical Conference
`
`January 16 - 20, 1995
`New Orleans, Louisiana, USA
`
`GOOG 1042
`IPR of U.S. Patent No. 8,601,154
`
`

`
`For additional copies of these proceedings, write:
`
`USENIX Association
`2560 Ninth Street, Suite 215
`Berkeley, CA 94710 USA
`
`The price is $30 for members and $39 for nonmembers.
`Outside the USA and canada, please add
`$20 per copy for postage (via air printed matter).
`
`Past USENIX Technical Conferences
`
`1994 Summer Boston
`1994 Winter San Diego
`1993 Summer Cincinnati
`1993 Winter San Diego
`1992 Summer San Antonio
`1992 Winter San Francisco
`1991 Summer Nashville
`1991 Winter Dallas
`1990 Summer Anaheim
`1990 Winter Washington, DC
`1989 Summer Baltimore
`1989 Winter San Diego
`
`1988 Summer San Francisco
`1988 Winter Dallas
`1987 Summer Phoenix
`1987 Winter Washington, DC
`1986 Summer Atlanta
`1986 Winter Denver
`1985 Summer Portland
`1985 Winter Dallas
`1984 Summer Salt Lake City
`1984 Winter Washington, DC
`1983 Summer Toronto
`1983 Winter San Diego
`
`1995 © Copyright by The USENIX Association
`All Rights Reserved.
`
`This volume is published as a collective work. Rights to individual
`papers remain with the author or the author's employer. Permission is
`granted for the noncommercial reproduction of the complete work for
`educational or research purposes. USENIX acknowledges all trade(cid:173)
`marks herein.
`
`ISBN 1-880446-67-7
`
`Printed in the United States of America on 50% recycled paper, 10-15% post consumer waste. @
`I FEB 26 ZOIO I
`
`M.I.T. LIBRARIES
`
`RECEIVED
`
`

`
`(cid:3) (cid:55)(cid:75)(cid:76)(cid:86)(cid:3)(cid:80)(cid:68)(cid:87)(cid:72)(cid:85)(cid:76)(cid:68)(cid:79)(cid:3)(cid:80)(cid:68)(cid:92)(cid:3)(cid:69)(cid:72)(cid:3)(cid:83)(cid:85)(cid:82)(cid:87)(cid:72)(cid:70)(cid:87)(cid:72)(cid:71)(cid:3)(cid:69)(cid:92)(cid:3)(cid:38)(cid:82)(cid:83)(cid:92)(cid:85)(cid:76)(cid:74)(cid:75)(cid:87)(cid:3)(cid:79)(cid:68)(cid:90)(cid:3)(cid:11)(cid:55)(cid:76)(cid:87)(cid:79)(cid:72)(cid:3)(cid:20)(cid:26)(cid:3)(cid:56)(cid:17)(cid:54)(cid:17)(cid:3)(cid:38)(cid:82)(cid:71)(cid:72)(cid:12)(cid:3)
`
`SIFT - A Tool for
`Wide-Area Information Dissemination *
`
`Tak W. Yan Hector Garcia-Molina
`Department of Computer Science
`Stanford University
`Stanford) CA 94305
`{tyan, hector }@cs.stanford.edu
`
`Abstract
`
`The dissemination model is becoming increasingly
`important in wide-area information system. In this
`model , the user subscribes to an information dissem(cid:173)
`ination service by submitting profiles that describe
`his interests. He then passively receives new, fil(cid:173)
`tered information. The Stanford Information Filter(cid:173)
`ing Tool (SIFT) is a tool to help provide such ser(cid:173)
`vice. It supports full-text filtering using well-known
`information retrieval models. The SIFT filtering en(cid:173)
`gine implements novel indexing techniques, capable
`of processing large volumes of information against a
`large number of profiles. It runs on several major
`Unix platforms and is freely available to the public.
`In this paper we present SIFT's approach to user in(cid:173)
`terest modeling and user-server communication. We
`demonstrate the processing capability of SIFT by de(cid:173)
`scribing a running server that disseminates USENET
`News. We present an empirical study of SIFT's per(cid:173)
`formance, examining its main memory requirement
`and ability to scale with information volume and user
`population.
`
`1
`
`Introduction
`
`Technological advances have made wide-area infor(cid:173)
`mation sharing commonplace. A suite of tools have
`
`*This research was sponsored by the Advanced Research
`Projects Agency (ARPA) of the Department of Defense un(cid:173)
`der Grant No.MDA972-92-J-1029 with the Corporation for Na(cid:173)
`tional Research Initiatives (CNRI). The views and conclusions
`contained in this document are those of the authors and should
`not be interpreted as necessarily representing the official poli(cid:173)
`cies or endorsement, either expressed or implied, of ARPA, the
`U.S. Government, or CNRI.
`
`emerged for network information finding and dis(cid:173)
`covery; e.g., Wide-Area Information Servers (WAIS)
`[KM91], archie [ED92], World- Wide Web (WWW)
`[BLCGP92], and gopher [McC92]. However , these
`new tools have one important missing element. They
`provide a means to search for existing information,
`but lack a mechanism for continuously informing the
`user of new information. The exploding volume of
`digital information makes it difficult for the user,
`equipped with only search capability, to keep up with
`the fast pace of information generation. Instead of
`making the user go after the information , it is desir(cid:173)
`able to have information selectively flow to the user.
`In an information dissemination (aka. alert, informa(cid:173)
`tion filtering, selective dissemination of information)
`service, the user expresses his interests in a number
`of long-term, continuously evaluated queries, called
`profiles. He will then passively receive documents fil(cid:173)
`tered according to the profiles. Such a service wi.ll be(cid:173)
`come increasingly important and form an indispens(cid:173)
`able tool for the dynamic environment of wide-area
`information systems.
`A very simple kind of information dissemination
`service is already available on the Internet: mailing
`lists (see e.g., [Kro92]). Hundreds of mailing lists ex(cid:173)
`ist, covering a wide variety of topics. The user sub(cid:173)
`scribes to lists of interest to him and receives mes(cid:173)
`sages on the topic via email. He may also send mes(cid:173)
`sages to the lists to reach other subscribers. LIST(cid:173)
`SERV is a software system for maintaining mailing
`lists. A problem with the mailing list mechanism as
`a tool for information dissemination is that it pro(cid:173)
`vides a crude granularity of interest matching. A user
`whose information need does not exactly match cer(cid:173)
`tain lists will either receive too many irrelevant or too
`few relevant messages. The USENET News (or Net-
`
`1995 USENIX Technical Conference- January 16-20, 1995 - New Orleans, LA
`
`177
`
`

`
`news) system [Kro92], an electronic bulletin board
`system on the Internet, is similar in nature to mail(cid:173)
`ing lists. While Netnews is extremely successful with
`millions of users and megabytes of daily traffic, at
`the same time it often creates an information over(cid:173)
`load. Like mailing lists, the coarse classification of
`topics into newsgroups means that a user subscribing
`to certain newsgroups may not find all articles inter(cid:173)
`esting, and also he will miss relevant articles posted
`in newsgroups that he does not subscribe to.
`
`Recent research efforts on information filtering fo(cid:173)
`cus on the filtering effectiveness, attempting to pro(cid:173)
`vide more fine-grained filtering using relational, rule(cid:173)
`based , information retrieval (IR) , and artificial intel(cid:173)
`ligence approaches. With few exceptions, they are
`often small scale (i.e., involving a small number of
`users or profiles) and thus the need to provide effi(cid:173)
`cient filtering is not apparent. However, in a large(cid:173)
`scale wide-area system where the number of informa(cid:173)
`tion providers and seekers are large, efficiency in the
`dissemination process is an important issue and must
`be addressed.
`
`The Stanford Information Filtering Tool (SIFT) is
`a tool for information providers to perform large-scale
`information dissemination. It can be used to set up
`a clearinghouse service that gathers large amount of
`information and selectively disseminates the informa(cid:173)
`tion to a large population of users. It supports full(cid:173)
`text filtering, using well-known and well-studied IR
`models. The SIFT filtering engine implements novel
`indexing techniques, capable of scaling to large num(cid:173)
`ber of documents and profiles. It runs on several
`major Unix platforms and is freely available to the
`public by anonymous ftp at URL :
`
`ftp://db.stanford.edu/pub/sift/sift-l.O.tar.Z
`
`In this paper we describe SIFT. We present its ap(cid:173)
`proach to user interest modeling and user-server com(cid:173)
`munication. We demonstrate the processing capabil(cid:173)
`ity of SIFT by describing a running server that dis(cid:173)
`seminates tens of thousands of Netnews articles daily
`to some 13,000 subscriptions. We describe the imple(cid:173)
`mentation of SIFT, focusing on the filtering engine.
`Finally we present an empirical study of SIFT's per(cid:173)
`formance, examining its main memory requirement
`and ability to scale with information volume and user
`population.
`
`2 Other Previous Work
`
`Boston Community Information System [GBBL85] is
`an experimental information dissemination system.
`Like SIFT, it allows a finer granularity of interest
`matching than mailing lists or Netnews. Users can
`express their interests with IR-style, keyword-based
`profiles. The system broadcasts new information via
`radio channel to all users, who then apply their own
`filters locally. While the radio communication chan(cid:173)
`nel makes broadcast inexpensive, local processing of
`mostly irrelevant information is very expensive. (For
`every user, a personal computer is dedicated for this
`purpose -this is cited as a major source of complaints
`from the users.)
`Information Lens (MGT+87] provides categoriza(cid:173)
`tion and filtering of semi-structured messages such
`as email. The user defines rules for filtering, and
`the processing is done at the user site. It provides
`effective filtering, but local processing is expensive
`for large-scale information dissemination . Similarly,
`the "kill file" mechanism in certain news reader pro(cid:173)
`grams allows the user to locally screen out irrele(cid:173)
`vant articles. A kill file only removes specified ar(cid:173)
`ticles from newsgroups that a user subscribes to, but
`it does not discover relevant articles in other news(cid:173)
`groups. To provide the same kind of filtering power
`as SIFT would require much local processing. It is
`more cost-effective to pool profiles together to share
`the processing overhead.
`The Tapestry system (GNOT92] is a research pro(cid:173)
`totype that uses the relational model for matching
`user interests and documents; filtering computation
`is done not on the properties of individual docu(cid:173)
`ments, but rather on the entire append-only doc(cid:173)
`ument database. Efficient query processing tech(cid:173)
`niques are proposed for handling this kind of queries.
`Tapestry is built on top of a commercial relational
`database system. The Pasadena system [WF91] in(cid:173)
`vestigates the effectiveness of different IR techniques
`in filtering. It collects new documents from several
`Internet information sources, and periodically run
`profiles against them. Similar to SIFT, it uses a
`clearinghouse approach, but efficient filtering is not
`addressed.
`In [YGM94b, YGM94c] we proposed a variety of
`indexing techniques for speeding up information fil(cid:173)
`tering under IR models. There we evaluated these
`techniques using analysis and simulation. The SIFT
`filtering engine is a real implementation of one class
`of index structures that we found to be efficient.
`
`178
`
`1995 USENIX Technical Conference- January 16-20, 1995- New Orleans, LA
`
`

`
`Uaer
`
`Documents
`
`3.1.1 F iltering M odel
`
`The interest profile can be expressed in one of t wo
`IR models: boolean and vector space (Sal89). We first
`focus on the vector space model.
`In vector space model, queries and documents are
`identified by terms, usually words. If there are m
`terms for content identification, then a document D
`is conceptually represented as an m-dimensional vec(cid:173)
`tor D = (w1, w2 , ... , wm), where weight w; for term
`l; signifies its statistical importance, such as its fre(cid:173)
`quency in the document. We may also write D as
`((t;.,w;,), ... ,(t;.,w;.)) where w;j i= 0; e.g., ( (un(cid:173)
`derwater, 60), (archeology, 60) ). A query is similarly
`represented. For a document-query pair, a similarity
`measure (such as the dot product) can be computed
`to determine how "similar" the two are. In an IR en(cid:173)
`vironment, the top ranked documents are retrieved
`for a query.
`In an information filtering setting, a key question
`is how many documents to return to the user. We
`could have allowed the user to receive a fixed number
`of top ranked documents per update period, e.g., as
`done in [FD92]. However, this is not very desirable:
`in a period when there are many relevant documents,
`he may miss some (low recall1 ); in a period when
`there are few interesting documen ts, he may receive
`irrelevant ones (low p1·ecision1). We instead allow
`the user to specify a relevance threshold, which is the
`minimum similarity score that a document must have
`against the profile for it to be delivered. A default
`value is supplied to the user for convenience.
`Instead of using the vector space model, the user
`may use boolean profiles to specify words that he
`wants in documents received, and words to be ex(cid:173)
`cluded. For example, the boolean profile "fly fishing
`not underwater" is for documents that contain both
`words "fly" and "fishing" but not the word "under(cid:173)
`water." The reader may note that the SIFT boolean
`model only allows conjunction and negation of words.
`However, the user may approximate d isjunction se(cid:173)
`mantics by submitting multiple subscriptions (though
`a document may match more than one subscriptions).
`
`3.1.2 Profile Constructio n an d Modification
`
`To assist the user with t he construction of a profile, a
`SIFT server provides a test run facility. The user may
`
`1 Recall is the proportion of relevant documents retumed
`and precision is the proportion of documents returned that ar~
`relevant.
`
`Figure 1: An overview of SIFT
`
`3 SIFT
`
`We begin our presentation of SIFT with an example.
`Suppose a user is interested in underwater archeol(cid:173)
`ogy (Figure 1). He sends an email subscribe request
`to a SIFT server specifying the profile "underwater
`archeology," and optionally parameters that control,
`for example, how often he wants to be updated or how
`long his subscription is for. He may alternativelv ac(cid:173)
`cess the SIFT server via WWW, using a graphical
`WWW client interface to fi ll out a form with the
`subscription information. (Before he subscribes, he
`may test run his profile against an index of a sam(cid:173)
`ple document collection.) His subscription is stored
`in the subscription database. As the SIFT server re(cid:173)
`ceives new documents, the fi ltering engine will pro(cid:173)
`cess them against the stored subscriptions, and noti(cid:173)
`fications will be sent out based on the user-specified
`parameters.
`In the following we detail the SIFT system from
`the perspectives of user interest modeling and com(cid:173)
`munication protocol. We then illustrate SIFT using
`the Netnews SIFT server.
`
`3.1 U ser Inter est Modeling
`
`A user subscribes to a SIFT server with one or more
`subscriptions, one for each topic of interest. A sub(cid:173)
`scription includes an IR-style profile, and as men(cid:173)
`tioned additional parameters to control the frequency
`of updates, the amount of information to receive, and
`the length of the subscription. A subscription is iden(cid:173)
`tified by the email address of the user and a subscrip(cid:173)
`tion identifier.
`
`1995 USENIX Technical Conference- January 16-20, 1995- New Orleans, LA
`
`179
`
`

`
`run his initial profile against an existing, representa(cid:173)
`tive collection of documents to test the effectiveness
`of his profile. He may interactively change the pro(cid:173)
`file and (for weighted profiles) adjust the threshold
`to the desired level of precision and recall. When he
`is satisfied with the performance of the filtering, he
`may then subscribe with the selected settings.
`After the user receives some periodic updates, he
`may decide to modify his profile or change the thresh(cid:173)
`old for vector space profiles. He may do this by
`accessing the SIFT server. Furthermore, for vec(cid:173)
`tor space profiles, relevance feedback [Sal89], a well(cid:173)
`known technique in IR to improve retrieval effective(cid:173)
`ness, can be used. The user simply gives SIFT the
`documents that he finds interesting; after examining
`them, the server adjusts the weights of the words in
`the user's profile accordingly.
`
`3.2 Communication Protocol
`
`There are two modes of communication between the
`user and a SIFT server. In the interactive mode, the
`user subscribes, test-runs a profile, views, updates,
`or cancels his subscriptions. In the passive mode, the
`user periodically receives information updates. In(cid:173)
`stead of developing a new communication protocol,
`we make use of current technologies: email [Cro82]
`and World-Wide Web HTIP [BLCGP92].
`First we discuss the interactive communication
`mode. Email communication is the lowest common
`denominator of network connectivity. By having an
`email interface, a SIFT server is accessible from users
`with less powerful machines, with limited network
`capability, or behind Internet-access firewalls. We
`adopt the LISTSERV mailing list syntax wherever
`possible, to ensure that minimal learning is required .
`We also use default settings to reduce the complexity
`of email requests for novice users.
`As the appeal of hypermedia navigation and the de(cid:173)
`velopment of sophisticated client interfaces are mak(cid:173)
`ing WWW the preferred tool for wide-area infor(cid:173)
`mation sharing, we also developed a WWW access
`interface for SIFT. Using a WWW client program,
`the user interacts with the SIFT server through a
`user-friendly graphical interface. We believe the dual
`email and WWW access covers the tradeoff between
`wide availability and ease of use.
`In the passive mode of user notification, a SIFT
`server sends out email messages that contain excerpts
`of new, potentially relevant documents (certain num(cid:173)
`ber of lines from the beginning, as specified by the
`
`user). After the user reads the excerpts, he may ac(cid:173)
`cess the SIFT server to retrieve the entire documents.
`Currently the excerpts are formatted to be read from
`regular mail readers. We plan to offer the option of
`formatting them in HTML, so that the user may view
`the notifications from sophisticated WWW viewers,
`and then interactively retrieve interesting documents
`from SIFT or provide feedback via HTIP.
`
`3.3 SIFTing Netnews
`
`Using SIFT, we have set up a server for selectively
`disseminating Netnews articles (text articles only; bi(cid:173)
`nary ones are first screened out). Like any other SIFT
`server, the user accesses the Netnews SIFT server via
`email or WWW. The reader is encouraged to try it
`out: for email access, please send an electronic mes(cid:173)
`sage to netnewsCldb. stanford. edu, with the word
`"help" in the body; for WWW access, please con(cid:173)
`nect to http://sift.stanford.edu. In February
`1994, we publicized the Netnews server in two news(cid:173)
`groups; within ten days of the announcement, we re(cid:173)
`ceived well over a thousand profiles. The number of
`profiles keeps increasing and now (November 1994)
`exceeds 13,000, submitted by users from almost all
`continents. Table 1 shows some interesting statistics
`obtained from the server on the day of November 10,
`1994. The average number of articles is over the week
`of November 4 - 10, 1994.
`
`Number of subscriptions
`Number of users
`Daily average number of articles
`Average notification period (in days)
`
`13,381
`5,146
`45,127
`1.4
`
`Table 1: Netnews SIFT server statistics
`
`Apparent from these numbers is that the load on
`the SIFT server is very high. It is necessary to match
`an average of over 45,000 articles against some 13,000
`profiles and deliver most updates within a day. The
`efficient implementation of the SIFT filtering engine
`enables the job to be done on regular hardware, a
`DECstation 5000/240. Even though we believe SIFT
`is efficient, there is a limit to the load that it can
`handle. It is necessary to replicate the server; in fact,
`we have been in contact with several sites (in the U.S.
`and Europe) that expressed interests in providing the
`same service.
`The Netnews SIFT server is not meant to be a
`replacement of the Netnews system, which is an in-
`
`180
`
`1995 USENIX Technical Conference- January 16-20, 1995- New Orleans, LA
`
`

`
`valuable tool for carrying on distributed discussions.
`Rather, it provides a complementary filtering service.
`Only the beginning lines of matched articles are sent
`out. The user may, after determining that an article
`is relevant, access his own local news host to access
`the article. In cases where he does not have access to
`a news host, he may request the whole article be sent
`from the SIFT server.
`Sifting Netnews is just one application of SIFT.
`We have set up another SIFT server, disseminating
`Computer Science Technical Reports (email access:
`elibOdb . stanford . edu, WWW access will soon be
`available). This server has close to 1,000 subscrip(cid:173)
`tions (November 1994).
`
`To provide the test run capability, we need to in(cid:173)
`dex a collection of documents. This can be done by
`some publicly available software, and we choose to
`use the index engine in the WAIS distribution (called
`waisindex). From our experience it is a very robust
`implementation. However, since we use a different
`scoring scheme than WJ\IS , we made slight modifica(cid:173)
`tions to the WAIS code to make it compatible with
`our filtering engine (i.e., giving the same results). The
`major changes are to support the use of relevance
`threshold and the processing of boolean queries. The
`modified waisindex code is included in our SIFT dis(cid:173)
`tribution.
`
`4
`
`Implementation
`
`SIFT is implemented in C and has been compiled
`on several Unix platforms: DEC Ultrix 4.2, HPUX
`8.07, and Sun OS 4.1. In the following we first give
`an overview of the components of the system and then
`focus on the implementation of the filtering engine.
`
`4.1 System Components
`
`The program mailui implements the email request
`handler shown in Figure 1. It parses email messages,
`assuming the format in [Cro82] . User requests are
`processed by accessing the subscription database and
`the document index (for test runs).
`The program wwvui is a so-called "cgi-script" for
`use with the HTTP daemon released by National
`Center for Supercomputing Applications. It accepts
`requests submitted by users using a WWW client
`(with a form-filling graphical interface), and, simi(cid:173)
`lar to mailui , it accesses the subscription database
`and document index to process the requests.
`The program filter implements the filtering en(cid:173)
`gine. It accepts as input a list of documents (in the
`form of Unix path names) and matches them against
`the stored profiles. The implementation details are
`discussed in the next subsection. The output from
`filter is a file of matchings.
`After the filtering is completed, the matchings are
`sorted by users and subscription identifiers. This is
`necessary because SIFT sends out updates on a per
`subscription basis. After sorting, the program alert
`can be used to send out the message one by one,
`using Unix sendmail. Included in the messages are
`excerpts from the matched documents (a few lines
`from the beginning of the document).
`
`4.2 Filtering Engine Implementation
`
`In IR, to efficient process queries against a collection
`of documents, an inverted index [Sal89] of documents
`is built, which associales a word with its occurrences
`in the documents. In the information filtering sce(cid:173)
`nario, we may use the same approach: an index is
`built of a batch of new documents, and profiles are
`treated as stored queries that are run periodically.
`This is the approach used in an earlier SIFT proto(cid:173)
`type. We refer to this as the document index ap(cid:173)
`proach. This approach may not be very appropriate
`with a high volume of new information. The docu(cid:173)
`ment index is large and resides on disk. Running a
`large number of profiles against it requires a lot of
`disk 1/0s.
`In the current SIFT implementation , we use a dif(cid:173)
`ferent approach. The idea is to consider information
`filtering as the dual problem of IR, and treat pro(cid:173)
`files as documents and documents as queries. Instead
`of building a document index , we build a profile in(cid:173)
`dex. In [YGM94b,YGM94c] we propose and analyze
`a variety of profile indexing techniques for the vec(cid:173)
`tor space and boolean models respectively.
`In our
`SIFT implementation, we have chosen good indexing
`techniques for the two models that are compatible in
`nature and combined them together in the same in(cid:173)
`dex structure. Below we briefly present lhis combined
`index structure through an example. For a detailed
`description, as well as descriptions of other profile in(cid:173)
`dexing techniques and analysis, the reader is referred
`to [YGM94b,YGM94c].
`Referring to Figure 2, suppose Pl is a weighted pro(cid:173)
`file ((underwater, 60), (archeology, 60)} and P2 is a
`boolean profile "fly fishing not underwater." For each
`word, we associate it with an inverted list of postings
`that reference profiles containing it. For example, in
`
`1995 USENIX Technical Conference- January 16-20, 1995 - New Orleans, LA
`
`181
`
`

`
`5 P erformance St udy
`
`A practical challenge for SIFT is to process a new
`batch of documents within the smallest time unit
`of notification . For example, in the Netnews SIFT
`server, it is critical that the filter and notify process
`is finished within 24 hours. Thus, it is important
`to understand and characterize the performance of
`the SIFT system, and measure how it scales with the
`number of profiles and volume of information.
`First we study the performance of the processing
`time with respect to the number of documents and
`number of profiles. We then compare the perfor(cid:173)
`mance of the filtering engine with and the alternative
`document index approach discussed in Section 4.2.
`Finally we investigate the main memory requirement
`of the filtering engine. All experiments are run on
`a dedicated DECstation 5000/240 running Ultrix 4.2
`and all times reported are real (wall clock) time.
`The study was performed in early July, 1994. The
`test data consisted of 38,000 articles received on the
`day of July 6, 1994 by our department's news host
`and 7 ,000 randomly selected subscriptions from the
`Netnews SIFT server at that time. The average ar(cid:173)
`ticle size was 269 words (a word is defined as an al(cid:173)
`phanumeric string longer than 2 characters). The
`subscriptions were stored on a local disk, while the
`news articles were stored on an NFS-mounted disk
`(from the news host). We repealed the experiments
`with test data from two other days and the results
`were within ±10% of those reported below.
`In the result graphs tha.t follow, we divide SIFT
`processing time into these f?ur steps:
`
`1. Build Time The profile index structure is built.
`Other auxiliary data structures are allocated and
`initialized.
`
`2. Filtering Time Documents are run against the
`index one by one. Document-profile matchings
`are written into a file.
`
`3. Sorting Time The document-profile matching
`file is sorted by user emai l address and subscrip(cid:173)
`tion identifier, using Unix sort command.
`
`4. Notify Time The sorted matching file is read.
`Excerpts of matchings for each subscription are
`prepared into an email message. Unix sendmail
`is invoked to send out each message.
`
`Threshold
`
`Score
`
`·1
`
`Figure 2: Profile index and auxiliary data structures
`
`Figure 2 the inverted list for word "underwater" is
`made up of two postings referencing P1 and P2. In
`the posting, if the profile is weighted, we include the
`weight of the word in the profile; e.g., 60 for "un(cid:173)
`derwater" in Pl. If the profile is a boolean one, we
`assign a weight of 1 to a non-negated word, and -1 to
`a negated one.
`In addition, we make use of two arrays called
`Threshold and Score; each profile has an entry in each
`array. For each profile, its Threshold entry stores the
`minimum score that a document must have against it
`to be a match. For a boolean profile, this minimum
`is equal to the number of non-negated words. For a
`weighted profile, the minimum is derived using the
`relevance threshold specified by the user. T he Score
`entry is used to accumulate the score that a document
`has against the profile.
`Now suppose a document D containing the words
`"underwater archeology" (and none of the other pro(cid:173)
`file words) is being processed. Suppose the vec(cid:173)
`tor space representation for D is ((underwater, 80),
`(archeology, 80), ... ); i.e., the words "underwater"
`and "archeology" are both assigned weights of 80.
`To process the word "underwater," we look up the
`directory and access its inverted list. For a weighted
`profile like Pl , the document weight of "underwater"
`is multiplied with the weight in the posting (60), and
`the product is accumulated in P1 's Score entry {note
`that we are calculating the dot product of the two
`vectors). For a boolean profile like l>2, we simply add
`its posting weight (-1 for "underwater") to the Score
`entry. Similarly the word "archeology" is processed.
`Finally, profiles whose resulting Score entry is no less
`than the Threshold entry match the document , such
`as Pl in Figure 2.
`
`182
`
`1995 USENIX Technical Conference- January 16-20, 1995- New Orleans, LA
`
`

`
`, T1.
`
`,.
`
`Suits 1- Filter 4- Son + Noiily Time -O-~
`Build + Filler 1- Sort Tirrie -+---
`Build + Filter ‘firne 8--
`Build Time —)<—
`
`_..o-''
`
`,..es--
`
`.—-‘$.--‘U.-‘B“
`
`"-‘‘“B,,..---
`
`0
`
`5000
`
`..~sr"ar
`--)<:ai¢—~ ><'
`><«
`)9
`x I
`15030 20000 25000 30000 35000
`NumberofDocumen1s
`
`10000
`
`Build + Filter + Son 4- Nolily Time -om
`Buiici 1- Filter + Sort Time —i---
`Build 1- Filter Time E}-
`Build Time ->(—
`
`4000
`
`8000
`5000
`Number oi Profiles
`
`10000
`
`12000
`
`14000
`
`Figure 3: SIFT performance vs. # documents
`
`Figure 4: SIFT performance vs. # profiles
`
`5.1 Coping with Information Volume
`
`First we investigate the relationship between the pro-
`cessing time and the number of documents. Figure
`3 shows the results of processing the 38,000 articles
`against the 7,000 subscriptions. The time it takes
`to build the profile index is very small and is as ex-
`pected independent of the number of documents. The
`filtering time is linear with respect
`to the number
`of documents, with slope 0.21 sec./document [value
`obtained by curve-fitting using Mathematica). The
`time it takes to sort the matching file is negligible.
`The proportionality constant for the total running
`time is 0.63 sec. /document. [Recall that the articles
`are stored on a remote news host, so some fraction
`
`of this time is spent in reading the articles via local
`network. This is confirmed by looking at the CPU
`utilization, which is found to be 42.8%.)
`
`To measure the notify time, since it would be im-
`possible to send the same updates repeatedly to our
`real users, the notify times shown in Figure 3 (except
`the rightmost data point) are interpolated results.
`We assume that the time it takes to send out notifi-
`
`cations is proportional to the number of matchings in
`the matching file. We have verified this assumption
`with a separate experiment and derived the propor-
`tionality constant. Then in the experiment for Figure
`3, we count the number of matchings in the matching
`file at the data points shown. We compute the notify
`time as the product of the number of matchings and
`the proportionality constant. The results show that
`a large fraction {z 63% for 38,000 documents) of the
`total time is spent in sending out subscriptions.
`
`5.2 Coping with Subscription Growth
`
`In the Second experiment, we look at the relation-
`ship between the processing time and the number of
`profiles. We repeat
`the process of filtering 38,000
`documents against 1, 1,000, 4,000, 7,000, and 14,000
`subscriptions. The last run is done simply by dupli-
`cating the 7,000 subscriptions in the database. Fig-
`ure 4 shows the results. Again, the profile index build
`time is negligible. For the filtering time, we find that
`even for one profile, it takes 5,105 sec.
`to filter. This
`is the time spent reading in the articles via NFS. Be-
`yond the initial start—up cost, filter time is appar-
`ently linear to the number of profiles, with slope 0.62
`sec./profile. The notify time is similarly obtained as
`discussed above. We find the total running time to
`be linear with respect to the number of profiles, and
`the slope is 2.63 sec./profile.
`
`5.3 Performance Improvement
`
`To quantify

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket