`
`
`
`
`UNIFIED PATENTS
`
`EXHIBIT 1012
`
`Unfied Patents Exhibit 1012
`Pg. 1
`
`
`
`In Proceedings of the USENIX Technical Conference(cid:2) pages (cid:5)(cid:2)
`
`SIFT A Tool for
`Wide(cid:3)Area Information Dissemination
`
`Tak W(cid:2) Yan Hector Garcia(cid:3)Molina
`
`Department of Computer Science
`Stanford University
`Stanford(cid:2) CA
`ftyan(cid:4) hectorg(cid:5)cs(cid:2)stanford(cid:2)edu
`
`February (cid:4)
`
`Abstract
`
`The dissemination model is becoming increasingly
`important in wide(cid:2)area information system(cid:3) In this
`model(cid:4) the user subscribes to an information dissem(cid:2)
`ination service by submitting pro les that describe
`his interests(cid:3) He then passively receives new(cid:4) l(cid:2)
`tered information(cid:3) The Stanford Information Filter(cid:2)
`ing Tool SIFT is a tool to help provide such ser(cid:2)
`vice(cid:3) It supports full(cid:2)text ltering using well(cid:2)known
`information retrieval models(cid:3) The SIFT ltering en(cid:2)
`gine implements novel indexing techniques(cid:4) capable
`of processing large volumes of information against a
`large number of pro les(cid:3)
`It runs on several major
`Unix platforms and is freely available to the public(cid:3)
`In this paper we present SIFT(cid:8)s approach to user in(cid:2)
`terest modeling and user(cid:2)server communication(cid:3) We
`demonstrate the processing capability of SIFT by de(cid:2)
`scribing a running server that disseminates USENET
`News(cid:3) We present an empirical study of SIFT(cid:8)s per(cid:2)
`formance(cid:4) examining its main memory requirement
`and ability to scale with information volume and user
`population(cid:3)
`
`
`
`Introduction
`
`Technological advances have made wide(cid:2)area infor(cid:2)
`mation sharing commonplace(cid:3) A suite of tools have
`
` This research was sponsored by the Advanced Research
`Projects Agency ARPA of the Department of Defense un(cid:4)
`der Grant No(cid:5)MDA (cid:4) (cid:4)J(cid:4) with the Corporation for Na(cid:4)
`tional Research Initiatives CNRI(cid:5) The views and conclusions
`contained in this document are those of the authors and should
`not be interpreted as necessarily representing the o cial poli(cid:4)
`cies or endorsement(cid:12) either expressed or implied(cid:12) of ARPA(cid:12) the
`U(cid:5)S(cid:5) Government(cid:12) or CNRI(cid:5)
`
`emerged for network information nding and dis(cid:2)
`covery(cid:9) e(cid:3)g(cid:3)(cid:4) Wide(cid:2)Area Information Servers WAIS
`KM (cid:4) archie ED (cid:4) World(cid:2)Wide Web WWW
`BLCGP (cid:4) and gopher McC (cid:3) However(cid:4) these
`new tools have one important missing element(cid:3) They
`provide a means to search for existing information(cid:4)
`but lack a mechanism for continuously informing the
`user of new information(cid:3) The exploding volume of
`digital information makes it di cult for the user(cid:4)
`equipped with only search capability(cid:4) to keep up with
`the fast pace of information generation(cid:3)
`Instead of
`making the user go after the information(cid:4) it is desir(cid:2)
`able to have information selectively ow to the user(cid:3)
`In an information dissemination aka(cid:3) alert(cid:2) informa(cid:3)
`tion ltering(cid:2) selective dissemination of information
`service(cid:4) the user expresses his interests in a number
`of long(cid:2)term(cid:4) continuously evaluated queries(cid:4) called
`pro les(cid:3) He will then passively receive documents l(cid:2)
`tered according to the pro les(cid:3) Such a service will be(cid:2)
`come increasingly important and form an indispens(cid:2)
`able tool for the dynamic environment of wide(cid:2)area
`information systems(cid:3)
`A very simple kind of information dissemination
`service is already available on the Internet(cid:17) mailing
`lists see e(cid:3)g(cid:3)(cid:4) Kro (cid:3) Hundreds of mailing lists ex(cid:2)
`ist(cid:4) covering a wide variety of topics(cid:3) The user sub(cid:2)
`scribes to lists of interest to him and receives mes(cid:2)
`sages on the topic via email(cid:3) He may also send mes(cid:2)
`sages to the lists to reach other subscribers(cid:3) LIST(cid:2)
`SERV is a software system for maintaining mailing
`lists(cid:3) A problem with the mailing list mechanism as
`a tool for information dissemination is that it pro(cid:2)
`vides a crude granularity of interest matching(cid:3) A user
`whose information need does not exactly match cer(cid:2)
`tain lists will either receive too many irrelevant or too
`few relevant messages(cid:3) The USENET News or Net(cid:2)
`
`
`
`Unfied Patents Exhibit 1012
`Pg. 2
`
`
`
`news system Kro (cid:7) an electronic bulletin board
`system on the Internet(cid:7) is similar in nature to mail(cid:8)
`ing lists(cid:9) While Netnews is extremely successful with
`millions of users and megabytes of daily tra c(cid:7) at
`the same time it often creates an information over(cid:8)
`load(cid:9) Like mailing lists(cid:7) the coarse classi cation of
`topics into newsgroups means that a user subscribing
`to certain newsgroups may not nd all articles inter(cid:8)
`esting(cid:7) and also he will miss relevant articles posted
`in newsgroups that he does not subscribe to(cid:9)
`
`Recent research e orts on information ltering fo(cid:8)
`cus on the ltering e ectiveness(cid:7) attempting to pro(cid:8)
`vide more ne(cid:8)grained ltering using relational(cid:7) rule(cid:8)
`based(cid:7) information retrieval IR(cid:7) and arti cial intel(cid:8)
`ligence approaches(cid:9) With few exceptions(cid:7) they are
`often small scale i(cid:9)e(cid:9)(cid:7) involving a small number of
`users or pro les and thus the need to provide e (cid:4)
`cient ltering is not apparent(cid:9) However(cid:7) in a large(cid:8)
`scale wide(cid:8)area system where the number of informa(cid:8)
`tion providers and seekers are large(cid:7) e ciency in the
`dissemination process is an important issue and must
`be addressed(cid:9)
`
`The Stanford Information Filtering Tool SIFT is
`a tool for information providers to perform large(cid:8)scale
`information dissemination(cid:9) It can be used to set up
`a clearinghouse service that gathers large amount of
`information and selectively disseminates the informa(cid:8)
`tion to a large population of users(cid:9) It supports full(cid:8)
`text ltering(cid:7) using well(cid:8)known and well(cid:8)studied IR
`models(cid:9) The SIFT ltering engine implements novel
`indexing techniques(cid:7) capable of scaling to large num(cid:8)
`ber of documents and pro les(cid:9)
`It runs on several
`major Unix platforms and is freely available to the
`public by anonymous ftp at URL(cid:14)
`
`ftp(cid:2)db(cid:4)stanford(cid:4)edupubsiftsift(cid:5) (cid:4) (cid:4)tar(cid:4)Z
`
`In this paper we describe SIFT(cid:9) We present its ap(cid:8)
`proach to user interest modeling and user(cid:8)server com(cid:8)
`munication(cid:9) We demonstrate the processing capabil(cid:8)
`ity of SIFT by describing a running server that dis(cid:8)
`seminates tens of thousands of Netnews articles daily
`to some (cid:7) subscriptions(cid:9) We describe the imple(cid:8)
`mentation of SIFT(cid:7) focusing on the ltering engine(cid:9)
`Finally we present an empirical study of SIFT(cid:18)s per(cid:8)
`formance(cid:7) examining its main memory requirement
`and ability to scale with information volume and user
`population(cid:9)
`
` Other Previous Work
`
`Boston Community Information System GBBL is
`an experimental information dissemination system(cid:9)
`Like SIFT(cid:7) it allows a ner granularity of interest
`matching than mailing lists or Netnews(cid:9) Users can
`express their interests with IR(cid:8)style(cid:7) keyword(cid:8)based
`pro les(cid:9) The system broadcasts new information via
`radio channel to all users(cid:7) who then apply their own
` lters locally(cid:9) While the radio communication chan(cid:8)
`nel makes broadcast inexpensive(cid:7) local processing of
`mostly irrelevant information is very expensive(cid:9) For
`every user(cid:7) a personal computer is dedicated for this
`purpose this is cited as a major source of complaints
`from the users(cid:9)
`Information Lens MGT(cid:0) provides categoriza(cid:8)
`tion and ltering of semi(cid:8)structured messages such
`as email(cid:9) The user de nes rules for ltering(cid:7) and
`the processing is done at the user site(cid:9)
`It provides
`e ective ltering(cid:7) but local processing is expensive
`for large(cid:8)scale information dissemination(cid:9) Similarly(cid:7)
`the kill le(cid:24) mechanism in certain news reader pro(cid:8)
`grams allows the user to locally screen out irrele(cid:8)
`vant articles(cid:9) A kill le only removes speci ed ar(cid:8)
`ticles from newsgroups that a user subscribes to(cid:7) but
`it does not discover relevant articles in other news(cid:8)
`groups(cid:9) To provide the same kind of ltering power
`as SIFT would require much local processing(cid:9) It is
`more cost(cid:8)e ective to pool pro les together to share
`the processing overhead(cid:9)
`The Tapestry system GNOT is a research pro(cid:8)
`totype that uses the relational model for matching
`user interests and documents(cid:25) ltering computation
`is done not on the properties of individual docu(cid:8)
`ments(cid:7) but rather on the entire append(cid:8)only doc(cid:8)
`ument database(cid:9) E cient query processing tech(cid:8)
`niques are proposed for handling this kind of queries(cid:9)
`Tapestry is built on top of a commercial relational
`database system(cid:9) The Pasadena system WF in(cid:8)
`vestigates the e ectiveness of di erent IR techniques
`in ltering(cid:9) It collects new documents from several
`Internet information sources(cid:7) and periodically run
`pro les against them(cid:9) Similar to SIFT(cid:7) it uses a
`clearinghouse approach(cid:7) but e cient ltering is not
`addressed(cid:9)
`In YGM b(cid:7) YGM c we proposed a variety of
`indexing techniques for speeding up information l(cid:8)
`tering under IR models(cid:9) There we evaluated these
`techniques using analysis and simulation(cid:9) The SIFT
` ltering engine is a real implementation of one class
`of index structures that we found to be e cient(cid:9)
`
`Unfied Patents Exhibit 1012
`Pg. 3
`
`
`
`Documents
`
` (cid:2) (cid:2) Filtering Model
`
`User
`
`"underwater
`archeology?"
`
`Email / WWW
`Request Handler
`
`SIFT
`
`...
`
`Document
`Index
`
`Subscription
`Database
`
`Filtering
`Engine
`
`Alerter
`
`Updates
`
`The interest pro le can be expressed in one of two
`IR models(cid:3) boolean and vector space Sal (cid:4) We rst
`focus on the vector space model(cid:4)
`In vector space model(cid:10) queries and documents are
`identi ed by terms(cid:10) usually words(cid:4)
`If there are m
`terms for content identi cation(cid:10) then a document D
`is conceptually represented as an m(cid:5)dimensional vec(cid:5)
`tor D (cid:16) hw (cid:2) w(cid:2) (cid:3) (cid:3) (cid:3) (cid:2) wmi(cid:10) where weight wi for term
`ti signi es its statistical importance(cid:10) such as its fre(cid:5)
`quency in the document(cid:4) We may also write D as
`hti (cid:2) wi (cid:2) (cid:3) (cid:3) (cid:3) (cid:2) tik (cid:2) wiki where wij (cid:16) (cid:18) e(cid:4)g(cid:4)(cid:10) h un(cid:5)
`derwater(cid:10) (cid:10) archeology(cid:10) i(cid:4) A query is similarly
`represented(cid:4) For a document(cid:5)query pair(cid:10) a similarity
`measure such as the dot product can be computed
`to determine how similar(cid:11) the two are(cid:4) In an IR en(cid:5)
`vironment(cid:10) the top ranked documents are retrieved
`for a query(cid:4)
`In an information ltering setting(cid:10) a key question
`is how many documents to return to the user(cid:4) We
`could have allowed the user to receive a xed number
`of top ranked documents per update period(cid:10) e(cid:4)g(cid:4)(cid:10) as
`done in FD (cid:4) However(cid:10) this is not very desirable(cid:3)
`in a period when there are many relevant documents(cid:10)
`he may miss some low recall (cid:18) in a period when
`there are few interesting documents(cid:10) he may receive
`irrelevant ones low precision (cid:4) We instead allow
`the user to specify a relevance threshold(cid:10) which is the
`minimum similarity score that a document must have
`against the pro le for it to be delivered(cid:4) A default
`value is supplied to the user for convenience(cid:4)
`Instead of using the vector space model(cid:10) the user
`may use boolean pro les to specify words that he
`wants in documents received(cid:10) and words to be ex(cid:5)
`cluded(cid:4) For example(cid:10) the boolean pro le y shing
`not underwater(cid:11) is for documents that contain both
`words y(cid:11) and shing(cid:11) but not the word under(cid:5)
`water(cid:4)(cid:11) The reader may note that the SIFT boolean
`model only allows conjunction and negation of words(cid:4)
`However(cid:10) the user may approximate disjunction se(cid:5)
`mantics by submitting multiple subscriptions though
`a document may match more than one subscriptions(cid:4)
`
` (cid:2) (cid:2) Pro le Construction and Modi cation
`
`To assist the user with the construction of a pro le(cid:10) a
`SIFT server provides a test run facility(cid:4) The user may
`
` Recall is the proportion of relevant documents returned(cid:2)
`and precision is the proportion of documents returned that are
`relevant(cid:3)
`
`Figure (cid:3) An overview of SIFT
`
` SIFT
`
`We begin our presentation of SIFT with an example(cid:4)
`Suppose a user is interested in underwater archeol(cid:5)
`ogy Figure (cid:4) He sends an email subscribe request
`to a SIFT server specifying the pro le underwater
`archeology(cid:10)(cid:11) and optionally parameters that control(cid:10)
`for example(cid:10) how often he wants to be updated or how
`long his subscription is for(cid:4) He may alternatively ac(cid:5)
`cess the SIFT server via WWW(cid:10) using a graphical
`WWW client interface to ll out a form with the
`subscription information(cid:4) Before he subscribes(cid:10) he
`may test run his pro le against an index of a sam(cid:5)
`ple document collection(cid:4) His subscription is stored
`in the subscription database(cid:4) As the SIFT server re(cid:5)
`ceives new documents(cid:10) the ltering engine will pro(cid:5)
`cess them against the stored subscriptions(cid:10) and noti(cid:5)
` cations will be sent out based on the user(cid:5)speci ed
`parameters(cid:4)
`In the following we detail the SIFT system from
`the perspectives of user interest modeling and com(cid:5)
`munication protocol(cid:4) We then illustrate SIFT using
`the Netnews SIFT server(cid:4)
`
` (cid:2) User Interest Modeling
`
`A user subscribes to a SIFT server with one or more
`subscriptions(cid:10) one for each topic of interest(cid:4) A sub(cid:5)
`scription includes an IR(cid:5)style pro le(cid:10) and as men(cid:5)
`tioned additional parameters to control the frequency
`of updates(cid:10) the amount of information to receive(cid:10) and
`the length of the subscription(cid:4) A subscription is iden(cid:5)
`ti ed by the email address of the user and a subscrip(cid:5)
`tion identi er(cid:4)
`
`Unfied Patents Exhibit 1012
`Pg. 4
`
`
`
`run his initial pro le against an existing(cid:3) representa(cid:4)
`tive collection of documents to test the e ectiveness
`of his pro le(cid:6) He may interactively change the pro(cid:4)
` le and for weighted pro les adjust the threshold
`to the desired level of precision and recall(cid:6) When he
`is satis ed with the performance of the ltering(cid:3) he
`may then subscribe with the selected settings(cid:6)
`After the user receives some periodic updates(cid:3) he
`may decide to modify his pro le or change the thresh(cid:4)
`old for vector space pro les(cid:6) He may do this by
`accessing the SIFT server(cid:6) Furthermore(cid:3) for vec(cid:4)
`tor space pro les(cid:3) relevance feedback Sal (cid:3) a well(cid:4)
`known technique in IR to improve retrieval e ective(cid:4)
`ness(cid:3) can be used(cid:6) The user simply gives SIFT the
`documents that he nds interesting(cid:13) after examining
`them(cid:3) the server adjusts the weights of the words in
`the user(cid:14)s pro le accordingly(cid:6)
`
` (cid:2) Communication Protocol
`
`There are two modes of communication between the
`user and a SIFT server(cid:6) In the interactive mode(cid:3) the
`user subscribes(cid:3) test(cid:4)runs a pro le(cid:3) views(cid:3) updates(cid:3)
`or cancels his subscriptions(cid:6) In the passive mode(cid:3) the
`user periodically receives information updates(cid:6)
`In(cid:4)
`stead of developing a new communication protocol(cid:3)
`we make use of current technologies(cid:15) email Cro
`and World(cid:4)Wide Web HTTP BLCGP (cid:6)
`First we discuss the interactive communication
`mode(cid:6) Email communication is the lowest common
`denominator of network connectivity(cid:6) By having an
`email interface(cid:3) a SIFT server is accessible from users
`with less powerful machines(cid:3) with limited network
`capability(cid:3) or behind Internet(cid:4)access rewalls(cid:6) We
`adopt the LISTSERV mailing list syntax wherever
`possible(cid:3) to ensure that minimal learning is required(cid:6)
`We also use default settings to reduce the complexity
`of email requests for novice users(cid:6)
`As the appeal of hypermedia navigation and the de(cid:4)
`velopment of sophisticated client interfaces are mak(cid:4)
`ing WWW the preferred tool for wide(cid:4)area infor(cid:4)
`mation sharing(cid:3) we also developed a WWW access
`interface for SIFT(cid:6) Using a WWW client program(cid:3)
`the user interacts with the SIFT server through a
`user(cid:4)friendly graphical interface(cid:6) We believe the dual
`email and WWW access covers the tradeo between
`wide availability and ease of use(cid:6)
`In the passive mode of user noti cation(cid:3) a SIFT
`server sends out email messages that contain excerpts
`of new(cid:3) potentially relevant documents certain num(cid:4)
`ber of lines from the beginning(cid:3) as speci ed by the
`
`user(cid:6) After the user reads the excerpts(cid:3) he may ac(cid:4)
`cess the SIFT server to retrieve the entire documents(cid:6)
`Currently the excerpts are formatted to be read from
`regular mail readers(cid:6) We plan to o er the option of
`formatting them in HTML(cid:3) so that the user may view
`the noti cations from sophisticated WWW viewers(cid:3)
`and then interactively retrieve interesting documents
`from SIFT or provide feedback via HTTP(cid:6)
`
` (cid:2) SIFTing Netnews
`
`Using SIFT(cid:3) we have set up a server for selectively
`disseminating Netnews articles text articles only(cid:13) bi(cid:4)
`nary ones are rst screened out(cid:6) Like any other SIFT
`server(cid:3) the user accesses the Netnews SIFT server via
`email or WWW(cid:6) The reader is encouraged to try it
`out(cid:15) for email access(cid:3) please send an electronic mes(cid:4)
`sage to netnews(cid:2)db(cid:3)stanford(cid:3)edu(cid:3) with the word
`help(cid:18) in the body(cid:13) for WWW access(cid:3) please con(cid:4)
`nect to http(cid:4)sift(cid:3)stanford(cid:3)edu(cid:6)
`In February
` (cid:3) we publicized the Netnews server in two news(cid:4)
`groups(cid:13) within ten days of the announcement(cid:3) we re(cid:4)
`ceived well over a thousand pro les(cid:6) The number of
`pro les keeps increasing and now November
`exceeds (cid:3) (cid:3) submitted by users from almost all
`continents(cid:6) Table shows some interesting statistics
`obtained from the server on the day of November (cid:3)
` (cid:6) The average number of articles is over the week
`of November (cid:4) (cid:3) (cid:6)
`
`Number of subscriptions
`Number of users
`Daily average number of articles
`Average noti cation period in days
`
` (cid:4)
`(cid:4)
`(cid:4)
` (cid:14)
`
`Table (cid:15) Netnews SIFT server statistics
`
`Apparent from these numbers is that the load on
`the SIFT server is very high(cid:6) It is necessary to match
`an average of over (cid:3) articles against some (cid:3)
`pro les and deliver most updates within a day(cid:6) The
`e cient implementation of the SIFT ltering engine
`enables the job to be done on regular hardware(cid:3) a
`DECstation (cid:6) Even though we believe SIFT
`is e cient(cid:3) there is a limit to the load that it can
`handle(cid:6) It is necessary to replicate the server(cid:13) in fact(cid:3)
`we have been in contact with several sites in the U(cid:6)S(cid:6)
`and Europe that expressed interests in providing the
`same service(cid:6)
`The Netnews SIFT server is not meant to be a
`replacement of the Netnews system(cid:3) which is an in(cid:4)
`
`Unfied Patents Exhibit 1012
`Pg. 5
`
`
`
`valuable tool for carrying on distributed discussions(cid:2)
`Rather(cid:3) it provides a complementary ltering service(cid:2)
`Only the beginning lines of matched articles are sent
`out(cid:2) The user may(cid:3) after determining that an article
`is relevant(cid:3) access his own local news host to access
`the article(cid:2) In cases where he does not have access to
`a news host(cid:3) he may request the whole article be sent
`from the SIFT server(cid:2)
`Sifting Netnews is just one application of SIFT(cid:2)
`We have set up another SIFT server(cid:3) disseminating
`Computer Science Technical Reports email access(cid:6)
`elib(cid:2)db(cid:3)stanford(cid:3)edu(cid:3) WWW access will soon be
`available(cid:2) This server has close to (cid:3) subscrip(cid:10)
`tions November (cid:2)
`
`To provide the test run capability(cid:3) we need to in(cid:10)
`dex a collection of documents(cid:2) This can be done by
`some publicly available software(cid:3) and we choose to
`use the index engine in the WAIS distribution called
`waisindex(cid:2) From our experience it is a very robust
`implementation(cid:2) However(cid:3) since we use a di erent
`scoring scheme than WAIS(cid:3) we made slight modi ca(cid:10)
`tions to the WAIS code to make it compatible with
`our ltering engine i(cid:2)e(cid:2)(cid:3) giving the same results(cid:2) The
`major changes are to support the use of relevance
`threshold and the processing of boolean queries(cid:2) The
`modi ed waisindex code is included in our SIFT dis(cid:10)
`tribution(cid:2)
`
`
`
`Implementation
`
`SIFT is implemented in C and has been compiled
`on several Unix platforms(cid:6) DEC Ultrix (cid:2)(cid:3) HPUX
`(cid:2) (cid:3) and SunOS (cid:2) (cid:2) In the following we rst give
`an overview of the components of the system and then
`focus on the implementation of the ltering engine(cid:2)
`
`(cid:2) System Components
`
`The program mailui implements the email request
`handler shown in Figure (cid:2) It parses email messages(cid:3)
`assuming the format in Cro(cid:2) User requests are
`processed by accessing the subscription database and
`the document index for test runs(cid:2)
`The program wwwui is a so(cid:10)called cgi(cid:10)script(cid:19) for
`use with the HTTP daemon released by National
`Center for Supercomputing Applications(cid:2) It accepts
`requests submitted by users using a WWW client
`with a form(cid:10) lling graphical interface(cid:3) and(cid:3) simi(cid:10)
`lar to mailui(cid:3) it accesses the subscription database
`and document index to process the requests(cid:2)
`The program filter implements the ltering en(cid:10)
`gine(cid:2) It accepts as input a list of documents in the
`form of Unix path names and matches them against
`the stored pro les(cid:2) The implementation details are
`discussed in the next subsection(cid:2) The output from
`filter is a le of matchings(cid:2)
`After the ltering is completed(cid:3) the matchings are
`sorted by users and subscription identi ers(cid:2) This is
`necessary because SIFT sends out updates on a per
`subscription basis(cid:2) After sorting(cid:3) the program alert
`can be used to send out the message one by one(cid:3)
`using Unix sendmail(cid:2) Included in the messages are
`excerpts from the matched documents a few lines
`from the beginning of the document(cid:2)
`
`(cid:2) Filtering Engine Implementation
`
`In IR(cid:3) to e cient process queries against a collection
`of documents(cid:3) an inverted index Sal of documents
`is built(cid:3) which associates a word with its occurrences
`in the documents(cid:2)
`In the information ltering sce(cid:10)
`nario(cid:3) we may use the same approach(cid:6) an index is
`built of a batch of new documents(cid:3) and pro les are
`treated as stored queries that are run periodically(cid:2)
`This is the approach used in an earlier SIFT proto(cid:10)
`type(cid:2) We refer to this as the document index ap(cid:10)
`proach(cid:2) This approach may not be very appropriate
`with a high volume of new information(cid:2) The docu(cid:10)
`ment index is large and resides on disk(cid:2) Running a
`large number of pro les against it requires a lot of
`disk IOs(cid:2)
`In the current SIFT implementation(cid:3) we use a dif(cid:10)
`ferent approach(cid:2) The idea is to consider information
` ltering as the dual problem of IR(cid:3) and treat pro(cid:10)
` les as documents and documents as queries(cid:2) Instead
`of building a document index(cid:3) we build a pro le in(cid:3)
`dex(cid:2) In YGM b(cid:3)YGM c we propose and analyze
`a variety of pro le indexing techniques for the vec(cid:10)
`tor space and boolean models respectively(cid:2)
`In our
`SIFT implementation(cid:3) we have chosen good indexing
`techniques for the two models that are compatible in
`nature and combined them together in the same in(cid:10)
`dex structure(cid:2) Below we brie y present this combined
`index structure through an example(cid:2) For a detailed
`description(cid:3) as well as descriptions of other pro le in(cid:10)
`dexing techniques and analysis(cid:3) the reader is referred
`to YGM b(cid:3)YGM c(cid:2)
`Referring to Figure (cid:3) suppose P is a weighted pro(cid:10)
` le hunderwater(cid:3) (cid:3) archeology(cid:3) i and P is a
`boolean pro le y shing not underwater(cid:2)(cid:19) For each
`word(cid:3) we associate it with an inverted list of postings
`that reference pro les containing it(cid:2) For example(cid:3) in
`
`Unfied Patents Exhibit 1012
`Pg. 6
`
`
`
`Inverted lists
`
` Performance Study
`
`Directory
`
`underwater
`
`archeology
`
`fly
`
`fishing
`
`P2
`
`-1
`
`Postings
`
`P1
`
`P1
`
`P2
`
`P2
`
`60
`
`60
`
`1
`
`1
`
`Threshold
`
`Score
`
`8400
`
`9600
`
`2
`
`-1
`
`P1
`
`P2
`
`Figure (cid:3) Pro le index and auxiliary data structures
`
`Figure the inverted list for word underwater(cid:6) is
`made up of two postings referencing P and P(cid:8) In
`the posting(cid:9) if the pro le is weighted(cid:9) we include the
`weight of the word in the pro le(cid:10) e(cid:8)g(cid:8)(cid:9) for un(cid:13)
`derwater(cid:6) in P (cid:8) If the pro le is a boolean one(cid:9) we
`assign a weight of to a non(cid:13)negated word(cid:9) and (cid:13) to
`a negated one(cid:8)
`In addition(cid:9) we make use of two arrays called
`Threshold and Score(cid:10) each pro le has an entry in each
`array(cid:8) For each pro le(cid:9) its Threshold entry stores the
`minimum score that a document must have against it
`to be a match(cid:8) For a boolean pro le(cid:9) this minimum
`is equal to the number of non(cid:13)negated words(cid:8) For a
`weighted pro le(cid:9) the minimum is derived using the
`relevance threshold speci ed by the user(cid:8) The Score
`entry is used to accumulate the score that a document
`has against the pro le(cid:8)
`Now suppose a document D containing the words
`underwater archeology(cid:6) and none of the other pro(cid:13)
` le words is being processed(cid:8)
`Suppose the vec(cid:13)
`tor space representation for D is hunderwater(cid:9) (cid:9)
`archeology(cid:9) (cid:9) (cid:0) (cid:0) (cid:0)i(cid:10) i(cid:8)e(cid:8)(cid:9) the words underwater(cid:6)
`and archeology(cid:6) are both assigned weights of (cid:8)
`To process the word underwater(cid:9)(cid:6) we look up the
`directory and access its inverted list(cid:8) For a weighted
`pro le like P (cid:9) the document weight of underwater(cid:6)
`is multiplied with the weight in the posting (cid:9) and
`the product is accumulated in P (cid:17)s Score entry note
`that we are calculating the dot product of the two
`vectors(cid:8) For a boolean pro le like P(cid:9) we simply add
`its posting weight (cid:13) for underwater(cid:6) to the Score
`entry(cid:8) Similarly the word archeology(cid:6) is processed(cid:8)
`Finally(cid:9) pro les whose resulting Score entry is no less
`than the Threshold entry match the document(cid:9) such
`as P in Figure (cid:8)
`
`A practical challenge for SIFT is to process a new
`batch of documents within the smallest time unit
`of noti cation(cid:8) For example(cid:9) in the Netnews SIFT
`server(cid:9) it is critical that the lter and notify process
`is nished within hours(cid:8) Thus(cid:9) it is important
`to understand and characterize the performance of
`the SIFT system(cid:9) and measure how it scales with the
`number of pro les and volume of information(cid:8)
`First we study the performance of the processing
`time with respect to the number of documents and
`number of pro les(cid:8) We then compare the perfor(cid:13)
`mance of the ltering engine with and the alternative
`document index approach discussed in Section (cid:8)(cid:8)
`Finally we investigate the main memory requirement
`of the ltering engine(cid:8) All experiments are run on
`a dedicated DECstation running Ultrix (cid:8)
`and all times reported are real wall clock time(cid:8)
`The study was performed in early July(cid:9) (cid:8) The
`test data consisted of (cid:9) articles received on the
`day of July (cid:9) by our department(cid:17)s news host
`and (cid:9) randomly selected subscriptions from the
`Netnews SIFT server at that time(cid:8) The average ar(cid:13)
`ticle size was words a word is de ned as an al(cid:13)
`phanumeric string longer than characters(cid:8) The
`subscriptions were stored on a local disk(cid:9) while the
`news articles were stored on an NFS(cid:13)mounted disk
`from the news host(cid:8) We repeated the experiments
`with test data from two other days and the results
`were within of those reported below(cid:8)
`In the result graphs that follow(cid:9) we divide SIFT
`processing time into these four steps(cid:3)
`
` (cid:8) Build Time The pro le index structure is built(cid:8)
`Other auxiliary data structures are allocated and
`initialized(cid:8)
`
`(cid:8) Filtering Time Documents are run against the
`index one by one(cid:8) Document(cid:13)pro le matchings
`are written into a le(cid:8)
`
` (cid:8) Sorting Time The document(cid:13)pro le matching
` le is sorted by user email address and subscrip(cid:13)
`tion identi er(cid:9) using Unix sort command(cid:8)
`
`(cid:8) Notify Time The sorted matching le is read(cid:8)
`Excerpts of matchings for each subscription are
`prepared into an email message(cid:8) Unix sendmail
`is invoked to send out each message(cid:8)
`
`Unfied Patents Exhibit 1012
`Pg. 7
`
`
`
`Build + Filter + Sort + Notify Time
`Build + Filter + Sort Time
`Build + Filter Time
`Build Time
`
`2000
`
`4000
`
`8000
`6000
`Number of Profiles
`
`10000
`
`12000
`
`14000
`
`45000
`
`40000
`
`35000
`
`30000
`
`25000
`
`20000
`
`15000
`
`10000
`
`5000
`
`0
`
`0
`
`Time (sec.)
`
`Build + Filter + Sort + Notify Time
`Build + Filter + Sort Time
`Build + Filter Time
`Build Time
`
`5000
`
`10000 15000 20000 25000 30000 35000
`Number of Documents
`
`25000
`
`20000
`
`15000
`
`10000
`
`5000
`
`0
`
`0
`
`Time (sec.)
`
`Figure (cid:3) SIFT performance vs(cid:4) (cid:5) documents
`
`Figure (cid:3) SIFT performance vs(cid:4) (cid:5) pro les
`
`(cid:2) Coping with Information Volume
`
`First we investigate the relationship between the pro(cid:6)
`cessing time and the number of documents(cid:4) Figure
` shows the results of processing the (cid:8) articles
`against the (cid:8) subscriptions(cid:4) The time it takes
`to build the pro le index is very small and is as ex(cid:6)
`pected independent of the number of documents(cid:4) The
` ltering time is linear with respect to the number
`of documents(cid:8) with slope (cid:4) sec(cid:4)document value
`obtained by curve(cid:6) tting using Mathematica(cid:4) The
`time it takes to sort the matching le is negligible(cid:4)
`The proportionality constant for the total running
`time is (cid:4) sec(cid:4)document(cid:4) Recall that the articles
`are stored on a remote news host(cid:8) so some fraction
`of this time is spent in reading the articles via local
`network(cid:4) This is con rmed by looking at the CPU
`utilization(cid:8) which is found to be (cid:4)(cid:4)
`
`To measure the notify time(cid:8) since it would be im(cid:6)
`possible to send the same updates repeatedly to our
`real users(cid:8) the notify times shown in Figure except
`the rightmost data point are interpolated results(cid:4)
`We assume that the time it takes to send out noti (cid:6)
`cations is proportional to the number of matchings in
`the matching le(cid:4) We have veri ed this assumption
`with a separate experiment and derived the propor(cid:6)
`tionality constant(cid:4) Then in the experiment for Figure
` (cid:8) we count the number of matchings in the matching
` le at the data points shown(cid:4)