throbber

`
`
`
`
`UNIFIED PATENTS
`
`EXHIBIT 1012
`
`Unfied Patents Exhibit 1012
`Pg. 1
`
`

`

`In Proceedings of the  USENIX Technical Conference(cid:2) pages (cid:5)(cid:2) 
`
`SIFT  A Tool for
`Wide(cid:3)Area Information Dissemination
`
`Tak W(cid:2) Yan Hector Garcia(cid:3)Molina
`
`Department of Computer Science
`Stanford University
`Stanford(cid:2) CA  
`ftyan(cid:4) hectorg(cid:5)cs(cid:2)stanford(cid:2)edu
`
`February (cid:4) 
`
`Abstract
`
`The dissemination model is becoming increasingly
`important in wide(cid:2)area information system(cid:3) In this
`model(cid:4) the user subscribes to an information dissem(cid:2)
`ination service by submitting proles that describe
`his interests(cid:3) He then passively receives new(cid:4) l(cid:2)
`tered information(cid:3) The Stanford Information Filter(cid:2)
`ing Tool SIFT is a tool to help provide such ser(cid:2)
`vice(cid:3) It supports full(cid:2)text ltering using well(cid:2)known
`information retrieval models(cid:3) The SIFT ltering en(cid:2)
`gine implements novel indexing techniques(cid:4) capable
`of processing large volumes of information against a
`large number of proles(cid:3)
`It runs on several major
`Unix platforms and is freely available to the public(cid:3)
`In this paper we present SIFT(cid:8)s approach to user in(cid:2)
`terest modeling and user(cid:2)server communication(cid:3) We
`demonstrate the processing capability of SIFT by de(cid:2)
`scribing a running server that disseminates USENET
`News(cid:3) We present an empirical study of SIFT(cid:8)s per(cid:2)
`formance(cid:4) examining its main memory requirement
`and ability to scale with information volume and user
`population(cid:3)
`
`
`
`Introduction
`
`Technological advances have made wide(cid:2)area infor(cid:2)
`mation sharing commonplace(cid:3) A suite of tools have
`
` This research was sponsored by the Advanced Research
`Projects Agency ARPA of the Department of Defense un(cid:4)
`der Grant No(cid:5)MDA (cid:4) (cid:4)J(cid:4)  with the Corporation for Na(cid:4)
`tional Research Initiatives CNRI(cid:5) The views and conclusions
`contained in this document are those of the authors and should
`not be interpreted as necessarily representing the ocial poli(cid:4)
`cies or endorsement(cid:12) either expressed or implied(cid:12) of ARPA(cid:12) the
`U(cid:5)S(cid:5) Government(cid:12) or CNRI(cid:5)
`
`emerged for network information nding and dis(cid:2)
`covery(cid:9) e(cid:3)g(cid:3)(cid:4) Wide(cid:2)Area Information Servers WAIS
`KM (cid:4) archie ED (cid:4) World(cid:2)Wide Web WWW
`BLCGP (cid:4) and gopher McC (cid:3) However(cid:4) these
`new tools have one important missing element(cid:3) They
`provide a means to search for existing information(cid:4)
`but lack a mechanism for continuously informing the
`user of new information(cid:3) The exploding volume of
`digital information makes it dicult for the user(cid:4)
`equipped with only search capability(cid:4) to keep up with
`the fast pace of information generation(cid:3)
`Instead of
`making the user go after the information(cid:4) it is desir(cid:2)
`able to have information selectively ow to the user(cid:3)
`In an information dissemination aka(cid:3) alert(cid:2) informa(cid:3)
`tion ltering(cid:2) selective dissemination of information
`service(cid:4) the user expresses his interests in a number
`of long(cid:2)term(cid:4) continuously evaluated queries(cid:4) called
`proles(cid:3) He will then passively receive documents l(cid:2)
`tered according to the proles(cid:3) Such a service will be(cid:2)
`come increasingly important and form an indispens(cid:2)
`able tool for the dynamic environment of wide(cid:2)area
`information systems(cid:3)
`A very simple kind of information dissemination
`service is already available on the Internet(cid:17) mailing
`lists see e(cid:3)g(cid:3)(cid:4) Kro (cid:3) Hundreds of mailing lists ex(cid:2)
`ist(cid:4) covering a wide variety of topics(cid:3) The user sub(cid:2)
`scribes to lists of interest to him and receives mes(cid:2)
`sages on the topic via email(cid:3) He may also send mes(cid:2)
`sages to the lists to reach other subscribers(cid:3) LIST(cid:2)
`SERV is a software system for maintaining mailing
`lists(cid:3) A problem with the mailing list mechanism as
`a tool for information dissemination is that it pro(cid:2)
`vides a crude granularity of interest matching(cid:3) A user
`whose information need does not exactly match cer(cid:2)
`tain lists will either receive too many irrelevant or too
`few relevant messages(cid:3) The USENET News or Net(cid:2)
`
`
`
`Unfied Patents Exhibit 1012
`Pg. 2
`
`

`

`news system Kro (cid:7) an electronic bulletin board
`system on the Internet(cid:7) is similar in nature to mail(cid:8)
`ing lists(cid:9) While Netnews is extremely successful with
`millions of users and megabytes of daily trac(cid:7) at
`the same time it often creates an information over(cid:8)
`load(cid:9) Like mailing lists(cid:7) the coarse classication of
`topics into newsgroups means that a user subscribing
`to certain newsgroups may not nd all articles inter(cid:8)
`esting(cid:7) and also he will miss relevant articles posted
`in newsgroups that he does not subscribe to(cid:9)
`
`Recent research eorts on information ltering fo(cid:8)
`cus on the ltering eectiveness(cid:7) attempting to pro(cid:8)
`vide more ne(cid:8)grained ltering using relational(cid:7) rule(cid:8)
`based(cid:7) information retrieval IR(cid:7) and articial intel(cid:8)
`ligence approaches(cid:9) With few exceptions(cid:7) they are
`often small scale i(cid:9)e(cid:9)(cid:7) involving a small number of
`users or proles and thus the need to provide e(cid:4)
`cient ltering is not apparent(cid:9) However(cid:7) in a large(cid:8)
`scale wide(cid:8)area system where the number of informa(cid:8)
`tion providers and seekers are large(cid:7) eciency in the
`dissemination process is an important issue and must
`be addressed(cid:9)
`
`The Stanford Information Filtering Tool SIFT is
`a tool for information providers to perform large(cid:8)scale
`information dissemination(cid:9) It can be used to set up
`a clearinghouse service that gathers large amount of
`information and selectively disseminates the informa(cid:8)
`tion to a large population of users(cid:9) It supports full(cid:8)
`text ltering(cid:7) using well(cid:8)known and well(cid:8)studied IR
`models(cid:9) The SIFT ltering engine implements novel
`indexing techniques(cid:7) capable of scaling to large num(cid:8)
`ber of documents and proles(cid:9)
`It runs on several
`major Unix platforms and is freely available to the
`public by anonymous ftp at URL(cid:14)
`
`ftp(cid:2)db(cid:4)stanford(cid:4)edupubsiftsift(cid:5) (cid:4)(cid:4)tar(cid:4)Z
`
`In this paper we describe SIFT(cid:9) We present its ap(cid:8)
`proach to user interest modeling and user(cid:8)server com(cid:8)
`munication(cid:9) We demonstrate the processing capabil(cid:8)
`ity of SIFT by describing a running server that dis(cid:8)
`seminates tens of thousands of Netnews articles daily
`to some (cid:7) subscriptions(cid:9) We describe the imple(cid:8)
`mentation of SIFT(cid:7) focusing on the ltering engine(cid:9)
`Finally we present an empirical study of SIFT(cid:18)s per(cid:8)
`formance(cid:7) examining its main memory requirement
`and ability to scale with information volume and user
`population(cid:9)
`
` Other Previous Work
`
`Boston Community Information System GBBL is
`an experimental information dissemination system(cid:9)
`Like SIFT(cid:7) it allows a ner granularity of interest
`matching than mailing lists or Netnews(cid:9) Users can
`express their interests with IR(cid:8)style(cid:7) keyword(cid:8)based
`proles(cid:9) The system broadcasts new information via
`radio channel to all users(cid:7) who then apply their own
`lters locally(cid:9) While the radio communication chan(cid:8)
`nel makes broadcast inexpensive(cid:7) local processing of
`mostly irrelevant information is very expensive(cid:9) For
`every user(cid:7) a personal computer is dedicated for this
`purpose  this is cited as a major source of complaints
`from the users(cid:9)
`Information Lens MGT(cid:0) provides categoriza(cid:8)
`tion and ltering of semi(cid:8)structured messages such
`as email(cid:9) The user denes rules for ltering(cid:7) and
`the processing is done at the user site(cid:9)
`It provides
`eective ltering(cid:7) but local processing is expensive
`for large(cid:8)scale information dissemination(cid:9) Similarly(cid:7)
`the kill le(cid:24) mechanism in certain news reader pro(cid:8)
`grams allows the user to locally screen out irrele(cid:8)
`vant articles(cid:9) A kill le only removes specied ar(cid:8)
`ticles from newsgroups that a user subscribes to(cid:7) but
`it does not discover relevant articles in other news(cid:8)
`groups(cid:9) To provide the same kind of ltering power
`as SIFT would require much local processing(cid:9) It is
`more cost(cid:8)eective to pool proles together to share
`the processing overhead(cid:9)
`The Tapestry system GNOT  is a research pro(cid:8)
`totype that uses the relational model for matching
`user interests and documents(cid:25) ltering computation
`is done not on the properties of individual docu(cid:8)
`ments(cid:7) but rather on the entire append(cid:8)only doc(cid:8)
`ument database(cid:9) Ecient query processing tech(cid:8)
`niques are proposed for handling this kind of queries(cid:9)
`Tapestry is built on top of a commercial relational
`database system(cid:9) The Pasadena system WF  in(cid:8)
`vestigates the eectiveness of dierent IR techniques
`in ltering(cid:9) It collects new documents from several
`Internet information sources(cid:7) and periodically run
`proles against them(cid:9) Similar to SIFT(cid:7) it uses a
`clearinghouse approach(cid:7) but ecient ltering is not
`addressed(cid:9)
`In YGM b(cid:7) YGM c we proposed a variety of
`indexing techniques for speeding up information l(cid:8)
`tering under IR models(cid:9) There we evaluated these
`techniques using analysis and simulation(cid:9) The SIFT
`ltering engine is a real implementation of one class
`of index structures that we found to be ecient(cid:9)
`
`Unfied Patents Exhibit 1012
`Pg. 3
`
`

`

`Documents
`
` (cid:2) (cid:2) Filtering Model
`
`User
`
`"underwater
`archeology?"
`
`Email / WWW
`Request Handler
`
`SIFT
`
`...
`
`Document
`Index
`
`Subscription
`Database
`
`Filtering
`Engine
`
`Alerter
`
`Email
`Updates
`
`The interest prole can be expressed in one of two
`IR models(cid:3) boolean and vector space Sal (cid:4) We rst
`focus on the vector space model(cid:4)
`In vector space model(cid:10) queries and documents are
`identied by terms(cid:10) usually words(cid:4)
`If there are m
`terms for content identication(cid:10) then a document D
`is conceptually represented as an m(cid:5)dimensional vec(cid:5)
`tor D (cid:16) hw (cid:2) w(cid:2) (cid:3) (cid:3) (cid:3) (cid:2) wmi(cid:10) where weight wi for term
`ti signies its statistical importance(cid:10) such as its fre(cid:5)
`quency in the document(cid:4) We may also write D as
`hti (cid:2) wi (cid:2) (cid:3) (cid:3) (cid:3) (cid:2) tik (cid:2) wiki where wij (cid:16) (cid:18) e(cid:4)g(cid:4)(cid:10) h un(cid:5)
`derwater(cid:10) (cid:10) archeology(cid:10)  i(cid:4) A query is similarly
`represented(cid:4) For a document(cid:5)query pair(cid:10) a similarity
`measure such as the dot product can be computed
`to determine how similar(cid:11) the two are(cid:4) In an IR en(cid:5)
`vironment(cid:10) the top ranked documents are retrieved
`for a query(cid:4)
`In an information ltering setting(cid:10) a key question
`is how many documents to return to the user(cid:4) We
`could have allowed the user to receive a xed number
`of top ranked documents per update period(cid:10) e(cid:4)g(cid:4)(cid:10) as
`done in FD (cid:4) However(cid:10) this is not very desirable(cid:3)
`in a period when there are many relevant documents(cid:10)
`he may miss some low recall (cid:18) in a period when
`there are few interesting documents(cid:10) he may receive
`irrelevant ones low precision (cid:4) We instead allow
`the user to specify a relevance threshold(cid:10) which is the
`minimum similarity score that a document must have
`against the prole for it to be delivered(cid:4) A default
`value is supplied to the user for convenience(cid:4)
`Instead of using the vector space model(cid:10) the user
`may use boolean proles to specify words that he
`wants in documents received(cid:10) and words to be ex(cid:5)
`cluded(cid:4) For example(cid:10) the boolean prole y shing
`not underwater(cid:11) is for documents that contain both
`words y(cid:11) and shing(cid:11) but not the word under(cid:5)
`water(cid:4)(cid:11) The reader may note that the SIFT boolean
`model only allows conjunction and negation of words(cid:4)
`However(cid:10) the user may approximate disjunction se(cid:5)
`mantics by submitting multiple subscriptions though
`a document may match more than one subscriptions(cid:4)
`
` (cid:2) (cid:2) Prole Construction and Modication
`
`To assist the user with the construction of a prole(cid:10) a
`SIFT server provides a test run facility(cid:4) The user may
`
` Recall is the proportion of relevant documents returned(cid:2)
`and precision is the proportion of documents returned that are
`relevant(cid:3)
`
`Figure (cid:3) An overview of SIFT
`
` SIFT
`
`We begin our presentation of SIFT with an example(cid:4)
`Suppose a user is interested in underwater archeol(cid:5)
`ogy Figure (cid:4) He sends an email subscribe request
`to a SIFT server specifying the prole underwater
`archeology(cid:10)(cid:11) and optionally parameters that control(cid:10)
`for example(cid:10) how often he wants to be updated or how
`long his subscription is for(cid:4) He may alternatively ac(cid:5)
`cess the SIFT server via WWW(cid:10) using a graphical
`WWW client interface to ll out a form with the
`subscription information(cid:4) Before he subscribes(cid:10) he
`may test run his prole against an index of a sam(cid:5)
`ple document collection(cid:4) His subscription is stored
`in the subscription database(cid:4) As the SIFT server re(cid:5)
`ceives new documents(cid:10) the ltering engine will pro(cid:5)
`cess them against the stored subscriptions(cid:10) and noti(cid:5)
`cations will be sent out based on the user(cid:5)specied
`parameters(cid:4)
`In the following we detail the SIFT system from
`the perspectives of user interest modeling and com(cid:5)
`munication protocol(cid:4) We then illustrate SIFT using
`the Netnews SIFT server(cid:4)
`
` (cid:2) User Interest Modeling
`
`A user subscribes to a SIFT server with one or more
`subscriptions(cid:10) one for each topic of interest(cid:4) A sub(cid:5)
`scription includes an IR(cid:5)style prole(cid:10) and as men(cid:5)
`tioned additional parameters to control the frequency
`of updates(cid:10) the amount of information to receive(cid:10) and
`the length of the subscription(cid:4) A subscription is iden(cid:5)
`tied by the email address of the user and a subscrip(cid:5)
`tion identier(cid:4)
`
`Unfied Patents Exhibit 1012
`Pg. 4
`
`

`

`run his initial prole against an existing(cid:3) representa(cid:4)
`tive collection of documents to test the eectiveness
`of his prole(cid:6) He may interactively change the pro(cid:4)
`le and for weighted proles adjust the threshold
`to the desired level of precision and recall(cid:6) When he
`is satised with the performance of the ltering(cid:3) he
`may then subscribe with the selected settings(cid:6)
`After the user receives some periodic updates(cid:3) he
`may decide to modify his prole or change the thresh(cid:4)
`old for vector space proles(cid:6) He may do this by
`accessing the SIFT server(cid:6) Furthermore(cid:3) for vec(cid:4)
`tor space proles(cid:3) relevance feedback Sal (cid:3) a well(cid:4)
`known technique in IR to improve retrieval eective(cid:4)
`ness(cid:3) can be used(cid:6) The user simply gives SIFT the
`documents that he nds interesting(cid:13) after examining
`them(cid:3) the server adjusts the weights of the words in
`the user(cid:14)s prole accordingly(cid:6)
`
` (cid:2) Communication Protocol
`
`There are two modes of communication between the
`user and a SIFT server(cid:6) In the interactive mode(cid:3) the
`user subscribes(cid:3) test(cid:4)runs a prole(cid:3) views(cid:3) updates(cid:3)
`or cancels his subscriptions(cid:6) In the passive mode(cid:3) the
`user periodically receives information updates(cid:6)
`In(cid:4)
`stead of developing a new communication protocol(cid:3)
`we make use of current technologies(cid:15) email Cro
`and World(cid:4)Wide Web HTTP BLCGP (cid:6)
`First we discuss the interactive communication
`mode(cid:6) Email communication is the lowest common
`denominator of network connectivity(cid:6) By having an
`email interface(cid:3) a SIFT server is accessible from users
`with less powerful machines(cid:3) with limited network
`capability(cid:3) or behind Internet(cid:4)access rewalls(cid:6) We
`adopt the LISTSERV mailing list syntax wherever
`possible(cid:3) to ensure that minimal learning is required(cid:6)
`We also use default settings to reduce the complexity
`of email requests for novice users(cid:6)
`As the appeal of hypermedia navigation and the de(cid:4)
`velopment of sophisticated client interfaces are mak(cid:4)
`ing WWW the preferred tool for wide(cid:4)area infor(cid:4)
`mation sharing(cid:3) we also developed a WWW access
`interface for SIFT(cid:6) Using a WWW client program(cid:3)
`the user interacts with the SIFT server through a
`user(cid:4)friendly graphical interface(cid:6) We believe the dual
`email and WWW access covers the tradeo between
`wide availability and ease of use(cid:6)
`In the passive mode of user notication(cid:3) a SIFT
`server sends out email messages that contain excerpts
`of new(cid:3) potentially relevant documents certain num(cid:4)
`ber of lines from the beginning(cid:3) as specied by the
`
`user(cid:6) After the user reads the excerpts(cid:3) he may ac(cid:4)
`cess the SIFT server to retrieve the entire documents(cid:6)
`Currently the excerpts are formatted to be read from
`regular mail readers(cid:6) We plan to oer the option of
`formatting them in HTML(cid:3) so that the user may view
`the notications from sophisticated WWW viewers(cid:3)
`and then interactively retrieve interesting documents
`from SIFT or provide feedback via HTTP(cid:6)
`
` (cid:2) SIFTing Netnews
`
`Using SIFT(cid:3) we have set up a server for selectively
`disseminating Netnews articles text articles only(cid:13) bi(cid:4)
`nary ones are rst screened out(cid:6) Like any other SIFT
`server(cid:3) the user accesses the Netnews SIFT server via
`email or WWW(cid:6) The reader is encouraged to try it
`out(cid:15) for email access(cid:3) please send an electronic mes(cid:4)
`sage to netnews(cid:2)db(cid:3)stanford(cid:3)edu(cid:3) with the word
`help(cid:18) in the body(cid:13) for WWW access(cid:3) please con(cid:4)
`nect to http(cid:4)sift(cid:3)stanford(cid:3)edu(cid:6)
`In February
` (cid:3) we publicized the Netnews server in two news(cid:4)
`groups(cid:13) within ten days of the announcement(cid:3) we re(cid:4)
`ceived well over a thousand proles(cid:6) The number of
`proles keeps increasing and now November 
`exceeds (cid:3)(cid:3) submitted by users from almost all
`continents(cid:6) Table shows some interesting statistics
`obtained from the server on the day of November (cid:3)
` (cid:6) The average number of articles is over the week
`of November  (cid:4) (cid:3) (cid:6)
`
`Number of subscriptions
`Number of users
`Daily average number of articles
`Average notication period in days
`
` (cid:4) 
`(cid:4) 
`(cid:4) 
` (cid:14)
`
`Table (cid:15) Netnews SIFT server statistics
`
`Apparent from these numbers is that the load on
`the SIFT server is very high(cid:6) It is necessary to match
`an average of over (cid:3) articles against some (cid:3)
`proles and deliver most updates within a day(cid:6) The
`ecient implementation of the SIFT ltering engine
`enables the job to be done on regular hardware(cid:3) a
`DECstation (cid:6) Even though we believe SIFT
`is ecient(cid:3) there is a limit to the load that it can
`handle(cid:6) It is necessary to replicate the server(cid:13) in fact(cid:3)
`we have been in contact with several sites in the U(cid:6)S(cid:6)
`and Europe that expressed interests in providing the
`same service(cid:6)
`The Netnews SIFT server is not meant to be a
`replacement of the Netnews system(cid:3) which is an in(cid:4)
`
`Unfied Patents Exhibit 1012
`Pg. 5
`
`

`

`valuable tool for carrying on distributed discussions(cid:2)
`Rather(cid:3) it provides a complementary ltering service(cid:2)
`Only the beginning lines of matched articles are sent
`out(cid:2) The user may(cid:3) after determining that an article
`is relevant(cid:3) access his own local news host to access
`the article(cid:2) In cases where he does not have access to
`a news host(cid:3) he may request the whole article be sent
`from the SIFT server(cid:2)
`Sifting Netnews is just one application of SIFT(cid:2)
`We have set up another SIFT server(cid:3) disseminating
`Computer Science Technical Reports email access(cid:6)
`elib(cid:2)db(cid:3)stanford(cid:3)edu(cid:3) WWW access will soon be
`available(cid:2) This server has close to (cid:3) subscrip(cid:10)
`tions November (cid:2)
`
`To provide the test run capability(cid:3) we need to in(cid:10)
`dex a collection of documents(cid:2) This can be done by
`some publicly available software(cid:3) and we choose to
`use the index engine in the WAIS distribution called
`waisindex(cid:2) From our experience it is a very robust
`implementation(cid:2) However(cid:3) since we use a dierent
`scoring scheme than WAIS(cid:3) we made slight modica(cid:10)
`tions to the WAIS code to make it compatible with
`our ltering engine i(cid:2)e(cid:2)(cid:3) giving the same results(cid:2) The
`major changes are to support the use of relevance
`threshold and the processing of boolean queries(cid:2) The
`modied waisindex code is included in our SIFT dis(cid:10)
`tribution(cid:2)
`
`
`
`Implementation
`
`SIFT is implemented in C and has been compiled
`on several Unix platforms(cid:6) DEC Ultrix (cid:2)(cid:3) HPUX
`(cid:2)(cid:3) and SunOS (cid:2) (cid:2) In the following we rst give
`an overview of the components of the system and then
`focus on the implementation of the ltering engine(cid:2)
`
`(cid:2) System Components
`
`The program mailui implements the email request
`handler shown in Figure (cid:2) It parses email messages(cid:3)
`assuming the format in Cro(cid:2) User requests are
`processed by accessing the subscription database and
`the document index for test runs(cid:2)
`The program wwwui is a so(cid:10)called cgi(cid:10)script(cid:19) for
`use with the HTTP daemon released by National
`Center for Supercomputing Applications(cid:2) It accepts
`requests submitted by users using a WWW client
`with a form(cid:10)lling graphical interface(cid:3) and(cid:3) simi(cid:10)
`lar to mailui(cid:3) it accesses the subscription database
`and document index to process the requests(cid:2)
`The program filter implements the ltering en(cid:10)
`gine(cid:2) It accepts as input a list of documents in the
`form of Unix path names and matches them against
`the stored proles(cid:2) The implementation details are
`discussed in the next subsection(cid:2) The output from
`filter is a le of matchings(cid:2)
`After the ltering is completed(cid:3) the matchings are
`sorted by users and subscription identiers(cid:2) This is
`necessary because SIFT sends out updates on a per
`subscription basis(cid:2) After sorting(cid:3) the program alert
`can be used to send out the message one by one(cid:3)
`using Unix sendmail(cid:2) Included in the messages are
`excerpts from the matched documents a few lines
`from the beginning of the document(cid:2)
`
`(cid:2) Filtering Engine Implementation
`
`In IR(cid:3) to ecient process queries against a collection
`of documents(cid:3) an inverted index Sal  of documents
`is built(cid:3) which associates a word with its occurrences
`in the documents(cid:2)
`In the information ltering sce(cid:10)
`nario(cid:3) we may use the same approach(cid:6) an index is
`built of a batch of new documents(cid:3) and proles are
`treated as stored queries that are run periodically(cid:2)
`This is the approach used in an earlier SIFT proto(cid:10)
`type(cid:2) We refer to this as the document index ap(cid:10)
`proach(cid:2) This approach may not be very appropriate
`with a high volume of new information(cid:2) The docu(cid:10)
`ment index is large and resides on disk(cid:2) Running a
`large number of proles against it requires a lot of
`disk IOs(cid:2)
`In the current SIFT implementation(cid:3) we use a dif(cid:10)
`ferent approach(cid:2) The idea is to consider information
`ltering as the dual problem of IR(cid:3) and treat pro(cid:10)
`les as documents and documents as queries(cid:2) Instead
`of building a document index(cid:3) we build a prole in(cid:3)
`dex(cid:2) In YGM b(cid:3)YGM c we propose and analyze
`a variety of prole indexing techniques for the vec(cid:10)
`tor space and boolean models respectively(cid:2)
`In our
`SIFT implementation(cid:3) we have chosen good indexing
`techniques for the two models that are compatible in
`nature and combined them together in the same in(cid:10)
`dex structure(cid:2) Below we briey present this combined
`index structure through an example(cid:2) For a detailed
`description(cid:3) as well as descriptions of other prole in(cid:10)
`dexing techniques and analysis(cid:3) the reader is referred
`to YGM b(cid:3)YGM c(cid:2)
`Referring to Figure (cid:3) suppose P is a weighted pro(cid:10)
`le hunderwater(cid:3) (cid:3) archeology(cid:3) i and P is a
`boolean prole y shing not underwater(cid:2)(cid:19) For each
`word(cid:3) we associate it with an inverted list of postings
`that reference proles containing it(cid:2) For example(cid:3) in
`
`Unfied Patents Exhibit 1012
`Pg. 6
`
`

`

`Inverted lists
`
` Performance Study
`
`Directory
`
`underwater
`
`archeology
`
`fly
`
`fishing
`
`P2
`
`-1
`
`Postings
`
`P1
`
`P1
`
`P2
`
`P2
`
`60
`
`60
`
`1
`
`1
`
`Threshold
`
`Score
`
`8400
`
`9600
`
`2
`
`-1
`
`P1
`
`P2
`
`Figure (cid:3) Prole index and auxiliary data structures
`
`Figure  the inverted list for word underwater(cid:6) is
`made up of two postings referencing P and P(cid:8) In
`the posting(cid:9) if the prole is weighted(cid:9) we include the
`weight of the word in the prole(cid:10) e(cid:8)g(cid:8)(cid:9)  for un(cid:13)
`derwater(cid:6) in P (cid:8) If the prole is a boolean one(cid:9) we
`assign a weight of to a non(cid:13)negated word(cid:9) and (cid:13) to
`a negated one(cid:8)
`In addition(cid:9) we make use of two arrays called
`Threshold and Score(cid:10) each prole has an entry in each
`array(cid:8) For each prole(cid:9) its Threshold entry stores the
`minimum score that a document must have against it
`to be a match(cid:8) For a boolean prole(cid:9) this minimum
`is equal to the number of non(cid:13)negated words(cid:8) For a
`weighted prole(cid:9) the minimum is derived using the
`relevance threshold specied by the user(cid:8) The Score
`entry is used to accumulate the score that a document
`has against the prole(cid:8)
`Now suppose a document D containing the words
`underwater archeology(cid:6) and none of the other pro(cid:13)
`le words is being processed(cid:8)
`Suppose the vec(cid:13)
`tor space representation for D is hunderwater(cid:9) (cid:9)
`archeology(cid:9) (cid:9) (cid:0) (cid:0) (cid:0)i(cid:10) i(cid:8)e(cid:8)(cid:9) the words underwater(cid:6)
`and archeology(cid:6) are both assigned weights of (cid:8)
`To process the word underwater(cid:9)(cid:6) we look up the
`directory and access its inverted list(cid:8) For a weighted
`prole like P (cid:9) the document weight of underwater(cid:6)
`is multiplied with the weight in the posting (cid:9) and
`the product is accumulated in P (cid:17)s Score entry note
`that we are calculating the dot product of the two
`vectors(cid:8) For a boolean prole like P(cid:9) we simply add
`its posting weight (cid:13) for underwater(cid:6) to the Score
`entry(cid:8) Similarly the word archeology(cid:6) is processed(cid:8)
`Finally(cid:9) proles whose resulting Score entry is no less
`than the Threshold entry match the document(cid:9) such
`as P in Figure (cid:8)
`
`A practical challenge for SIFT is to process a new
`batch of documents within the smallest time unit
`of notication(cid:8) For example(cid:9) in the Netnews SIFT
`server(cid:9) it is critical that the lter and notify process
`is nished within  hours(cid:8) Thus(cid:9) it is important
`to understand and characterize the performance of
`the SIFT system(cid:9) and measure how it scales with the
`number of proles and volume of information(cid:8)
`First we study the performance of the processing
`time with respect to the number of documents and
`number of proles(cid:8) We then compare the perfor(cid:13)
`mance of the ltering engine with and the alternative
`document index approach discussed in Section (cid:8)(cid:8)
`Finally we investigate the main memory requirement
`of the ltering engine(cid:8) All experiments are run on
`a dedicated DECstation  running Ultrix (cid:8)
`and all times reported are real wall clock time(cid:8)
`The study was performed in early July(cid:9) (cid:8) The
`test data consisted of (cid:9) articles received on the
`day of July (cid:9)  by our department(cid:17)s news host
`and (cid:9) randomly selected subscriptions from the
`Netnews SIFT server at that time(cid:8) The average ar(cid:13)
`ticle size was  words a word is dened as an al(cid:13)
`phanumeric string longer than  characters(cid:8) The
`subscriptions were stored on a local disk(cid:9) while the
`news articles were stored on an NFS(cid:13)mounted disk
`from the news host(cid:8) We repeated the experiments
`with test data from two other days and the results
`were within   of those reported below(cid:8)
`In the result graphs that follow(cid:9) we divide SIFT
`processing time into these four steps(cid:3)
`
` (cid:8) Build Time The prole index structure is built(cid:8)
`Other auxiliary data structures are allocated and
`initialized(cid:8)
`
`(cid:8) Filtering Time Documents are run against the
`index one by one(cid:8) Document(cid:13)prole matchings
`are written into a le(cid:8)
`
` (cid:8) Sorting Time The document(cid:13)prole matching
`le is sorted by user email address and subscrip(cid:13)
`tion identier(cid:9) using Unix sort command(cid:8)
`
`(cid:8) Notify Time The sorted matching le is read(cid:8)
`Excerpts of matchings for each subscription are
`prepared into an email message(cid:8) Unix sendmail
`is invoked to send out each message(cid:8)
`
`Unfied Patents Exhibit 1012
`Pg. 7
`
`

`

`Build + Filter + Sort + Notify Time
`Build + Filter + Sort Time
`Build + Filter Time
`Build Time
`
`2000
`
`4000
`
`8000
`6000
`Number of Profiles
`
`10000
`
`12000
`
`14000
`
`45000
`
`40000
`
`35000
`
`30000
`
`25000
`
`20000
`
`15000
`
`10000
`
`5000
`
`0
`
`0
`
`Time (sec.)
`
`Build + Filter + Sort + Notify Time
`Build + Filter + Sort Time
`Build + Filter Time
`Build Time
`
`5000
`
`10000 15000 20000 25000 30000 35000
`Number of Documents
`
`25000
`
`20000
`
`15000
`
`10000
`
`5000
`
`0
`
`0
`
`Time (sec.)
`
`Figure (cid:3) SIFT performance vs(cid:4) (cid:5) documents
`
`Figure (cid:3) SIFT performance vs(cid:4) (cid:5) proles
`
`(cid:2) Coping with Information Volume
`
`First we investigate the relationship between the pro(cid:6)
`cessing time and the number of documents(cid:4) Figure
` shows the results of processing the (cid:8) articles
`against the (cid:8) subscriptions(cid:4) The time it takes
`to build the prole index is very small and is as ex(cid:6)
`pected independent of the number of documents(cid:4) The
`ltering time is linear with respect to the number
`of documents(cid:8) with slope (cid:4) sec(cid:4)document value
`obtained by curve(cid:6)tting using Mathematica(cid:4) The
`time it takes to sort the matching le is negligible(cid:4)
`The proportionality constant for the total running
`time is (cid:4) sec(cid:4)document(cid:4) Recall that the articles
`are stored on a remote news host(cid:8) so some fraction
`of this time is spent in reading the articles via local
`network(cid:4) This is conrmed by looking at the CPU
`utilization(cid:8) which is found to be (cid:4)(cid:4)
`
`To measure the notify time(cid:8) since it would be im(cid:6)
`possible to send the same updates repeatedly to our
`real users(cid:8) the notify times shown in Figure except
`the rightmost data point are interpolated results(cid:4)
`We assume that the time it takes to send out noti(cid:6)
`cations is proportional to the number of matchings in
`the matching le(cid:4) We have veried this assumption
`with a separate experiment and derived the propor(cid:6)
`tionality constant(cid:4) Then in the experiment for Figure
` (cid:8) we count the number of matchings in the matching
`le at the data points shown(cid:4)

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket