`Information Filtering: Model, System, and
`Evaluation
`
`J. MOSTAFA
`Indiana University
`S. MUKHOPADHYAY
`Purdue University
`W. LAM
`The Chinese University of Hong Kong
`and
`M. PALAKAL
`Purdue University
`
`In information-filtering environments, uncertainties associated with changing interests of the
`user and the dynamic document stream must be handled efficiently. In this article, a filtering
`model
`is proposed that decomposes the overall task into subsystem functionalities and
`highlights the need for multiple adaptation techniques to cope with uncertainties. A filtering
`system, SIFTER, has been implemented based on the model, using established techniques in
`information retrieval and artificial intelligence. These techniques include document represen-
`tation by a vector-space model, document classification by unsupervised learning, and user
`modeling by reinforcement learning. The system can filter information based on content and a
`user’s specific interests. The user’s interests are automatically learned with only limited user
`intervention in the form of optional relevance feedback for documents. We also describe
`experimental studies conducted with SIFTER to filter computer and information science
`documents collected from the Internet and commercial database services. The experimental
`results demonstrate that the system performs very well in filtering documents in a realistic
`problem setting.
`Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Informa-
`tion Search and Retrieval—clustering; selection process; I.2.6 [Artificial Intelligence]:
`Learning; I.7.3 [Text Processing]: Index Generation
`
`S. Mukhopadhyay was partially supported by NSF CAREER grant ECS-9623971 during the
`course of the research reported in this article.
`Authors’ addresses: J. Mostafa, School of Library and Information Science, Indiana Univer-
`sity, Bloomington, IN 47405-1801; email: jm@juliet.ucs.indiana.edu; S. Mukhopadhyay and M.
`Palakal, Computer and Information Science, Purdue University School of Science at Indianap-
`olis, Indianapolis, IN 46202; W. Lam, Department of Systems Engineering and Engineering
`Management, The Chinese University of Hong Kong, Shatin, Hong Kong.
`Permission to make digital / hard copy of part or all of this work for personal or classroom use
`is granted without fee provided that the copies are not made or distributed for profit or
`commercial advantage, the copyright notice, the title of the publication, and its date appear,
`and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to
`republish, to post on servers, or to redistribute to lists, requires prior specific permission
`and / or a fee.
`© 1997 ACM 1046-8188/97/1000 –0368 $03.50
`
`ACM Transactions on Information Systems, Vol. 15, No. 4, October 1997, Pages 368 –399.
`
`IPR2019-01304
`BloomReach, Inc. EX1025 Page 1
`
`
`
`A Multilevel Approach to Intelligent Information Filtering
`
`•
`
`369
`
`General Terms: Algorithms, Experimentation, Theory
`Additional Key Words and Phrases: Automated document representation, information filter-
`ing, user modeling
`
`1. INTRODUCTION
`Information-filtering (IF) systems have recently gained popularity, mainly
`as part of various information services based on the Internet [Edwards et
`al. 1996; Oard 1996]. These systems are similar to conventional informa-
`tion retrieval (IR) systems in that they aid in selecting documents that
`satisfy users’ information needs. However, certain fundamental differences
`do exist between IF and IR systems, making IF systems interesting and an
`independent object of analysis [Belkin and Croft 1992]. IR systems are
`usually designed to facilitate rapid retrieval of
`information units for
`relatively short-term needs of a diverse population of users. In contrast, IF
`systems are commonly personalized to support long-term information needs
`of a particular user or a group of users with similar needs. They accomplish
`the goal of personalization by directly or indirectly acquiring information
`from the user. In IF systems, these long-term information needs are
`represented as interest profiles (Lewis [1992a] refers to them as standing
`queries), which are subsequently used for matching or ranking purposes.
`The interest profiles are maintained beyond a single session and may be
`modified based on users’ feedback. Another important difference has to do
`with the document source. IR systems usually operate on a relatively static
`set of documents, whereas IF systems are usually concerned with identify-
`ing relevant documents from a continuously changing document stream.
`To operate efficiently, IF systems must acquire and maintain accurate
`knowledge regarding documents as well as users. The dynamic nature of
`users’ interests and the document stream makes the maintenance of such
`knowledge quite complex. Acquiring correct user interest profiles is diffi-
`cult, since users may be unsure of their interests and may not wish to
`invest a great deal of effort in creating such a profile. Acquiring informa-
`tion regarding documents is also difficult, because of the size of the
`document stream and the computational demands associated with parsing
`voluminous texts. At any time, new topics may be introduced in the
`document stream, or user’s interests related to topics may change. Further-
`more, sufficiently representative documents may not be available to facili-
`tate a priori analysis or training. Research on filtering, so far, has not
`clarified to a significant extent how these particular problems associated
`with users and documents may influence the overall filtering process.
`In this article, we present both an analytical and an empirical examina-
`tion of the basic problems in filtering. In our investigation here of the
`demands placed on IF systems, we identify the relevant functions and
`express them at a suitable abstraction level. This abstraction (we refer to it
`as the model) is then implemented as a system using well-known tech-
`
`ACM Transactions on Information Systems, Vol. 15, No. 4, October 1997.
`
`IPR2019-01304
`BloomReach, Inc. EX1025 Page 2
`
`
`
`370
`
`•
`
`J. Mostafa et al.
`
`niques from information science and machine learning. Following this, the
`performance of the resulting system is subjected to rigorous experimental
`analysis to clarify the influence of major constituent functions on the
`overall filtering process. The primary objective of an IF system is to
`perform a mapping from a space of documents to a space of user relevance
`values. This mapping, in turn, can be decomposed into a multilevel process,
`where the intermediate functions involve the subproblems of representa-
`tion, classification, and profile management. To ensure effective service, we
`further assume that these functions must be realized under two strict
`constraints. First, user intervention in the operation of the system must be
`minimized. That is, the system should rely on automated techniques as
`much as possible for acquiring information about documents and users.
`Second, when faced with changes in documents or users’ information needs,
`the system must adjust quickly with little or no degradation in perfor-
`mance.
`In the rest of this section, we discuss in more detail the challenges
`associated with performing effective filtering while minimizing user inter-
`vention and system degradation. We then identify some of the basic
`problems associated with filtering and delineate our approach for address-
`ing them. We conclude the section by surveying related research. In Section
`2, we present our model for information filtering. A description of an
`implementation of the model, named SIFTER (Smart Information Filtering
`Technology for Electronic Resources), is provided in Section 3. Results of
`experimental analysis conducted on SIFTER (and indirectly on the under-
`lying model used) are presented in Section 4. In Section 5, we discuss
`possible future extensions of SIFTER. Finally, we present our conclusions in
`Section 6.
`
`1.1 Problem Description
`Uncertainties in the filtering environment— especially the dynamic nature
`of users’ interests and the document stream—make it extremely difficult to
`gather and maintain accurate information necessary for filtering. Rapid or
`gradual changes introduced in the environment, viewed from the perspec-
`tive of the filtering system, are sources of uncertainty. To manage such
`uncertainties requires a high level of adaptivity on the system’s part. This
`adaptivity can be achieved by applying various machine-learning tech-
`niques. The overall problem of IF may then be broadly posed as learning a
`map from a space of documents to the space of real-valued user relevance
`factors. More precisely, denoting the space of documents as D, the objective
`is to learn a map f : D 3 ⺢ such that f(d) corresponds to the relevance of a
`document d. Given that such a map is known for all points in D, a finite set
`of documents can always be rank-ordered and presented in a prioritized
`fashion to the user.
`In an IF system, f is not known a priori and has to be estimated on-line
`based on queries and user feedback. This could, in principle, be accom-
`plished by setting up some form of a parameterized map approximator
`
`ACM Transactions on Information Systems, Vol. 15, No. 4, October 1997.
`
`IPR2019-01304
`BloomReach, Inc. EX1025 Page 3
`
`
`
`A Multilevel Approach to Intelligent Information Filtering
`
`•
`
`371
`
`(such as artificial neural networks) and updating the parameters based on
`the feedback. Such a direct on-line learning of the map f, however, is
`computationally intensive and requires a large number of user feedbacks,
`considering the high dimensionality of any reasonable representation of the
`documents. To provide a practically feasible solution to the filtering prob-
`lem, we decompose the latter into two levels. The higher level represents a
`classification mapping f1 from the document space to a finite number of
`. . . , Cm} (i.e., f1 : D 3 {C1,
`classes {C1,
`. . . , Cm}). This mapping is
`learned in an off-line setting, based on a representative database of
`documents, either by using prior information concerning the classes and
`examples or by automatically discovering abstractions using a clustering
`technique. Hence, this higher level partitions the document space into m
`equivalent classes over which user relevance is estimated. The lower level
`subsequently estimates the mapping f2 describing user relevance for the
`different classes (i.e., f2 : {C1, . . . , Cm} 3 ⺢). Since f2, unlike f and f1,
`deals with a finite input set of relatively few classes, the on-line learning of
`f2 is not unrealistically time consuming and burdensome on the user. Thus,
`the map f is being learned as the composition of f1 and f2. The decomposi-
`tion of f into f1 and f2 clearly limits the maximum achievable filtering
`accuracy, since a class may not correspond to a constant user interest.
`However, in our experience, the resulting inaccuracy is more than ade-
`quately compensated for by the substantial reduction in learning complex-
`ity. If greater accuracy is desired, it can be achieved as a two-stage process.
`In the first stage, a two-level map (i.e., f1 and f2) is learned as stated
`before. Subsequently, a more general single-level learning scheme can be
`initialized on the basis of learned f1 and f2. From then onward, the general
`map can be used for ranking purposes and can be updated on the basis of
`user feedback.
`Decomposition of f only aids in reducing the learning complexity; it does
`not eliminate it. The on-line learning problem is made even more difficult
`due to the following factors:
`
`(1) Difficulty of Representation: In general, it is not possible to represent D
`exactly by a finite-dimensional space that corresponds to some features
`of the documents (e.g., the relative frequencies of some predefined
`keywords). Hence, any finite-dimensional representation space D⬘ is
`merely an approximation to D, and there is always a loss of information
`in the process. The area of document representation and indexing
`[Salton and McGill 1983] is devoted to discovering methods for finite-
`dimensional representations that minimize the information loss in
`some sense. In a dynamic environment, to make the problem more
`difficult, the most preferable representation scheme is also a function of
`time. The choice of the representation scheme directly affects the
`realization of function f1.
`(2) Stochasticity of Feedback: The user relevance feedback may at certain
`times appear to be random to the filtering system. This can occur due to
`several reasons. First, the particular user interacting with the system
`
`ACM Transactions on Information Systems, Vol. 15, No. 4, October 1997.
`
`IPR2019-01304
`BloomReach, Inc. EX1025 Page 4
`
`
`
`372
`
`•
`
`J. Mostafa et al.
`
`may have uncertain needs or may not be very discriminating in
`expressing his or her needs. Second, depending on the f1 chosen, the
`target classes may not correspond to the way a user would normally
`group documents. This may lead to the generation of different user
`relevance feedback values for documents belonging to the same class.
`The third and final factor relates to the difficulty described in (1). On
`certain occasions, user feedback may be motivated by particular fea-
`tures (e.g., keywords) in documents that are actually not part of the
`underlying representation scheme. Feedback generated based on such
`“missing features” would appear as random, because the system would
`be unable to determine what caused such feedback.
`(3) Changing Interests of the User: Due to personal or professional reasons,
`a user’s interests may shift or change. These changes may happen in a
`relatively short duration of time or over a long period. We refer to all
`such situations as the nonstationary user case. The shifts can affect the
`user’s interests partially or fully. Whatever the scope of such shifts, the
`interest profile must be updated accordingly. The map f2 is directly
`affected by this problem.
`
`As mentioned earlier, due to the inherent complexity, filtering based on a
`direct learning approach is very difficult to accomplish in an efficient
`fashion. Decomposition allows us to isolate more specific problems, and we
`solve them by relying on existing and newly developed approaches. The
`main contributions of this article can now be summarized as follows:
`
`—We present a general model of filtering. As a way to reduce complexity,
`the architecture of the model incorporates multilevel functional decompo-
`sition and supports generality through modularity. It admits application
`of virtually any preferred techniques for basic tasks involving represen-
`tation, classification, and profile management.
`—The idea of learning is made central to the filtering process. We show
`how learning techniques can support the high degree of adaptivity
`required while minimizing user intervention. We apply learning tech-
`niques for acquiring information about both documents and users. To
`support adaptation to changes in the document stream, an unsupervised
`cluster discovery method is used. A reinforcement learning algorithm
`with very low overhead is used for user interest profile acquisition.
`—We demonstrate how representation can be conducted on a dynamic
`stream of text. The method provides a high degree of control in determin-
`ing what content to capture and what to ignore. The classification process
`is also designed to be flexible. The set of classes (i.e., the target of f1) can
`easily be changed by invocation of a relearning process. Both of these
`features allow convenient tuning of the filter to minimize user interven-
`tion.
`—We describe a method to handle profile degradation due to shifts in user
`interests. Graceful handling of interest shifts without requiring addi-
`tional data from the user is supported by the method. It is capable of
`
`ACM Transactions on Information Systems, Vol. 15, No. 4, October 1997.
`
`IPR2019-01304
`BloomReach, Inc. EX1025 Page 5
`
`
`
`A Multilevel Approach to Intelligent Information Filtering
`
`•
`
`373
`
`detecting multiple interest shifts in the same user and can take appropri-
`ate actions to minimize possible negative effects on the function f2.
`Preliminary simulation experiments involving an implementation of the
`general model have been reported in earlier sources [Lam et al. 1996;
`Mukhopadhyay et al. 1996]. These experiments simulated the operation of
`filtering LISTSERV emails. In this article, we describe the integration of
`all functionalities into a complete working system, conduct studies involv-
`ing human users in a real-world filtering application, and systematically
`analyze the influence of various user- and system-related parameters on
`the filtering performance.
`
`1.2 Related Work
`As mentioned in the introduction, IF systems are strongly related to IR
`systems in their functional goals and in the methods they apply to accom-
`plish those goals. Belkin and Croft [1992] provided an excellent review of
`IF, comparing it with IR and several other closely related processes (e.g.,
`text routing, text categorization, etc.). We do not intend to repeat such a
`comparison here. Instead, we review literature that deals more directly
`with the problems delineated in the last section.
`A basic filtering problem is to transform a large volume of information
`(text) into entities that permit efficient computation without significant
`loss of content. In our formulation of the problem, this is the objective of
`function f1, mapping documents to a more limited space of document
`classes. This particular task is generally referred to in IR as automated
`document classification. The first step in this process demands that a
`representative feature set for each document be identified. Various tech-
`niques have been developed for feature selection, ranging from simple
`procedures that calculate statistical distribution of keywords to more
`sophisticated techniques relying on analysis based on natural language
`processing (NLP) algorithms. Lewis [1992a] provided a thorough review of
`feature selection procedures and the influence of such procedures on
`document classification. The general and surprising finding of feature
`selection research in IR is that simple keyword-distribution-based ap-
`proaches are almost as effective as more sophisticated approaches [Lewis
`1992b]. The next step in classification involves assigning documents to one
`or more groups. In IR, hierarchical cluster generation techniques such as
`single-link and complete-link methods are commonly applied [Salton 1989].
`A particular track of research in IR has concentrated on generating
`predictive classifier functions based on off-line training conducted on
`representative document sets. As far back as 1963, Borko and Bernick
`[1963] described a text classifier based on a simple linear regression model
`that produced good results. A more recent successful effort that also
`applied a linear model classifier, based on least-squares fit, was described
`by Yang and Chute [1994]. The DARPA-sponsored MUC (Message Under-
`standing Conferences) initiative has generated significant research in the
`area of text routing [Lewis and Tong 1992]. The MUC efforts rely on strong
`
`ACM Transactions on Information Systems, Vol. 15, No. 4, October 1997.
`
`IPR2019-01304
`BloomReach, Inc. EX1025 Page 6
`
`
`
`374
`
`•
`
`J. Mostafa et al.
`
`NLP approaches to develop classifiers, since analysis is necessary at a
`fine-grain level to assess document content (e.g., identification of terrorist
`events based on news stories). It has been demonstrated, however, that less
`complex linear models may be appropriate if the type of information a
`system must handle is relatively simple (e.g., elements that constitute
`bibliographic document information). Generally, in most IF systems, the
`classification process has to be conducted fast, whereas the classifier
`building process can be delegated to a slower process.
`Traditionally, IR placed little attention on the users’ role—specifically,
`identification of users’ interests, representation of interests, and applica-
`tion of such representations in interactions [Belkin and Croft 1992]. My-
`aeng and Korfhage’s [1990] work on user profiles is one of the few and
`important efforts in this area. It attempted to integrate user interest
`profiles in IR systems and focused on various combinations of queries and
`profiles in enhancing retrieval. The profiles however had certain limita-
`tions. They had to be contributed directly by the user (who may be
`uncertain or unwilling to take the trouble), and profiles did not change
`during interaction. To keep up with changes in users’ interests automati-
`cally, systems can rely on internal knowledge representations or on learn-
`ing. Rich [1983] demonstrated how in an IR setting, in the absence of direct
`evidence about information needs, stereotypes can be applied to generate
`user models representing long-term interests. This is an innovative tech-
`nique; however, substantive human investment in knowledge engineering
`would be required to build the user stereotypes. Relevance feedback, a
`highly constrained and indirect form of evidence, has been successfully
`used to learn and adapt representations used for the purpose of query
`reformulation [Frants et al. 1993; Goker and McCluskey 1991]. It should be
`noted, though, that in many IF systems queries (in an IR sense) are not
`necessary, and users’ interests are more stable than in typical IR situa-
`tions. These factors must be taken into consideration in devising methods
`that minimize user involvement in profile management.
`A body of IF research exists that directly addresses problems associated
`with profile acquisition and maintenance, applying mostly AI-based tech-
`niques. Malone et al.
`[1987] described an intelligent message-sharing
`system called InfoLens in which users can generate profiles using rules.
`The rules prescribe appropriate actions with tests on content-based factors
`such as message type, date, and sender. Such explicit user-based knowl-
`edge acquisition methods support a high degree of transparency, permit-
`ting users to follow an “up-to-the-moment” knowledge state of the system.
`InfoScope, a system that applies a similar technique, has been developed
`for filtering Usenet news [Fischer and Stevens 1991]. InfoScope uses
`heuristic rules associating common patterns of usage (e.g., number of
`sessions, newsgroups read, frequencies of relevant terms in an article, etc.)
`to appropriate actions. To refine profiles, users must add or remove terms
`from the profile and must set appropriate rule-triggering thresholds. The
`requirement for direct and explicit user input for profile management, in
`our view, is somewhat demanding, and furthermore, such rule-based ap-
`
`ACM Transactions on Information Systems, Vol. 15, No. 4, October 1997.
`
`IPR2019-01304
`BloomReach, Inc. EX1025 Page 7
`
`
`
`A Multilevel Approach to Intelligent Information Filtering
`
`•
`
`375
`
`proaches may be too “brittle” to support efficient profile adaptation. News-
`Weeder [Lang 1995] is another Usenet filtering tool. In this, users’ ratings
`of documents are used as training examples for a machine-learning algo-
`rithm that is executed nightly to generate the user interest profiles for the
`next day. By limiting the user input to only ratings of documents, News-
`Weeder is successful in reducing user involvement. However, NewsWeed-
`er’s inability to adapt the profile in an on-line fashion limits its utility.
`SIFT (Stanford Information Filtering Tool) has also been developed to filter
`Usenet news [Yan and Garcia-Molina 1995]. SIFT requires users to specify
`keywords to generate the initial profile. Depending on the user’s choice, the
`filter may be represented using the vector-space model or simply as a
`boolean formula. If a vector-space approach is selected, SIFT can provide
`some adaptivity in profile refinement. In this mode, SIFT requires users to
`provide relevance feedback (by pointing out documents of interest), based
`on which weights in the profile are adjusted accordingly. Finally, NewT
`(news tailor) [Seth 1994] offers the user the option to select multiple
`profiles from a set of predefined profiles that cover common topical areas.
`NewT also applies relevance feedback for profile adaptation. To further
`reduce user involvement in profile refinement, NewT utilizes a genetic
`algorithm to evolve profiles toward increased fitness.
`In summary, IR provides a solid basis to exploit various document
`representation techniques, especially for the intermediate IF stage of
`document classification (i.e., f1). Relevance feedback and machine-learning-
`based approaches show promise in handling the subsequent IF operation of
`user modeling. However, at this point little is known as to how multiple
`functional components can be integrated satisfactorily in a single IF
`system, and additional empirical evidence is required to clarify how char-
`acteristics associated with users and the document stream may affect
`filtering performance.
`
`2. FILTERING MODEL
`There are three important and independent entities that constitute a
`filtering environment. These are the document source, the filter, and the
`user (Figure 1). Documents may exist at various sites and may be received
`by the user through disparate channels. The task of storing such docu-
`ments, before filtering,
`is handled by a component we call document
`acquisition and management (DAM). DAM is a separate component from
`the filter, and its actual design may vary from one environment to another.
`For example, at its core, DAM may be a web-crawler utility that retrieves
`documents from designated sites, a daemon that maintains indexed files, or
`even a sophisticated DBMS. Whatever the construction of DAM, when
`invoked, it would produce a stream of documents that flows into the filter.
`The filter itself consists primarily of three modules: (M1) representer,
`(M2) classifier, and (M3) profile manager. In the context of the multilevel
`decomposition of the map f : D 3 ⺢ (i.e., f ⫽ f2 ⴰ f1) discussed in Section
`1.1, M1 determines the input space for f1; M2 maps the resulting vector
`
`ACM Transactions on Information Systems, Vol. 15, No. 4, October 1997.
`
`IPR2019-01304
`BloomReach, Inc. EX1025 Page 8
`
`
`
`376
`
`•
`
`J. Mostafa et al.
`
`Fig. 1. Model of the filtering process.
`
`representation to the classification space (i.e., the output for f1); and M3
`implements the mapping f2. The functions of these modules are best
`described in terms of two different modes: filter application and filter
`tuning. In the filter application mode, upon arrival of new documents the
`representer module transforms the stream into more efficient representa-
`tions. This transformation would involve identifying relevant concepts and
`correctly assessing the discriminatory value of concepts in relation to
`specific documents. To avoid unnecessary parsing of concepts, a thesaurus
`management submodule would be used to select concepts for only those
`domains that are of interest to the user. The function of the classifier
`module is to identify for each document its corresponding document class or
`group. The classifier module utilizes a classification scheme, generated by a
`submodule, as an off-line process. In selecting an appropriate size for the
`space of classes, a crucial constraint must be followed. The space of classes
`in the filter must be smaller than the space of input document space. This
`aspect of the filtering model ensures a significant reduction in computa-
`tional complexity (see the discussion in Section 1.1). The profile manager
`module has the dual role of maintaining accurate interest profiles and
`applying the profiles to assess the relevance of documents. Profile represen-
`tation constitutes information concerning user preferences for document
`classes utilized by the filter. Such preference information may be acquired
`in various ways, but the method requiring the least user effort should be
`favored (i.e., it should be the default method used by the system). It
`appears that the best automatic profile acquisition methods are available
`from the machine-learning literature, relying on relevance feedback from
`the user. Whatever method is ultimately chosen, users should always have
`the option to enter or modify values in their profiles directly to ensure
`transparency of the filter. Once profile representation is achieved, docu-
`ments are ranked in relation to their membership in classes. It is worth
`noting here that, due to the strict imposition of a class space, assignment of
`
`ACM Transactions on Information Systems, Vol. 15, No. 4, October 1997.
`
`IPR2019-01304
`BloomReach, Inc. EX1025 Page 9
`
`
`
`A Multilevel Approach to Intelligent Information Filtering
`
`•
`
`377
`
`semantically related documents to different classes may occur. But, as
`profile learning is always conducted over the set of classes, it would have
`minimal effect on the overall document ranking. After the profile is
`learned, the classes that are semantically related are treated approxi-
`mately equally by the system, for ranking purposes.
`At the output end of the filtering model lies the presentation and access
`management (PAM) system. PAM is more tightly coupled with the filter
`than DAM and would normally be the user interface of the filtering system.
`To support the filter application mode, PAM can offer various functions, the
`most important being the actual presentation of documents. PAM must
`allow the user to select documents for display and to control the way
`documents are actually displayed (e.g., window size, font size, color, etc.).
`Another important function of PAM is to collect information for the purpose
`of profile management. For example,
`if relevance feedback is chosen,
`functions should exist to permit users to point out relevant documents.
`In modeling the filter, we also identified ways of tuning the filter so as to
`customize and improve its performance. Various types of tuning operations
`can be performed to influence the behavior of the three modules that
`constitute the filter. The frequency of such tuning would vary depending on
`the proximity of the particular module to the user (i.e., from the PAM end
`of the model). Hence, the profile manager is subjected to frequent tuning.
`An important type of tuning that applies to the profile manager module is
`avoidance of profile degradation when a user’s interests change due to some
`external circumstances. Because such a case can have an immediate effect
`on the filter’s performance, it should preferably be handled automatically.
`This would require continuous monitoring of users’ feedback and predicting
`shifts as quickly as possible. We show this tuning operation as a submodule
`of the profile manager module. The structure, size, and content of the
`classification scheme can also have a significant influence on the filter’s
`behavior. Such a scheme is usually generated using a training document
`set (a large and representative document set). However, the content of the
`document stream may change sufficiently over time to demand regenera-
`tion of the classification scheme. This type of tuning would be necessary
`less frequently and can be conducted by a submodule of the classifier (using
`the last n documents as the new training set). Finally, the structure and
`content of the thesaurus may directly affect document representation and
`consequently the rest of the filtering processes. When a domain or a field
`experiences significant change (which usually happens very slowly), tuning
`operations would be needed to update the thesaurus to keep up with such
`changes. We show these operations as a submodule of the representer
`module.
`
`3. SIFTER: AN IMPLEMENTATION OF THE FILTERING MODEL
`As a way to empirically investigate the utility of the model, we imple-
`mented a filtering system named SIFTER (written in C and TCL/TK for a
`Unix environment) that incorporates the major components described in
`
`ACM Transactions on Information Systems, Vol. 15, No. 4, October 1997.
`
`IPR2019-01304
`BloomReach, Inc. EX1025 Page 10
`
`
`
`378
`
`•
`
`J. Mostafa et al.
`
`the last section. We now describe these components in detail. We begin
`with the filter part of SIFTER, focusing mainly on the three constituent
`modules.
`
`3.1 Document Representation Using a Vector-Space Model
`The first component of the filter (i.e., the document representation module)
`needs to convert documents into structures that can efficiently be parsed
`without the loss of vital content. We chose the vector-space model [Salton
`1989] for document representation, because it has been widely tested and is
`general enough to support other computational requirements of the filter-
`ing environment. This, in turn, relies on a thesaurus management submod-
`ule. At the core of the latter is a set of technical terms or concepts culled
`from authoritative s