`vinted in Greay Britain:
`
`0306-4573/89 $3.00 © 00
`Copyright = 1989 Pergamon Press ple
`
`AUTOMATIC THESAURUS CONSTRUCTION BY
`MACHINE LEARNING FROM RETRIEVAL SESSIONS*
`
`U. Gtntzer,t G. JoTrNer,t G. SEEGMULLER,§ and FP. SARRET
`{Department of Computer Science, Technical University of Munich, Postfach 202420, D-8000
`Munchen 2, FRG, tlnstitute for Application Oriented Knowledge Processing, Ulm, FRG,
`and §Gesellschaft fiir Mathematik und Datenverarbeitung, Bonn, FRG
`
`(Received 27 May 1988; accepted in final form \5 September 1988)
`
`Abstract — Users of information retrieval systems (IRS) know and use manyrelationships
`between concepts a long time before these find their way into textbooks, printed thesauri,
`or classification schemes. We present here an [RS component called TEGEN, which taps
`this expertise by automatically drawing conclusions from actual search behavior about
`possible thesaurus entries. This is done during an iterative knowledge acquisition pro-
`eess: only after explicit or implicit confirmation by other users of the IRS during the
`knowledge verification process, the results are incorporated into a thesaurus. TEGEN
`is written in PASCAL using a knowledge-based programming method. It uses the rela-
`tional database system IMF2 and is implemented at the Technical University of Munich
`and at the Leibniz Computer Center of the Bavarian Academy of Sciences.
`
`1. INTRODUCTION
`
`Information retrieval systems (IRS) enable searches to be carried out on a collection of doc-
`uments. During this process, the user receives information relevant to the questions that
`are of interest to him or her,
`The information retrieval (IR) is more successful if various concept relationships can
`be incorporated into a search request by means of a thesaurus. Examples of such concept
`relationships are: synonyms, related terms, broader and narrower terms, and inflected
`forms.
`Since the conventional manual methods of thesaurus generation are too costly and the
`automatic methods have not produced sufficiently good results to succeed in practice [1-
`3], we present in this article a thesaurus-generating system designated TEGEN, which pro-
`vides a method by which a thesaurus canlargely be learned by a computer without major
`effort or expenditure.
`TEGEN is implemented on top of a multilingual IRS with automatic (and rather
`crude) indexing, which is used by professionals and graduate students. In such an environ-
`ment a thesaurus is particularly helpful, namely:
`
`1. The primitive kind of automatic indexing we use extracts all nonstop words from
`titles and abstracts of documents. This means that a paper onartificial intelligence
`might be indexed by the term “kiinstliche Intelligenz”if it happens to be written
`in German or by “AI” if only this abbreviation occurred in the title. Furthermore,
`4 paper On trees might be indexed by “tree” or “trees” (or “arbres”ete.). Without
`the use of a thesaurus, recall will be rather low.
`2. The thesaurus componentis used interactively during query sessions. Thus, the user
`has immediate control when modifying the query with the help of the thesaurus.
`3, Furthermore, in our environment—as in many others—gains in recall are much
`more at a premium than precision, because professionals are very good at screening
`out irrelevant material, but cannot afford to overlook relevant things; therefore,
`even if one would use the thesaurus only for en/arging queries (with a correspond-
`ing loss in precision) this would be beneficial,
`
`*A shorter version of this article was presented al the conference "User-oriented Content-based Text and
`Image Handling” (RIAO88), MIT, Boston, MA, March 1988.
`
`265
`
`(cid:20)
`
`PETITIONERS - EXHIBIT 1024
`PETITIONERS- EXHIBIT 1024
`IPR2022-00217
`
`IPR2022-00217
`
`
`
`266
`
`U. Giwrzer ef al.
`
`To summarize: If the indexing method is rather primitive, then a good thesaurus can
`make up for its deficiencies and achieves considerably better search results. But normally,
`if an IRS had adequate resources to create and maintain a good thesaurus, presumably it
`could also index moreintelligently in the first place. The TEGEN system is intended to be
`a remedy to this dilemma. The intelligence of the user population exhibited during query
`sessions is tapped to construct a thesaurusreflecting expertise, interests, and even jargon
`of this population. Of course, it remains to be proven that a really good thesaurus can be
`generated this way.
`
`2. THE BASIC PRINCIPLES OF THE LEARNING PROCESS
`
`2.1 Learning by analyzing
`The learning process used by the TEGENsystem is called learning by analyzing. This
`is a refined variant of learning by observation in which, however, an additional feedback
`component is used to verify the learning results.
`This learning process is based on the assumption that a large number of users of an
`IRS possess expert knowledge in the field to which their research relates. In carrying out
`their IR, they adopt certain procedures to find the desired documents contained in the col-
`lection of data.
`The user searches are evaluated online to acquire information regarding significant
`concepts and their interrelationships simultaneously with the process of providing infor-
`mation concerning the relevant documents. Thus, the knowledge providing and the knowl-
`edge acquisition processes work cooperatively.
`TEGENobserves the searches carried out by its users and exploits the users’ expert
`knowledge. Probable semantic relationships among concepts are extracted online by means
`of acquisition rules from the syntax of a search request and from the responses of the user
`to certain reactions of the search system.
`At the beginning of the learning process, the thesaurus is almost empty and can give
`only very limited support to an IRS; its quality, however, increases with the number of
`qualified users and instances of use,
`
`2.2 Feedback from the user
`User searches are evaluated online by TEGENsince feedback from the user is neces-
`sary for two reasons:
`
`i. ambiguity of conclusions
`2. uncertainty of results.
`
`If the analysis of user searches is ambiguous as regards the conclusions to be drawn,
`the concept relationship is assigned by the user with the aid of explicit feedback. This
`means direct feedback from the user. The system asks questionsto clarify the situation and
`expects these questions to be answered unambiguously.
`Analysis of the search behavior of a user can lead to the acquisition of uncertain or
`wrong thesaurus contents. For this reason, the intermediate results of learning may be
`accepted as final only when their validity has been established by a sufficiently large num-
`ber of verification processes on the basis of implicit or explicit feedback.
`Implicit feedback is feedback from the user without direct questions. Thesaurus con-
`tents that require clarification are integrated in the search dialogue of the user at suitable
`points and the user’s reactions to them are used for verification purposes.
`
`2.3 Architecture of the learning system
`Concept relationships, which are recognized as a result of the analysis of a user search,
`are designated intermediate results (IRES) and areinitially assigned an appropriate status.
`These relationships have to be verified by further users so that their validity can be finally
`
`(cid:21)
`
`
`
`Automatic thesaurus construction
`
`267
`
`established. A learning intermediate result becomes a final result (FRES) when the veri-
`fication process is complete.
`Thus, TEGENconsists essentially of two parts: knowledge acquisition and knowledge
`verification. The knowledge acquisition process derives possible thesaurus contents as inter-
`mediate results from the user searches. The knowledge verification process checks these
`intermediate results and converts them into final results. The architecture of the learning
`IRS is shown in Fig. 1.
`The management of the IRES and the recording of the FRESis carried out by the
`thesaurus. This contains the relations SYNONYMS, RELATED TERMS, HOMONYMS,
`BASIC and INFLECTED FORMS, BROADER TERMS and NARROWER TERMS,
`NEGATIVE LIST and POSITIVE LIST, and USEFUL PREFIXES.
`A few remarks concerning these relations shall clarify their semantics. The relation
`SYNONYMS contains not only synonym pairsin the linguistic sense but also translations
`and abbreviations, e.g. the pairs (artificial intelligence, AI) and (artificial intelligence,
`kiinstliche Intelligenz) would qualify for inclusions into the relation SYNONYMS. For
`riultilingual information systems, the semiautomatic construction oftranslation tables (for
`at least those terms that are in actual use) is a great advantage.
`The relation POSITIVE LIST contains the controlled vocabulary, i.e. those terms that
`have been used several times as search terms within queries, At a later stage we intend to
`use this list for indexing. The unary relation NEGATIVE LIST contains stop words like
`“is”, “of”, “the”, and “der”.
`The relation USEFUL PREFLXES needs a special comment. It is designed to help
`the user when using truncation, e.g. “Datenba.” shall denote all terms starting with
`“Datenba ,
`. " like “Datenbank”, “Datenbanken”, etc. If the user truncates too early, too
`many documentswill be derived; if the user truncates too late, recall will be lost. To find
`an adequate truncation point, the user goes through a term character-by-character from
`the left to the right and sends these prefixes one after the other as queries to the system.
`As an example, we display the results of this method for the term “Datenbanksystem”(this
`is the German word for “database system”).
`
`User
`
`Document
`
`Information
`Rotrieval
`
`“2EZ
`G4ed
`
`Verlflcation
`
`LLL
`
`Thesaurus
`
`OOOO00
`
`Database
`
`Fig. 1. Architecture of TEGEN.
`
`(cid:22)
`
`
`
`268
`
`U..GUNTZER é! wl,
`
` Useful?
`
`Prefixes
`
`No, of documents retrieved
`
`-
`~
`no
`yes
`no
`yes
`no
`no!
`no
`yes
`no
`
`D.
`Da.
`Dat.
`Date.
`Daten.
`Datenb.
`Datenba.
`Datenban.
`Datenbank.
`Datenbanks.
`Datenbanksy.
`
`52,578
`17,050
`16,429
`2,182
`1,941
`§23
`469
`445
`445
`186
`162
`
`As one can see, “Datenban.” is not a useful prefix. If “Datenb.”is nol good enough
`(this depends on the rest of the query), then it will not be helpful to replace it by “Daten-
`ban,” The user will go on and eventually use *Datenbanks.”instead. Actually, the user will
`sel the truncation point at those points, where a significant drop in thehit rate takes place
`or where the semantics changes, like between “Daten.” and “Datenb.”
`
`3. KNOWLEDGE ACQUISITION
`
`Acquisition of the intermediate results takes place by means of acquisition rules, which
`are represented in the form of production rules such as those frequently used for knowl-
`edge representation in expert systems [4-6].
`The rules can be classified into three different types, which we present in the follow-
`ing sections. Examples of rules implemented in our system are given.
`TEGEN currently contains 29 sets of semantic rules with a total of about 300 produc-
`tion rules that are implemented in PASCALtogether with the rule interpreter. Experience
`gained in using the system will result in further acquisition rules.
`
`3.1 Indirect acquisition without feedback
`User searches allow an unambiguous conclusion to be drawn as regards a thesaurus
`content and this can accordingly be directly acquired without any query being directed at
`the user.
`Insofar as rules of this class are used to investigate a similarity relationship, these can-
`not be specified more precisely as SYNONYMS, RELATED TERMS, or INFLECTED
`FORMS without feedback from the user. In this class of rules, therefore, similarity rela-
`tionships are generally assigned to the relation RELATED TERMSand are specified more
`closely in a further stage of the process.
`EXAMPLE,
`Rule 14: Restriction by means of a new search request
`Syntax:
`if
`(a) two or more search terms x, in a search request X are combined by OR
`and
`(b) x; occur as key words
`and
`(c) the search request X is combined in a further search request Y by AND or
`AND NOTwith further search terms 9;
`
`then
`
`all x, are accepted in pairs as learning IRES in the RELATED TERMS
`relation,
`
`Semantics of Rule 14. If the user restricts the combination of search terms by the oper-
`ators AND or AND NOTwith further search terms in a further search request, the
`search terms of the first search request, which are used to form the union, in general
`
`(cid:23)
`
`
`
`Automatic thesaurus construction
`
`269
`
`have a conceptual relationship. The fact that the OR combination receives further con-
`ceptual treatment by way of restriction is a strong indication that the terms x, have
`not been combined accidentally.
`Condition (b) of Rule 14 implies that the search terms are complete and sensible
`concepts even if truncated.
`
`EXAMPLE.
`xX: (BAUM OR BAEUME OR TREE OR TREES)
`Y: X AND (SUCHE OR SEARCH)
`producessimilarity relationships among:
`BAUM, BAEUME, TREE, TREES.
`
`3.2 Indirect acquisition with feedback
`In this rule class, the conclusions regarding thesaurus contents are ambiguous; thesit-
`wation is clarified by explicit feedback. Similarity relationships can be unambiguously
`assigned to the different thesaurus relations on the basis of queries addressed to the user.
`EXAMPLE,
`Rule 25: New attempt in the case of AND combinations
`Syntax:
`if
`(a) asearch request X consists of m search terms x, with mn = 2
`and
`(b) the x; are combined by AND
`and
`(c) the search request X is repeated in search request Y but one of the search
`terms involved, x,, is replaced by the search term y
`and
`(d) the search request Y producesless results than X
`and
`(e) the user gets the results of Y printed
`and
`(f) .. . some further premise...
`
`then (1) ask the user, whether:
`(A) ¥ is a narrower term of x;
`(B) y is a synonym ora translation of x,
`(C) x; is an inflected form of »
`(D) » is an inflected form of x;
`(E) an affinityrelationship exists between x; and y
`(F) none of this applies
`(2) in case (A)-(E), the pair (y,x;) is accepted
`as [RESin the appropriate thesaurusrelation,
`
`Semantics of Rule 25. If, after an unsuccessful search attempt in which two or more
`search terms were combined by AND, a fresh attempt is made and one of the search
`terms x; involved is replaced by a search term y, a conceptual relationship generally
`exists between x; and y. If replacing x; by y produces less results and the user is satis-
`fied with the modification, because the results were printed, then it is rather likely that
`one has found in » a term narrower than x, or related to x,. This relationship is clas-
`sified more precisely by queries addressed to the user.
`The precondition 7 > 2 is necessary so that it can be seen that question Y is an
`emendation of question X.
`
`EXAMPLE.
`X: STORAGE METHOD AND INFORMATION RETRIEVAL
`Y: INVERTED LIST AND INFORMATION RETRIEVAL
`results, after query to user, in:
`INVERTEDLISTis a narrower term of STORAGE METHOD.
`
`(cid:24)
`
`
`
`270
`
`U. Ginwrzer et al.
`
`Of course, answering the question posed by the system requires some effort by the
`user. On the other hand, the question gives some hints about other possibilities to emend
`the query, Maybe the user has replaced x; by an inflected form and is reminded —when
`reading alternative (B)—that x, should also have been replaced by a synonym. In spite of
`the fact that the numberof hits appeared satisfactory to the searcher, there could be many
`other ways to improve the query. In this way, attention is drawn to the fact that thesau-
`rus Support is available, i.e. instead of being bothered the user might even be grateful for
`the interference by the system, In any case, the user can switch off the thesaurus compo-
`nent, e.g. if not ready to answer the questions. But even if the user has no immediate ad-
`vantage for the actual query by answering the questions of TEGEN,every improvement
`of the thesaurus will be to the benefit of later searchers,i.e. the colleagues of the present
`user,
`
`Our first experiences show that users are cooperative, but this acceptance question has
`to be evaluated carefully along the followinglines:
`
`|. Anonymous session history protocols are recorded and shall be evaluated manu-
`ally. There it can be seen how often users have asked the system for thesaurus help
`and, if users have left the thesaurus mode during a retrieval session (presumably
`feeling bothered), how many questions had been asked (and answered) and what
`was the last question.
`Because we anticipate that TEGEN would usually be used within an organiza-
`tion, the organization could make this kind of cooperation by its members obliga-
`tory. We are thinking of trying this method for a limited period of time. Of course,
`users could react by no longer using the system at all or by answering at random.
`(Even then the verification procedure would prevent incorporation of random re-
`sults into the thesaurus.) This appears to be rather unlikely, but has to be verified
`empirically,
`2. Additionally, a questionnaire is designed to find out what the users think of the
`thesaurus and how taxing they feel the additional effort is. It is very well possible
`that some types of questions are more bothering than others. Then one might do
`without these special questions, especially if the corresponding rule was not very
`productive. This can be measured, because we record for every intermediate and
`every final learning result the rule number responsible for acquiring (or verifying)
`this bit of knowledge.
`
`3.3 Direct acquisition
`As a further possible method of generating learning intermediate results, the user is
`asked explicitly about thesaurus contents known to him or her if this can be done within
`the search dialogue at a point that is appropriate from a semantic point of view and with-
`out delaying the user.
`For this reason, this form of acquisition is used only if the user has previously indi-
`cated that he or she wishes thesaurus support (thesaurus mode).
`
`EXAMPLE.
`
`Rule 29: Direct acquisition of thesaurus contents
`Syntax:
`if (a) the thesaurus mode is set
`and
`(b) the user requires concepts that have relationship type ¢ with a search term x
`then (1) provide with x all previously verified concepts of this relationship type ¢
`(2) carry out verification of not yet finally established intermediate results
`of « in respect of this relationship type f¢
`(3) ask the user for further concepts y, that have a relationship of type !
`with x
`(4) for all y, shown, accept the pair (x, ;) as intermediate results into the
`thesaurus relation of this relationship type.
`
`(cid:25)
`
`
`
`Automatic thesaurus construction
`
`27)
`
`Semantics of Rule 29. By means of this acquisition rule, thesaurus contents are
`obtained directly from the user by means of questions and are entered into the thesau-
`rus as IRES.
`Actions (1) and (2) are important for the verification process and also motivate
`the user to add to the thesaurus content.
`
`Here the acceptance problem is much less critical than in the last section. The user is
`asked for further concepts y, that have a relationship of type ¢ with x. The user has to
`enter these terms anywayandis neither delayed nor bothered, because he or she presum-
`ably wants to include them into the actual query.
`
`4. KNOWLEDGE VERIFICATION
`
`4.1 Verification of the intermediate results
`In this second phase of the learning process, the knowledge and the behavior of the
`searchers are once again used to confirm or reject intermediate results and thus to assign
`them a validity status.
`For this purpose, learning attributes are allocated to the intermediate results within
`the thesaurus database; these learning attributes are initialized during the knowledge acqui-
`sition process and updated in the course of verification, These learning attributes include:
`
`Learning weight —indicates the degree of validity of a concept relationship as
`ascertained so far.
`
`Learning status—indicates the current state of the concept relationship within the
`learning process.
`Verification confirmation—indicates the number of confirmations of an intermedi-
`ate result that have occurred so far.
`
`Verification rejection—records the number of rejections of an intermediate result.
`
`Finally, the dafe of entry shows the calendar date on which an IRES was acquired.
`Since, in addition, the number of the relevant acquisition rule is recorded for each
`intermediate result, TEGEN even possesses a metalearning capability, Once TEGEN has
`been in use for any adequate period oftime, it can be learned by checking this rule num-
`ber record, which acquisition rules have produced particularly good results.
`Abuse of TEGENis prevented by ensuring that no user is ever involyed in the veri-
`lication of an intermediate result that was acquired by him/herself.
`
`4,2 Variables for controlling the learning process
`There are some further quantities that express relationships among the learning attri-
`butes and are used for controlling the learning process. The verification frequency h is de-
`termined by the sum of the verification confirmations and verification rejections. The
`learning step g determines the amount by which the learning weight
`is increased, or
`reduced, depending on user reactions. The acceptance value a is defined as the difference
`between the numberof verification confirmations and the numberof verification rejec-
`tions; it indicates whether an [RES is being accepted or rejected by a majority of users.
`Finally, the verification barrier s indicates how high the acceptance value must be for an
`IRES to befinally accepted,
`
`4.3 Generationofthe final results
`In the course of the learning process, altempts are made to transform IRES, as far
`as possible automatically, into FRES so that they are available in the thesaurus.
`An intermediate result is converted into an established final result when the verifica-
`tion process has been carried out sufficiently often and thus the verification barrier s has
`been exceeded.
`A preliminaryfinal result is obtained when the period of investigation corresponding
`to learning period / has elapsed and either the IRES has not been verified sufficiently often
`IPM 25:3-D
`
`(cid:26)
`
`
`
`272
`
`U. Gintzer ef al,
`
`or the Users have contrary views as to the validity of a concept relationship. Since in these
`cases an automatic decision is not possible, manual checking by an external expert
`is
`necessary.
`
`4.4 Learning targets and learning parameters
`In systems using TEGEN, various learning targets—such as a high learning rate, a
`large number and high quality of automatically generated FRES, and a small number of
`external manual checking processes —are laid down; to achieve these targets, several Jearn-
`ing parameters can be set.
`These learning parameters, which can be influenced from outside to a greater or lesser
`extent, include the number of users, their degree of expertise, the verification barrier s, and
`the /earning period /, The degree of expertise is determined by the homogencity ofthe user
`group and also by their level of knowledge both in the specialized field (domain expertise)
`and relating to the search system (retrieval expertise).
`Since these learning parameters in part have opposite effects, it is necessary to carry
`out an experimentally based adaptation. For example, an increase in s will result in an
`increase in the quality of the automatically generated FRESsince a larger numberof users
`have toa participate in the verification of an IRES so that their decision becomes morereli-
`able. However, the learning rate is simultaneously reduced and the number of manual
`checking processes is increased for the same learning period /.
`At present, the value s is set to 5 in our system and the value /to 90 days. This means,
`for example, that a presumed thesaurus relationship must have been confirmed five times
`more often than it has been rejected before it is finally incorporated into the thesaurus as
`a final result. However, a preliminary decision is taken in any case after 90 days at the
`most.
`
`5. OUTLOOK
`
`TEGEN is implemented In PASCAL on a CYBER 990 computer, applying knowl-
`edge-based programming methods on the basis of a conventional programming language,
`By using these programming methods, it has been possible to show that their advantages
`(i.e. exploratory programming, easy maintenance) are also beneficial if a conventional pro-
`gramming language is used. Additionally, this solves the problems involved in connecting
`knowledge-based systems to conventional databases and other software systems.
`At present, TEGEN is being used within the documentretrieval system TUBIBMUE.
`Online searches on 130,000 documents of the Faculty of Mathematics and ComputerSci-
`ence of the Technical University of Munich are possible [7]. A thesaurus is being gener-
`ated in accordance with the methods described in this article and the final results are
`offered to the users to support their searches.
`It appears likely that term relationships that have been established by several expert
`users are in fact meaningful, but this has to be proved by a careful validation, e.g. it has
`to be validated, whether the methods implemented in fact create good thesaurus relation-
`ships, especially whether the verification procedures exclude most invalid intermediate
`results from incorporation into the thesaurus withoul being too restrictive.
`This validation is planned as follows: as already mentioned in Section 3.2, TEGEN
`records for every intermediate and every final result the number of the rule responsible for
`this result; even the complete verification and rejection history for every learning result is
`recorded. In this way, the quantitative productivity of the different rules can be easily mea-
`sured. Furthermore, it can automatically test whether the results produced by some rule
`were controversial during the verification process. Thus, the quality of a rule can be at least
`partially measured.
`Finally, a manual evaluation of a representative part of the thesaurus by a commit-
`tee is indispensable. However, we are rather optimistic that this committee will draw the
`same conclusions as the researchers during the feedback phase of the verification process,
`The quantity and quality of the results generated depend on the (uning parameters
`
`(cid:27)
`
`
`
`Automatic thesaurus construction
`
`ara
`
`explained in Section 4,4 Appropriate values for these parameters can only be fixed
`empirically.
`The TEGEN learning system can be integrated into an IRS with a Boolean search
`interface at any time. Especially, it is possible (and advisable) to use an existing thesau-
`rus to “prime” the system. It should also be mentioned that this type of computerlearn-
`ing will surely not remain limited to documentretrieval systems.
`
`REFERENCES
`
`eh
`
`al
`
`. Salton, G, Dynamic information and library processing. Englewood Cliffs, NJ: Prentice-Hall; 1975.
`. van Rijsbergen, C.J. Information retrieval, 2nd ed. London: Butterworths; 1978.
`. Foskett, D.J. Thesaurus. In: Dym, E.D., editor. Subject and information analysis. New York: Marcel Dek-
`ker; 1985; 270-317.
`Davis, R.; King, J. An overview of production systems. In: Eleock, E.W,; Michie, D., editors, Machine intel»
`ligence, vol. 8. New York: John Wiley and Sons; 1977: 300-332.
`. Mcalla, G.; CerCone, N. Approaches to knowledge representation. Computer, [6(10): 12-18; 1983,
`. Buchanan, B.G.; Shortliffe, E.H. Rule-based expert systems: The MYCIN experiment of the Stanford Heuristic
`Programming Project. Reading, MA: Addison Wesley; 1984.
`. Jitter, G.; Landherr, G. Die bibliotheksverwaltungssysteme LRZBIBLIO und TUBIBMUE, LR2Z-Bericht
`Nr. 8505, Miinchen; 1985.
`. Giintzer, U.; Jiittner, G.; Seegmuller, G. TEGEN—Ein lernfahiges information retrieval system, In; Caap,
`H.; Galinsky, C., editors. Terminology and knowledge engineering. Proceedings ofthe first international con-
`vress on terminology and knowledge engineering; 1987, 323-337; Trier.
`
`(cid:28)
`
`