`Prim<!<! io Gteal Brtlain:
`
`03011-4510/92 5$.00 + .DO
`Copyriaht ~ t99l P~r..,oo Preu Ltd..
`
`OPINION PAPER
`
`ON THE DIFFICULTIES OF APPLYING THE RESULTS OF
`INFORMATION RETRIEVAL RESEARCH TO AID IN THE
`SEARCHING OF LARGE SCIENTIFIC DATABASES
`
`RoBERT LEDWITH
`Chemical Abstracts Service, Columbus, OH 43210, U.S.A.
`
`Abstract- Although much effon has been applied by researchers to the problem of im(cid:173)
`proving information retrieval systems during the last 20 years, the results of these effotU
`are not always directly applicable to commercial online systems, especially information
`retrieval (IR) from large scientific databases. In this paper, the difficulties of extrapo(cid:173)
`lating from the results of JR research to the searching of scientific flies accessible via S'IN
`lnternationaJI!! are discussed and suggestions for further investigation are given.
`
`INTRODUCTION
`
`When I was asked to contribute an article to this special issue, the editor explained that it
`would be valuable to have the viewpoint of someone who develops and uses a commercial
`online system. As a research scientist at Chemical Abstracts Service (CAS), a division of
`the American Chemical Society (ACS), one of my responsibilities is to evaluate advances
`in lR for possible application to an online information service, STN International. 1 have
`often been concerned about a number of significant differences between the commercial
`online environment and the typical IR research environment. As a result of contemplating
`these concerns, I believe the most valuable contribution I can make to this special issue is
`to discuss the difficulties of extrapolating the results of information retrieval experiments
`to the problem of searching large scientific databases, and to provide insight on how an on(cid:173)
`line vendor examines advances in information retrieval. Although the article constitutes my
`opinion, and does not represent the official position of CAS or the ACS, I hope that by
`describing the difficulti.es a~d stating my concerns. each will eventually be resolved.
`To help explain some of the difficulties in applying IR research results, this article uses
`two large scientific files used in a commercial online system and compares them with a larBe
`test collection used in IR research, the !NSPEC 12,684 collection (Fox, 1983).lt briefly dis(cid:173)
`cusses differences- in the data searched, the searching mechanisms, and the users of the
`search systems. Using this information, the article discusses how an online vendor wish(cid:173)
`ing to improve service for its current users evaluates online system enhancements. Ranked
`retrieval methods (one area of IR research) and examples of the differences between there(cid:173)
`search experiments performed and the current online system are mentioned. 1 argue that
`the differences preclude an adequate understanding of the actual performance that a ranked
`retrieval facility would achieve within the commercial online system. The article concludes
`with a list of suggestions that could help us acquire a better ability to predict the utility of
`implementing IR advances willrln commercial online SyStems.
`
`THE DATA
`
`STN International provides access to scientific and engineering information. To help
`illustrate specific points within the article, two STN files and a test collection used in IR
`research are used as examples. The first file is the Chemical Journals of the American
`
`Requests for reprints should be sent to the author at Chemical Abstracts Service, P.O. Box 3012, Columbus,
`OH 43210, U.S.A.
`!I'M 2k4-B
`
`4Sl
`
`EXHIBIT 2028
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00479
`
`
`
`452
`
`R. LEDWITH
`
`Chemical Society (CJACS) File, a primary literature file that contains the text of 97,000
`research articles. The second file, the Chemical Abstracts (CA) File, is a secondary litera(cid:173)
`ture file that has 9.5 million citations, containing titles, abstracts, keywords, and articu(cid:173)
`lated indexing phrases. These files are compared to one of the larger test collections
`commonly used in IR research, the INSPEC 12,684 collection. This collection consists of
`12,684 document titles and abstracts from the INSPEC database, and 77 queries collected
`at Cornell and Syracuse universities. (Both natural language and Boolean logic forms of
`the queries are available and have been used in experiments.) Descriptive statistics for the
`files appear in Table I.
`
`THE SEARCH SYSTEMS
`
`In the STN online service, access to the CJACS and CA files is provided via a con(cid:173)
`ventional Boolean search system that supports the Boolean operators AND, OR, and NOT.
`(Parentheses may be used to nest expressions.) The documents are divided into fields, such
`as author name and abstract fields, and search terms may be qualified to match only the
`occurrences of terms appearing within specific fields. The service also supports searching
`via proximity operators, which match only the occurrences of specified terms that appear
`adjacently, or appear within the same sentence, paragraph, or section of a document. When
`using proximity operators, the user may specify that a variable number of words, sentences,
`paragraphs, or document sections may appear between the terms being matched. The doc(cid:173)
`uments retrieved may be displayed entirely or limited to specific fields.
`In research systems, a variety of retrieval models have been used, including the Vec(cid:173)
`tor Space, Probabilistic, Fuzzy Sets, and p-norm models. I focused on the p-norm model
`described by Fox (1983) and Salton eta/. (1983), which performed well when used to search
`test collections. In the experiments using the p-norm model, the queries were augmented
`Boolean queries containing AND and OR operators. Parentheses were used to nest expres(cid:173)
`sions and field-specific queries were supported. Proximity operators were not used in the
`experiments.
`
`THE SEARCHERS
`
`Most STN searchers are highly trained in both the domain area and the use of online
`systems, more so than searchers in information retrieval experiments. Typical STN users
`have at least one degree in, for example, biology, chemistry, or library science. The users
`have had several hours of formal training on using the online service and the specific files
`being searched. Most have refined their searching skills by using the online service for many
`hours. In short, although some STN searchers are end-users of the data retrieved, the ma(cid:173)
`jority are highly trained search intermediaries. In informal discussions, STN users consis(cid:173)
`tently indicate that they are comfortable with the search command language and that they
`understand and regularly use Boolean and proximity operators. The users are highly mo(cid:173)
`tivated to use the system because it is a cost-effective way to find information for a vari(cid:173)
`ety of reasons, from preliminary background and SDI searches (typically searches with very
`low recall and high precision) to patent searches (typically searches with near exhaustive
`
`Table L Characlerislics of Iwo large sdenlific files and an IR research collecliona
`
`File name
`
`Record
`Iype
`
`Records
`
`Term
`type
`
`Total
`lerms
`
`Disiincl
`Ierms
`
`Total
`1erms/record
`
`Dislincl
`terms/record
`
`CJACS
`CA
`I NSPEC 12684
`
`primary
`secondary
`secondary
`
`96,900 words
`9,528,000 words
`slems
`12.684
`
`270.000,000
`I ,234.000,000
`733,800
`
`5.536.000
`17,540,000
`14,683
`
`2786
`129
`58
`
`768
`58
`33
`
`,;For lhe CJACS and CA files, the words described are limiled lo Ihose in Ihe title, body, keywords, indexes, and
`figure lilies. Paient numbers, file keys, elc., are excluded, as are Ihe non-searchable words (s10pwords). For Ihe
`INSPEC 12684 colleclion, Ihe s1ems are limiled Io I hose found in I he Iitle and abslracts af1er removing stopwords.
`
`
`
`Information relrieval from large scientific databases
`
`453
`
`recall and lower precision). The wide range of search types is different from many of the
`IR test collection queries, which would be considered as preliminary background or SDI
`searches by the users of STN.
`
`EVALUATING ONLINE SYSTEM ENHANCEMENTS
`
`When a self-supporting service such as STN International evaluates a new approach
`or technology, two primary questions need to be answered:
`
`• To what degree will this benefit the user?
`• Is the cost of implementing and using the technology recoverable?
`
`Ultimately, it is the user's perceived needs and willingness to pay for new capabilities
`that dictates STN's system enhancements. The sci-tech online industry is a small, modest(cid:173)
`growth industry, with most online services operating at only a small profit margin. Because
`there are limited resources for implementing new features, potential online system enhance(cid:173)
`ments must be critically evaluated before implementation. Consider a traditional problem
`of using Boolean operators. Certain classes of online users have difficulty understanding
`the function of the Boolean AND and OR operators. As a consequence of this, various
`schemes involving free-form or menu-based input have been proposed to surmount this
`problem. STN users, however, have stated that this is not a problem for them; thus the ben(cid:173)
`efit to the current users is minimaL Accordingly, implementing these advancements to the
`system would receive a low priority.
`
`Evaluating the applicability of ranked retrieval to searching large scientific files
`Ranked retrieval models have been examined as alternatives to the standard Boolean
`retrieval model. However, despite the significant efforts to explore and develop these mod(cid:173)
`els, there remain concerns about the models' utility for the searching of large scientific da(cid:173)
`tabases. Using the p-norm retrieval experiment described in Fox (1983) as an example, I will
`present my three major concerns.
`I. The first concern is with the size and composition of the collections used for test(cid:173)
`ing in research""" Most testing has used small collections containing fewer than 10,000 records
`or collections containing very brief document surrogates, such as document titles. Of the
`existing test collections used in IR research, the INSPEC collection, which is one of the
`larger test collections available, appear to be an appropriate collection for ba'iing extrap(cid:173)
`olations to the searching of STN files, because it contains both titles and abstracts describ(cid:173)
`ing scientific articles. Despite these features, the reliability of extrapolating the performance
`of research systems that use the collection to a system to search a file over 750 times larger
`than the collection is highly questionable:' At least two factors aggravate any attempts at
`extrapolation. The first is that a retrieval system must include a human component. Al(cid:173)
`though it is possible to build larger, faster software and hardware components to handle
`larger files, the human component of the system does not change. In particular, the human
`cannot and should not be required to review and summarize more data from the larger sys(cid:173)
`tem than from the smaller one, The second factor deals with the likelihood of unexpected
`(and undesirable) combinations of terms appearing within the documents, where unex(cid:173)
`pected combinations cause nonrelevant documents to be ranked as highly relevant ones. To
`illustrate why this is a potential problem, assume that for a specific set of queries an un(cid:173)
`desirable combination of terms appears within only .0030Jo of the documents in a collec(cid:173)
`tion. If the collection contains 12,684 documents, this is equivalent to one document for
`every three searches that the user ignores. This occasional document is statistically so small
`that its influence is easily ignored when examining test results. However, if the collection
`contains 9.5 million documents, the user must attempt to cope with 285 unwanted docu(cid:173)
`ments for each search. Obviously, even very subtle factors within test collection searching
`could translate into significant effects when searching large files.
`2. A second concern is with the nature of the queries used in research collections. Com(cid:173)
`pared to STN user queries, the research queries are too broad. Looking at the INSPEC
`
`
`
`454
`
`R, LEDWITH
`
`collection, a typical query maps to 33 relevant documents out of a collection of 12,684. This
`would extrapolate to an STN user retrieving and reviewing over 24,000 documents from
`theCA File. However, a typical STN user reviews fewer than 50 documents per search.
`Thus, it can be argued that many of the research queries are fundamentally different from
`STN queries. Another difference between the queries is that most research queries do not
`use proximity information. This differs from STN user queries, where over 850/o of the CA
`File and virtually all of the CJACS File searches contain one or more proximity operators.
`The importance of proximity operators may be illustrated by using the example of a user
`wishing to retrieve information about vitamin A. If one searches for "vitamin" or "vita(cid:173)
`mins" and "A" in theCA File, 45,800 records are retrieved. However, requiring that "A"
`must immediately follow "vit'!_min" or "vitamins" causes only 21 "'o of the records from the
`first search (9,950 records) to be retrieved. For the CJACS File, the results are even more
`extreme, with 1500 records retrieved for the first search and 130/o of the records (190
`records) retrieved for the second. Clearly, using proximity operators can be a valuable tool
`for improving the precision of some searches of large files.
`3. The third concern deals with the performance of ranked retrieval systems and the
`perceived benefit versus cost to the user. Ranked retrieval schemes are intrinsically more
`expensive to perform than the unranked schemes. For users to be willing to pay substan·
`tially more for a service, they must perceive a noticeable and valuable improvement. How(cid:173)
`ever, there are concerns about whether the performance of the existing ranked retrieval
`models is a large enough improvement over the Boolean search model to represent a cost(cid:173)
`effective alternative. To illustrate, assume that a standard Boolean search retrieves 100 doc(cid:173)
`uments from a collection, of which I 0 are relevant. To find the I 0 relevant documents, the
`user might review 90 documents. A ranked retrieval search such as the p·norm model might
`also retrieve the same 100 documents, but orders them in an attempt to place the relevant
`documents first. To find all ten relevant documents, the user might review only 70 docu·
`ments. While this is a statistically significant improvement in retrieval, in the eyes of the
`user, it may not be worth the additional cost.
`
`SUGGESTIONS FOR FURTHER RESEARCH
`
`Having raised these concerns, what suggestions can be made to resolve them? From
`the perspective of an online vendor of large scientific databases, there are several
`suggestions:
`I. Research collections with larger vocabularies and more records are needed. For test·
`ing the retrieval of primary and secondary literature, the collections must be large enough
`to capture the size and complexity of the files that the collections represent.
`2. Investigations of retrieval schemes that incorporate proximity information are
`needed. As was shown in the vitamin A example, when larger collections are searched,
`proximity information may be a valuable aid for improving precision.
`3. Test collections that contain more specific queries are needed. If large research col·
`lections become available, it will be possible to conduct meaningful experiments using que·
`ries that correspond to minute portions of collections' records. This will permit better
`modeling of the types of user searches than is possible with the existing collections.
`4. Investigations into how the human component of the search system can be made
`more tolerable are needed. As illustrated in an earlier example, even a statistically small per·
`centage of nonrelevant documents may translate into an unacceptable number of records
`for the searcher to cope with. Possible mechanisms that might assist the user include aids
`to integrate, summarize, and display search results.
`5. Investigations into retrieval schemes and search languages for accessing primary lit·
`erature are needed. Specifically, the creation of new operators other than Boolean and
`proximity operators could potentially be very valuable. As studies (such as Ro, 1988) have
`shown, when searching primary and secondary literature files that represent the same doc(cid:173)
`uments, the precision level of the primary literature search is usually much lower than the
`equivalent secondary literature search. This drop in precision combined with the increased
`size of primary literature records over secondary ones implies that the searcher's need for
`
`
`
`Informal ion relrieval from large scientific databases
`
`455
`
`improved access to and concise display of primary literature is even more crucial than when
`searching secondary literature.
`
`CONCLUSION
`
`Although it is difficult to determine whether some IR research results may be mean(cid:173)
`ingfully applied to searching large scientific databases, there are efforts underway that rec(cid:173)
`ognize the gaps between traditional research efforts and commercial systems. Three such
`efforts are (a) a proposed investigation into the effect of proximity by Keen (1991); (b) an
`exploration into issues dealing with a large online collection of chemical primary literature
`articles within the Chemical Online Retrieval Experiment (CORE) research project (the
`project is a collaborative effort of OCLC, ACS, CAS, Bell Communications Research
`(Bellcore) and the Albert R. Mann Library at Cornell); and (c) the development of concept(cid:173)
`oriented databases for IR as an alternative approach to searching existing large text data(cid:173)
`bases (Ledwith, 1988). Efforts such as these could eventually lead to resolving the concerns
`that I have discussed.
`
`REFERENCES
`
`Fox, E. (1983). Extending the Boolean and vector space models of information retrieval with p-norm queries and
`multiple concept types. Unpublished doc! oral dlssenalion, Cornell Universily, hhaca, NY, USA.
`Keen, E.M. (1991). The use of term posilion devices in ranked oulpUI experimenls. Journal of Documentation,
`47(1 ), 1-22.
`Ledwi!h, R.H. (1988). Developmenl of a large, concepl·orienled da!abase for informal ion relrieval. Paper pre·
`sented a1 ACM Conference on Research and Developmenl in lnformalion Relrieval, Grenoble, France.
`Ro, J.S. ( 1988). An evalualion of !he applicabilily of ranking algorilhms 10 improve !he effecliveness of full·lexl
`relrievaL I. On !he effec1iveness of full-1ex1 relrieval. Journal of the A SIS, 39(2), 73-78.
`Sal! on, G., Fox, E., & Wu, H. (1983). Ex! ended Boolean informal ion re!rlevaL Communications of the ACM,
`26(12), 1022-1036.
`
`