A look back and a look forward
`Karen Sparck Jones
`Computer Laboratory, University of Cambridge
`This paper was given for the SIGIR Award (now the Salton Award) at SIGIR 1988;
`the final version appeared in
`Proceedings of the 11th International Conference on Research and Development in
`Information Retrieval, ACM-SIGIR, 1988, 13-29.
`This paper is in two parts, following the suggestion that I first comment on my own past
`experience in information retrieval, and then present my views on the present and future.
`Some personal history
`I began serious work in IR in the mid sixties through one of those funding accidents that afflict
`everyone in research; but I had become involved with it before that, for respectable intellectual
`reasons. The group working under Margaret Masterman at the Cambridge Language Research
`Unit had argued for the use of a thesaurus as a semantic interlingua in machine translation,
`and had then seen that a thesaurus could be used in a similar way, as a normalising device,
`for document indexing and retrieval (Masterman et al 1958). My doctoral research was
`concerned with automatic methods of constructing thesauri for language interpretation and
`generation in tasks like machine translation; and Roger Needham was working at the same
`time on text-based methods of constructing retrieval thesauri, in the context of research on
`general-purpose automatic classification techniques.
`The essential common idea underlying this work was that word classes, defining lexical
`substitutibility, could be derived by applying formal clustering methods to word occurrence,
`and hence cooccurrence, data (Sparck Jones 1971b).
`In the early sixties we saw semantic
`interlinguas, thesauri, and statistical classification as promising new forms of older ideas
`which were well suited to the challenges and the opportunities computers offered both for
`carrying out language-based information managament, as in translation or retrieval, and for
`providing the tools, like thesauri, needed for these information extraction and transformation
`In my doctoral research (Sparck Jones 1964/1986) I suggested that a thesaurus could be
`built up by starting from sets of synonymous word senses defined by substitution in sentential
`text contexts, and carried out classification experiments to derive larger groups of related
`word senses constituting thesaurus classes from these, though I was not able to test any of
`my classifications as a vehicle for their ultimate purpose, namely translation.
`In my first
`major project in IR I also worked on automatic thesaurus construction, but in this case with
`word classes defined not through direct substitution in restricted sentential contexts, but by
`cooccurrence in whole texts. This rather coarse-grained classification, of the type originally
`studied by Roger Needham, seemed to be appropriate for document indexing and retrieval
`purposes. Substitution classes not confined to synonyms, but extending to collocationally
`related items, could be used as indexing labels within the coordination matching framework


`that I have always thought natural for derivative indexing. Word classes based on text
`cooccurrence naturally pick up collocationally linked pairs, and capture synonym pairs only
`via their common collocates, but we argued that substituting a collocate is legitimate and
`indeed that to respond effectively to the very heterogeneous ways a concept can be expressed
`in text, it is necessary to allow for very heterogeneous word classes.
`But it soon became clear that plausible arguments are not enough in IR. The project
`we began in 1965 was designed to evaluate automatic classification not only in the sense
`of demonstrating that classification on a realistic scale was feasible, but of showing that it
`had the intended recall effect in retrieval. We were therefore working with the Cranfield 2
`material and began to do experiments with the smaller Cranfield collection, constructing term
`classifications and testing them in searching. At the CLRU we had always emphasised the
`need for testing in the language processing and translation work; and in the classification
`research, because this was concerned with automatic methods, there was a similar emphasis
`on testing. The importance of IR in this context was not only that it supplied challenging
`volumes of data, but that it came with an objective evaluation criterion: does classification
`promote the retrieval of relevant documents? Performance evaluation for many language
`processing tasks is an intrinsically difficult notion (1986a), and natural language processing
`research in general had in any case not advanced enough to support more than very partial
`or informal evaluation; while with many other applications there are no good, independent
`evaluation criteria because classification does not have the well-defined functional role it does
`in retrieval.
`In earlier research on classification methods Roger Needham had already stressed that
`equally plausible arguments could be advanced for very different forms of classification, and
`we found the same for the specific IR application. More generally we found that things did not
`work out as we expected, and we found it very difficult to see why. The major evaluation work
`of the sixties, like the Cranfield and Case Western investigations and Salton’s comparative
`experiments, showed how many environmental or data variables, and system parameters there
`are in an indexing and retrieval system. But we found that in trying to understand what
`was happening in our classification experiments, and to design experiments which would be
`both sufficiently informative about system behaviour and well-founded tests for particular
`techniques, we were driven to a finer descriptive and analytic framework which made the
`whole business of experiment very demanding. The same trend is clear in the Cornell research.
`The attempt to identify all the relevant variables and parameters, even within the relatively
`restricted indexing and searching area of IR systems as wholes within which we worked, that
`is to find an appropriate granularity in describing system properties, was a long haul driven
`by the need to understand system behaviour sufficiently to provide the controls required for
`automatic processes which have to be fully and precisely specified.
`In the late sixties we concentrated on those variables and parameters most obviously
`relevant to automatic classification, namely the distributional properties of the term vocab-
`ulary being indexed, and the definitional properties of the classification In earlier reports
`I referred to environmental parameters and system variables: I think my present usage is
`preferable. techniques being applied, in the attempt to get an automatic classification which
`worked. I succeeded in this (Sparck Jones and Jackson 1970, Sparck Jones 1971a) and was
`able to obtain decent performance improvements with automatic classifications meeting cer-
`tain requirements, restricting classification to non- frequent terms and classes to very strongly
`connected terms; and these results could be explained in terms of the way they limited the new
`terms entering document and request descriptions to ones with a high similarity in potential


`relevant document incidence to the given terms.
`However subsequent very detailed analytic experiments (Sparck Jones and Barber 1971)
`designed to discover exactly what happened when a classification was used and hence what the
`optimal classification strategy was, added to the earlier experience of not being led astray by
`plausible arguments for specific forms of classification by suggesting that the general argument
`for keyword clustering as a recall device might be suspect. Thus is appeared that a term
`classification could usefully function as a precision device.
`But good-looking results for one collection were clearly not enough. We were interested in
`generally applicable classification techniques and, further, in classification with an operational
`rather than a descriptive role. So, following the tradition established at Cornell, I began
`comparative tests with other collections.
`This led to a very complex period of research, because I found that classification was less
`effective on these other collections than it had been for the Cranfield one, but it was very dif-
`ficult to find out why. I wanted to show that a keyword classification, constrained and applied
`as in the Cranfield case, would help performance. The fact that it did not provoked a long
`series of analytic experiments designed to uncover the influences on classification behaviour,
`taking the characterisation of collections and devices down to whatever level of detail seemed
`to be required to support the specification of effective strategies (e.g. Sparck Jones 1973a).
`One outcome of this research was the Cluster Hypothesis Test (van Rijsbergen and Sparck
`Jones 1973). It turned out in some cases to be so difficult to get any kind of performance
`improvement over the term matching baseline as to suggest that it was not the devices being
`applied but the collection to which they were being applied that was intrinsically unrewarding.
`But the main results of this work of the early seventies were those concerned with index
`term weighting. The research on classification led us to take an interest in the distributional
`properties of terms, partly for their possible effects on classification (so, for example, one
`shouldn’t group frequent terms), and partly because term matching without the use of a clas-
`sification provided a baseline standard of retrieval performance; and we found that collection
`frequency weighting (otherwise known as inverse document frequency weighting) was useful:
`it was cheap and effective, and applicable to different A program bug meant the specific results
`reported here were incorrect: see Sparck Jones and Bates 1977b; but the corrected results
`were very similar, and the test remains sound. collections (Sparck Jones 1971c, 1973b).
`I nevertheless felt that all these various findings needed pulling together, and I therefore
`embarked on a major series of comparative experiments using a number of collections, includ-
`ing one large one. I still did not understand what was happening in indexing and retrieval
`sufficiently well, and thought that more systematic comparative information would help here:
`it could at least show what affected performance if not explain why or how. I also wanted
`to be able to demonstrate that any putative generally applicable techniques were really so.
`Moreover for both purposes, I wanted to feel satisfied that the tests were valid, in being
`properly controlled and with performance properly measured. I believed that the standard
`of my own experiments, as well as those of others, needed to be raised, in particular in terms
`of collection size, both because small scale tests were unlikely to be statistically valid and
`because, even if they were, the results obtained were not representative of the absolute levels
`of performance characteristic of large collections in actual use.
`The effort involved in these tests, the work of setting up the collections and the persistent
`obstacles in the way of detailed comparisons with the results obtained elsewhere, were all
`begetters of the idea of the of Ideal Test Collection (Sparck Jones and van Rijsbergen 1976,
`Sparck Jones and Bates 1977a) as a well-founded community resource supporting at once


`individually satisfying and connectible experiments.
`The major series of tests concluded in 1976 (Sparck Jones and Bates 1977b) covered four
`input factors, four indexing factors and three output factors each, and particularly the index-
`ing factors, covering a range of alternatives; fourteen test collections representing different
`forms of primary indexing for four document and request sets; and nine performance mea-
`surement procedures: there were hundreds of runs each matching a request set against a
`document set. I felt that these tests, though far from perfect, represented a significant ad-
`vance in setting and maintaining experimental standards. I found the results saddening from
`one point of view, but exciting from another. It was depressing that, after ten years’ effort,
`we had not been able to get anything from classification. But the line of work we began on
`term weighting was very interesting. Collection frequency weighting was established as use-
`ful and reliable. This exploited only the distribution of terms in documents, but Miller and
`subsequently Robertson had suggested that it was worth looking at the more discriminating
`relative distribution of terms in relevant and non-relevant documents, and this led to a most
`exhilarating period of research interacting with Stephen Robertson in developing and testing
`relevance weighting (Robertson and Sparck Jones 1976). The work was particularly satisfying
`because it was clear that experiments could be done to test the theory and because the test
`results in turn stimulated more thorough theoretical analysis and a better formulation of the
`theory. The research with relevance weighting was also worthwhile because it provided both
`a realistic measure of optimal performance and a device, relevance feedback, for improving
`actual performance.
`The results we obtained with predictive relevance weights were both much better than
`those given by simple terms and much better than we obtained with other devices. My next
`series of experiments was therefore a major one designed to evaluate relevance weighting
`in a wide range of conditions, and in particular for large test collections, and to measure
`performance with a wide variety of methods. This was a most gruelling business, but I was
`determined to reach a proper standard, and to ensure that any claims that might be made
`for relevance weighting were legitimate. These tests, like the previous ones, involved large
`numbers of variables and parameters; and they, like the previous ones, required very large
`amounts of preliminary data processing, to derive standard-form test collections from the raw
`data from various sources, for example ones representing abstracts or titles, or using regular
`requests or Boolean SDI profiles; setting up the subsets for predictive relevance weighting was
`also a significant effeort. The tests again involved hundreds of runs, on seven test collections
`derived from four document sets, two of 11500 and 27000 documents respectively, with seven
`performance measures.
`But all this effort was worthwhile because the tests did establish the value of relevance
`weighting, even where little relevance information was available Sparck Jones 1979a, Sparck
`Jones and Webster 1980). It was also encouraging to feel that the results had a good theoreti-
`cal base, which also applied to the earlier document frequency weighting, and which was being
`further studied and integrated into a broader probabilistic theory of indexing and retrieval
`by my colleagues Stephen Robertson and Keith van Rijsbergen and others.
`I felt, however, somewhat flattened by the continuous experimental grind in which we had
`been engaged. More importantly, I felt that the required next step in this line of work was
`to carry out real, rather than simulated, interactive searching, to investigate the behaviour of
`relevance weighting under the constraints imposed by real users, who might not be willing to
`look at enough documents to provide useful feedback information. Though we had already
`done some laboratory tests designed to see how well relevance weighting performed given little


`relevance information (Sparck Jones 1979b), something much nearer real feedback conditions
`was required. I hoped, indeed, that the results we had obtained would be sufficiently con-
`vincing to attract those engaged with operational services, though implementing relevance
`weighting in these contexts presents many practical difficulties.
`I was at the same time somewhat discouraged by the general lack of snap, crackle and
`pop evident in IR research by the end of the seventies, which did not offer stimulating new
`lines of work.
`I had maintained my interest in natural language processing, and this was
`manifestly then a much more dynamic area. I therefore returned to it, through a project on
`a natural language front end for conventional databases, though I maintained a connection
`with IR through the idea of an integrated inquiry system described in the second part of this
`paper. I further became involved with the problems of user modelling (Sparck Jones 1987)
`which, in its many aspects and as a general issue in discourse and dialogue processing, has
`become an active area of language processing research. This has also been recognised as a
`topic of concern for IR, which provides an interesting study context for work on the problems
`involved and for research on the related issues of interface architectures, that I shall consider
`further in the second part of this paper.
`I think it a fair judgement, in reviewing all the research I have described, to say that
`it did show that distributional information could be successfully exploited in indexing and
`searching devices, and that it helped to establish experimental standards. But throughout I
`owed a great deal to the examples set by Cyril Cleverdon, Mike Keen and Gerry Salton, and
`to the productive exchanges and collaborations I have had with them and with other close
`colleagues, notably Keith van Rijsbergen and Stephen Robertson, as well to my research
`assistants of the seventies, Graham Bates and Chris Webster.
`1 Thoughts on the present and future
`The work I have described directly reflects the dominant preoccupations of research on auto-
`matic indexing and retrieval from the time in the late fifties when computers appeared to offer
`new possibilities in the way of power and objectivity. It was concentrated on the derivation
`of document and request descriptions from given text sources, and on the way these could be
`manipulated; and it sought to ground these processes in a formal theory of description and
`But these concerns, though worthy, had unfortunate consquences. One was that, in spite
`of references to environmental parameters and so forth, it tested information systems in an
`abstract, reductionist way which was not only felt to be disagreeably arid but was judged
`to neglect not only important operational matters but, more importantly, much of the vital
`business of establishing the user’s need. Relevance feedback, and a general concentration
`on requests rather than documents as more worthy of attention in improving performance
`(following the Case Western findings of the sixties) went some way towards the user, but
`did nothing like enough compared with the rich interaction observed between the human
`intermediary and the user. The neglect of the user does not invalidate what was done, but
`it suggests it plays a less important part in the information management activity involved in
`running and using a body of documents than the concentration on it implied. The rather
`narrow view was however also a natural consequence of the desperate struggle to achieve
`experimental control which was a very proper concern and which remains a serious problem
`for IR research, and particularly the work on interactive searching to which I shall return


`The second unfortunate consequence of the way we worked in the sixties and seventies
`was that while the research community was struggling to satisfy itself in the laboratory,
`the operational world could not wait, and passed it by. The research experiments were so
`small, the theory was so impenetrable, and the results it gave were at best so marginal in
`degree and locus, that they all seemed irrelevant. One of the main motivations for the Ideal
`Test Collection was the recognised need for larger scale and thus naturally more convincing
`experimental research. Many of the large services’ concerns are wholly proper and important
`ones. But we are in the unfortunate position that the services have become established in a
`form that makes it very difficult, psychologically as well as practically, to investigate the best
`research strategies in a fully operational environment.
`Carrying out well-founded experiments to compare, for example, relevance weights with
`more conventional search methods would be arduous and very costly, and there are fundamen-
`tal difficulties about evaluating essentially different strategies like those producing unranked
`and ranked output. A fair case can be made for automatic indexing (Salton 1986), but the
`miscellaneous tests comparing less conventional with more conventional indexing and search-
`ing devices which have been carried out over a long period have not, in dealing with general
`matters like relevance sampling or in distinguishing variables and parameters and testing with
`adequate ranges of values and settings, been thorough enough to support solid conclusions
`at the refined level of analysis and characterisation that is really required, and to justify
`the specific assumptions and claims that are made. The research community interested in
`statistically-based methods is open to the criticism that it is an in-group engaged in splitting
`formula hairs, and that even where it has done experiments these have not shown enough
`about the methods themselves, or about their relative contribution compared with that made
`by other factors to overall system performance as perceived by the user, for it to be legitimate
`to assert that we can now expect the operational community to pick up the results and apply
`them. We need to do these more serious experiments, and the question is how, given the
`challenges they present.
`As it is, it is impossible not to feel that continuing research on probabilistic weighting in
`the style in which it has been conducted, however good in itself in aims and conduct, is just
`bombinating in the void; and it is noticeable that the action is taking place somewhere else.
`In fact, IR as conventionally perceived and conducted is being left behind in the rush to climb
`on the information management bandwagon. The service vendors will continue to improve
`and consolidate their technology, and the library schools to train the professionals to guard
`the sacred flame of bibliographic control. But there is a new movement, and it is useful to
`look at its goals, to see what this suggests about the right directions for IR research.
`The current interest, clearly stemming from the growth of computing power and the ex-
`tension of computer use, is in integrated, personalisable information management systems.
`These are multifacetted information systems intended to bring together different types of
`information object, and to support different types of information use, exploiting modern
`workstation technology and, most importantly, calling on artificial intelligence in manipulat-
`ing the knowledge required to connect different objects and uses in a conveniently transparent,
`and personally oriented, way. The user should be able, for example, to edit, annotate and
`publish papers, scan and modify bibliographic files, submit database queries, send and receive
`mail, consult directories and checklists, organise schedules, and so forth, moving freely from
`one type of object or activity to another within a common command framework and inter-
`acting not only with his own files but with a larger system involving other active users. The


`argument is that to do all this effectively, for example to support a combination of literature
`searching and record management in a hospital, a knowledge base, in this case about the
`relevant medical domain, is required, to support the inference needed to allow, say, effective
`patient scheduling.
`Salton has already cast doubt on the AI approach (Salton 1987). I believe (Sparck Jones
`1988b) that there are fundamental difficulties about the programme just outlined, and that
`there is a misconception in the idea that AI in the shape of rampant inference on deep
`knowledge, could lead to the desired goal. An integrated system of the kind envisaged is
`thoroughly heterogeneous, in the nature of the objects involved and in their varied grain size,
`in the functions applicable to them, and in the relevance needs they serve. Integrating these
`heterogeneous resources so the individual user can move freely from one information source
`or information-using activity to another implies a commonality of characterisation that is
`arguably unattainable given the intrinsic indeterminacy of IR systems and the public/private
`conflict that is built into them.
`It is rather necessary to remember that information management systems are information
`access systems, and that what they primarily provide access to are linguistic objects: natural
`language words and texts to be read, and properly and unavoidably to be read. The starting
`points for access supplied by the user are themselves also language objects. So to the extent
`that integration and personalisation can generally be achieved, this has to be through the
`relationships that hold between natural language expressions themselves in all their untidy
`variation and not through some cleanly logical universal character. This is not to suggest
`that individual specialised components serving particular purposes which should enhance
`system performance in specific ways, and which fully exploit artificial intelligence as defined
`should not be sought, for example, an expert system to construct controlled language search
`specifications; but their depth will probably be inversely related to their breadth. In general
`we have to look to language-based ways of connecting different parts of the system, and of
`relating the individual user to the public world of information.
`I shall illustrate the kind of thing I believe is required, and on which we should therefore
`be working, with two personal examples.
`I am not claiming any special status for these
`particular cases: they are intended primarily to supply some concrete detail.
`The first example, Menunet (Brooks and Sparck Jones 1985), is very simple and does not
`make any reference to AI. Menunet was proposed as a device for allowing the user of a set of
`office utilities accessed and operated through hierarchically-organised menus to move laterally
`from one point to another without a prior knowledge of the relevant menu option names, via
`ad hoc route- finding referring to system actions or objects. Essentially the user would be able
`to say I want to do something like ’send’, or to operate on something like a ’document’, and
`given these words as starting points be presented with all the instantiations of the underlying
`concepts in their various menu option linguistic forms. This would be done through index
`menus, constructed on the fly, listing all the menu options indexed at their sources by the
`starting word(s). The user would thus be given all the system menus accessible from the given
`calling words, where the concept(s) invoked by the calling word(s) figured under whatever
`lexical label(s) were deemed appropriate and therefore were used there. The argument was
`that with a large and complex set of utilities of the kind enountered in office automation,
`the number and variety of local menu contexts implies that identical terms will not be used
`for the same or similar purposes, and that the user cannot be expected to remember all the
`labels used; but that both tracking up and down a large hierarchical menu, and relying on a
`conventional help system, are unsatisfactory as supports for optimal travel within the system.


`The basic model can be made more sophisticated by incorporating term weighting, indicating
`the relative value of index terms as labels for an option, and by making the system adaptive
`by allowing for change in the sets and weights of index terms indicating the pattern and
`intensity of term relationships to reflect the user’s behaviour over time.
`This particular suggestion is an application of document retrieval methods in the office
`area, and as such illustrates the role of the language-based associative structures I believe
`have a crucial part to play in the information management systems now being sought.
`My other, more ambitious example comes from the work we have done relating to the
`idea of an integrated inquiry system (Boguraev and Sparck Jones 1983, Sparck Jones and
`Tait 1984, Sparck Jones 1983).
`In this we assume that the system has different types of
`information source in some subject area, e.g. a database of the conventional coded sort, a
`bibliographic text base, and (in principle) a subject or domain knowledge base. Then if the
`user seeks information, expressing his need in a natural language question, the system will
`seek to respond with germane information items from whatever type of source these can be
`obtained. This would be a normal strategy where a particular type of source is not specified,
`reflecting the fact that the different types of source provide different sorts of information
`complementing one another and therefore potentially all of value to the user. It could also be
`a default strategy where information from a specified type of source cannot be obtained.
`This scenario requires appropriate ways of processing the input question to extract the
`different kinds of search specification suited to the different source types: a formal query
`in a data language in the database case, and a set of search terms, for example, in the
`document case. In our experiments we have used the same language analyser to obtain an
`initial interpretation of the input question, resolving its lexical and structural ambiguities and
`giving it an explicit, normalised meaning representation. This representation is then taken
`as the input for further processing of the different sorts required to obtain the appropriate
`search query and request forms.
`In the first case this involves structural transformations
`to derive a logical form, and substituting terms and expressions relating specifically to the
`database domain for the less restricted elements of the natural language input, so searching
`can be carried out on the set of artificial data language encodings of the domain information
`constituting the database.
`In this database-oriented processing the structure of the input
`question as a whole is retained (Boguraev and Sparck Jones 1984).
`For the document case it is more appropriate to look for a different type of derived
`question representation in which many of the initial structural constraints are relaxed or
`abandoned. We have, however, specifically concentrated on extracting not just simple terms,
`but complex ones, from the initial analyser output, by looking for well-founded components
`of the initial interpretation, like those defined by pairs of case-related items. These could
`in principle be mapped into controlled indexing terms if documents were indexed using a
`controlled vocabulary. But we have rather investigated the idea of generating, from each of
`these underlying representation constituents, a set of alternative English forms, to provide a
`set of equivalent search terms for each concept which can be directly matched onto the stored
`texts, full or surrogate, of the document file (Tait and Sparck Jones 1983).
`For the inquiry system design, however, unlike the Menunet utility interface, the more
`challenging access requirements imply the use of AI. Thus in the database case, it turns
`out that in a complex domain, deriving a correct search query from a natural language
`question can call for inference on world models, for example inference on a model of the
`database domain to establish the specific legitimate form for an entity characterisation given
`in the question:
`in a town planning domain, for instance, a reference to people in places


`has to be transformed into a reference to people owning property in places (Boguraev et al
`1986). We are currently investigating the use of a network-type knowledge representation
`scheme with associated inference operations, to encode and manipulate world knowledge.
`It seems appropriate, because the processes of query derivation can be viewed as linguistic
`translations, to treat the knowledge base as embodying relations between word senses rather
`than as directly characterising the world, i.e. to view it as a shallow, linguistically-oriented
`body of knowledge, and further, as one which is redundant rather than parsimonious in
`allowing for very different roles for, and expressions of, common concepts. Thus buildings
`as a concept in the town planning domain, for example, have to be characterised in terms
`of a whole mass of overlapping perspectives on their physical and functional properties. The
`kinds of inference procedure allowed are rather weak and limited, and are oriented towards
`establishing linguist

