H. P. Luhn
`A Statistical Approach to Mechanized Encoding
`and Searching of Literary Information*
`Abstract: Written communication of Ideas is carried out on the basis of statistical probability in that a writer
`chooses that level of subject specificity and that combination of words which he feels will con.vey the most
`meaning. Since this process varies among individuals and since similar ideas are therefore relayed at differ(cid:173)
`ent levels of specificity and by means of different words, the problem of literature searching by machines
`still presents major difficulties. A statistical approach to this pr.oblem will be outlined and the various steps
`of a system based on this approach will be described. Steps incl11de the statistical analysis of a collectiorl'of
`documents in a field of interest, the establishment of a set of "notions" and the vocab11lary by which. they
`are expressed, the compilation of a thesa11rus-type dictionary and index, the a11tomatic encoding of docu(cid:173)
`ments by machine with the aid of such a dictionary, the encoding of topological notations (such as branched
`str~Jcturesl, the recording of the coded information, the establishment of a searching pattern for finding
`pertinent information, and the programming of appropriate machines to carry out a search.
`1. Introduction
`The essential purpose of literature searching is to :find
`those documents within a collection which have a bearing
`on a given topic. Many of the systems and devices, such
`as classifications and subject-heading lists, that have been
`developed in the past to solve the problems encountered
`in this searching process are proving inadequate. The
`need for new solutions is at present being intensified by
`the rapid growth of literature and the demand for higher
`levels of searching efficiency.
`Specialists in the literature searching field are optimis(cid:173)
`tic about the future application of powerful electronic
`devices in obtaining more satisfactory results. A success(cid:173)
`ful mechanical solution is unlikely, however, if such
`modern devices are to be viewed merely as agents for
`accelerating systems heretofore fitted to human capabili(cid:173)
`ties. The ultimate benefits of· mechanization will be
`realized only if the characteristics of machines are better
`understood and systems are developed which exploit
`these characteristics to the fullest. Rather than subtilize
`the artful classificatory schemes now in use, new systems
`would replace them in large part by mechanical routines
`based on rather elementary reasoning.
`The major technical effort involved in substituting
`mechanical for intellectual means must, of course, be
`justified by the improved results obtained. However, if
`partial mechanica1 substitution for human effort cannot
Presented at American Chemical Society meeting in Miami, April 8, 1957.
`be found in automation, there is a real danger that the
`demand for professiona1 talent will become too great to
`fill. In view of the foreseeable strain, the most efficient
`use of talent will have to be made even by automatic
`systems. The operating requirements of these systems
`will, above all, have to be well adapted to the degree of
`education and experience of generally available personnel.
`Language difficulties, too, will have to be met. The
`problems stemming from the mere volumes of 1iterature
`to be searched are being continually. aggravated by the
`increasing accession of foreign-language documents that
`rate consideration on an equal level with domestic
`material. To be of real value, future automatic systems
`will have to provide a workable means of overcoming
`the language barrier.
`• Complexity levels of information systems
`The general terms in which the problem of literature
`searching has been treated might indicate the possibility
`of a general, or universal, solution. It would be unreal(cid:173)
`istic to assume that such is practical or desirable. It is
`quite important to establish some differentiating criteria
`by which information reference arrays may be distin(cid:173)
`guished and graded as to their make-up, objectives, and
`uses. It will then he possible to recognize better the exist(cid:173)
`ence of different levels and the necessity of applying
`appropriately different techniques to their mechanization.
`The following list of six information systems in order of
`increasing complexity, insofar as mechanical solution is
`concerned, may he of use in indicating the differences in
`information levels:
`1. Ready reference look-up systems of facts such as
`indexes, dictionaries, and parts catalogues.
`2. Systems of limited and narrowly defined categories
`especially where, as in lists of specifications, categories
`are repetitive. (Personnel histories, medical case his(cid:173)
`tories, etc.)
`3, Systems of the kind found commonly in chemistry that
`deal with inventories of uniquely definable structures
`and their interrelations and transformations.
`4. Systems of mathematics, logic, and law that are based
`on disciplined concepts of human intellect.
`5. Systems dealing with
`the exploitation of natural
`phenomena and objects, as in the applied sciences and
`6. Systems, of which pme fiction is the extreme, dealing
`with unrestricted association of human notions.
`While there may be criteria other than those listed
`above according to which the spectrum of information
`can be graduated, the important fact to recognize here is
`that there are radically different classes of information. It
`therefore makes little sense to discuss a literature search(cid:173)
`ing system without also identifying the portion of the
`spectrum to which the system is to be applied.
`• Distribution of human effort
`Since the graduation of the above list ranges from explicit
`factual listings to the abstract concepts of creative writ(cid:173)
`ing, it seems unavoidable that the efficiency of recognition
`of desired information will decrease in this direction. The
`various systems might therefore be characterized by their
`recognition potential and the amount and distribution of
`human and machine effort required. It seems to be an
`inescapable fact that the less disciplined the language, the
`greater the human effort that must be expended some(cid:173)
`where in the system.
`There are four distinct phases of human effort in(cid:173)
`1. The design, setup, and maintenance of the system
`2. The interpretation and introduction of information
`into the system.
`3. The programming of wanted information for mechan(cid:173)
`ical recognition.
`4. The interpretation of selected records to determine
`whether they are relevant to the wanted information.
`To arrive at an optimum process for a given informa(cid:173)
`tion level, the question of the quality and proportion of
`human effort to be expended at each of these phases
`must be answered.
`• Time considerations
`The introduction of time as an additional variable will
`change proportions quite considerably. If, for instance,
`any kind of information must be located in a matter of
`minutes, the possible maximum of skilled effort will have
`to be spent at the input phase of the system and in an
`equal degree on every entry into the system. If, however,
`time requirements are less pressing, input procedures
`that require medium skill and minimum effort may be
`chosen so that the skilled effort can be concentrated at
`the output phase on only a small fraction of the records
`of the collection. In the latter case, the fact that only a
`small fraction of the records of a collection will ever be
`selected should result in a reduction of the overall effort.
`Time may affect a system in another way that makes
`the shift of skilled effort to the output phase more desir(cid:173)
`able. Excessive editing obviously increases the likelihood
`of bias due to current interests, experiences, and points of
`view. In consequence the usefulness of the system will be
`reduced as emphases and interests change. It would there(cid:173)
`fore appear that the less information is classified and
`contracted at the input, the more it will lend itself to
`dynamic interpretation at the output phase.
`• A proposed solution hy statistical methods
`The following paragraphs will present the basis and
`organization of a literature searching system which util(cid:173)
`izes statistical methods in conjunction with a high degree
`of mechanization. The principles involved represent an
`extension and refinement of those discussed in an earlier
`paper.' Although the specific system described is primar(cid:173)
`ily designed to satisfy the requirements of information
`level 5 of the foregoing list, it may also be found adapt(cid:173)
`able to levels 4 and 6.
`Generally speaking, the proposed system is based on
`what are variously referred to as cross-indexing, multidi(cid:173)
`mensional-indexing, coordinate-indexing, multiple-aspect(cid:173)
`indexing and encoded-abstract techniques. Actual prac(cid:173)
`tices vary from lifting key words of a text by manual
`editing to interpretive analysis by logical formulas of
`well-defined concepts. M. R. Hyslop has given a general
`description and a bibliography of methods using these
`2. The statistical aspects of communicating ideas
`Communication of ideas by way of words is carried out
`on the basis of statistical probability. We speculate that
`by using certain words we will be able to produce in
`somebody else's mind, a mood and disposition resembling
`our own state of mind which resulted from an actual
`experience or a process of thought. ln order to communi(cid:173)
`cate an idea, we break it down into a series of little ideas,
`i.e. more elementary ideas for which previous and com(cid:173)
`mon experience might have led to an agreement of
`meaning. We extend this process until we feel that we
`have reached a level of conventional notions, a level at
`which communication can he accomplished. This level
`may vary depending on the degree of similarity of com(cid:173)
`mon experiences. The fewer experiences we have in
`common, the more words we must use.
`A picture of this process. if it can be drawn at all,
`might look something like the triangular portion of Fig. 1.
`The process of communicating ideas is dynamic when
`it can be performed by means of the spoken word. ln the
`first place, the addressor can size up the addressee and


`Figure 1 Communication of ideas.
`Breakdown of basic idea into elementary concepts on experiential level common to reader and writer.
`adjust the process of subdividing his idea to the level of
`common experience which most probably exists between
`the two. Secondly, guided by the feedback of the ad(cid:173)
`dressee's reactions and questions, the addressor may re(cid:173)
`adjust to a reasonably optimum level or change his
`strategy of composition.
`The process assumes static qualities as soon as ideas
`are expressed in writing. Here the addressor has to make
`certain assumptions as to the make-up of the potential
`addressee and as to which level of common experience
`he should choose. Since the addressor has to rely on some
`kind of indirect feedback, he might therefore be guided
`by the degree to which the written expressions of ideas
`of others has raised the level of common experience
`relative to the concepts he wishes to communicate.
`The most general such guidance is furnished by the
`dictionary. Here the verbal expressions of ideas at a
`given level of common experience are defined in terms
`of verbal expressions at other levels so that a broad
`domain of common experience is assured. Thus, the dic(cid:173)
`tionary is a periodical report to word users on the ideas
`which currently are most often conveyed by the words
`in use. The level at which the lexicographer breaks off
`his reporting may vary and is, of course, dictated by
`economical factors. This is so because in the extreme
`he would have to quote substantial parts of the current
`literature to explain the slightest differentiations of ideas.
`It is the task of special dictionaries to bring more remote
`areas of experience into the common domain by explain(cid:173)
`ing ideas at higher levels.
`In writing, the addressor may then take the special
`dictionary of his field of interest as a next approximation
`of a level of common experience on which to communi(cid:173)
`cate with the addressee. However, since the lexicographer
`can never be up to date, there still remains a gap which
`the addressor will have to fill to permit the addressee to
`adjust himself to the desired level. This he may do by
`referring the addressee, by means of a bibliography, to
`that portion of the literature which the lexicographer has
`not as yet analyzed.
`It may be assumed that the means and procedures just
`mentioned permit communication to be accomplished in
`a satisfactory manner. If it is possible to establish a level
`of common experience, it seems to follow that there is
`also a common denominator for ideas between two or
`more individuals. Thus the statistical probability of com(cid:173)
`binations of similar ideas being similarly interpreted
`must be very high.


`If it were possible to recognize idea building blocks
`irrespective of the words used to evoke them, these
`building blocks might be considered the elements of a
`syntax of notions. Communication could then be carried
`out by relaying these notions by means of agreed-upon
`symbols. Since these symbols would be independent of
`style and language, they would help to overcome lan(cid:173)
`guage barriers. A symbol system of this kind would be
`most useful in facilitating the process of information
`recognition by automatic means.
`3. Possible building blocks for a statistical system
`The lack of uniformity of structure and the arbitrariness
`of word usage make literature an unwieldy subject for
`automation. Its information content must first be repre(cid:173)
`sented and organized into a form that can be operated
`on by a machine, for only then can the degree of
`similarity between any two records be automatically
`determined. The most efficient means of transforming
`information for machine interpretation would be those
`that permit the application of a minimum of logical
`machine instructions for access to relevant information.
`The very nature of free-style renderings of informa(cid:173)
`tion seems to preclude any system based on precise rela(cid:173)
`tionships and values, such as has been developed in the
`field of mathematics. Only by treatment of this problem
`as a statistical proposition is a systematic approach possi(cid:173)
`ble. The objectives of a system based on this proposition
`would be first to transform information into arrays of
`normalized idea building blocks and then to discover
`similarities in the respective building-block patterns of
`these arrays by means of a statistical analysis. It could be
`reasonably assumed that the more closely two arrays are
`matched, the greater the probability that the records they
`represent contain similar information.
`It is true that the principle of pattern matching has
`been applied previously in searching systems.o The em(cid:173)
`phasis here, however, is on the use of notions as a basis
`for pattern derivation. Where such non-precise elements
`can be used as building blocks, the possibility of creating
`a practical information retrieval system is substantially
`In the process of communicating ideas, an author
`pursues a certain plan of organizing his ideas. The ex(cid:173)
`ternal evidence of such a plan is the grouping of his
`ideas into chapters, paragraphs, and sentences. Figure 1
`illustrates how this organization may come about. No(cid:173)
`tions are most closely and specifically related to each
`other within a sentence. One sentence immediately fol(cid:173)
`lowing another might either be related in its entirety to
`previous notions or serve to relate these notions to new
`ones. The same might be said of succeeding sentences.
`However, a significant new argument is usually intro(cid:173)
`duced in a new paragraph. A still more decisive change
`of aspects might be denoted by the start of a new chapter.
`This conscious division by the author furnishes one
`key to the relatedness of his notions, which although not
`always accurate, may generally be accepted as a signifi(cid:173)
`cant and meaningful element of the information he is
`attempting to relay. We may therefore consider several
`degrees of relationship; namely, the first-order relation(cid:173)
`ship of notions within a sentence, a second-order rela(cid:173)
`tionship between sentences within a paragraph and their
`respective notions, a third-order relationship between
`paragraphs within a chapter, and still higher orders for
`larger divisions.
`The sum of the relationships and divisions, as far as
`the author is concerned, is the entirety of his message or
`paper. However, since it is desirable to make the paper
`or document comparable with other similar documents,
`a still higher level of grouping is indicated, and this is the
`level of common experience previously discussed. It was
`argued that a level, or field, of common experience was a
`requirement for communication. It follows that the more
`specific the field, the closer will be the agreement among
`the notions used in the mental process of people associ(cid:173)
`ated with that field. It therefore seems important and
`helpful to recognize these fields and to establish them as
`a next order or level of division.
`• Notions and "technese"
`Communications at this specialized level are made as
`though in a foreign tongue, in that people in various
`specific fields each speak a "native" technical language.
`However, since notions are here to be considered inde(cid:173)
`pendently of their implementation by words, we are
`referring to the syntax of notions of the specialist. This
`syntax of notions might be called technese, ior lack of a
`suitable existing term. We may talk, for example, about
`the technese of the chemist, the lawyer, or the electrical
`For each kind of technese, a notion may be expressed
`in the words of any desired language. The association of
`words and notions will of course be typical of a given
`field, and the more specialized the field, the more com(cid:173)
`plex may be the notion expressed in a single word. It
`must be emphasized that language per se remains inci(cid:173)
`dental. The notions, which are the essential elements in
`all technese, are assumed to be independent of any
`After individual special fields have been established,
`a final grouping would be required to embrace the totality
`of special fields. The notions to be applied at this level
`would necessarily be more general and the process of
`matching would be carried out by way of appropriately
`broader notions.
`In addition to the hierarchical organization just de(cid:173)
`scribed, there is another kind of division which should
`be introduced to facilitate the adjustment of a system to
`the constant expansion of knowledge and the associated
`adaptations and changes of language. This may be done
`by starting a new division or "age class" of documents at
`given intervals, as time progresses. For each new interval
`the system would be updated to reflect, for the ensuing
`period, the changes during the preceding period. The
`process of searching would then be performed first for
`the current period, then for the preceding period and so
`on, and to the extent dictated by the results obtained.


`The use of age classes seems to be the only method by
`which a collection may be divided into mutually exclu(cid:173)
`sive sections. The searching of a collection in retrogres(cid:173)
`sive steps or by predetermined age groups is bound to
`shorten the average time of a search. It also appears
`useful in many instances to search the most recent litera(cid:173)
`ture first.
`The above system of notions and their degree of re(cid:173)
`latedness is not necessarily the sole system by which
`comparable patterns may be derived. Certain classes of
`information elements such as names or symbolism of
`structure, e.g., chemical structures, flow diagrams, circuit
`diagrams, road maps, etc., might demand rather specific
`identifications. The notations used to represent these
`elements would assume the same status as that accorded
`to notions.
`4. The limitations of serial communication
`The process of communicating notions by means of
`words can only be performed in serial fashion. In order
`to overcome this basic limitation, intricate devices have
`to be incorporated into a language to instruct the ad(cid:173)
`dressee how to relate notions in ways other than those
`given by the linear sequence of words. By means of
`additional words, the addressee is told how to construct
`a mental image of the multidimensional conceptions of
`the idea being communicated. Since these instructions
`may become rather involved and subject to misinterpre(cid:173)
`tation, it is advantageous to utilize pictorial presentations.
`When thus supplemented, serial language lends itself
`much more readily to the investigation and description
`of multidimensional relationships.
`This limitation of serial communication and its asso(cid:173)
`ciated problems also inheres in data-processing machines.
`Communication is carried out on a serial basis in the
`same sense as among humans. For pictorial representa(cid:173)
`tions, the machine is at a disadvantage, at least at the
`present stage of the art. The best that can be done is to
`instruct the machine to create a multidimensional array
`and to further instruct the machine to analyze all the
`many relationships contained in this array. For a ma(cid:173)
`chine to do this, it must have an internal memory where
`it can store the representation and analyze it over and
`over again in accordance with a specific program.
`The organization and recording of information capa(cid:173)
`ble of being analyzed in the above fashion, as well as the
`development of programs directing the machine to do
`this, is a very exacting procedure. The machine, having
`only logic to its credit, cannot function unless informa(cid:173)
`tion and instructions are given it in strictly logical lan(cid:173)
`guage. A system in which relationships between notions
`were to be given and explicitly recognized would there(cid:173)
`fore be dependent upon a major intellectual effort for
`interpreting meanings and relationships and translating
`them into unique notations. As with current classificatory
`schemes, this effort would have to be repeated for each
`new document. When it came to the searching operation,
`inqumes would have to be similarly interpreted and
`encoded. The machine would then have to recognize
`similarity of representations through an iterative process
`of identifying and comparing each of the specific rela(cid:173)
`tionships given.
`The question arises whether similarity of multidimen(cid:173)
`sional representations might not be established by more
`direct methods without reliance on an internal memory
`machine. It might be argued that, while it is true that a
`given number of various notional or pictorial elements
`could theoretically be related in countless patterns, only
`a very limited number of these patterns represent mean(cid:173)
`ingful information. Moreover, each additional pattern,
`in association, further limits the number of meaningful
`interpretations applicable in the particular case.
`On these grounds it would be possible to disregard
`specific and explicit relationships and merely investigate
`whether certain elements happen to be associated and
`to what degree. Such a substitution of statistical for
`critical criteria would facilitate the establishment of sim(cid:173)
`ilarity by matching. The more two representations agreed
`in given elements and their distribution, the higher would
`be the probability of their representing similar informa(cid:173)
`tion. The actual matching process would be performed
`through a serial scanning of records. The machine used
`need not be capable of temporarily storing blocks of
`information in an internal memory.
`As will be seen, the type of scanning suggested is
`applicable to the statistical searching system presented in
`the following sections. The system concerns itself mainly
`with information represented by the written word. In the
`above discussion, however, reference was made to pic(cid:173)
`torial representations without indicating how these might
`be organized either for the purpose of exhaustive analysis
`by machines with internal memory or scanning and
`matching on a statistical basis. By way of example, the
`reader will find the first kind of system presented in a
`paper by Ascher Opler4 and the second kind in a paper
`by the author."
`5. The organization of a statistical searching system
`• Objective
`The primary objective of the proposed system is the
`minimization of the intellectual effort of professionals at
`the document encoding stage of the system so that this
`day-by-day routine may be performed by automatic
`means and with a minimum of professional personnel, in
`accordance with a few simple rules. The intellectual
`effort of professionals, who are relieved of the routine
`encoding task, is now shifted to and concentrated at the
`creative stage of setting up the system itself. This effort
`will be quite substantial when a system is first installed
`and will call for above-average talent. Thereafter, a
`moderate effort will be required periodically to update
`the system.
`• Creating a dictionary of notions
`The procedure to be described is similar to the one used
`by P. M. Roget for compiling his Thesaurus of English
`Words." Roget created categories of words that had a


`the system is to be a dynamic one, such a sample should
`consist of the "youngest" age group, comprising all ac(cid:173)
`cessions from the present back to a judiciously selected
`date. The choice of these data should in part be governed
`by the number of useful documents obtained.
`The next step consists of transcribing the sample docu(cid:173)
`ments into punched or magnetic tape, i.e., into a form
`which will permit subsequent mechanical operations on
`the information. Inasmuch as certain grammatical fea(cid:173)
`tures of words should be recognized in subsequent steps
`of the procedure, it would be advantageous to identify
`certain classes such as nouns, adjective qualifiers, and
`names by special symbols. Eventually these differentia(cid:173)
`tions may be determined by machine, as they will have
`to be when the art of machine translation is perfected.
`The third step is the preparation of a card index of all
`transcribed sentences. A concordance worked out with
`these cards will then result in the grouping of words of
`similar or related meaning into "notional" families. This
`is so similar to the work required for the creation of a
`thesaurus such as Roget's, that the basic organization of
`such books may well serve as the skeleton for this process.
`The formation of notional families constitutes a major
`intellectual effort to be undertaken by experts thoroughly
`familiar with the habits of communication among people
`associated with the special field of the subject literature.
`These experts would endeavor to differentiate the no(cid:173)
`tional families so as to resolve the material in terms of
`an optimum number of equally weighted elements. For
`instance, in a field that specializes in electricity the
`notion "electricity" would be common to most docu(cid:173)
`ments and therefore worthless as a discriminating ele(cid:173)
`ment. On the other hand, in this same field, the notion
`"butterfly" would be entirely too specific for a separate
`notional family. Instead, the notion "electricity" would
`have to be broken down into an appropriate number of
`subnotions in accordance with qualifying adjectives that
`might accompany it, while the notion "butterfly" would
`have to be relegated to a notional family of broader
`aspects, such as the notions "insects," "animals," or
`"living things," depending on the overall frequency of
`occurrence of such notions.
`In one of many possible systems, for instance, it is
`assumed that nouns (including gerunds) and, where
`necessary, qualified nouns are capable of providing an
`effective set of discriminating notions. Such nouns would
`then be grouped into notional families in accordance
`with the principles established by Roget. The physical
`result would be a dictionary in two parts. The first part
`would be the listing in some systematic order of the
`notional families, each identified by an index symbol
`such as a number or key word. Each of these would
`represent a listing of the words from the sample docu(cid:173)
`ments which are related with respect to the notion they
`express. If more than one language is involved, the words
`within a family might be segregated by language. The
`second part would be an alphabetic index of the words
`occurring in the first part, giving the key word and index
`number of the one or several notional families of which
`Figure 2 Information searching system-creation of
`a dictionary of notions.
`family resemblance on a conceptual level. He arrived at
`approximately 1000 of these categories for the entirety
`of experience. Under such a category as "space" he lists
`all words and phrases that include any notion of spatial(cid:173)
`ity. This procedure, as adapted here, also relies greatly
`upon the techniques used in the preparation of con(cid:173)
`cordances of significant works in literature such as have
`been applied in connection with the complete works
`of St. Thomas Aquinas as described in a recent article. 7
`The virtue of such a procedure is that it provides for the
`greatest possible extent of mechanization. In the form
`presented it is most applicable to a collection of docu(cid:173)
`ments embracing a specialized field that would normally
`be pertinent to a research activity serving an industrial
`The first step in the procedure is the establishment of
`a basic sample drawn from the collection (Fig. 2). Since


`the given word is a member. This index may also be
`segregated by language.
`As far as intellectual effort is concerned, the establish(cid:173)
`ment of notional categories at the word level, as prac(cid:173)
`ticed in this system, appears to have advantages over the
`development of classification or subject headings. It is
`estimated that the number of notional categories required
`will be less than a thousand and that this number will
`grow at a very low rate. In the case of classifications or
`subject headings, the number and growth rate would
`probably be substantially higher. Also it is likely that
`it is easier to establish notional family membership for
`the words than to define exact classes of subject headings
`and their subsequent interpretations. Lastly, the reduc(cid:173)
`tion in the effort required to maintain and update the
`system should prove significant.
`• The encoding of documents
`The encoding of the documents of the sample may now
`be carried out with the aid of the dictionary of notions.
`This process consists of recording each document in
`terms of notional elements and thereby creating patterns
`which later will serve as a means of recognizing con(cid:173)
`ceptual similarity in varying degrees between documents.
`This process might best be carried out on the basis
`of the prevalent patterns of literary organization. Implied
`here is the capability for recognizing various levels of
`the relatedness of notions as reflected by the author's
`formation of sentences, paragraphs, chapters, et cetera.
`There is also the probability that the more frequently a
`notion and combination of notions occur, the more im(cid:173)
`portance the author attaches to them as reflecting the
`essence of his overall idea.
`At this point the question arises: To what degree of
`specificity must notions and their relationships specifically
`be encoded to arrive at a practical measure of compari(cid:173)
`son? The answer probably cannot be given with a back(cid:173)
`ground of practical experience. A practical method is to
`start with a broad system and to determine by experience
`whether and where refinements are needed. If, on the
`other hand, a s

