`Facebook Ex. 1014
`
`
`001
`
`Facebook Ex. 1014
`
`
`
`Datenverarbeitung im Recht
`Archiv fur die gesamte Wissenschaft der Rechtsinformatik, der Rechtskybernetik
`und der Datenverarbeitung in Recht und Verwaltung.
`Zitierweise: OVA
`
`Herausgeber:
`Dr. jur. Bernt Buhnemann, Wissenschaftlicher Oberrat an der Universitat Hamburg
`Professor Dr. jur. Dr. rer. nat. Herbert Fiedler, Universitat Bonn/Gesellschaft fur
`Mathematik und Datenverarbeitung, Birlinghoven
`Dr. jur. Hermann Heussner, Vorsitzender Richter am Bundessozialgericht, Kassel,
`Lehrbeauftragter an der Universitat GieBen
`Professor Dr. jur. Dr. phil. Adalbert Podlech, Technische Hochschule Darmstadt
`Professor Dr. jur. Spiro.s Simitis, Universitat Frankfurt a. M.
`Professor Dr. jur. Wilhelm Steinmuller, Universitat Regensburg
`Dr. jur. Sigmar Uhlig, Regierungsdirektor im Bundesministerium der Justiz, Bonn
`(Geschaftsflihrender Herausgeber)
`
`Beratende Herausgeber und standige Mitarbeiter:
`Dr. Helene Bauer Bernet, Service juridique commission C. E., Brussel - Pierre
`Catala, Professeur a Ia Faculte de Droit de Paris, Directeur de l'lnstitut de Recher(cid:173)
`ches et d'Etudes pour le Traitement de !'Information Juridique de Montpellier(cid:173)
`Prof. Dr. jur. Wilhelm Dodenhoff, Vors. Richter am Bundesverwaltungsgericht,
`Berlin- Dr. Aviezri S. Fraenkel, Department of Applied Mathematics, The Weizman
`Institute of Science, Rehovot - Prof. Dr. jur. Dr. phil. Klaus J. Hopt, M. C. J.,
`Universitat Tubingen - Prof. Ejan Mackaay, Director of the Jurimetrics Research
`Group, Universite de Montreal - mr. Jan Th. M. Palstra, Nederlandse Economische
`Hogeschool, Rotterdam - Professor Dr. Jurgen Redig t. Universitat GieBen -
`Direktor Stb. Dr. jur. Otto Simmler, Administrative Bibliothek und Osterreichische
`Rechtsdokumentation im Bundeskanzleramt, Wien- Professor Dr. Lovro Sturm,
`Institute of Public Administration, University in Ljubljana- Professor Dr. jur. Dieter
`Suhr, Freie Universitat Berlin - Professor Colin F. Tapper, Magdalen College,
`Oxford -
`lie. jur. Bernhard Vischer, UNIDATA AG, Zurich- Or. Vladimir Vrecion,
`Juristische Fakultat der Karls-Universitat in Prag.
`
`Geschaftsfu h render Herausgeber:
`Dr. Sigmar Uhlig, Ander Dune 13, D-5300 Bonn-Tannenbusch,
`Telefon 0 22 21/66 13 78 (privat); 0 22 21/5 81 oder 58 48 27 (dienstlich)
`Redaktioneller Mitarbeiter:
`Dieter Hebebrand, Fliederweg 1, D-3501 Niestetal, Telefon 05 61/52 46 31 (privat);
`05 61/30 73 62 (Bundessozialgericht)
`
`Manuskrlpte, redaktlonelle Anfragen und Besprechungsexemplare werden an den Ge(cid:173)
`schiftsfiihrenden Herausgeber erbeten, geschiftllche Mitteilungen an den Verlag. Fiir
`unverlangt elngesandte Manuskrlpte wlrd kelne Gewihr gelelstet.
`Die Beltrige werden nur unter der Voraussetzung aufgenommen, daB der Verfasser
`denselben Gegenstand nlcht glelchzeltlg In elner anderen Zeltschrlft behandeH. Mit der
`Oberlassung des Manuskrlpts iibertrigt der Verfasser dem Verlag auf die Dauer des
`urheberrechtllchen Schutzes auch das Recht, die Herstellung von photomechanlschen
`Vervlelfiltlgungen In gewerbllchen Unternehmen zum lnnerbetrlebllchen Gebrauch zu
`genehmlgen, wenn auf jedes Photokopleblatt elne Wertmarke der lnkassostelle des
`Borsenvereins des Deutschen Buchhandels, GroBer Hlrschgraben 17/19, 6000 Frankfurt
`a. M., nach dem jewell& geltenden Tarlf aufgeklebt wird.
`
`
`Facebook Ex. 1014
`
`
`002
`
`Facebook Ex. 1014
`
`
`
`Colin F. H. Tapper
`
`Citation Patterns in Legal Information Retrieval
`
`Obersicht
`
`A. State of the Art
`1. The Established Method and Its
`Defects
`2. Improvements
`3. An alternative
`4. Defects of the Alternative
`
`A. State of the Art1
`
`B. Citation Patterns
`1. Citations and Research
`2. Citations and Information Retrieval
`3. Citations and Vectors
`4. Some Problems
`5. Uses of Citation Vectors
`6. Weighting
`
`1. The Established Method and Its Defects
`
`In the late 1950's and early 1960's it first became possible to contemplate the use
`of computers to assist in the retrieval of legal information. Those years saw the
`formation of a special sub-committee of the American Bar Association, the
`launching of a number of journals specifically devoted to computer applications in
`law and predominantly to legal information retrieval,2 and the establishment of a
`number of research programs.3 Largely owing to the work of John Horty at the
`University of Pittsburgh the direction taken by most of these experiments, at least
`in the Anglo-American legal world, was towards the use of the full-text of legal
`documents for retrieval systems. There were a number of reasons for this. The
`main one was a distrust of any screen put between the lawyer confronted by the
`problem and the information made available to him. Indexing and abstracting are
`essentially methods of reducing the bulk of the information presented to the
`lawyer so as to make it of manageable magnitude.
`Information regarded by the indexer or abstracter as less important is either
`discarded altogether or restated in a more general and more concise form. It is
`then possible for the lawyer to scan this reduced version and to select those parts
`which seem relevant to his problem. With the advent of the computer it seemed
`that things had changed. The machine could scan any amount of information in a
`very short space of time, and so long as its judgment of relevance was satisfactory
`could save the lawyer work by presenting him with all and with only relevant
`
`1 See generally Tapper ,Computers and the Law' chs. 5-7.
`2 Jurimetrics Journal (formerly Modern Uses of Logic in Law), Law and Computer Techno(cid:173)
`logy, Rutgers Journal of Computers and Law, Datenverarbeitung im Recht.
`3 Among the earliest experiments were those of Professor Horty at the University of
`Pittsburgh, Colin Tapper at Magdalen College, Oxford and Aviezri Fraenkel at the
`Weizman Institute of Science.
`
`
`Facebook Ex. 1014
`
`
`003
`
`Facebook Ex. 1014
`
`
`
`250
`
`Colin F. H. Tapper
`
`material. It was argued that this method optimised the interaction of man and
`machine by restricting the machine to the mechanical job of searching for
`matches between words specified by the lawyer, and yet still allowed full scope for
`the creativity of the lawyer in selecting the words to be matched.
`
`At first some felt that this approach was best suited to statutory materials because
`their volume was smaller and the use of words was relatively more precise than in
`case-law. Others took the view that case-law was more suitable than statute on the
`basis that the greater volatility of statutory materials more than compensated for
`their relatively smaller bulk, and that the relative poverty of statutory vocabulary
`could lead to failure to retrieve relevant information. Conversely it was felt that the
`trend towards ever cheaper storage of information reduced the force of the
`argument based on bulk. In fact the best argument in favour of concentrating on
`statutory materials within full-text systems was hardly ever deployed. This is that
`the sort of search which is commonly employed in the statutory area corresponds
`much more closely to the scanning technique of full-text than does the sort of
`search commonly required in the area of caselaw. Another argument is simply that
`there is less scope for reduction in statutory materials where every word is
`authoritative than in case-law where only the rule in a case can ever be authoritati(cid:173)
`ve. As usual however the decisive arguments were economic ones. No system
`offering access to statutes alone is economically viable or psychologically accep(cid:173)
`table in an environment in which both statutes and caselaw have to be used. So
`working systems developed for widespread use catered for both statute and
`case-law as a data-base.
`
`These systems were essentially full-text systems based on the principle of word
`matching. This requires the lawyer to state his problem in terms of the words and
`combinations of words which he would expect to find in any document relevant to
`his problem. All of these terms are pregnant with difficulty. What is to count as a
`word? When is one word different from another? What sorts of combination are
`allowed? What principle of individuation is to be applied to legal documents?
`These questions have received pragmatic solutions. In general a word is equated
`with a string of characters terminated either by a space or by some punctuation,
`though there are exceptions to this rough definition. Different strings are regarded
`as different words. Strings can be combined into lists and lists into final search
`formulations by Boolean operators and semantic distance measured in terms of
`document, sentence and words. In legislation the section is usually regarded as
`the basic unit, in case-law the case. In general also so as to reduce storage
`requirements in the concordances used by such systems ,common words' are
`omitted. At first such systems were operated in batch mode, but increasingly they
`are offered on an interactive basis. This means that the user types in the words
`which are to characterise the answers to his problem, and is given the opportunity
`to review his characterisation in the light of interim results. The results are
`commonly expressed first as a numerical value representing the number of
`documents which satisfy the user's characterisation. The user then has the option
`of having the whole or part of the text of those documents displayed, or of
`modifying his characterisation. This process goes on until the user is satisfied by
`
`
`Facebook Ex. 1014
`
`
`004
`
`Facebook Ex. 1014
`
`
`
`Citation Patterns in Legal Information Retrieval
`
`251
`
`the responses obtained from the system at which stage he can get a hard copy,
`i.e., one printed on paper, of either the references to or text of those documents
`which satisfy his final characterisation.
`It is plain that these methods harbour a number of defects. These may be classified
`into two broad groups. The first includes those which affect the level of perform(cid:173)
`ance in terms of the quality of the material produced, and the second those which
`otherwise affect the acceptability of the system to users. So far as the first is
`concerned it is easier to propound theoretical reasons for it than to produce
`empirical evidence, since there has been no published report of any systematic
`test of the more advanced systems now being offered commercially. Such empiri(cid:173)
`cal evidence as there is relates only to cruder and earlier experimental systems.4
`That evidence is somewhat equivocal in suggesting that while machine perform(cid:173)
`ance does tend to retrieve relevant information which cannot be recovered in any
`other way, it does so only at the expense of recovering a vast amount of irrelevant
`information also. The explanation lies in a combination of several factors. First it
`occurs because the nature of the procedure relies upon occurrence and co-occur(cid:173)
`rence of character strings as a unique indicator of meaning. This process has a
`number of drawbacks. In the first place, very similar meanings can be encapsulat(cid:173)
`ed in very different character strings, so all must be specified. Secondly, the same
`character string can encapsulate very differe_nt meanings in different contexts. The
`former raises problems of synonyms and, much more potently, of levels of
`abstraction. The latter that of homologues. Thus ,auto', ,automobile', and ,car'
`although all different character strings have m~anings which are substantially
`similar; so, too, ,Chevrolet', ,car' and ,vehicle' can easily have the same meaning in
`the context of some legal problems though they may not do so in all; and then
`,jury' in the context of trial by jury has a different meaning from ,jury' in the sense
`of a temporary maritime repair although the character strings are identical. This
`means that in order to characterise his meaning uniquely and accurately the user
`must specify all possible synonyms, particularisations and generalisations (inclu(cid:173)
`ding all their different grammatical forms), and exclude all identical character
`strings having different meanings. The latter task can only be accomplished by
`way of the context in which the character strings appear. So some lists of strings
`must be combined and some use made of combinatorial logic. It follows that the
`user must not only be able to specify in advance all the different ways of
`expressing the meaning he wishes to include, but also all the different ways of
`expressing the meaning of a least one other meaning which he wishes to find
`associated with the original meaning and the way in which the two are to be linked.
`It will be a great help to him in doing this to be able to think of the strings most
`likely to be associated with the other unwanted meanings of the string he wishes to
`use so as to be sure that they are excluded from the combined list. Thus if a lawyer
`wishes to find cases dealing with temporary maritime repairs, he must not only ask
`for occurrences of the string ,jury' which if specified alone would deluge him with
`unwanted references to jury trial, but must also specify some association with, for
`
`4 Summarised in Tapper op. cit. ch. 6.
`
`
`Facebook Ex. 1014
`
`
`005
`
`Facebook Ex. 1014
`
`
`
`252
`
`Colin F. H. Tapper
`
`example, ,sail' or ,mast' or ,rig', remembering in the last case to be very careful
`about position so as to exclude cases referring to jury rigging in the unintended
`sense. It should by now be apparent where in the difficulty lies. In order to retrieve
`all the relevant information the user must be able to specify all the possible
`synonyms, particularisations and generalisations, whereas in order to retrieve only
`relevant information he must combine his strings in such a way as to exclude every
`possible ambiguity of every string in every list. In practice these two conflicting
`tasks have to be balanced against each other so that the user has to be content
`with as much of the relevant information as is compatible with not getting too
`much irrelevant material, or more commonly vice versa.
`
`It is at this point that the question of the general acceptability of these methods to
`potential users becomes apparent. It is neither easy to think in these ways nor is it
`customary for lawyers to do so. Such thinking is not common in any other context
`nor is it taught at law school. It can to some extent be taught, and most systems
`intended for wide use offer either instruction courses or practice manuals, or both.
`But the really effective method of learning to use such systems is by trial and error.
`After a short course or after reading a manual a lawyer can use the system in the
`sense of operating the terminal so as to secure some relevant results. But efficient
`use of the system in the sense of securing all the relevant information, and only the
`relevant information in the shortest possible time, is something which is acquired
`only over long periods of constant use. This is expensive in both time and money.
`In addition, such potent factors as the conservatism of many members of the legal
`profession, particularly older and more senior members, and the inability and
`reluctance of many to acquire the physical dexterity required to operate a key
`board effectively contribute to the relative reluctance of the legal profession to
`embrace these new systems.
`
`A further problem which has gradually become more apparent relates to access to
`the information. One of the reasons for the development of computerised methods
`was that the volume of legal material was increasing at such a startling rate that it
`could not be handled by conventional means. The choice of full-text as a method
`for computerised retrieval was taken in the teeth of that reasoning on the ground
`stated earlier that the cost of holding and securing access to data held in computer
`stores was decreasing at a time when the costs of all other forms and methods was
`increasing. That was and is true. But what it tendend to gloss over was the initial
`cost of transforming existing legal information into a machine readable form, the
`unwieldiness of the enormous volumes of material required to be stored and the
`difficulty of keeping it, and especially the statutory material, up to date. It was also
`the case that the promoters of computerised legal information retrieval systems
`did not always have the legal rights to reproduce relevant legal materials where the
`copyright was in the hands of private publishers. This would not have been too
`serious if the development of automatic devices for transforming printed books
`into a computer readable format (optical character recognition devices) had been
`more successful, or even if the publishers of law reports had co-operated with
`those collecting data for retrieval systems by producing their printed versions from
`computerised typesetting processes and making the corrected tapes available. It
`
`
`Facebook Ex. 1014
`
`
`006
`
`Facebook Ex. 1014
`
`
`
`Citation Patterns in Legal Information Retrieval
`
`253
`
`seems also that the expense of maintaining and making accessible so large a
`volume of material as the full-text system demands has imposed considerable
`strains and constraints upon the commercial systems. It has tendend to make them
`rather more expensive and rather more selective than would be ideal from the
`point of view of either users or promoters.
`All of these factors have contributed to slowing down the development of compu(cid:173)
`terised legal information retrieval though they have not stopped it altogether,
`which in itself testifies to the felt need for some form of speedier access to legal
`information.
`
`2. Improvements
`
`These factors have not passed unnoticed by the designers of information retrieval
`systems, and various devices have been suggested to help overcome them. First
`there is the problem of specifying all of the strings necessary to retrieve all of the
`relevant material. In theory this can be mitigated in two ways, either by increasing
`the number of strings in the original question, or by increasing the number of
`answers that the original question retrieves. Most systems have tended to concen(cid:173)
`trate on the former. The precise method depends to some extent upon the way in
`which the particular system requires an enquiry to be prepared. If it is by the
`specification of lists of strings then a number of possibilities suggest themselves.
`The most obvious of these is the use of an automatic thesaurus. The difficulty with
`this solution is that it would be extremely difficult to prepare such a thesaurus in
`advance, and it is difficult to see how exactly it could cater for variation as between
`different legal fields. Thus the string ,election' has one set of synonyms in the
`context of equity and another in that of constitutional law. Even more difficult
`problems are posed by particularisations and generalisations, it would, for examp(cid:173)
`le, be difficult to give all the possible particular forms of a string like ,reasonable' in
`the context of the law of negligence. A different approach to this problem, and one
`adopted by the DATUM system in Montreal, is to develop the thesaurus ex post
`facto. Thus final and presumably successful lists of strings are stored and the
`strings in each list are regarded as equivalent for the purpose of enriching the
`strings specified by the current user. This may either be automatic, or, if the system
`is interactive, may operate by supplying the suggested equivalent strings to the
`user for his acceptance or rejection. It should be pointed out that some human
`intervention is required either at the stage of grouping the strings or at that of
`using them just because of the problem of homologues. In some systems que(cid:173)
`stions can be asked not in the form of lists of strings accompanied by Boolean
`connectors and positional specification, but by a simple natural language que(cid:173)
`stion. Such systems often have to apply rules to break the question down into a
`succession of strings and operators. Since there is likely to be only one string to
`each operator, it is further necessary to amplify them. One method would be to
`include in the system a dictionary of word roots so that other strings with the same
`root could be supplied. The same could be done for grammatical variation. The
`difficulty here lies in the premise that all strings coming from the same root can
`first be identified and second regarded as equivalent. The same objection applies
`
`
`Facebook Ex. 1014
`
`
`007
`
`Facebook Ex. 1014
`
`
`
`254
`
`Colin F. H. Tapper
`
`to a comparable solution to the problem of different grammatical forms. Of course
`these objections are less serious in an interactive environment where the user can
`check the amplification of his question, though this may be contrary to the
`intention behind the natural language question approach which is to spare the
`lawyer involvement with that sort of exercise. Two other techniques which can be
`used in either context involve truncation and contextual display. The former is
`quite commonly provided in retrieval systems. It permits the user to instruct the
`system to include all strings having a certain number of characters in certain
`positions, typically consecutively at the beginning. Thus the user might specify
`,taxa+' where,+' indicates to the system that all strings beginning ,taxa' are to be
`included thus bringing in ,taxable' and ,taxation'. This is a very crude tool and it is
`necessary to be very careful in its use. Thus in the example given above it may be
`noted that ,tax' and ,taxing' are not included, but any specification short enough to
`include them such as ,tax+' would also include a huge variety of unwanted strings
`such as ,taxi' and ,taxonomy'. The other method is to display the string specified in
`its alphabetic place in the system's concordance so that the user may select which
`of the alphabetically close forms should be included. This is a natural preliminary
`to the method discussed above, and is subject to the same objection, nf:imely that
`words may be so widely separated alphabetically that they might not be detected. It
`is not clear how effective such techniques are in amplifying lists of strings. There is
`no empirical evidence derived from testing systems with and without such facili(cid:173)
`ties. In its absence it is possible to infer from the near universal adoption of some
`at least of these techniques that they are to some extent effective, but from the
`equally intensive search fo·r ways to improve them, that they are not perfect.
`The converse problem is that of reducing the amount of irrelevant material that
`would otherwise be retrieved. Here, as indeed with the problem just discussed, the
`most potent device is probably the use of an interactive system with the facility for
`supplying the number of documents retrieved by each search for~ulation and
`displaying any part of those documents. In the case of document reduction this
`may be made still more effective by including facilities for the display of those
`parts of the material physically adjacent to a required string, this is the keyword in
`context (KWIC) approach in which the user can specify as m.uch context in
`physical terms as he chooses. In interactive systems it is more common to achieve
`a similar result by highlighting required strings in the displayed text either by the
`use of bold characters, reverse video or some similar technique. This may be
`further enhanced by the provision of keys allowing the text to be skipped to the
`next highlighted term. In this way it is hoped that the user will be able to identify
`speedily, and to reject retrieved documents which are irrelevant because one of
`the required strings has proved ambiguous in meaning. Another method devoted
`not to preventing the retrieval of irrelevant material but to marshalling all the
`material is to present it to the user not in the order in which the documents appear
`in the database, but in descending order of relevance to the user's search
`formulation. Techniques to achieve this are known as ranking algorithms. They
`usually work on the comparison of the frequency of occurrence of the required
`strings in individual documents and in the database as a whole. A cruder basis
`would simply rely upon the frequency of occurrence of the required strings in the
`
`
`Facebook Ex. 1014
`
`
`008
`
`Facebook Ex. 1014
`
`
`
`Citation Patterns in Legal Information Retrieval
`
`255
`
`documents retrieved. Once again there is no hard evidence of the effectiveness of
`such techniques. The main complaint from those who have used them is that they
`tend to exalt documents more by length than relevance, presumably because the
`longer the document the more likely it is that any given string will appear, and
`appear more frequently. Nor is it generally very easy for users to understand the
`probable effects of choosing particular algorithms.
`
`3. An alternative
`Because of the difficulties which were experienced with the string matching
`approach to legal information retrieval by computer an alternative, often referred
`to as the vector technique, was developed in the late 1960's. So far it has not been
`adopted by any widely used system in law,5 but it has a number of interesting
`features which could meet some of the objections raised above. In outline this
`technique substitutes the concept of findling an approximate match for the whole
`document in place of a precise match for part of it. The established system relies
`upon a precise fit between word strings, if the precise match is achieved then the
`document is considered relevant, if it fails in even the slightest respect then it is
`disregarded. It is, in the opinion of many, just this feature which makes it
`impossible for the established system to function efficiently. It presupposes that
`information is either relevant or irrelevant, and that this can be detected by testing
`for the presence or absence of particular character strings. The proponents of this
`alternative approach argue that relevance is a continuous variable in that legal
`material is more or less relevant, and thus that no all or nothing test can for this
`reason ever be completely successful. It may be interjected at this point that a
`more plausible position would accept the arguments of both sides, but direct them
`to different types of legal search. Thus there clearly are some sorts of legal search,
`such as those for all statutory uses of particular words or phrases, or all case-law
`discussion of particular provisions which are all or nothing in the required sense,
`and for which the established technique is ideal. Nevertheless, it is undeniable
`that much case-law searching is of a less precise nature, and it is to this that the
`vector technique is most appropriately directed.
`It too proceeds on the basis of a machine readable version of the full text of the
`material to be analysed, and usually common words are omitted, though there are
`some particular applications of the alternative technique such as the detection of
`authorship which depend upon the presence of common words. This full-text
`requirement presents a few problems since even in the established system com(cid:173)
`mon words appear in the original machine readable version, not because they are
`required in the retrieval procedure itself but because it is easier to prepare the text
`with the words included than to exclude them, and because it would make the text
`impossible to read if they were excluded from the version which is displayed
`during the retrieval procedures or printed out after the procedures have been
`completed. Similarly in just the same way as the established one the alternative
`
`5 This technique was first suggested by Professor Salton and forms the basis of the SMART
`retrieval system. It has been adapted to legal materials by Bryan Niblett and Gillian
`Boreham at the University of Kent whose generous assistance to the author is gratefully
`acknowledged.
`
`
`Facebook Ex. 1014
`
`
`009
`
`Facebook Ex. 1014
`
`
`
`256
`
`Colin F. H. Tapper
`
`system constructs an inverted file of the uncommon words, though there is one
`difference in that the concordance has no need to store full positional information.
`It is sufficient merely to note the document references for each string. This
`concordance is then supplemented by a file or vector for each document which
`records the strings contained in that document. At this point, different applications
`of the alternative approach diverge. One possibility is to store these strings
`together with an indication of their frequency of occurrence in that particular
`document. It is then possible to accept a question in natural language and to
`compare the strings or vector used in the formulation of the question with the
`strings or vector which represent the documents contained in the database. This
`will not be on the basis of attempting to find a complete match, but the nearest
`match. Thus it will not matter if some of the strings in the question do not appear in
`some of the documents, but rather the documents will be presented to the user in
`the order of closeness of fit. Thus this technique incorporates a natural ranking
`algorithm. It is possible to exploit this feature still further. It is unlikely that any
`natural language formulation of a question in the true sense, that is a sentence that
`one might use in ordinary speech or writing would contain much repetition of key
`terms. It is thus possible that some relevant documents would contain none of the
`strings appearing in the question. On the other hand, some relevant documents
`almost certainly would contain some of the strings. This system has the facility not
`only to compare documents with the question that is asked, but also with each
`other. Thus it is quite likely that a document which matches a particular question
`very well may itself match another document very well although the other docu(cid:173)
`ment does not match the question at all just because none of its strings happens to
`have been specified in that particular formulation. The vector system could be
`organised so as to indicate such a second document as having possible relevance
`to the question. In this way a document would be indicated which could not
`possibly be found by the established system (working on an unenriched natural
`question basis). This alternative can of course operate on the same sort of
`question formulation as the established system, namely on lists of strings connec(cid:173)
`ted by Boolean operators. The only difference would be that no positional quali(cid:173)
`fiers could be accepted. Similarly the alternative system could employ most of the
`methods described above to improve search formulation such as string truncation,
`root specification, KWIC and highlighted displays. An exception would be the case
`mentioned above where a document is retrieved without the appearance of any
`string which has been explicitly specified.
`
`A further enhancement possible in the alternative system is to give different
`weights to particular strings. This can be done either on the basis of the frequency
`of occurrence of a string, and is indeed very commonly done in that context, or
`perhaps by reference to the string's position in the question, or by the explicit
`prescription of the user. In any event the precise amount and effect of the
`weighting has to be decided upon and implemented. This makes it slightly
`awkward to allow user's prescription since few users would want to be bothered
`with transmuting an intuitive preference into a precise mathematical value.
`
`
`Facebook Ex. 1014
`
`
`010
`
`Facebook Ex. 1014
`
`
`
`Citation Patterns in Legal Information Retrieval
`
`257
`
`4. Defects of the Alternative
`
`To some extent it suffers from the same defects as the established system. Thus it
`requires a machine-readable and up to date version of the legal documents it
`includes. The enormous bulk of full-text presents to that extent the same problem.
`But in fact it is even more serious here since the amount of computing involved in
`this technique is very much greater than in the established systems. The establis(cid:173)
`hed technique is based on string matching which essentially involves a trivial
`amount of subtraction. Thus if the string ,cab' should be represented by the value
`,312' it is merely a matter of subtracting ,312' from all the other values and
`accounting it an exact match when the result is ,000'. In the alternative system the
`long sequences of values which constitute th