`Published by the Faculties of Law, and Mathematical and Computing Sciences
`New South Wales Institute of Technology
`
`Vol. 1 No. 2
`
`1982
`
`EDITORIAL BOARD
`
`Chairman
`Hon. Mr. Justice M.D. Kirby,
`Chairman, Australian Law Reform Commission
`
`Editor
`Dr. R.A. Brown,
`Lecturer in Law, N.S.W.I.T.
`
`Members
`
`Mr. G.W. Bartholomew,
`Dean, Faculty of Law,
`N.S.W.I.T.
`
`Mr. D. Biles,
`Assistant Director (Research)
`Australian Institute of Criminology
`
`Professor J. Bing,
`Norwegian Research Centre for
`Computers and Law, Oslo University
`
`Professor A. Blackshield.
`Dept. of Legal Studies
`La Trobe University
`
`Professor P. Catala,
`Institut de Recherches et d'Etudes
`pour le Traitement de l'Information
`Juridique, Université de Montpellier I
`
`Dr. V.X. Gledhill,
`Dean, Faculty of Mathematical
`and Computing Sciences,
`N.S.W.I.T.
`
`Professor J. Goulet,
`Faculté de Droit
`Université Laval, Quebec
`
`Mr. C.H. Gray,
`Dept. of Mathematics and Statistics,
`C.S.I.R.O.
`
`Mr. W.A. Steiner,
`Librarian
`Institute of Advanced Legal Studies
`
`Mr. C. Tapper,
`All Souls Reader in Law,
`Magdalen College, Oxford
`
`Mr. R.J. Watt,
`Senior Lecturer in Law,
`N.S.W.I.T.
`
`Professor D.N. Weisstub,
`Professor of Law and Psychiatry,
`Osgoode Hall Law School
`
`Professor D. Whalan,
`Faculty of Law,
`Australian National University
`
`001
`
`Facebook Ex. 1015
`
`
`
`This volume may be cited as
`(1982) 1 J.L.I.S.
`
`Articles, books for review, subscriptions and all inquiries should be
`addressed to The Editor, Journal of Law and Information Science,
`C/O Faculty of Law, New South Wales Institute of Technology,
`P.O. Box 123, Broadway, N.S.W. 2007, Australia.
`
`Copyright © 1982 New South Wales Institute of Technology
`
`All rights reserved. Subject to the law of Copyright no part of this publication
`may be reproduced stored in a retrieval system or transmitted in any form
`or by any means electronic, mechanical, photocopying, recording or otherwise,
`without the permission of the owner of the copyright. All inquiries seeking
`permission to reproduce any part of this publication should be addressed in
`the first instance to The Editor, Journal of Law and Information Science,
`do Faculty of Law, New South Wales Institute of Technology, P.O. Box 123,
`Broadway, N.S.W. 2007, Australia.
`
`Printed in Singapore by Tak Seng Press Pte. Ltd.,
`147, Hill Street, Singapore 0617.
`
`002
`
`Facebook Ex. 1015
`
`
`
`Editorial
`
`Articles
`
`Legality — Information Technology and
`the Laws of Evidence
`
`T.H. Smith
`
`Theories of Information in Law (Recent
`Developments in the Discipline of Rechts-
`und Verwaltungsinformatik -RVI- in
`Germany) A Background Paper
`
`H. Burkert
`
`The Use of Citation Vectors for Legal
`Information Retrieval
`
`C. Tapper
`
`Computerisation of Legal Material in
`Australia
`
`P.J. Ward
`
`Page
`
`vii
`
`89
`
`120
`
`131
`
`162
`
`Casenote
`Conwell v. Tap field
`
`Book Reviews
`The Solicitor and the Silicon Chip
`
`Control & Audit of Small/Medium Com-
`puter Systems
`
`D.I. Robinson
`
`175
`
`R.A. Brown
`
`R.J. Watt
`
`179
`
`180
`
`003
`
`Facebook Ex. 1015
`
`
`
`
`
`
`
`
`Vol. 1 No. 2
`
`131
`
`THE USE OF CITATION VECTORS FOR LEGAL
`INFORMATION RETRIEVAL
`
`C. TAPPER
`
`Cohn Tapper is one of the founders of the study of
`computers and law in England, and this paper adds to his
`contributions to the field. As its title indicates, the article
`is concerned with the use of case citations as selection vectors
`in legal information retrieval, and, in particular, with the value
`of citation vestors in coin parison to the usual semantic vectors
`currently used.
`
`The author details recent experiments with citation vectors
`in the United States and at the Norwegian Research Centre
`for Computers and Law (NRCCL). The comparative results
`of using citation vectors against semantic vectors in these
`experiments are documented and considered, and Mr. Tapper
`provides some valuable discussion of the algorithms used in
`computing and assessing vectors in data retrieval. Despite the
`complexity of this work, it will be of great value to all
`interested in the field of computers and law, because of its
`implications for the future development of legal data retrieval.
`
`The first section of this article is intended for those who have no,
`or little, previous awareness of legal information retrieval techniques.
`Since the main aim of the article is to explain the theory behind the
`substitution for such methods of citation vectors those who have the
`requisite familiarity with matching and vector based systems as applied
`to law might prefer to start with the second section.
`
`1. Current Legal Information Retrieval Techniques
`It is now about 25 years since at the University of Pittsburgh
`in Pennsylvania Professor John Horty first succeeded in applying
`computerised methods to the retrieval of legal information. It is a
`tribute to his insight that the techniques which he devised remain the
`bedrock of virtually all of the systems which operate in the world
`today. The essence of the technique is the identification in the text
`of a document of a word, or words, in a particular combination which
`have been selected by the lawyer as being likely to indicate the
`relevance of that document to the lawyer's problem. As normally
`implemented the system creates a concordance of the full legal texts
`constituting the database of the system, excluding only words of such
`low prima facie information content that they are highly unlikely to
`be nominated by lawyers as search terms. Each concordance item
`then becomes a potential search term, and searches are typically
`conducted by the nomination of classes of words, for example synonyms,
`grammatical variations, particularisations and generalisations, which
`must occur in a given relationship to other similar classes in a docu-
`ment in order for it to satisfy the search request as a potentially
`
`004
`
`Facebook Ex. 1015
`
`
`
`
`132
`
`Journal of Law & information Science
`
`(1982)
`
`relevant document. So as to accomplish this process the lawyer must
`first accustom himself to thinking in terms of word occurrence rather
`than directly in terms of the meaning of a document. He must be
`comprehensive in his classification of terms, and be must be able to
`specify the appropriate logical relationship in terms of Boolean logic
`and relative sequential occurrence in order to secure an answer. In
`a commercially operational system he will also be well-advised to
`consider very carefully not only whether his categorisation is appro-
`priate, but whether it is the most efficiently appropriate formulation
`of his search, since the more efficient the search the quicker and
`cheaper it becomes.
`
`It is no exaggeration to say that this process teems with problems
`both for the system designer and for the average lawyer. Many of
`these can be mitigated by proper training and continual practice.
`Some of them are more intractable. At the level of the selection of
`words with prima facie low information content there is the difficulty
`that "word" is strictly speaking an inaccurate designation. "Words"
`in the system also encompasses such things as numbers and abbre-
`viations, and would more properly be described as strings of characters.
`In this extended sense it is rarely possible to predict with certainty
`that a given string has no information content. Most systems for
`example exclude the string "A" on the basis that the upper case
`indefinite article is rarely essential to a search. This may be true,
`but it is not sufficient to justify the exclusion of the string "A" from
`the concordance since "A" does have meaning in some contexts,
`for example, the Australian abbreviation "A" followed immediately
`by "L" followed immediately by "R". In the United States "A" is
`itself an abbreviation for an important series of reports. It is, of
`course, immaterial that the abbreviation occurs only in some other
`jurisdiction if material from that jurisdiction can ever be reported
`in one's own.
`
`Semantics and syntax present further difficulties. A basic pro-
`blem of a semantic nature is that character strings may not denote
`concepts uniquely or exclusively. In many contexts the strings
`"minor", "infant", "child", "juvenile", "boy" and "girl" are equivalent,
`in others they are not. Conversely strings like "office", "bank", "safe",
`"deposit" and "flag" have more than one meaning. In the former
`case one of the problems is to think of all of the possible alternatives
`so as to include them in the search formulation, in the latter it is to
`think of them so as to draft the combination of classes in such a
`way as to exclude the unintended meanings. Given the presumptively
`inclusive range of search terms this can be extremely difficult, thus
`in one search of British material when seeking documents relevant
`to the Gas Board it was found that the string "Gas" was ambiguous
`because there had been an Indian litigant in one case of that name.
`To some extent these problems interact with each other, for example
`when in an effort to avoid the former problem so far as grammatical
`variants are concerned truncation is used, that is, specifying a word
`root followed by a special character to retrieve all strings commencing
`with that root, extra problems are created in relation to unanticipated
`homographs.
`
`005
`
`Facebook Ex. 1015
`
`
`
`Vol. 1 No. 2
`
`The Use of Citation Vectors for Legal
`Information Retrieval
`
`133
`
`Syntax creates problems because English often permits a wide
`range of word orders to convey similar meanings, so the effectiveness
`of positional logic to avoid problems is reduced. Thus the meaning
`of "man bites dog" cannot be distinguished from the meaning of
`"dog bites man" by the simple expedient of requiring the string
`"man" to precede "dog" in the document, since the sane meaning
`would be conveyed by "dog was bitten by man".
`
`A final difficulty which may be mentioned here is that the end
`product of a search on a system such as this is the specification of
`a number of cases from the database which satisfy the search criteria.
`In many systems the order of presentation of these cases is random
`and reflects only the internal organisation of the database; in others
`it reflects a crude judgment of potential interest, such as higher court
`before lower and most recent first. The essential point is that since
`degrees of relevance are not distinguished it is impossible to rank
`responses in order of relevance to the query.
`
`A number of expedients have been adopted to try to meet this
`last point. In some cases the basic Horty method has been retained
`for the selection of relevant responses, but then further computation
`has been carried out to indicate an order of relevance. Some such
`systems use a statistical algorithm operating upon the basis and
`comparison of string frequency in the documents retrieved and in
`the database as a whole. Others permit the lawyer to assign weights
`to search terms and by algorithms relating to these contrive an order
`of relevance. Another encourages users to multiply classes of search
`terms, and then ranks responses by reference to the number of classes
`represented in selected documents. All however suffer from the dis-
`advantage that lawyers rarely have enough understanding of the
`significance of the algorithm or the way in which it is likely to work
`to be able to use such aids successfully. It may be noted parenthe-
`tically at this point that Horty type systems tend to err on the side
`of over-retrieval, and the most effective way of homing in on the
`most relevant material is by interacting with the system so as to refine
`the search formulation in the light of the system's response to the
`previous formulation. This is clearly done most effectively by the
`person who understands the problem best, namely the lawyer who
`has himself identified and formulated the problem. For this reason
`systems in which lawyers themselves operate the system tend to give
`much better results than those in which the task is delegated to others.
`This constraint is thoroughly beneficial since it forces designers to
`create systems which are clearly thought out and very simple to operate.
`
`Largely because of the drawbacks of full text systems of the
`Horty type a totally different technique was tried by an American
`information scientist, Professor Gerard Salton. The aim of this system
`was essentially twofold. It wanted to overcome the difficulty in
`nominating search terms so precisely as to define relevant documents
`with complete precision, and as a desirable corollary it wanted to
`present results in order of relevance. The essence of the new method
`was to replace the technique of seeking a precise match for part of
`the document, typically words and phrases, by seeking instead for
`an approximate match for the whole document. The responses could
`
`006
`
`Facebook Ex. 1015
`
`
`
`134
`
`Journal of Law & Inf ormation Science
`
`(1982)
`
`then be presented in order of their approximation. This is a highly
`ingenious idea. In lay rather than mathematical terms it looks for
`the degree of overlap between the terminology of different documents.
`If for example the lawyer has a problem involving the escape of oil
`from an undersea pipeline caused by negligent dredging by a harbour
`authority it might be thought that a document which had a very high
`frequency of occurrence of the strings "escape", "oil", "undersea",
`"pipeline", "negligent", "dredging", "harbour" and "authority" might
`be more relevant than one which had a lower frequency of the use
`of such terms, or than one which had a high frequency of occurrence
`of the strings "escape", "oil", "pipeline" and "negligent", but made
`no reference to "dredging", "harbour" or "authority". It might be
`more of a problem to say intuitively which of the two above rejected
`alternatives was the more relevant. The way in which the technique
`operates is to consider each document as constituted by the different
`strings it contains, and the weight of those strings as being the frequency
`with which they occur in the document. It is then possible to use
`a mathematical algorithm to calculate the similarity between the two
`documents, on a scale running from zero when there is no overlapping
`of strings to one where both the assortment of strings and their res-
`pective weights are identical.
`
`Unfortunately this method also has its drawbacks in a legal
`context. As the example quoted above indicates it too relies upon
`the specification of target strings. It is true that the non-occurrence
`of some in a document searched may not be fatal, but the occurrence
`of different equivalents will inevitably be overlooked, and a document
`containing them may be decisively undervalued. Suppose in the con-
`text of the previous example that a case referred only to the "leakage"
`or "hydrocarbon products" from an "undersea conduit" as a result
`of the "reckless operations" of a "docks board". Homographs also
`continue to present a problem. Suppose a lawyer has a problem
`relating to the validity of a will. He will surely find that the string
`"will" occurs so frequently in so many documents as to mask those
`with which he is really most concerned. A related difficulty is that
`while words with plenty of common synonyms will tend to be less
`indicative of content than they deserve, words with no positive negation
`but which rely instead upon a negative operator will tend to be equally
`undeservediy more indicative. A further problem is that this system
`dispenses with syntax altogether, it simply regards a document as
`the strings it contains and the frequency of their occurrence, or in
`the jargon of information science, as a vector. If these vectors are
`identical then the documents are regarded as identical by ascribing
`the similarity or correlation value of one. Yet it is possible for
`documents while equivalent in this sense nevertheless because of their
`word order to be very different in meaning. Thus in this system
`"dog bites man" is regarded as the equivalent or "man bites dog"
`because both would be expressed within the system as the vector
`"bites" (1); "dog" (1); and "man" (1).
`
`A further, and practical, disadvantage of this technique is that
`it involves the system in an inordinate amount of computing. In the
`Horty system which involves only string recognition, Boolean operation
`and positional determination, the number of computations rises arith-
`
`007
`
`Facebook Ex. 1015
`
`
`
`Vol. 1 No. 2
`
`The Use of Citation Vectors for Legal
`inf ormation Retrieval
`
`135
`
`metically with the length of the document. The Salton system saves
`on Boolean operation and positional determination, but it then requires
`a series of mathematical operations rising geometrically with the
`length of the document, each repeated at least twice. If the further
`step is taken of establishing the similarities of all of the documents
`within the database then the number of times these computations
`must be performed is about half of the square of the number of
`documents within the collection.1
`
`It was against this state of the art that the possible substitution
`of citations for words in a Salton style system was first considered.
`This article wifi explain the theoretical basis for such a substitution,
`it will then describe the experiments which have been conducted to
`test the theory and their results, and will finally suggest ways in which
`the technique may be further developed.
`
`2. Theory of Citation Vectors
`The main difference between citation vectors and word based
`vector systems of the type pioneered by Salton is, as might be imagined,
`one of the content of the vectors. Whereas Salton characterised a
`document as a vector the elements of which were words and the
`weights of which were their frequency of occurrence, in a citation
`vector system the elements are citations and their weights could be
`their frequency, but could equally well represent a number of other
`parameters, as will be explained below.
`
`There are several reasons for choosing to represent legal docu-
`ments by citation vectors rather than by word based vectors. By far
`the most important of these is that the method corresponds well with
`the intuitive approach that lawyers have to legal research. If a lawyer
`knows that a particular case deals with his problem he is likely to
`use it in at least two ways. The first is to exploit it to investigate
`recent material which may not have got into the textbooks, and which
`his usually haphazard personal up-dating system may have missed.
`He simply scans the recent material for cases which may have cited
`the ease he knows. It has been discovered that many lawyers in the
`United States use the ordinary full text matching system LEXIS for
`just this purpose. It is ideal for this application since in a corn-
`puterised legal information retrieval system material becomes available
`on the computer much faster than it can be published in conventional
`hard copy form. It may also happen that the given case, if very
`recent, has itself undergone some change. This possibility led Lawyers'
`Co-Operative Publishing Company to devise, first for their own in-
`house use, then publicly, and finally in association with LEXIS, a
`special computerised system, AUTOCITE, for checking the accuracy
`and currency of citations. A second way of using citations to assist
`with research is to use the well-known case to indicate the relevant
`part of an encyclopaedia or textbook by reference to the Table of
`Cases. It is significant that no scholarly legal publication can ever
`afford to omit such a Table.
`
`1 In fact N(N-1)/2 where N is the number of documents in the database.
`
`008
`
`Facebook Ex. 1015
`
`
`
`136
`
`Journal of Law & inf ormation Science
`
`(1982)
`
`Because the use of citations as a research tool is so important
`in law it has been developed there far more thoroughly than in any
`other field of study. The United States partly because of its multi-
`plicity of jurisdictions, partly because of its federal character, partly
`because of the volume of litigation, partly because of the structure
`of the law, and partly because of the practices of the law reporters
`has by far the largest volume of case reporting of any common law
`jurisdiction. It is not surprising that this is reflected by the most
`far-reaching and sophisticated citation reference system ever devised.
`The most outstanding feature of this system is Shepard's Citator.
`It tracks the history of every reported decision in minute detail using
`an elaborate structure of codes, superscripts and subscripts to indicate
`the effects of the subsequent reference and the particular point in
`the cited case to which reference has been made. This magnificent
`work is the main resort of most American lawyers engaged on any
`piece of legal research. It may be asked why if Shepard is so useful
`is there any necessity to replace it with an automated system. The
`answer is that Shepard's greatest virtues, its comprehensiveness and
`its currency, are also its greatest weaknesses. It is cripplingly ex-
`pensive to buy and maintain if a thorough coverage is required: it
`is awkward to use because at least three, and sometimes four, separate
`volumes need to be consulted to trace the complete history of an
`early decision; it is hard to interpret because of the mass of detail
`crammed into a tiny space in coded form; and it becomes far too
`time-consuming to follow up all of the references to even one popular
`case, let alone the snowball effect of Shepardising all the references
`yielded by Shepard to the original point-of-entry case. It is believed
`that by the use of a computerised system of citation vectors the
`advantages of citation based research can be maintained, but in a
`very much more efficient way.
`
`Citations also have very substantial advantages over words as
`the elements in a vector based system. It will be recalled that among
`the disadvantages of word based systems were found to be the difficul -
`ties created by synonyms, grammatical variants, particularisations,
`generalisations and homographs. In all of these respects citation
`based systems have the advantage. The only sense in which a case
`citation can be said to have a synonym is in relation to parallel
`reports of the same decision. This is, however, very easily dealt with.
`Often the different citations appear in the text so that the appropriate
`one can be extracted, but more fundamentally such synonymity is
`fixed and invariable so it presents no problem to have a simple
`conversion table built into the system to permit automatic trans-
`formation into the chosen style. There are clearly no such things
`as grammatical variations, particularisations or generalisations. It is
`however worth mentioning a special problem with citations which
`relates to the page referencing. Ideally such references should contain
`two page numbers, that of the first page of the cited report and that
`of the page in it to which reference is made. Unfortunately while
`this sometimes occurs, it is not uncommon for one or other of the
`single page references to be used. In a well-developed system this
`feature could actually be turned to advantage because where there
`was a missing first page reference a simple table look-up technique
`should be able to yield the first page number. It might then be
`
`009
`
`Facebook Ex. 1015
`
`
`
`Vol. 1 No. 2
`
`The Use of Citation Vectors for Legal
`Inf ormation Retrieval
`
`137
`
`possible to exploit differences in the point of citation as possible
`parameters for element weighting in the correlation algorithm. If only
`the first page reference is given then it would require manual inter-
`vention, and perhaps even the exercise of delicate judgment to assign
`the page number of the point to which reference was made. It is,
`after all, sometimes a mystery to know how a particular citation
`supports the proposition for which it is prayed in aid. Case citations
`are not in the same sense as words capable of being homographs
`either. Each citation is exclusive. It always refers to the same report.
`It may however be ambiguous in the sense that the case may contain
`more than one discrete point, and thus the same citation may perfectly
`properly occur in two subsequent cases, each dealing with a totally
`different subject. This phenomenon poses difficult problems for a
`system based on citation vectors to handle as will be seen later.
`
`So far it has been assumed that citation vectors are confined to
`the citation of cases. While it is true that cases are much easier to
`handle, it is worth considering the possibility of including citations
`to statutory materials which are no less indicative of meaning than
`are citations to other cases. This consideration is, of course, confined
`to retrospective citation since it is highly uncommon for statutes to
`refer to cases explicitly. It is also worth considering statutory citations
`because there is a tiny proportion of cases which cite, and are cited
`by, no other cases, but which do cite statutes. It is a very rare case
`indeed which cites no authority of any sort to justify the decision
`arrived at. Statutes present two especially acute problems. The first
`is that there is such a wide variety of notation that it is extremely
`difficult to devise a framework into which all statutory provisions
`can be cast. This is not unconnected with the second problem which
`concerns the dissection of statutes into their component parts. A
`difficulty is to know when to stop. It is clear that if two cases merely
`cite the same statute they may have too little in common for it to
`be significant. In conventional information retrieval, here too following
`Horty's original technique, the statutory section is usually taken to
`be the minimum unit of reference. It is not clear that for the purposes
`of citation vectors significance should not be attached to lower levels
`of subordination. It is here that the link to the former problem
`becomes apparent. In some jurisdictions at some periods of history
`the fashion is to use long undivided sections, in others it is to divide
`statutes into very short self-contained sections, in still others it is to
`build into the statutory sections a complicated hierarchy of sub-
`sections and sub-sub-sections. It is not immediately obvious how
`all of this can be rendered into a sufficiently homogenous form for
`the operation of the algorithms essential to the success of the vector
`system.
`
`The most obvious difference between statutory and case materials
`is that statutory material is subject to subsequent amendment and
`modification in a formal way. This presents another formidable
`problem for the vector method. It is not uncommon for a statutory
`section to have its meaning changed quite substantially by such
`amendment. If the amended section is cited before and after the
`amendment should this count as the same citation in both cases?
`
`010
`
`Facebook Ex. 1015
`
`
`
`138
`
`Journal of Law & Information Science
`
`(1982)
`
`It would seem that it should not. In this case a further problem is
`posed for the design of an appropriate notation since it will no longer
`bear so close a resemblance to the conventional form, which by
`hypothesis is identical in both cases. The situation is made still
`worse in cases where the original version of the statutory section
`dealt with two distinct points. If it is amended so far as one of them
`is concerned, but left unchanged so far as the other is concerned
`then the system must break down. Whether the two versions are
`treated as the same or as different the system will be inaccurate in
`the case of citation for one point or the other. The only way out
`would be to make an artificial sub-division, and then to assign reference
`to one or the other by manual means, but this is impractical in a
`large working system.
`
`A major advantage which working with citations rather than
`words possesses is that citations have more, and more readily quanti-
`fiable, parameters than have words. In the case of words the only
`objectively quantifiable criterion is the frequency of occurrence. It is
`true that other values could be assigned manually, but it is hard to
`see how they could be controlled in any satisfactory way in a working
`system. Citations on the other hand are different. They do also
`have a dimension of frequency which cannot sensibly be ignored.
`If a case makes dozens of references to another that other is likely
`to be more significant for the citing case than another case which is
`cited only once. This however raises a problem which does not
`exist with words. It is not difficult to determine whether or not a
`word occurs, and to count the separate occurrences. With citations
`it is equally easy to discover whether a source has been cited. It is
`much less easy to know how many times it has been cited. Suppose
`a judge discusses a case extensively throughout his judgment, and
`because it has become familiar he ceases to repeat its name, and the
`reporter stops footnoting references to it by the standard coding.
`It is clear that such a case makes more reference to that case than
`to another which may be referred to in full form once at the beginning
`and once at the end of the judgment, perhaps in support of some
`quite parenthetic comment by the judge. It is clear therefore that
`the former case must count as having been cited more than once,
`but there is no obvious and objective criterion to establish just how
`often.
`
`Other possible parameters present no such thorny problems. One
`of the most obvious is age, in the sense of the time difference between
`the citing and cited case. This is clearly quantifiable, and about the
`only problem is to know whether to take the relevant date as the
`date of decision or the date of reporting, bearing in mind that this
`can occasionally be very substantial. Other possible parameters in-
`clude such things as the level within the hierarchy at which the case
`was decided, and the degree of remoteness of the jurisdiction. It is
`true that here the numerical value which may be assigned is not
`objectively determined, but the application of the standard is objective.
`Thus one may attach a high numerical value to a decision of the
`High Court. This value will be an arbitrary one, but its assignment
`will be objectively determined. Here the main problem is to reconcile
`
`011
`
`Facebook Ex. 1015
`
`
`
`Vol. 1 No. 2
`
`The Use of Citation Vectors for Legal
`Information Retrieval
`
`139
`
`different structures in different jurisdictions so as to make values
`homogenous. For example should the Australian Federal Court be
`given the same value as the Federal Courts of Appeals in the United
`States, or the same as the Federal District Courts? It is felt that,
`at least so far as the common law jurisdictions are concerned this
`should not be an impossible task, though even here historical change
`can present problems. Other possible parameters are more contro-
`versial. It is possible that one ought to pay some attention to the
`degree of approval or disapproval accorded by the citing case, whether
`it has followed, distinguished, overruled or doubted the previous case,
`for example. Such a judgment is made by Shepard, but it is extremely
`doubtful whether it is practical to control the ascertainment of this
`opinion, and in any event it too might be very much affected by the
`practices of different jurisdictions.
`
`For these reasons citation vectors appeared, in theory, to offer
`a useful supplement to full text matching systems for the retrieval
`of legal information, and to be much more promising than word based
`vector systems. The next step was to test the theory by experiment.
`
`3. Experiments with citation vectors
`In much the same way as it is wise to test a theory by experiment
`before putting it into practice, it is also wise to test an experimental
`design by a pilot project. An opportunity arose in 1975 to conduct
`such a pilot study of citation vectors at the Law School of the University
`of Stanford in California. The object of the pilot study was to see
`whether the practical implementation of a vector citation experiment
`would involve unforeseen difficulties. At that time there were well-
`established programmes for conducting vector based work, among
`them some interesting work conducted under the auspices of Dr. Bryan
`Niblett at the University of Kent. These programmes incorporated
`standard techniques for calculating the degree of similarity between
`vectors, and also of grouping the documents together into clusters on
`the basis of the similarity coefficients so established. The very first
`pilot experiments were conducted with the generous assistance of
`Dr. Niblett using the Kent programmes on some English data compiled
`at Standford. The data were taken from a volume of the English
`Criminal Appeal Reports, and included as vector elements only cited
`cases without any weighting. The first runs were evaluated on an
`intuitive basis and it seemed that prima facie the technique was
`grouping cases together in a sufficiently satisfactory manner to justify
`