throbber
;I~
`Jet
`if!"· '
`
`Regression Analysis of Extended Vectors
`
`to Obtain Coefficients for Use in
`...
`Probabilistic Inrormati~n Retrieval Systems
`
`by
`
`Gary L. Nunn
`
`Project submitted to the Faculty of the
`
`Virginia Polytechnic Institute and State University
`
`in partial fulfillment of the requirements for the degree of
`
`Master of Science
`
`Computer Science and Applications
`
`APPROVED:
`
`..
`
`December 7, 1987
`
`Blacksburg, Vrrginia
`
`EXHIBIT 2027
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00479
`
`

`

`Regression Analysis of Extended Vectors
`
`to Obtain Coefficients for Usc in
`
`Probabilistic Information Retrieval Systems
`
`by
`
`Gary L. Nunn
`
`Edward A. Fox, Chainnan
`
`Computer Science and Applications
`
`(ABSTRACT)
`
`Previous work by Fox has extended the vector space model of information retrieval and its imple(cid:173)
`
`mentation in the SMART system so different types of information about documents can be sepa(cid:173)
`
`rately handled as multiple subvectors, each for a ditferent concept type. We hypothesized that
`
`relevance of a document could be best predicted if proper coefficients are obtained to reflect the
`
`importance of the query-document similarity for each subvector when computing an overall simi(cid:173)
`
`larity value. Two different research collections, CACM and lSI, each split into halves, were used
`
`to generate data for the regression studies to obtain coefficients. Most of the variance in relevance
`
`could be accounted for by only four of the subvectors (authors, Computing Review descriptors,
`
`links, and terms) for the CACM l collection. In the ISll collection, two of the vectors (terms and
`
`cocitations) accounted for most of the variance. Log transformed data and samples of the records
`
`gave the best RSQ's; .6654 was the highest RSQ (binary relevance). The regression runs provided
`
`coefficients which were used in subsequent feedback runs in SMART. Having ranked relevance
`
`did not improve the regression model over binary relevance. The coefficients in the feedback runs
`
`with SMART proved to be of limited usefulness since improvements in precision were in the 1-5%
`
`range. Although log data and samples of the records gave the best RSQ's, coefficients from log
`
`values of all data improved precision the most. The findings of this study support previous work
`
`of Fox, that additional information improves retrieval. Regression coefficients improved precision
`
`slightly when used as subvector weights. Log transfonning the data values for the concept types
`
`modestly helped both the regression analyses and the retrieval in SMART.
`
`

`

`Acknowledgements
`
`I am grateful to Dr. Edward A. Fox for his patient help and many interesting discussions during
`
`the course of tlus research and my studies. I would to thank Dr. Osman Balci and Dr. Clifford
`
`Shaffer for their aid in completing this project. I very much appreciate the many hours of effort that
`
`Mr. Whay Lee gave to this study.
`
`I would especially like to thank my wife, Dr. Pamela Gam-Nunn, and our son Bradley, who gave
`
`me support and made many sacrifices during this project and my course of study.
`
`Acknowledgements
`
`iii
`
`

`

`p
`
`Table of Contents
`
`1.0
`
`Introduction
`
`1.1 Probabilistic retrieval ........ ... . . . .... .. .................... . .... . .. . .
`
`1.2
`
`Information retrieval in SMART
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
`
`1.3 Research goals
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
`
`2.0 Methods .. , , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
`
`2.1 Division of the collections
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
`
`2.2 Description of the collections and vectors
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
`
`2.3 Descriptive analysis of the data
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
`
`2.4 Obtaining regression coefficients for use in SMART
`
`. . ..... . .. . .. . ..... . ... . . . . 7
`
`2.5 Testing the usefulness of the coefficients as weights in SMART ....... . ...... .. ... 9
`
`2.6 Threshold techniques
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
`
`3.0 Results
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
`
`3.1 Description of the data
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
`
`3.2 Linear regressions on the CACM 1 collection
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
`
`3.3 Linear regressions on the lSI! collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
`
`Table of Contents
`
`iv
`
`

`

`3.4 Additional regression techniques
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
`
`3.5 Use of regression coeiftcients in SMART for CACM I and ISil collections . . .. ..... . 31
`
`3.6 Use of regression coefficients in SMART for CACM2 and ISI2 collections ....... . . . 32
`
`3. 7 Use of threshold techniques
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
`
`4.0 Discussion • . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
`
`4.1 Usefulness of concept types as shown by linear regressions
`
`. . . . . . . . . . . . . . . . . . . . . 40
`
`4.2 Ranked relevances versus binary relevances
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
`
`4.3 Thesholds as aids in regression
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
`
`4.4
`
`Improvement of retrieval using coe1licients
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
`
`4.5 Conclusions and implications for further research
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
`
`References
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • . . . . . . . . . . . . . . . . . . . . . . . . . . • . . 44
`
`Vita .•....•... .................. .... ..............•..•.......... ..... 45
`
`Table of Contents
`
`v
`
`

`

`List of Illustrations
`
`Figure
`
`1. Histograms of raw versus log tcnns in sample ISil data.
`
`• • • • • 0 . 0 • • 0 • • • 0 • • • 17
`
`Figure 2. Histograms of raw versus log Computing Reviews categories in all CACM1 data.
`18
`Figure 3. Predicted versus residuals for log sample binary data, CACM 1 collection. . ..... 24
`
`Figure 4. Predicted versus residuals for log sample ranked data, CACM 1 collection.
`
`• 0 . 0 • • 25
`
`Figure 5. Predicted versus residuals for log sample for the lSI collection.
`
`• • • • • • • • • • • 0 • • 29
`
`List of Illustrations
`
`vi
`
`

`

`List of Tables
`
`Table
`
`1. Organization of the CACM l merged data set.
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
`
`Table 2. Descriptive statistics and vector length of raw CACM I data.
`
`. . . . . . . . . . . . . . . . 13
`
`Table 3. Descriptive statistics and vector length of raw ISil data . . . . . . . . . . . . . . . . . . . . 14
`
`Table 4. Descriptive statistics of log CACM l data
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
`
`Table 5. Descriptive statistics of log lSI 1 data
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
`
`Table 6. Regression coefficients, ranks and RSQ's for CACM l ranked relevances.
`
`. . . . . . . 22
`
`Table 7. Regression coefficients, ranks and RSQ's for CACM 1 binary relevances.
`
`. . . . . . . 23
`
`Table 8. Sums of squares and probability values for CACM I two-way interactions ....... 26
`
`Table 9. Ref,rression coefficients, ranks and RSQ's for lSI data.
`
`. . . . . . . . . . . . . . . . . . . . . 28
`
`Table 10. Sums of squares and probability values for lSll (log sample data) interactions. . . . 30
`
`Table 11. Precision values from base and coefficient runs for the ISll collection.
`
`. . . . . . . . . 34
`
`Table 12. Precision values from base and coefficient runs for the CACM I
`
`..... ........ 35
`
`Table 13. Precision values from base and coefficient runs for the CACM 1
`
`. . . . . . . . . . . . . 36
`
`Table 14. Precision values from base and coefficient runs for the IS12 collection.
`
`. . . . . . . . . 37
`
`Table 15. Precision values from base and coefficient runs for CACM2 (binary relevance).
`
`Table 16. Precision values from base and coefficient runs for CACM2 (ranked relevance).
`
`List of Tables
`
`38
`
`39
`
`vii
`
`

`

`p
`
`1.0
`
`Introduction
`
`1.1 Probabilistic retrieval
`
`The probabilistic model for information retrieval ((Yu and Salton, 1976 and Robertson and Sparck
`
`Jones, 1976) as cited in van Rijsbergcn, 1981) assumes that the terms in a query and the terms in
`
`a collection of documents are used in an initial retrieval to obtain a sample of the documents. The
`
`terms in the sample are then used to estimate the probability that each document in the collection
`
`is relevant or not relevant. The collection has usually been indexed by building a vector of terms
`
`for each document where the vector consists of binary values ( l for presence - 0 for absence) for
`
`all terms in the collection. A document thus is represented as a vector of length n:
`
`Term(l), Term(2) ... Term(n)
`
`For a given query it is possible to estimate a probability of relevance for each document in the
`
`collection by computing an inner product of "term relevance" values for all terms considered:
`
`Probability(relevanceldocument) = SUM [term_relevance(i) * Term(i)]
`
`Introduction
`
`

`

`where:
`
`term_relevance(i) = [r / CR - r)] / [(n - r) / CN - n - R + r)]
`
`and:
`
`N = number of documents
`n = number of documents with term i
`R = number of relevant documents
`r = number of relevant documents with term i
`
`The Bayesian decision rule can be used to the decide whether a document has a high enough
`
`probability estimate to be chosen as relevant:
`
`Probability(relevanceldocument) > Probability(non-relevanceldocument)
`
`1.2
`
`lufornzation retrieval in SMART
`
`The SMART information retrieval system (Salton and McGill, 1983) has options to use the prob(cid:173)
`
`abilistic model to rank documents that are retrieved as part of a feedback process. Initial retrieval
`
`is usually accomplished after computing a cosine similarity (Salton and McGill, 1983) between the
`
`terms in a query and the terms in documents. Those documents with the highest similarity have
`
`the lowest ranks (also the highest probability of being relevant). The user is presented with a list
`
`of the "top ranked" documents. The user can then decide which of the "top ranked" documents
`
`are relevant. SMART then can perform a vector feedback search (if desired) by adding any new
`
`terms from the relevant documents to the initial query and subtracting those terms from the initial
`
`query, which only appeared in the documents that were judged to be nonrelevant. The resultant
`
`Introduction
`
`2
`
`

`

`feedback query is then used to provide a new ranked list of documents (Salton and McGill, 1983).
`
`Alternatively, a probabilistic feedback can be performed.
`
`Fox (1983a) has modified the application of the vector and probabilistic models to utilize additional
`
`information. This consists of author, date of publication, bibliographic coupling, bibliographic
`
`links, Computing Reviews' categories, and cocitations. SMART has been modified to add the
`
`additional information as subvectors (Fox, 1983a) to the document-term vectors of the original
`
`vector/probabilistic system. As modified, the collection then consists of extended vectors where the
`
`information is separated into subvectors:
`
`Document_identification_number,
`
`Terms,
`
`authors,
`
`date of publication,
`
`bibliographic
`
`coupling,
`
`bibliographic links, Computing Reviews' categories, cocitations
`
`1.3 Resea1·ch goals
`
`The goals of this study were to statistically examine the usefulness of the subvectors associated with
`
`different concept types (Fox, 1983a) and to determine if coefficients obtained via multiple re(cid:173)
`
`gressions could be used to further enhance SMART's retrieval using the extended vectors. A less
`
`significant goal was to detennine if knowledge of relevance as a ranking (from least (1) to most (4)
`
`relevant) would facilitate prediction, thus yielding better coefficients. To accomplish these objec(cid:173)
`
`tives two different research collections were analyzed. These had been loaded into a version of
`
`SMART, which has been installed on a VAX-11/785 running UNIX (Pox, 1983b).
`
`Introduction
`
`3
`
`

`

`The collections, which have been described elsewhere (Fox, l983b), consist of 3,204 abstracts of
`
`documents which appeared ( 1958 - 1979) in Communications of the Association for Computing
`
`Machinery (CACM Collection) and 1,460 abstracts of documents from various sources ( 1969 -
`
`1977) concerning information science, along with citation data obtained from the Institute for Sci(cid:173)
`
`entific Information (lSI Collection). The CACM collection has information necessary for all of the
`
`above mentioned extended vectors, but the lSI collection has only two additional information
`
`components (subvectors), author and cocitations. For both collections, a set of queries with known
`
`document-query relevances was used to generate data for the regression studies. Precision (the ratio
`
`of relevant documents retrieved to all documents retrieved for that query) averages were the prin(cid:173)
`
`cipal measure used for determining the effectiveness of all retrieval runs with SMART.
`
`Introduction
`
`4
`
`

`

`f
`
`2.0 Methods
`
`2.1 Division of the collections
`
`Both the CACM and lSI collections were divided into two approximately equally sized sub(cid:173)
`
`collections by randomly selecting documents. The collections were split so that all analyses could
`
`be performed on one half of the data and the results obtained could be tested on the other half of
`
`the data. The dividing procedure used kept the same proportion of relevant to nonrelevant docu(cid:173)
`
`ments with regard to sets of queries for each collection. In the description and discussion that fol(cid:173)
`
`low these sub collections are referred to as the CACM 1 or CACM2 and lSI 1 or ISI2 Collections.
`
`2.2 Descriptioll of the collectious aud vectors
`
`SMART was used to prepare data sets, for both the CACM! and ISil collections, which consisted
`
`of query identification number (QID), document identification number (DID), rank in the prob-
`
`Methods
`
`5
`
`

`

`f
`
`abilistic retrieval, and a similarity measure (SIM). However, as it was based on the value of the
`
`similarity, rank was not used in this study. A separate data set was obtained for each concept type,
`
`which had the appropriate measure of similarity for all QID - DID pairings.
`
`for use in tables and figures the concept types will be represented by the following abbreviations:
`
`AUT
`
`Authors
`
`CRC
`
`Computing Reviews' Category
`
`DTE
`
`Date of Publication
`
`TRM
`
`Terms
`
`nne
`
`Bibliographic Coupling
`
`LNK
`
`Bibliographic Links
`
`coc
`
`Cocitations
`
`for the CACM I collection, there were 7 equal len!,>th data sets, one for each concept type, and a
`
`data set which listed the relevance judgment (ranked from 0 to 4) for each QID - DID pairing.
`
`The Statistical Analysis System (SAS, 1985) was used to merge the 7 concept type data sets with
`
`the relevance judgment data set by matching QID and DID. Tllis gave a data set, which has QID,
`
`DID, and 7 different similarity measures (independent variables) and relevance judgment (depend(cid:173)
`
`ent variable) for subsequent analysis.
`
`For the ISil collection, only three concept types were available. Thus, only three concept type data
`
`sets with their appropriate sirrlllarity measure were obtained from SMART. Also, only binary rel(cid:173)
`
`evance judgments were available for the ISil collection. SAS was used to merge the three concept
`
`type data sets with the relevance data set by matching a query and document. Thus, a data set with
`
`Methods
`
`6
`
`

`

`three similarity measures (independent variables) and relevance judgment was constructed.
`
`Table 1 on page 8 shows the nature of the matrix that resulted from the merger of the 7 concept
`
`type data sets and the relevance data set.
`
`2.3 Descriptive analysis of the data
`
`SAS, the Statistical Analysis System, Version 5 ( 1985) was used to produce all statistical and
`
`graphical results. For all statistical tests of significance a threshold of .05 was used.
`
`Procedures MEANS and UNIVARIATE were used to obtain descriptive statistics for all concept
`
`types in the two collections. As the distributions of values for the concept types were quite variable,
`
`they were transformed by taking the natural log of the similarity measure plus one. The log trans(cid:173)
`
`formation was chosen, because the data were highly positively skewed due to the large values that
`
`occur in itmer product calculations, when there is a good match between a query and document.
`
`Log and square root transformations arc good choices for positively skewed data, but the log
`
`transformation is more effective at bringing the large vatues closer to the mean. This made the
`
`distributions much more symmetrical. The variables for the two collections were summarized in
`
`tables and some representative histograms and can be found in section "Description of the data"
`
`on page II.
`
`2.4 Obtaining regression coefficients for use in SMART
`
`Procedure General Linear Model (GLM) was used for most regressions with models specified so
`
`that regressions were run without an intercept being calculated. The intercept would have been
`
`Methods
`
`7
`
`

`

`R
`N
`K
`D R
`I E
`D L
`
`Q
`I
`D
`
`A
`u
`T
`
`D
`T
`E
`
`T
`R
`M
`
`L
`N
`K
`
`Table I. Organization of the CACM I merged data set.
`
`PROBABILISTIC "SIMILARITY" FOR EACH CONCEPT TYPE
`
`c
`c
`B
`·R
`B
`0
`c
`c
`c
`0.000
`0.000 0.0000
`0.000 20.816
`0.000
`98 0
`0.13
`2
`0.000
`0 . 0000
`0.000
`0.000
`0.000
`2 655 0
`0 . 000
`38.28
`3.620 0.0000
`0.000
`0.000
`0.000
`2 1138 0
`7.78 87 . 531
`8.35 73.654 27.7 54 117.956
`0 . 000 0.0000
`0.000
`2 1179 0
`0.000
`3.620 0.0000
`0.000
`4 . 09
`2 1314 0
`6.939
`0.000
`4.895 0.0000
`0.000
`2 1426 0
`0.000
`9.61
`0.000
`0.000
`4 . 730 20.8158 228.92 143 . 040 97.140 117.956
`2 1429 4 428.254
`37.15
`2 1435 0
`0.000 13.877 0.0000
`0.000
`0.000
`0.000
`2 1541 4 484.816 185.923 27.7544 143.27
`0.000 84.316
`0.000
`0.000
`0.000 0.0000
`2.41
`486 0
`3
`0.000
`0.000 41.250
`507 0
`0.000
`0.000 0.0000
`63.35
`0.000
`3
`0.000
`0.000
`561 4 252.947
`0.000 10.4079 282.14 87.228 166.526 57.903
`3
`60 . 52
`0.000
`0 . 000
`816 0
`0.000 0.0000
`0.000
`0 . 000
`3
`NOTE: QID = query identification number
`DID = document identification number
`RNKREL = ranked relevance
`Other variables are as previously described .
`
`Methods
`
`8
`
`

`

`useless in subsequent runs, which tested the coefficients obtained from regressions. For the CACM
`
`collection, two, four, and all variable models were run on raw and log transformed data. In the lSI
`
`collection studies, two and all variable models were used. An example of one of the regression
`
`equations (two variable model) is as follows:
`
`Relevance = al * (Similarity_TRM) + a2 * CSimilarity_LNK) + Error
`
`Where al and a2 are regression coefficients for terms and
`
`bibliographic links respectively.
`
`In addition to the linear regressions pertormed with GLM, logistic regression (Procedure LOGIST)
`
`and all possible regressions (Procedure REG) were used on a preliminary data set.
`
`2.5 Testillg the usefullless of the coefficiellts as weights ill
`
`SMART
`
`Base runs were made in SMART for both halves of both the CACM and lSI collections. These
`
`runs used combinations of concept types that corresponded to the two and three variable models
`
`for the ISll Collections and two, four, and seven variable models for the CACM! Collection. As
`
`no coefficients were used in these runs, they provided equal weights for all concept types for com(cid:173)
`
`parison with runs having coeflicients. The coefficients obtained in the regression runs were then
`
`used as weights in feedback runs to try to improve precision in SMART. The runs using coeffi(cid:173)
`
`cients were compared against base runs, which had no coefficients but did use the same concept
`
`types. The coeflicients that were developed for CACM I and ISll were also tested on CACM2 and
`
`ISI2.
`
`Methods
`
`9
`
`

`

`2.6 Threshold tech11iques
`
`I Iistograms and graphs of the data were examined for possible threshold values for the concept
`
`types. Threshold values could be useful, if it could be shown that a value as high or higher than a
`
`certain percentile for a concept type gave a high probability that the document was relevant. Ac(cid:173)
`
`cordingly, the 60th, 75th and 90th percentiles were used as thresholds to test their usefulness.
`
`Variables were created for each concept type at each of the above percentiles. The corresponding
`
`threshold variables were given a value of 1.0 when the concept type value exceed the threshold and
`
`0.0 otherwise. The regressions were then rerun with the additional variables for each threshold.
`
`These results of these runs are reported in "Use of threshold techniques" on page 33.
`
`Methods
`
`10
`
`

`

`3.0 Results
`
`3.1 Descriptioll of the data
`
`Descriptive statistics for the CACM! collection are given in Table 2 on page 13. Vectors for the
`
`seven concept types show considerable variation, as they range from zero to four digit numbers.
`
`All show relatively high values for CV (the coefficient of variation, from 235.0 to 674.4) and all are
`
`highly positively skewed (from 4.22 to 17.3). This is due to the high values that resulted from the
`
`inner product calculations when concept types in the query obtained good match with a document.
`
`Kurtosis measures were all large (from 8.8 to 381.5) due to peakedness caused by the high propor(cid:173)
`
`tion of low or zero values in most concept types that were produced when there was a poor match.
`
`As seen in Table 3 on page 14 the variables in the lSI 1 collection were similar. The numbers range
`
`from 0 to 1188.1 and had high skewness (from 3.4 to 5.8) and high kurtosis (from 17.1 to 45.4).
`
`Their CV's were somewhat lower ( 163.8 to 356. 7). Although linear regression and the F test are
`
`robust, these data are rather variable and quite different from the kind of examples usually shown
`
`in textbooks on regression analysis. Some of the variation is due to the extremes in values, but
`
`much of the variability is due to the sparseness of the QID- DID array. Evidence of the sparseness
`
`Results
`
`II
`
`

`

`is seen in Table 2 on page 13 and Table 3 on page 14 regarding length, the column that gives the
`
`number of non-zero values for each vector. In fact, five of the seven concept types for CACM!
`
`have non-zero values in from 8.8% (AUT) to 17.9% (BBC) of the data records. In the ISll col(cid:173)
`
`lection one of the three subvectors has only 13% of its values that are non-zero (AUT). An at(cid:173)
`
`tempt to compensate for the sparseness is discussed under "Linear regressions on the CACM 1
`
`collection" on page 19. To try to minimize the effect of the high variability in the data, a natural
`
`log transformation was used on all variables and is reported in Table 4 on page 15 and Table 5
`
`on page 16 for the CACM 1 and ISil collections respectively. As can be seen in these two tables,
`
`the transformation docs make the distributions considerably more symmetrical and reduces ex(cid:173)
`
`tremes among the various measures.
`
`In the CACM 1 collection, skewness (0 is normal) was reduced to more acceptable levels ( -.009 to
`
`2.53), kurtosis (0 is normal) was reduced similarly ( -.44 to 10.2) and CV ( 100 is normal) was tighter
`
`( 52.3 to 265.0). In the ISil collection, skewness dropped ( -.37 to 2.6), kurtosis declined ( -1.4 to
`
`5.1), and CV was lowered (39.5 to 270.2). Thus, in both collections, the distributions of the con(cid:173)
`
`cept type variables are more evenly matched and closer to nonnal than for the raw data.
`
`I Iistograms of raw data and log data further illustrate the effects of the log transformation on the
`
`data. I Iistograms in Fi~:,rure 1 on page 17 show the impact of the log transformation on TRM in
`
`the lSI 1 collection (sample data). The large value of 4.8 for skewness in the raw data is shown by
`
`the long positive tail.
`
`I lowcver, the log transformed data show no tail in either direction, as
`
`skewness has been reduced to -.154. Not all of the histograms show such dramatic improvement
`
`as those seen in Figure 1 on page 17, but all of the concept types do show improved distributions.
`
`Figure 2 on page 18, which is of raw and log transformed CRC from the CACM! collection (all
`
`records) is an example of a variable with only modest improvement.
`
`Results
`
`12
`
`

`

`Table 2. Descriptive statistics ami vector length of raw CACM I data.
`
`VARIABLE
`
`MEAN
`
`AUT
`CRC
`DTE
`TRM
`BBC
`LNK
`coc
`
`8.38
`4.24
`3.41
`47.69
`10.89
`11.62
`10.76
`
`c.v.
`
`601.6
`668.1
`674.4
`235.0
`313.9
`486.0
`603.7
`
`MINIMUM
`VALUE
`
`MAXIMUM
`VALUE
`
`0
`0
`0
`0
`0
`0
`0
`
`1152.3
`187.9
`87.8
`2389.0
`433.2
`1574.3
`1602.0
`
`VARIABLE
`
`SKEWNESS KURTOSIS NUMBER
`NON-ZERO
`
`AUT
`CRC
`DTE
`TRM
`BBC
`LNK
`coc
`NOTE: There were 4035 records in this set.
`
`268.3
`78.9
`24.3
`90.5
`30.2
`226.3
`381.4
`
`14.0
`7.3
`4.2
`7.6
`4.7
`12.1
`17.3
`
`314
`1603
`725
`3899
`819
`616
`566
`
`Results
`
`13
`
`

`

`Table 3. Descriptive statistics and vector length of raw IS II data
`
`VARIABLE
`
`MEAN
`
`TRM
`AUT
`coc
`
`45.7
`3.8
`33.7
`
`c.v.
`
`176.6
`356.6
`163.7
`
`MINIMUM
`VALUE
`
`MAXIMUM
`VALUE
`
`0
`0
`0
`
`1188.1
`200.3
`727.5
`
`VARIADLE
`
`SKEWNESS KURTOSIS NUMBER
`NON-ZERO
`
`TRM
`AUT
`coc
`
`5.8
`5.2
`3.4
`
`45.4
`36.4
`17.1
`
`5449
`710
`3400
`
`NOTE: There were 5456 records in this set.
`
`Results
`
`14
`
`

`

`Table 4. Descriptive statistics of log CACM I data
`
`VARIADLE
`
`MEAN
`
`AUT
`CRC
`DTE
`TRM
`DOC
`LNK
`coc
`
`0.33
`0.81
`0.50
`2.83
`0.68
`0.53
`0.50
`
`c.v.
`
`351.8
`138.4
`218.4
`52.3
`214.5
`257.1
`265.0
`
`MINIMUM
`VALUE
`
`MAXIMUM
`VALUE
`
`0
`0
`0
`0
`0
`0
`0
`
`7.1
`5.2
`4.4
`7.7
`6.1
`7.3
`7.3
`
`VARIADLE
`
`SKEWNESS KURTOSIS
`
`AUT
`CRC
`DTE
`TRM
`DDC
`LNK
`coc
`NOTE:
`
`3.39
`1.07
`1.84
`-0.00
`1.95
`2.49
`2.53
`
`10.20
`0.05
`1.75
`-0.44
`2.33
`4.90
`5.16
`
`Number of non-zero values and number of records
`are given in Table 2 on page 13
`
`Results
`
`IS
`
`

`

`Table 5. Descriptive statistics of log IS II data
`
`VARIABLE
`
`MEAN
`
`TRM
`AUT
`coc
`
`3.19
`0.40
`2.34
`
`c.v.
`
`35.2
`268.5
`74.2
`
`MINIMUM
`VALUE
`
`MAXIMUM
`VALUE
`
`0
`0
`0
`
`7.1
`5.3
`6.6
`
`VARIABLE
`
`SKEWNESS KURTOSIS
`
`TRM
`AUT
`coc
`
`NOTE:
`
`-0.12
`2.51
`-0.03
`
`0.48
`4.85
`-1.30
`
`Number of non-zero values and number of records
`are given in Table 3 on page 14.
`
`Results
`
`16
`
`

`

`1175+*
`
`RAH DATA
`
`#
`1
`
`·*
`·*
`·* ·*
`·*
`·*
`·* ·* ·*
`·* ·*
`·* ·* ·*
`
`1
`1
`2
`1
`2
`5
`4
`3
`0
`7
`9
`9
`lo
`25
`22
`21
`37
`50
`133 470
`2435
`
`·*
`·*
`·*
`·*
`·***
`·**********
`25+1flflflflflflflflflflflflflf:lflf:lflflflflflflflflflflflflflflflflflflf:lflflflflflflflflflf***
`----+----+----+----+----+----+----+----+----+---
`* MAY REPRESENT UP TO 51 COUNTS
`==========================================================
`LOG DATA
`#
`7.25+*
`1
`19
`,If
`·***
`47
`·****
`74
`117
`·******
`·************
`2&2
`·***************************
`&02
`1104
`3.75+************************************************
`1075
`·***********************************************
`·**************************************
`852
`·**************************
`587
`·*************
`282
`150
`·*******
`·**********
`228
`0.25+***
`5o
`----+----+----+----+----+----+----+----+----+---
`* MAY REPRESENT UP TO 23 COUNTS
`
`Figure I. Histograms of raw versus log terms in sample IS II tlallt.
`
`Results
`
`17
`
`

`

`#
`2
`1
`
`2
`1
`1
`1
`1
`8
`11
`1
`
`5
`31
`21
`76
`344
`3529
`
`185+*
`
`RAH DATA
`
`5.3+*
`·*
`·*
`·*
`·*
`·*
`·*
`·*
`·*
`·*
`·*
`·**
`·**
`2.7+**
`·***
`·***
`·***
`·***
`·***
`·*****
`·**
`·**
`·* ·***
`
`·*
`·*
`·*
`·*
`·*
`·*
`95+*
`·*
`·*
`·*
`·* ·*
`·**
`·*****
`5+************************************************
`----+----+----+----+----+----+----+----+----+---
`* MAY REPRESENT UP TO 74 COUNTS
`=========================================================
`HISTOGRAM
`#
`2
`1
`4
`2
`19
`1
`4
`14
`20
`22
`36
`54
`93
`98
`136
`145
`139
`142
`137
`216
`100
`71
`5
`142
`
`0.1+************************************************ 2432
`----+----+----+----+----+----+----+----+----+---
`* MAY REPRESENT UP TO 51 COUNTS
`
`Figure 2.
`
`llistogrnms of rnw versus log Computing Reviews categories in nil CACM I llnta.
`
`Results
`
`18
`
`

`

`3.2 Li11ear regressions on the CACM 1 collection
`
`SAS Procedure GLM was used to run full models. of all concept types as independent variables and
`
`ranked relevance judgment as the dependent variable. These runs were performed with the raw data
`
`and log transformed data and are summarized in Table 6 on page 22. However, the proportion
`
`of nonrelevant to relevant documents was too high, more than 9 to l. This problem of unbalanced
`
`groups was partly responsible for the low coefficient of determination (RSQ) for the raw data (.387)
`
`and for the log data (.396). In order to improve the proportion of relevant versus nonrelevant re(cid:173)
`
`cords, most of the nonrclevant documents were randomly discarded leaving a data set that had
`
`equal proportions of relevant versus nonrelevant records (766 total records). This modestly im(cid:173)
`
`proved the RSQ of the raw data to .445 and considerably improved the RSQ of the log transformed
`
`data to .627, as can also be seen in Table 6 on page 22. The sparseness of many of the concept
`
`type subvectors is also probably contributing to the relatively low RSQ (see "Description of the
`
`data" on page II), but nothing could be done about that.
`
`Similar runs were made using binary relevance as the dependent variable. The results of these re(cid:173)
`
`gressions are given in Table 7 on page 23. The same kind of improvement in RSQ that was seen
`
`in Table 6 on page 22 was found by discarding most of the nonrelevant records and balancing the
`
`relative number of relevant versus nonrelevant documents. Ilowevcr, the improvement between
`
`raw data and log data is a little greater, by approximately 1% to 4% . furthermore, the binary rel(cid:173)
`
`evance data with the log transformed indcpemlcnt variables gave a better RSQ than the ranked
`
`relevance (.6659 versus .6274).
`
`A plot of predicted scores versus residuals for the best log sample model with binary relevance data
`
`is displayed in Pigure 3 on page 24 and shows a fair degree of closeness for relevant documents ( l.O)
`
`and considerable spread for nonrelevant documents. A similar plot for ranked relevance data is
`
`Results
`
`19
`
`

`

`shown in Figure 4 on page 25, but here the relevant values are divided into values from 1.0 to 4.0.
`
`Again, the relevant documents show less spread than the noruelevant.
`
`The concept type variables for each regression run were ranked by their Type III Sum of Squares
`
`(Si\S, 1985), which gives the sum of squares for each variable independently of its order in the re(cid:173)
`
`gression model. From the rankings, some two and four variable models were chosen and run using
`
`the same two dependent variables for all records and for the sample set of records. The coefficients,
`
`RSQ's, and rankings are also provided in Table 7 on page 23 and Table 6 on page 22 for binary
`
`and ranked relevance data respectively.
`
`The two variable model (TRM and LNK) using raw data and all records gave 86% of the RSQ
`
`of the seven variable model for both dependent variables. For sample raw data, 83% and 86% of
`
`the seven variable RSQ was obtained. For log transformed data, all record regressions with the
`
`same independent and dependent variables gave 79% and 81% of the original seven variable RSQ
`
`(ranked.and binary relevances respectively). However, the sample of log data gave 98% of the seven
`
`variable RSQ for both ranked and binary relevance data.
`
`In fact the two variable model for log
`
`translormcd imlcpcndent variables and binary relevance data gave a higher RSQ than any of the
`
`other seven variable models. The four variable model (AUT, CRC, TRM, and LNK) gave modest
`
`improvement, but clearly most of the variance is accounted for by TRM and LNK.
`
`All possible two-way interactions were tested (using proc GLM in SAS) on ranked relevance and
`
`binary relevance data. Several were found to be significant at the .05 level. Table 8 on page 26
`
`shows

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket