`
`to Obtain Coefficients for Use in
`•..
`Probabilistic Information Retrieval Systems
`
`by
`
`Gary L. Nunn
`
`Project submitted to the Faculty of the
`
`Virginia Polytechnic Institute and State University
`
`in partial fulfillment of the requirements for the degree of
`
`Master of Science
`
`Computer Science and Applications
`
`APPROVED:
`
`December 7, 1987
`
`Blacksburg, V1rginia
`
`EXHIBIT 2027
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00481
`
`
`
`Regression Analysis of Extended Vectors
`
`to Obtain Coefficients for Usc in
`
`Probabilistic Information Retrieval Systems
`
`by
`
`Gary L. Nunn
`
`Edward A. Fox, Chainnan
`
`Computer Science and Applications
`
`(ABSTRACT)
`
`Previous work by Fox has extended the vector space model of information retrieval and its imple(cid:173)
`
`mentation in the SMART system so different types of information about documents can be sepa(cid:173)
`
`rately handled as multiple subvectors, each for a ditferent concept type. We hypothesized that
`
`relevance of a document could be best predicted if proper coefficients are obtained to reflect the
`
`importance of the query-document similarity for each subvector when computing an overall simi(cid:173)
`
`larity value. Two different research collections, CACM and lSI, each split into halves, were used
`
`to generate data for the regression studies to obtain coefficients. Most of the variance in relevance
`
`could be accounted for by only four of the subvectors (authors, Computing Review descriptors,
`
`links, and terms) for the CACM l collection. In the ISll collection, two of the vectors (terms and
`
`cocitations) accounted for most of the variance. Log transformed data and samples of the records
`
`gave the best RSQ's; .6654 was the highest RSQ (binary relevance). The regression runs provided
`
`coefficients which were used in subsequent feedback runs in SMART. Having ranked relevance
`
`did not improve the regression model over binary relevance. The coefficients in the feedback runs
`
`with SMART proved to be of limited usefulness since improvements in precision were in the 1-5%
`
`range. Although log data and samples of the records gave the best RSQ's, coefficients from log
`
`values of all data improved precision the most. The findings of this study support previous work
`
`of Fox, that additional information improves retrieval. Regression coefficients improved precision
`
`slightly when used as subvector weights. Log transfonning the data values for the concept types
`
`modestly helped both the regression analyses and the retrieval in SMART.
`
`
`
`Acknowledgements
`
`I am grateful to Dr. Edward A. Fox for his patient help and many interesting discussions during
`
`the course of tlus research and my studies. I would to thank Dr. Osman Balci and Dr. Clifford
`
`Shaffer for their aid in completing this project. I very much appreciate the many hours of effort that
`
`Mr. Whay Lee gave to this study.
`
`I would especially like to thank my wife, Dr. Pamela Gam-Nunn, and our son Bradley, who gave
`
`me support and made many sacrifices during this project and my course of study.
`
`Acknowledgements
`
`iii
`
`
`
`p
`
`Table of Contents
`
`1.0
`
`Introduction
`
`1.1 Probabilistic retrieval ........ ... . . . .... .. .................... . .... . .. . .
`
`1.2
`
`Information retrieval in SMART
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
`
`1.3 Research goals
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
`
`2.0 Methods .. , , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
`
`2.1 Division of the collections
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
`
`2.2 Description of the collections and vectors
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
`
`2.3 Descriptive analysis of the data
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
`
`2.4 Obtaining regression coefficients for use in SMART
`
`. . ..... . .. . .. . ..... . ... . . . . 7
`
`2.5 Testing the usefulness of the coefficients as weights in SMART ....... . ...... .. ... 9
`
`2.6 Threshold techniques
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
`
`3.0 Results
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
`
`3.1 Description of the data
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
`
`3.2 Linear regressions on the CACM 1 collection
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
`
`3.3 Linear regressions on the lSI! collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
`
`Table of Contents
`
`iv
`
`
`
`3.4 Additional regression techniques
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
`
`3.5 Use of regression coeiftcients in SMART for CACM I and ISil collections . . .. ..... . 31
`
`3.6 Use of regression coefficients in SMART for CACM2 and ISI2 collections ....... . . . 32
`
`3. 7 Use of threshold techniques
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
`
`4.0 Discussion • . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
`
`4.1 Usefulness of concept types as shown by linear regressions
`
`. . . . . . . . . . . . . . . . . . . . . 40
`
`4.2 Ranked relevances versus binary relevances
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
`
`4.3 Thesholds as aids in regression
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
`
`4.4
`
`Improvement of retrieval using coe1licients
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
`
`4.5 Conclusions and implications for further research
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
`
`References
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • . . . . . . . . . . . . . . . . . . . . . . . . . . • . . 44
`
`Vita .•....•... .................. .... ..............•..•.......... ..... 45
`
`Table of Contents
`
`v
`
`
`
`List of Illustrations
`
`Figure
`
`1. Histograms of raw versus log tcnns in sample ISil data.
`
`• • • • • 0 . 0 • • 0 • • • 0 • • • 17
`
`Figure 2. Histograms of raw versus log Computing Reviews categories in all CACM1 data.
`18
`Figure 3. Predicted versus residuals for log sample binary data, CACM 1 collection. . ..... 24
`
`Figure 4. Predicted versus residuals for log sample ranked data, CACM 1 collection.
`
`• 0 . 0 • • 25
`
`Figure 5. Predicted versus residuals for log sample for the lSI collection.
`
`• • • • • • • • • • • 0 • • 29
`
`List of Illustrations
`
`vi
`
`
`
`List of Tables
`
`Table
`
`1. Organization of the CACM l merged data set.
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
`
`Table 2. Descriptive statistics and vector length of raw CACM I data.
`
`. . . . . . . . . . . . . . . . 13
`
`Table 3. Descriptive statistics and vector length of raw ISil data . . . . . . . . . . . . . . . . . . . . 14
`
`Table 4. Descriptive statistics of log CACM l data
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
`
`Table 5. Descriptive statistics of log lSI 1 data
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
`
`Table 6. Regression coefficients, ranks and RSQ's for CACM l ranked relevances.
`
`. . . . . . . 22
`
`Table 7. Regression coefficients, ranks and RSQ's for CACM 1 binary relevances.
`
`. . . . . . . 23
`
`Table 8. Sums of squares and probability values for CACM I two-way interactions ....... 26
`
`Table 9. Ref,rression coefficients, ranks and RSQ's for lSI data.
`
`. . . . . . . . . . . . . . . . . . . . . 28
`
`Table 10. Sums of squares and probability values for lSll (log sample data) interactions. . . . 30
`
`Table 11. Precision values from base and coefficient runs for the ISll collection.
`
`. . . . . . . . . 34
`
`Table 12. Precision values from base and coefficient runs for the CACM I
`
`..... ........ 35
`
`Table 13. Precision values from base and coefficient runs for the CACM 1
`
`. . . . . . . . . . . . . 36
`
`Table 14. Precision values from base and coefficient runs for the IS12 collection.
`
`. . . . . . . . . 37
`
`Table 15. Precision values from base and coefficient runs for CACM2 (binary relevance).
`
`Table 16. Precision values from base and coefficient runs for CACM2 (ranked relevance).
`
`List of Tables
`
`38
`
`39
`
`vii
`
`
`
`p
`
`1.0
`
`Introduction
`
`1.1 Probabilistic retrieval
`
`The probabilistic model for information retrieval ((Yu and Salton, 1976 and Robertson and Sparck
`
`Jones, 1976) as cited in van Rijsbergcn, 1981) assumes that the terms in a query and the terms in
`
`a collection of documents are used in an initial retrieval to obtain a sample of the documents. The
`
`terms in the sample are then used to estimate the probability that each document in the collection
`
`is relevant or not relevant. The collection has usually been indexed by building a vector of terms
`
`for each document where the vector consists of binary values ( l for presence - 0 for absence) for
`
`all terms in the collection. A document thus is represented as a vector of length n:
`
`Term(l), Term(2) ... Term(n)
`
`For a given query it is possible to estimate a probability of relevance for each document in the
`
`collection by computing an inner product of "term relevance" values for all terms considered:
`
`Probability(relevanceldocument) = SUM [term_relevance(i) * Term(i)]
`
`Introduction
`
`
`
`where:
`
`term_relevance(i) = [r / CR - r)] / [(n - r) / CN - n - R + r)]
`
`and:
`
`N = number of documents
`n = number of documents with term i
`R = number of relevant documents
`r = number of relevant documents with term i
`
`The Bayesian decision rule can be used to the decide whether a document has a high enough
`
`probability estimate to be chosen as relevant:
`
`Probability(relevanceldocument) > Probability(non-relevanceldocument)
`
`1.2
`
`lufornzation retrieval in SMART
`
`The SMART information retrieval system (Salton and McGill, 1983) has options to use the prob(cid:173)
`
`abilistic model to rank documents that are retrieved as part of a feedback process. Initial retrieval
`
`is usually accomplished after computing a cosine similarity (Salton and McGill, 1983) between the
`
`terms in a query and the terms in documents. Those documents with the highest similarity have
`
`the lowest ranks (also the highest probability of being relevant). The user is presented with a list
`
`of the "top ranked" documents. The user can then decide which of the "top ranked" documents
`
`are relevant. SMART then can perform a vector feedback search (if desired) by adding any new
`
`terms from the relevant documents to the initial query and subtracting those terms from the initial
`
`query, which only appeared in the documents that were judged to be nonrelevant. The resultant
`
`Introduction
`
`2
`
`
`
`feedback query is then used to provide a new ranked list of documents (Salton and McGill, 1983).
`
`Alternatively, a probabilistic feedback can be performed.
`
`Fox (1983a) has modified the application of the vector and probabilistic models to utilize additional
`
`information. This consists of author, date of publication, bibliographic coupling, bibliographic
`
`links, Computing Reviews' categories, and cocitations. SMART has been modified to add the
`
`additional information as subvectors (Fox, 1983a) to the document-term vectors of the original
`
`vector/probabilistic system. As modified, the collection then consists of extended vectors where the
`
`information is separated into subvectors:
`
`Document_identification_number,
`
`Terms,
`
`authors,
`
`date of publication,
`
`bibliographic
`
`coupling,
`
`bibliographic links, Computing Reviews' categories, cocitations
`
`1.3 Resea1·ch goals
`
`The goals of this study were to statistically examine the usefulness of the subvectors associated with
`
`different concept types (Fox, 1983a) and to determine if coefficients obtained via multiple re(cid:173)
`
`gressions could be used to further enhance SMART's retrieval using the extended vectors. A less
`
`significant goal was to detennine if knowledge of relevance as a ranking (from least (1) to most (4)
`
`relevant) would facilitate prediction, thus yielding better coefficients. To accomplish these objec(cid:173)
`
`tives two different research collections were analyzed. These had been loaded into a version of
`
`SMART, which has been installed on a VAX-11/785 running UNIX (Pox, 1983b).
`
`Introduction
`
`3
`
`
`
`The collections, which have been described elsewhere (Fox, l983b), consist of 3,204 abstracts of
`
`documents which appeared ( 1958 - 1979) in Communications of the Association for Computing
`
`Machinery (CACM Collection) and 1,460 abstracts of documents from various sources ( 1969 -
`
`1977) concerning information science, along with citation data obtained from the Institute for Sci(cid:173)
`
`entific Information (lSI Collection). The CACM collection has information necessary for all of the
`
`above mentioned extended vectors, but the lSI collection has only two additional information
`
`components (subvectors), author and cocitations. For both collections, a set of queries with known
`
`document-query relevances was used to generate data for the regression studies. Precision (the ratio
`
`of relevant documents retrieved to all documents retrieved for that query) averages were the prin(cid:173)
`
`cipal measure used for determining the effectiveness of all retrieval runs with SMART.
`
`Introduction
`
`4
`
`
`
`f
`
`2.0 Methods
`
`2.1 Division of the collections
`
`Both the CACM and lSI collections were divided into two approximately equally sized sub(cid:173)
`
`collections by randomly selecting documents. The collections were split so that all analyses could
`
`be performed on one half of the data and the results obtained could be tested on the other half of
`
`the data. The dividing procedure used kept the same proportion of relevant to nonrelevant docu(cid:173)
`
`ments with regard to sets of queries for each collection. In the description and discussion that fol(cid:173)
`
`low these sub collections are referred to as the CACM 1 or CACM2 and lSI 1 or ISI2 Collections.
`
`2.2 Descriptioll of the collectious aud vectors
`
`SMART was used to prepare data sets, for both the CACM! and ISil collections, which consisted
`
`of query identification number (QID), document identification number (DID), rank in the prob-
`
`Methods
`
`5
`
`
`
`f
`
`abilistic retrieval, and a similarity measure (SIM). However, as it was based on the value of the
`
`similarity, rank was not used in this study. A separate data set was obtained for each concept type,
`
`which had the appropriate measure of similarity for all QID - DID pairings.
`
`for use in tables and figures the concept types will be represented by the following abbreviations:
`
`AUT
`
`Authors
`
`CRC
`
`Computing Reviews' Category
`
`DTE
`
`Date of Publication
`
`TRM
`
`Terms
`
`nne
`
`Bibliographic Coupling
`
`LNK
`
`Bibliographic Links
`
`coc
`
`Cocitations
`
`for the CACM I collection, there were 7 equal len!,>th data sets, one for each concept type, and a
`
`data set which listed the relevance judgment (ranked from 0 to 4) for each QID - DID pairing.
`
`The Statistical Analysis System (SAS, 1985) was used to merge the 7 concept type data sets with
`
`the relevance judgment data set by matching QID and DID. Tllis gave a data set, which has QID,
`
`DID, and 7 different similarity measures (independent variables) and relevance judgment (depend(cid:173)
`
`ent variable) for subsequent analysis.
`
`For the ISil collection, only three concept types were available. Thus, only three concept type data
`
`sets with their appropriate sirrlllarity measure were obtained from SMART. Also, only binary rel(cid:173)
`
`evance judgments were available for the ISil collection. SAS was used to merge the three concept
`
`type data sets with the relevance data set by matching a query and document. Thus, a data set with
`
`Methods
`
`6
`
`
`
`three similarity measures (independent variables) and relevance judgment was constructed.
`
`Table 1 on page 8 shows the nature of the matrix that resulted from the merger of the 7 concept
`
`type data sets and the relevance data set.
`
`2.3 Descriptive analysis of the data
`
`SAS, the Statistical Analysis System, Version 5 ( 1985) was used to produce all statistical and
`
`graphical results. For all statistical tests of significance a threshold of .05 was used.
`
`Procedures MEANS and UNIVARIATE were used to obtain descriptive statistics for all concept
`
`types in the two collections. As the distributions of values for the concept types were quite variable,
`
`they were transformed by taking the natural log of the similarity measure plus one. The log trans(cid:173)
`
`formation was chosen, because the data were highly positively skewed due to the large values that
`
`occur in itmer product calculations, when there is a good match between a query and document.
`
`Log and square root transformations arc good choices for positively skewed data, but the log
`
`transformation is more effective at bringing the large vatues closer to the mean. This made the
`
`distributions much more symmetrical. The variables for the two collections were summarized in
`
`tables and some representative histograms and can be found in section "Description of the data"
`
`on page II.
`
`2.4 Obtaining regression coefficients for use in SMART
`
`Procedure General Linear Model (GLM) was used for most regressions with models specified so
`
`that regressions were run without an intercept being calculated. The intercept would have been
`
`Methods
`
`7
`
`
`
`R
`N
`K
`D R
`I E
`D L
`
`Q
`I
`D
`
`A
`u
`T
`
`D
`T
`E
`
`T
`R
`M
`
`L
`N
`K
`
`Table I. Organization of the CACM I merged data set.
`
`PROBABILISTIC "SIMILARITY" FOR EACH CONCEPT TYPE
`
`c
`c
`B
`·R
`B
`0
`c
`c
`c
`0.000
`0.000 0.0000
`0.000 20.816
`0.000
`98 0
`0.13
`2
`0.000
`0 . 0000
`0.000
`0.000
`0.000
`2 655 0
`0 . 000
`38.28
`3.620 0.0000
`0.000
`0.000
`0.000
`2 1138 0
`7.78 87 . 531
`8.35 73.654 27.7 54 117.956
`0 . 000 0.0000
`0.000
`2 1179 0
`0.000
`3.620 0.0000
`0.000
`4 . 09
`2 1314 0
`6.939
`0.000
`4.895 0.0000
`0.000
`2 1426 0
`0.000
`9.61
`0.000
`0.000
`4 . 730 20.8158 228.92 143 . 040 97.140 117.956
`2 1429 4 428.254
`37.15
`2 1435 0
`0.000 13.877 0.0000
`0.000
`0.000
`0.000
`2 1541 4 484.816 185.923 27.7544 143.27
`0.000 84.316
`0.000
`0.000
`0.000 0.0000
`2.41
`486 0
`3
`0.000
`0.000 41.250
`507 0
`0.000
`0.000 0.0000
`63.35
`0.000
`3
`0.000
`0.000
`561 4 252.947
`0.000 10.4079 282.14 87.228 166.526 57.903
`3
`60 . 52
`0.000
`0 . 000
`816 0
`0.000 0.0000
`0.000
`0 . 000
`3
`NOTE: QID = query identification number
`DID = document identification number
`RNKREL = ranked relevance
`Other variables are as previously described .
`
`Methods
`
`8
`
`
`
`useless in subsequent runs, which tested the coefficients obtained from regressions. For the CACM
`
`collection, two, four, and all variable models were run on raw and log transformed data. In the lSI
`
`collection studies, two and all variable models were used. An example of one of the regression
`
`equations (two variable model) is as follows:
`
`Relevance = al * (Similarity_TRM) + a2 * CSimilarity_LNK) + Error
`
`Where al and a2 are regression coefficients for terms and
`
`bibliographic links respectively.
`
`In addition to the linear regressions pertormed with GLM, logistic regression (Procedure LOGIST)
`
`and all possible regressions (Procedure REG) were used on a preliminary data set.
`
`2.5 Testillg the usefullless of the coefficiellts as weights ill
`
`SMART
`
`Base runs were made in SMART for both halves of both the CACM and lSI collections. These
`
`runs used combinations of concept types that corresponded to the two and three variable models
`
`for the ISll Collections and two, four, and seven variable models for the CACM! Collection. As
`
`no coefficients were used in these runs, they provided equal weights for all concept types for com(cid:173)
`
`parison with runs having coeflicients. The coefficients obtained in the regression runs were then
`
`used as weights in feedback runs to try to improve precision in SMART. The runs using coeffi(cid:173)
`
`cients were compared against base runs, which had no coefficients but did use the same concept
`
`types. The coeflicients that were developed for CACM I and ISll were also tested on CACM2 and
`
`ISI2.
`
`Methods
`
`9
`
`
`
`2.6 Threshold tech11iques
`
`I Iistograms and graphs of the data were examined for possible threshold values for the concept
`
`types. Threshold values could be useful, if it could be shown that a value as high or higher than a
`
`certain percentile for a concept type gave a high probability that the document was relevant. Ac(cid:173)
`
`cordingly, the 60th, 75th and 90th percentiles were used as thresholds to test their usefulness.
`
`Variables were created for each concept type at each of the above percentiles. The corresponding
`
`threshold variables were given a value of 1.0 when the concept type value exceed the threshold and
`
`0.0 otherwise. The regressions were then rerun with the additional variables for each threshold.
`
`These results of these runs are reported in "Use of threshold techniques" on page 33.
`
`Methods
`
`10
`
`
`
`3.0 Results
`
`3.1 Descriptioll of the data
`
`Descriptive statistics for the CACM! collection are given in Table 2 on page 13. Vectors for the
`
`seven concept types show considerable variation, as they range from zero to four digit numbers.
`
`All show relatively high values for CV (the coefficient of variation, from 235.0 to 674.4) and all are
`
`highly positively skewed (from 4.22 to 17.3). This is due to the high values that resulted from the
`
`inner product calculations when concept types in the query obtained good match with a document.
`
`Kurtosis measures were all large (from 8.8 to 381.5) due to peakedness caused by the high propor(cid:173)
`
`tion of low or zero values in most concept types that were produced when there was a poor match.
`
`As seen in Table 3 on page 14 the variables in the lSI 1 collection were similar. The numbers range
`
`from 0 to 1188.1 and had high skewness (from 3.4 to 5.8) and high kurtosis (from 17.1 to 45.4).
`
`Their CV's were somewhat lower ( 163.8 to 356. 7). Although linear regression and the F test are
`
`robust, these data are rather variable and quite different from the kind of examples usually shown
`
`in textbooks on regression analysis. Some of the variation is due to the extremes in values, but
`
`much of the variability is due to the sparseness of the QID- DID array. Evidence of the sparseness
`
`Results
`
`II
`
`
`
`is seen in Table 2 on page 13 and Table 3 on page 14 regarding length, the column that gives the
`
`number of non-zero values for each vector. In fact, five of the seven concept types for CACM!
`
`have non-zero values in from 8.8% (AUT) to 17.9% (BBC) of the data records. In the ISll col(cid:173)
`
`lection one of the three subvectors has only 13% of its values that are non-zero (AUT). An at(cid:173)
`
`tempt to compensate for the sparseness is discussed under "Linear regressions on the CACM 1
`
`collection" on page 19. To try to minimize the effect of the high variability in the data, a natural
`
`log transformation was used on all variables and is reported in Table 4 on page 15 and Table 5
`
`on page 16 for the CACM 1 and ISil collections respectively. As can be seen in these two tables,
`
`the transformation docs make the distributions considerably more symmetrical and reduces ex(cid:173)
`
`tremes among the various measures.
`
`In the CACM 1 collection, skewness (0 is normal) was reduced to more acceptable levels ( -.009 to
`
`2.53), kurtosis (0 is normal) was reduced similarly ( -.44 to 10.2) and CV ( 100 is normal) was tighter
`
`( 52.3 to 265.0). In the ISil collection, skewness dropped ( -.37 to 2.6), kurtosis declined ( -1.4 to
`
`5.1), and CV was lowered (39.5 to 270.2). Thus, in both collections, the distributions of the con(cid:173)
`
`cept type variables are more evenly matched and closer to nonnal than for the raw data.
`
`I Iistograms of raw data and log data further illustrate the effects of the log transformation on the
`
`data. I Iistograms in Fi~:,rure 1 on page 17 show the impact of the log transformation on TRM in
`
`the lSI 1 collection (sample data). The large value of 4.8 for skewness in the raw data is shown by
`
`the long positive tail.
`
`I lowcver, the log transformed data show no tail in either direction, as
`
`skewness has been reduced to -.154. Not all of the histograms show such dramatic improvement
`
`as those seen in Figure 1 on page 17, but all of the concept types do show improved distributions.
`
`Figure 2 on page 18, which is of raw and log transformed CRC from the CACM! collection (all
`
`records) is an example of a variable with only modest improvement.
`
`Results
`
`12
`
`
`
`Table 2. Descriptive statistics ami vector length of raw CACM I data.
`
`VARIABLE
`
`MEAN
`
`AUT
`CRC
`DTE
`TRM
`BBC
`LNK
`coc
`
`8.38
`4.24
`3.41
`47.69
`10.89
`11.62
`10.76
`
`c.v.
`
`601.6
`668.1
`674.4
`235.0
`313.9
`486.0
`603.7
`
`MINIMUM
`VALUE
`
`MAXIMUM
`VALUE
`
`0
`0
`0
`0
`0
`0
`0
`
`1152.3
`187.9
`87.8
`2389.0
`433.2
`1574.3
`1602.0
`
`VARIABLE
`
`SKEWNESS KURTOSIS NUMBER
`NON-ZERO
`
`AUT
`CRC
`DTE
`TRM
`BBC
`LNK
`coc
`NOTE: There were 4035 records in this set.
`
`268.3
`78.9
`24.3
`90.5
`30.2
`226.3
`381.4
`
`14.0
`7.3
`4.2
`7.6
`4.7
`12.1
`17.3
`
`314
`1603
`725
`3899
`819
`616
`566
`
`Results
`
`13
`
`
`
`Table 3. Descriptive statistics and vector length of raw IS II data
`
`VARIABLE
`
`MEAN
`
`TRM
`AUT
`coc
`
`45.7
`3.8
`33.7
`
`c.v.
`
`176.6
`356.6
`163.7
`
`MINIMUM
`VALUE
`
`MAXIMUM
`VALUE
`
`0
`0
`0
`
`1188.1
`200.3
`727.5
`
`VARIADLE
`
`SKEWNESS KURTOSIS NUMBER
`NON-ZERO
`
`TRM
`AUT
`coc
`
`5.8
`5.2
`3.4
`
`45.4
`36.4
`17.1
`
`5449
`710
`3400
`
`NOTE: There were 5456 records in this set.
`
`Results
`
`14
`
`
`
`Table 4. Descriptive statistics of log CACM I data
`
`VARIADLE
`
`MEAN
`
`AUT
`CRC
`DTE
`TRM
`DOC
`LNK
`coc
`
`0.33
`0.81
`0.50
`2.83
`0.68
`0.53
`0.50
`
`c.v.
`
`351.8
`138.4
`218.4
`52.3
`214.5
`257.1
`265.0
`
`MINIMUM
`VALUE
`
`MAXIMUM
`VALUE
`
`0
`0
`0
`0
`0
`0
`0
`
`7.1
`5.2
`4.4
`7.7
`6.1
`7.3
`7.3
`
`VARIADLE
`
`SKEWNESS KURTOSIS
`
`AUT
`CRC
`DTE
`TRM
`DDC
`LNK
`coc
`NOTE:
`
`3.39
`1.07
`1.84
`-0.00
`1.95
`2.49
`2.53
`
`10.20
`0.05
`1.75
`-0.44
`2.33
`4.90
`5.16
`
`Number of non-zero values and number of records
`are given in Table 2 on page 13
`
`Results
`
`IS
`
`
`
`Table 5. Descriptive statistics of log IS II data
`
`VARIABLE
`
`MEAN
`
`TRM
`AUT
`coc
`
`3.19
`0.40
`2.34
`
`c.v.
`
`35.2
`268.5
`74.2
`
`MINIMUM
`VALUE
`
`MAXIMUM
`VALUE
`
`0
`0
`0
`
`7.1
`5.3
`6.6
`
`VARIABLE
`
`SKEWNESS KURTOSIS
`
`TRM
`AUT
`coc
`
`NOTE:
`
`-0.12
`2.51
`-0.03
`
`0.48
`4.85
`-1.30
`
`Number of non-zero values and number of records
`are given in Table 3 on page 14.
`
`Results
`
`16
`
`
`
`1175+*
`
`RAH DATA
`
`#
`1
`
`·*
`·*
`·* ·*
`·*
`·*
`·* ·* ·*
`·* ·*
`·* ·* ·*
`
`1
`1
`2
`1
`2
`5
`4
`3
`0
`7
`9
`9
`lo
`25
`22
`21
`37
`50
`133 470
`2435
`
`·*
`·*
`·*
`·*
`·***
`·**********
`25+1flflflflflflflflflflflflflf:lflf:lflflflflflflflflflflflflflflflflflflf:lflflflflflflflflflf***
`----+----+----+----+----+----+----+----+----+---
`* MAY REPRESENT UP TO 51 COUNTS
`==========================================================
`LOG DATA
`#
`7.25+*
`1
`19
`,If
`·***
`47
`·****
`74
`117
`·******
`·************
`2&2
`·***************************
`&02
`1104
`3.75+************************************************
`1075
`·***********************************************
`·**************************************
`852
`·**************************
`587
`·*************
`282
`150
`·*******
`·**********
`228
`0.25+***
`5o
`----+----+----+----+----+----+----+----+----+---
`* MAY REPRESENT UP TO 23 COUNTS
`
`Figure I. Histograms of raw versus log terms in sample IS II tlallt.
`
`Results
`
`17
`
`
`
`#
`2
`1
`
`2
`1
`1
`1
`1
`8
`11
`1
`
`5
`31
`21
`76
`344
`3529
`
`185+*
`
`RAH DATA
`
`5.3+*
`·*
`·*
`·*
`·*
`·*
`·*
`·*
`·*
`·*
`·*
`·**
`·**
`2.7+**
`·***
`·***
`·***
`·***
`·***
`·*****
`·**
`·**
`·* ·***
`
`·*
`·*
`·*
`·*
`·*
`·*
`95+*
`·*
`·*
`·*
`·* ·*
`·**
`·*****
`5+************************************************
`----+----+----+----+----+----+----+----+----+---
`* MAY REPRESENT UP TO 74 COUNTS
`=========================================================
`HISTOGRAM
`#
`2
`1
`4
`2
`19
`1
`4
`14
`20
`22
`36
`54
`93
`98
`136
`145
`139
`142
`137
`216
`100
`71
`5
`142
`
`0.1+************************************************ 2432
`----+----+----+----+----+----+----+----+----+---
`* MAY REPRESENT UP TO 51 COUNTS
`
`Figure 2.
`
`llistogrnms of rnw versus log Computing Reviews categories in nil CACM I llnta.
`
`Results
`
`18
`
`
`
`3.2 Li11ear regressions on the CACM 1 collection
`
`SAS Procedure GLM was used to run full models. of all concept types as independent variables and
`
`ranked relevance judgment as the dependent variable. These runs were performed with the raw data
`
`and log transformed data and are summarized in Table 6 on page 22. However, the proportion
`
`of nonrelevant to relevant documents was too high, more than 9 to l. This problem of unbalanced
`
`groups was partly responsible for the low coefficient of determination (RSQ) for the raw data (.387)
`
`and for the log data (.396). In order to improve the proportion of relevant versus nonrelevant re(cid:173)
`
`cords, most of the nonrclevant documents were randomly discarded leaving a data set that had
`
`equal proportions of relevant versus nonrelevant records (766 total records). This modestly im(cid:173)
`
`proved the RSQ of the raw data to .445 and considerably improved the RSQ of the log transformed
`
`data to .627, as can also be seen in Table 6 on page 22. The sparseness of many of the concept
`
`type subvectors is also probably contributing to the relatively low RSQ (see "Description of the
`
`data" on page II), but nothing could be done about that.
`
`Similar runs were made using binary relevance as the dependent variable. The results of these re(cid:173)
`
`gressions are given in Table 7 on page 23. The same kind of improvement in RSQ that was seen
`
`in Table 6 on page 22 was found by discarding most of the nonrelevant records and balancing the
`
`relative number of relevant versus nonrelevant documents. Ilowevcr, the improvement between
`
`raw data and log data is a little greater, by approximately 1% to 4% . furthermore, the binary rel(cid:173)
`
`evance data with the log transformed indcpemlcnt variables gave a better RSQ than the ranked
`
`relevance (.6659 versus .6274).
`
`A plot of predicted scores versus residuals for the best log sample model with binary relevance data
`
`is displayed in Pigure 3 on page 24 and shows a fair degree of closeness for relevant documents ( l.O)
`
`and considerable spread for nonrelevant documents. A similar plot for ranked relevance data is
`
`Results
`
`19
`
`
`
`shown in Figure 4 on page 25, but here the relevant values are divided into values from 1.0 to 4.0.
`
`Again, the relevant documents show less spread than the noruelevant.
`
`The concept type variables for each regression run were ranked by their Type III Sum of Squares
`
`(Si\S, 1985), which gives the sum of squares for each variable independently of its order in the re(cid:173)
`
`gression model. From the rankings, some two and four variable models were chosen and run using
`
`the same two dependent variables for all records and for the sample set of records. The coefficients,
`
`RSQ's, and rankings are also provided in Table 7 on page 23 and Table 6 on page 22 for binary
`
`and ranked relevance data respectively.
`
`The two variable model (TRM and LNK) using raw data and all records gave 86% of the RSQ
`
`of the seven variable model for both dependent variables. For sample raw data, 83% and 86% of
`
`the seven variable RSQ was obtained. For log transformed data, all record regressions with the
`
`same independent and dependent variables gave 79% and 81% of the original seven variable RSQ
`
`(ranked.and binary relevances respectively). However, the sample of log data gave 98% of the seven
`
`variable RSQ for both ranked and binary relevance data.
`
`In fact the two variable model for log
`
`translormcd imlcpcndent variables and binary relevance data gave a higher RSQ than any of the
`
`other seven variable models. The four variable model (AUT, CRC, TRM, and LNK) gave modest
`
`improvement, but clearly most of the variance is accounted for by TRM and LNK.
`
`All possible two-way interactions were tested (using proc GLM in SAS) on ranked relevance and
`
`binary relevance data. Several were found to be significant at the .05 level. Table 8 on page 26
`
`shows interactions for binary