Regression Analysis of Extended Vectors
`to Obtain Coefficients for Use in
`Probabilistic Information Retrieval Systems
`Gary L. Nunn
`Project submitted to the Faculty of the
`Virginia Polytechnic Institute and State University
`in partial fulfillment of the requirements for the degree of
`Master of Science
`Computer Science and Applications
`December 7, 1987
`Blacksburg, V1rginia
`Previous work by Fox has extended the vector space model of information retrieval and its imple(cid:173)
`mentation in the SMART system so different types of information about documents can be sepa(cid:173)
`rately handled as multiple subvectors, each for a ditferent concept type. We hypothesized that
`relevance of a document could be best predicted if proper coefficients are obtained to reflect the
`importance of the query-document similarity for each subvector when computing an overall simi(cid:173)
`larity value. Two different research collections, CACM and lSI, each split into halves, were used
`to generate data for the regression studies to obtain coefficients. Most of the variance in relevance
`could be accounted for by only four of the subvectors (authors, Computing Review descriptors,
`links, and terms) for the CACM l collection. In the ISll collection, two of the vectors (terms and
`cocitations) accounted for most of the variance. Log transformed data and samples of the records
`gave the best RSQ's; .6654 was the highest RSQ (binary relevance). The regression runs provided
`coefficients which were used in subsequent feedback runs in SMART. Having ranked relevance
`did not improve the regression model over binary relevance. The coefficients in the feedback runs
`with SMART proved to be of limited usefulness since improvements in precision were in the 1-5%
`range. Although log data and samples of the records gave the best RSQ's, coefficients from log
`values of all data improved precision the most. The findings of this study support previous work
`of Fox, that additional information improves retrieval. Regression coefficients improved precision
`slightly when used as subvector weights. Log transfonning the data values for the concept types
`modestly helped both the regression analyses and the retrieval in SMART.


`I am grateful to Dr. Edward A. Fox for his patient help and many interesting discussions during
`the course of tlus research and my studies. I would to thank Dr. Osman Balci and Dr. Clifford
`Shaffer for their aid in completing this project. I very much appreciate the many hours of effort that
`Mr. Whay Lee gave to this study.
`I would especially like to thank my wife, Dr. Pamela Gam-Nunn, and our son Bradley, who gave
`me support and made many sacrifices during this project and my course of study.


`Table of Contents
`1.1 Probabilistic retrieval ........ ... . . . .... .. .................... . .... . .. . .
`Information retrieval in SMART
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
`1.3 Research goals
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
`2.0 Methods .. , , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
`2.1 Division of the collections
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
`2.2 Description of the collections and vectors
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
`2.3 Descriptive analysis of the data
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
`2.4 Obtaining regression coefficients for use in SMART
`. . ..... . .. . .. . ..... . ... . . . . 7
`2.5 Testing the usefulness of the coefficients as weights in SMART ....... . ...... .. ... 9
`2.6 Threshold techniques
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
`3.0 Results
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
`3.1 Description of the data
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
`3.2 Linear regressions on the CACM 1 collection
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
`3.3 Linear regressions on the lSI! collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
`Table of Contents


`3.4 Additional regression techniques
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
`3.5 Use of regression coeiftcients in SMART for CACM I and ISil collections . . .. ..... . 31
`3.6 Use of regression coefficients in SMART for CACM2 and ISI2 collections ....... . . . 32
`3. 7 Use of threshold techniques
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
`4.0 Discussion • . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
`4.1 Usefulness of concept types as shown by linear regressions
`. . . . . . . . . . . . . . . . . . . . . 40
`4.2 Ranked relevances versus binary relevances
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
`4.3 Thesholds as aids in regression
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
`Improvement of retrieval using coe1licients
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
`4.5 Conclusions and implications for further research
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • . . . . . . . . . . . . . . . . . . . . . . . . . . • . . 44
`Vita .•....•... .................. .... ..............•..•.......... ..... 45
`Table of Contents


`List of Illustrations
`1. Histograms of raw versus log tcnns in sample ISil data.
`• • • • • 0 . 0 • • 0 • • • 0 • • • 17
`Figure 2. Histograms of raw versus log Computing Reviews categories in all CACM1 data.
`Figure 3. Predicted versus residuals for log sample binary data, CACM 1 collection. . ..... 24
`Figure 4. Predicted versus residuals for log sample ranked data, CACM 1 collection.
`• 0 . 0 • • 25
`Figure 5. Predicted versus residuals for log sample for the lSI collection.
`• • • • • • • • • • • 0 • • 29
`List of Illustrations


`List of Tables
`1. Organization of the CACM l merged data set.
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
`Table 2. Descriptive statistics and vector length of raw CACM I data.
`. . . . . . . . . . . . . . . . 13
`Table 3. Descriptive statistics and vector length of raw ISil data . . . . . . . . . . . . . . . . . . . . 14
`Table 4. Descriptive statistics of log CACM l data
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
`Table 5. Descriptive statistics of log lSI 1 data
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
`Table 6. Regression coefficients, ranks and RSQ's for CACM l ranked relevances.
`. . . . . . . 22
`Table 7. Regression coefficients, ranks and RSQ's for CACM 1 binary relevances.
`. . . . . . . 23
`Table 8. Sums of squares and probability values for CACM I two-way interactions ....... 26
`Table 9. Ref,rression coefficients, ranks and RSQ's for lSI data.
`. . . . . . . . . . . . . . . . . . . . . 28
`Table 10. Sums of squares and probability values for lSll (log sample data) interactions. . . . 30
`Table 11. Precision values from base and coefficient runs for the ISll collection.
`. . . . . . . . . 34
`Table 12. Precision values from base and coefficient runs for the CACM I
`..... ........ 35
`Table 13. Precision values from base and coefficient runs for the CACM 1
`. . . . . . . . . . . . . 36
`Table 14. Precision values from base and coefficient runs for the IS12 collection.
`. . . . . . . . . 37
`Table 15. Precision values from base and coefficient runs for CACM2 (binary relevance).
`Table 16. Precision values from base and coefficient runs for CACM2 (ranked relevance).
`List of Tables


`1.1 Probabilistic retrieval
`The probabilistic model for information retrieval ((Yu and Salton, 1976 and Robertson and Sparck
`Jones, 1976) as cited in van Rijsbergcn, 1981) assumes that the terms in a query and the terms in
`a collection of documents are used in an initial retrieval to obtain a sample of the documents. The
`terms in the sample are then used to estimate the probability that each document in the collection
`is relevant or not relevant. The collection has usually been indexed by building a vector of terms
`for each document where the vector consists of binary values ( l for presence - 0 for absence) for
`all terms in the collection. A document thus is represented as a vector of length n:
`Term(l), Term(2) ... Term(n)
`For a given query it is possible to estimate a probability of relevance for each document in the
`collection by computing an inner product of "term relevance" values for all terms considered:
`Probability(relevanceldocument) = SUM [term_relevance(i) * Term(i)]


`term_relevance(i) = [r / CR - r)] / [(n - r) / CN - n - R + r)]
`N = number of documents
`n = number of documents with term i
`R = number of relevant documents
`r = number of relevant documents with term i
`The Bayesian decision rule can be used to the decide whether a document has a high enough
`probability estimate to be chosen as relevant:
`Probability(relevanceldocument) > Probability(non-relevanceldocument)
`lufornzation retrieval in SMART
`The SMART information retrieval system (Salton and McGill, 1983) has options to use the prob(cid:173)
`abilistic model to rank documents that are retrieved as part of a feedback process. Initial retrieval
`is usually accomplished after computing a cosine similarity (Salton and McGill, 1983) between the
`terms in a query and the terms in documents. Those documents with the highest similarity have
`the lowest ranks (also the highest probability of being relevant). The user is presented with a list
`of the "top ranked" documents. The user can then decide which of the "top ranked" documents
`are relevant. SMART then can perform a vector feedback search (if desired) by adding any new
`terms from the relevant documents to the initial query and subtracting those terms from the initial
`query, which only appeared in the documents that were judged to be nonrelevant. The resultant


`feedback query is then used to provide a new ranked list of documents (Salton and McGill, 1983).
`Alternatively, a probabilistic feedback can be performed.
`Fox (1983a) has modified the application of the vector and probabilistic models to utilize additional
`information. This consists of author, date of publication, bibliographic coupling, bibliographic
`links, Computing Reviews' categories, and cocitations. SMART has been modified to add the
`additional information as subvectors (Fox, 1983a) to the document-term vectors of the original
`vector/probabilistic system. As modified, the collection then consists of extended vectors where the
`information is separated into subvectors:
`date of publication,
`bibliographic links, Computing Reviews' categories, cocitations
`1.3 Resea1·ch goals
`The goals of this study were to statistically examine the usefulness of the subvectors associated with
`different concept types (Fox, 1983a) and to determine if coefficients obtained via multiple re(cid:173)
`gressions could be used to further enhance SMART's retrieval using the extended vectors. A less
`significant goal was to detennine if knowledge of relevance as a ranking (from least (1) to most (4)
`relevant) would facilitate prediction, thus yielding better coefficients. To accomplish these objec(cid:173)
`tives two different research collections were analyzed. These had been loaded into a version of
`SMART, which has been installed on a VAX-11/785 running UNIX (Pox, 1983b).


`The collections, which have been described elsewhere (Fox, l983b), consist of 3,204 abstracts of
`documents which appeared ( 1958 - 1979) in Communications of the Association for Computing
`Machinery (CACM Collection) and 1,460 abstracts of documents from various sources ( 1969 -
`1977) concerning information science, along with citation data obtained from the Institute for Sci(cid:173)
`entific Information (lSI Collection). The CACM collection has information necessary for all of the
`above mentioned extended vectors, but the lSI collection has only two additional information
`components (subvectors), author and cocitations. For both collections, a set of queries with known
`document-query relevances was used to generate data for the regression studies. Precision (the ratio
`of relevant documents retrieved to all documents retrieved for that query) averages were the prin(cid:173)
`cipal measure used for determining the effectiveness of all retrieval runs with SMART.


`2.0 Methods
`2.1 Division of the collections
`Both the CACM and lSI collections were divided into two approximately equally sized sub(cid:173)
`collections by randomly selecting documents. The collections were split so that all analyses could
`be performed on one half of the data and the results obtained could be tested on the other half of
`the data. The dividing procedure used kept the same proportion of relevant to nonrelevant docu(cid:173)
`ments with regard to sets of queries for each collection. In the description and discussion that fol(cid:173)
`low these sub collections are referred to as the CACM 1 or CACM2 and lSI 1 or ISI2 Collections.
`2.2 Descriptioll of the collectious aud vectors
`SMART was used to prepare data sets, for both the CACM! and ISil collections, which consisted
`of query identification number (QID), document identification number (DID), rank in the prob-


`abilistic retrieval, and a similarity measure (SIM). However, as it was based on the value of the
`similarity, rank was not used in this study. A separate data set was obtained for each concept type,
`which had the appropriate measure of similarity for all QID - DID pairings.
`for use in tables and figures the concept types will be represented by the following abbreviations:
`Computing Reviews' Category
`Date of Publication
`Bibliographic Coupling
`Bibliographic Links
`for the CACM I collection, there were 7 equal len!,>th data sets, one for each concept type, and a
`data set which listed the relevance judgment (ranked from 0 to 4) for each QID - DID pairing.
`The Statistical Analysis System (SAS, 1985) was used to merge the 7 concept type data sets with
`the relevance judgment data set by matching QID and DID. Tllis gave a data set, which has QID,
`DID, and 7 different similarity measures (independent variables) and relevance judgment (depend(cid:173)
`ent variable) for subsequent analysis.
`For the ISil collection, only three concept types were available. Thus, only three concept type data
`sets with their appropriate sirrlllarity measure were obtained from SMART. Also, only binary rel(cid:173)
`evance judgments were available for the ISil collection. SAS was used to merge the three concept
`type data sets with the relevance data set by matching a query and document. Thus, a data set with


`three similarity measures (independent variables) and relevance judgment was constructed.
`Table 1 on page 8 shows the nature of the matrix that resulted from the merger of the 7 concept
`type data sets and the relevance data set.
`2.3 Descriptive analysis of the data
`SAS, the Statistical Analysis System, Version 5 ( 1985) was used to produce all statistical and
`graphical results. For all statistical tests of significance a threshold of .05 was used.
`Procedures MEANS and UNIVARIATE were used to obtain descriptive statistics for all concept
`types in the two collections. As the distributions of values for the concept types were quite variable,
`they were transformed by taking the natural log of the similarity measure plus one. The log trans(cid:173)
`formation was chosen, because the data were highly positively skewed due to the large values that
`occur in itmer product calculations, when there is a good match between a query and document.
`Log and square root transformations arc good choices for positively skewed data, but the log
`transformation is more effective at bringing the large vatues closer to the mean. This made the
`distributions much more symmetrical. The variables for the two collections were summarized in
`tables and some representative histograms and can be found in section "Description of the data"
`on page II.
`2.4 Obtaining regression coefficients for use in SMART
`Procedure General Linear Model (GLM) was used for most regressions with models specified so
`that regressions were run without an intercept being calculated. The intercept would have been


`D R
`I E
`D L
`Table I. Organization of the CACM I merged data set.
`0.000 0.0000
`0.000 20.816
`98 0
`0 . 0000
`2 655 0
`0 . 000
`3.620 0.0000
`2 1138 0
`7.78 87 . 531
`8.35 73.654 27.7 54 117.956
`0 . 000 0.0000
`2 1179 0
`3.620 0.0000
`4 . 09
`2 1314 0
`4.895 0.0000
`2 1426 0
`4 . 730 20.8158 228.92 143 . 040 97.140 117.956
`2 1429 4 428.254
`2 1435 0
`0.000 13.877 0.0000
`2 1541 4 484.816 185.923 27.7544 143.27
`0.000 84.316
`0.000 0.0000
`486 0
`0.000 41.250
`507 0
`0.000 0.0000
`561 4 252.947
`0.000 10.4079 282.14 87.228 166.526 57.903
`60 . 52
`0 . 000
`816 0
`0.000 0.0000
`0 . 000
`NOTE: QID = query identification number
`DID = document identification number
`RNKREL = ranked relevance
`Other variables are as previously described .


`useless in subsequent runs, which tested the coefficients obtained from regressions. For the CACM
`collection, two, four, and all variable models were run on raw and log transformed data. In the lSI
`collection studies, two and all variable models were used. An example of one of the regression
`equations (two variable model) is as follows:
`Relevance = al * (Similarity_TRM) + a2 * CSimilarity_LNK) + Error
`Where al and a2 are regression coefficients for terms and
`bibliographic links respectively.
`In addition to the linear regressions pertormed with GLM, logistic regression (Procedure LOGIST)
`and all possible regressions (Procedure REG) were used on a preliminary data set.
`2.5 Testillg the usefullless of the coefficiellts as weights ill
`Base runs were made in SMART for both halves of both the CACM and lSI collections. These
`runs used combinations of concept types that corresponded to the two and three variable models
`for the ISll Collections and two, four, and seven variable models for the CACM! Collection. As
`no coefficients were used in these runs, they provided equal weights for all concept types for com(cid:173)
`parison with runs having coeflicients. The coefficients obtained in the regression runs were then
`used as weights in feedback runs to try to improve precision in SMART. The runs using coeffi(cid:173)
`cients were compared against base runs, which had no coefficients but did use the same concept
`types. The coeflicients that were developed for CACM I and ISll were also tested on CACM2 and


`2.6 Threshold tech11iques
`I Iistograms and graphs of the data were examined for possible threshold values for the concept
`types. Threshold values could be useful, if it could be shown that a value as high or higher than a
`certain percentile for a concept type gave a high probability that the document was relevant. Ac(cid:173)
`cordingly, the 60th, 75th and 90th percentiles were used as thresholds to test their usefulness.
`Variables were created for each concept type at each of the above percentiles. The corresponding
`threshold variables were given a value of 1.0 when the concept type value exceed the threshold and
`0.0 otherwise. The regressions were then rerun with the additional variables for each threshold.
`These results of these runs are reported in "Use of threshold techniques" on page 33.


`3.0 Results
`3.1 Descriptioll of the data
`Descriptive statistics for the CACM! collection are given in Table 2 on page 13. Vectors for the
`seven concept types show considerable variation, as they range from zero to four digit numbers.
`All show relatively high values for CV (the coefficient of variation, from 235.0 to 674.4) and all are
`highly positively skewed (from 4.22 to 17.3). This is due to the high values that resulted from the
`inner product calculations when concept types in the query obtained good match with a document.
`Kurtosis measures were all large (from 8.8 to 381.5) due to peakedness caused by the high propor(cid:173)
`tion of low or zero values in most concept types that were produced when there was a poor match.
`As seen in Table 3 on page 14 the variables in the lSI 1 collection were similar. The numbers range
`from 0 to 1188.1 and had high skewness (from 3.4 to 5.8) and high kurtosis (from 17.1 to 45.4).
`Their CV's were somewhat lower ( 163.8 to 356. 7). Although linear regression and the F test are
`robust, these data are rather variable and quite different from the kind of examples usually shown
`in textbooks on regression analysis. Some of the variation is due to the extremes in values, but
`much of the variability is due to the sparseness of the QID- DID array. Evidence of the sparseness


`is seen in Table 2 on page 13 and Table 3 on page 14 regarding length, the column that gives the
`number of non-zero values for each vector. In fact, five of the seven concept types for CACM!
`have non-zero values in from 8.8% (AUT) to 17.9% (BBC) of the data records. In the ISll col(cid:173)
`lection one of the three subvectors has only 13% of its values that are non-zero (AUT). An at(cid:173)
`tempt to compensate for the sparseness is discussed under "Linear regressions on the CACM 1
`collection" on page 19. To try to minimize the effect of the high variability in the data, a natural
`log transformation was used on all variables and is reported in Table 4 on page 15 and Table 5
`on page 16 for the CACM 1 and ISil collections respectively. As can be seen in these two tables,
`the transformation docs make the distributions considerably more symmetrical and reduces ex(cid:173)
`tremes among the various measures.
`In the CACM 1 collection, skewness (0 is normal) was reduced to more acceptable levels ( -.009 to
`2.53), kurtosis (0 is normal) was reduced similarly ( -.44 to 10.2) and CV ( 100 is normal) was tighter
`( 52.3 to 265.0). In the ISil collection, skewness dropped ( -.37 to 2.6), kurtosis declined ( -1.4 to
`5.1), and CV was lowered (39.5 to 270.2). Thus, in both collections, the distributions of the con(cid:173)
`cept type variables are more evenly matched and closer to nonnal than for the raw data.
`I Iistograms of raw data and log data further illustrate the effects of the log transformation on the
`data. I Iistograms in Fi~:,rure 1 on page 17 show the impact of the log transformation on TRM in
`the lSI 1 collection (sample data). The large value of 4.8 for skewness in the raw data is shown by
`the long positive tail.
`I lowcver, the log transformed data show no tail in either direction, as
`skewness has been reduced to -.154. Not all of the histograms show such dramatic improvement
`as those seen in Figure 1 on page 17, but all of the concept types do show improved distributions.
`Figure 2 on page 18, which is of raw and log transformed CRC from the CACM! collection (all
`records) is an example of a variable with only modest improvement.


`Table 2. Descriptive statistics ami vector length of raw CACM I data.
`NOTE: There were 4035 records in this set.


`Table 3. Descriptive statistics and vector length of raw IS II data
`NOTE: There were 5456 records in this set.


`Table 4. Descriptive statistics of log CACM I data
`Number of non-zero values and number of records
`are given in Table 2 on page 13


`Table 5. Descriptive statistics of log IS II data
`Number of non-zero values and number of records
`are given in Table 3 on page 14.


`·* ·*
`·* ·* ·*
`·* ·*
`·* ·* ·*
`133 470
`Figure I. Histograms of raw versus log terms in sample IS II tlallt.


`·* ·***
`·* ·*
`0.1+************************************************ 2432
`Figure 2.
`llistogrnms of rnw versus log Computing Reviews categories in nil CACM I llnta.


`3.2 Li11ear regressions on the CACM 1 collection
`SAS Procedure GLM was used to run full models. of all concept types as independent variables and
`ranked relevance judgment as the dependent variable. These runs were performed with the raw data
`and log transformed data and are summarized in Table 6 on page 22. However, the proportion
`of nonrelevant to relevant documents was too high, more than 9 to l. This problem of unbalanced
`groups was partly responsible for the low coefficient of determination (RSQ) for the raw data (.387)
`and for the log data (.396). In order to improve the proportion of relevant versus nonrelevant re(cid:173)
`cords, most of the nonrclevant documents were randomly discarded leaving a data set that had
`equal proportions of relevant versus nonrelevant records (766 total records). This modestly im(cid:173)
`proved the RSQ of the raw data to .445 and considerably improved the RSQ of the log transformed
`data to .627, as can also be seen in Table 6 on page 22. The sparseness of many of the concept
`type subvectors is also probably contributing to the relatively low RSQ (see "Description of the
`data" on page II), but nothing could be done about that.
`Similar runs were made using binary relevance as the dependent variable. The results of these re(cid:173)
`gressions are given in Table 7 on page 23. The same kind of improvement in RSQ that was seen
`in Table 6 on page 22 was found by discarding most of the nonrelevant records and balancing the
`relative number of relevant versus nonrelevant documents. Ilowevcr, the improvement between
`raw data and log data is a little greater, by approximately 1% to 4% . furthermore, the binary rel(cid:173)
`evance data with the log transformed indcpemlcnt variables gave a better RSQ than the ranked
`relevance (.6659 versus .6274).
`A plot of predicted scores versus residuals for the best log sample model with binary relevance data
`is displayed in Pigure 3 on page 24 and shows a fair degree of closeness for relevant documents ( l.O)
`and considerable spread for nonrelevant documents. A similar plot for ranked relevance data is


`shown in Figure 4 on page 25, but here the relevant values are divided into values from 1.0 to 4.0.
`Again, the relevant documents show less spread than the noruelevant.
`The concept type variables for each regression run were ranked by their Type III Sum of Squares
`(Si\S, 1985), which gives the sum of squares for each variable independently of its order in the re(cid:173)
`gression model. From the rankings, some two and four variable models were chosen and run using
`the same two dependent variables for all records and for the sample set of records. The coefficients,
`RSQ's, and rankings are also provided in Table 7 on page 23 and Table 6 on page 22 for binary
`and ranked relevance data respectively.
`The two variable model (TRM and LNK) using raw data and all records gave 86% of the RSQ
`of the seven variable model for both dependent variables. For sample raw data, 83% and 86% of
`the seven variable RSQ was obtained. For log transformed data, all record regressions with the
`same independent and dependent variables gave 79% and 81% of the original seven variable RSQ
`(ranked.and binary relevances respectively). However, the sample of log data gave 98% of the seven
`variable RSQ for both ranked and binary relevance data.
`In fact the two variable model for log
`translormcd imlcpcndent variables and binary relevance data gave a higher RSQ than any of the
`other seven variable models. The four variable model (AUT, CRC, TRM, and LNK) gave modest
`improvement, but clearly most of the variance is accounted for by TRM and LNK.
`All possible two-way interactions were tested (using proc GLM in SAS) on ranked relevance and
`binary relevance data. Several were found to be significant at the .05 level. Table 8 on page 26
`shows interactions for binary

