`
`J. Crouch
`Carolyn
`Department
`of Computer
`Tulane
`University
`New Orleans,
`LA 70118
`
`Science
`
`in
`
`information
`of an
`operation
`successful
`the
`thesaurus
`of a
`importance
`The
`which
`support
`the automatic
`techniques
`Yet
`recognized.
`is well
`retrieval
`system
`This
`paper
`describes
`one
`undiscovered.
`remain
`largely
`thesauri
`generation
`of
`thesauri,
`based
`on
`the
`the
`automatic
`generation
`of global
`approach
`to
`value
`model
`of Salton,
`Yang,
`and Yu and
`on an appropriate
`discrimination
`This method
`has been
`implemented
`and
`applied
`to
`two
`algorithm.
`clustering
`collections.
`Preliminary
`results
`indicate
`that
`this method,
`which
`produces
`document
`improvements
`in retrieval
`performance
`in excess
`of 10 and 15 percent
`in
`the
`test
`collections,
`is viable
`and worthy
`of continued
`investigation.
`
`INTRODUCTION
`
`is
`system
`retrieval
`information
`of any
`operation
`in the successful
`factor
`A major
`dictionaries
`(e.g.,
`Of these
`available
`for use by
`the system.
`the set of dictionaries
`term hierarchies,
`etc.),
`the
`and/or
`syntactic
`phrase
`dictionaries,
`thesauri,
`statistical
`system
`performance
`is
`on
`the
`greatest
`potential
`impact
`dictionary
`having
`thesaurus.
`undoubtedly
`the
`accruing
`from
`the use of a well
`Although
`the benefits
`system
`performance
`are well
`of
`in
`constructed
`thesaurus
`terms
`increased
`creating
`such
`a thesaurus
`remains
`recognized,
`the methodology
`for automatically
`in use are
`idiosyncratic.
`unspecified.
`In fact, virtually
`all thesauri
`presently
`the
`improve
`researchers
`aiming
`to
`Thus
`a topic
`of considerable
`interest
`to
`thesaurus
`systems
`is automatic
`overall
`performance
`of
`information
`retrieval
`to the automatic
`construction
`of a
`construction,
`This paper
`describes
`an approach
`global
`thesaurus
`based
`on
`the discrimination
`value model
`of Salton,
`Yang,
`and Yu
`[SALTON75aJ
`and on an appropriate
`clustering
`algorithm.
`The discrimination
`value
`model
`itself
`is based
`on the vector
`space model described
`below.
`
`to copy without
`Permission
`butcd
`for direct
`commercial
`appear, and notice
`is given
`otherwise,
`or to republish,
`
`the copies are not made or distri-
`that
`is granted provided
`fee all part of this material
`the title of the publication
`and
`its date
`advantage,
`the ACM
`copyright
`notice and
`that copying
`is by pcrnmission of the Association
`for Computing Machinery.
`To copy
`requires
`a fee and/or
`spccifk
`permission.
`
`C
`
`1988
`
`ACM
`
`O-8979 1-274-S
`
`88
`
`0600
`
`0309
`
`$ 1,50
`
`-303--
`
`IPR2017-01039
`Unified EX1022 Page 1
`
`
`
`The Vector
`
`Space
`
`Model
`
`a
`djk
`of
`
`space model.
`is the vector
`retrieval
`information
`in
`the major models
`One of
`collection
`as a set of unique
`in
`the document
`views
`each
`document
`This model
`words
`or word
`types.
`Each documeni:
`may
`then be regarded
`as a term vect,or,
`and
`the complete
`document
`collection
`becomes
`a vector
`space
`of dimension
`m, where
`m
`is
`the number
`of word
`types
`in
`the collection.
`In
`the vector
`space model,
`dj,
`is represented
`document
`vector,
`by a set of
`lerms,
`djk, 1 2 k Am,
`where
`represents
`the
`frequency
`(or weight)
`of
`term
`k in document
`j (i.e.,
`the number
`If djk = 0. term k does not appear
`in document
`times
`term k appears
`in document
`j).
`dj. Queries,
`like documents,
`are
`represented
`by weighted
`term vectors.
`be
`may
`Given
`any
`two
`term
`vectors,
`the
`similarity
`between
`the
`vectors
`vectors
`assumed
`to be
`inversely
`related
`t”le angle
`between
`them.
`If the
`two
`to
`In
`two
`coincide,
`the angle
`between
`them
`is zero,
`arid
`the vectors
`are
`identical.
`The
`dimensions,
`the
`vector
`space
`may
`be
`represented
`by
`its
`envelope.
`the
`and
`the
`(normalized)
`vectors
`are
`then
`viewed
`as points
`in
`vector
`space,
`the
`of
`distance
`between
`any
`two
`points
`is
`inversely
`related
`to
`the
`similarity
`the
`corresponding
`document
`vectors.
`The smaller
`the distance
`between
`two points,
`smaller
`the angle
`between
`the corresponding
`vectors,
`and
`the greater
`the similarity
`of the vectors
`in terms
`of the number
`of word
`types
`they have
`in common.
`the
`that
`Salton
`ef a/
`[SALTON75a,
`SALTON75b,
`SALTON
`have
`shown
`best document
`space
`for
`retrieval
`purposes
`is one which maximizes
`the average
`separation
`between
`documents
`in the document
`space.
`In this space,
`it is easier
`to
`distinguish
`between
`documents
`and
`thus
`easier
`to
`retrieve
`documents
`which
`are
`most
`similar
`to a given
`query.
`The model which
`allows
`the
`terms
`in a collection
`to
`be ranked
`in order of their effect on space
`density
`is called
`the discrimination
`value
`model.
`
`The Discrimination
`
`Value
`
`Model
`
`to single
`roles
`and provides
`a
`can be
`ranked
`in
`a
`It also
`offers
`
`specific
`assigns
`[SALTON75a]
`value model
`‘The discrimination
`terms,
`purposes
`analysis
`for content
`term classes
`and
`term phrases,
`in a collection
`index
`term
`each
`potential
`framework
`within which
`discriminator.
`accordance
`with
`its usefulness
`as a document
`process.
`reasonable
`physical
`interpretation
`of the
`indexing
`by a set of
`represented
`each
`If we consider
`a collection
`of documents,
`coefficient
`computed
`between
`weighted
`m-dimensional
`vectors,
`then
`the similarity
`any
`Iwo
`term
`vectors
`can
`be
`interpreted
`as a measure
`the
`closeness
`or
`of
`relatedness
`between
`the vectors
`in m-space.
`If the similarity
`coefficient
`is large,
`the
`documents
`are very
`similar
`and
`appear
`in close
`proximity
`to each
`other
`in
`the
`And
`if the similarity
`coefficient
`is small,
`the documents
`exhibit
`little
`document
`space.
`similarity
`and are widely
`separated
`in the document
`space.
`
`-310
`
`IPR2017-01039
`Unified EX1022 Page 2
`
`
`
`of the change
`as a measure
`is then defined
`value of a term
`The discrimination
`is assigned
`to the document
`which occurs when a given
`term
`in space
`separation
`discriminaior
`is a term which, when assigned
`to a document,
`collection.
`A good
`decreases
`the space
`density
`(i.e.,
`renders
`the documents
`less similar
`to each
`other).
`Conversely,
`the assignment
`of a poor discriminator
`increases
`the space
`before and after
`density.
`By computing
`the density
`of the document
`space
`the
`assignment
`of each
`term,
`the discrimination
`value of the
`term can be determined.
`The
`terms can
`then be ranked
`in decreasing
`order of their discrimination
`values.
`to
`Salton,
`Yang,
`and Yu
`[SALTON75a]
`have
`used
`discrimination
`value
`determine
`three
`categories
`of discriminators,
`namely,
`good,
`poor, and
`indifferent
`discriminators.
`A term with a positive
`discrimination
`value has been
`found
`to be a
`good discriminator.
`Safton el al suggest
`that
`these
`terms be used directfy as index
`terms.
`Those
`terms with negative
`discrimination
`values are poor discriminators;
`the
`retrieval
`properties
`of such
`terms
`can
`be
`transformed
`by
`including
`them
`in
`appropriate
`phrases.
`The majority
`of terms are
`indifferent
`discriminators
`with near-
`zero
`discrimination
`values.
`The
`retrieval
`capabilities
`of
`these
`terms
`can be
`enhanced
`by means of their
`incorporation
`in appropriate
`thesaurus
`classes.
`of
`Thus
`the discrimination
`value model
`presents
`a criterion
`for the
`formation
`this model, a thesaurus
`global
`thesauri.
`According
`to
`of
`is composed
`of a’set
`thesaurus
`classes.
`A thesaurus
`dass
`is composed
`of a group
`of terms
`or word
`types.
`The
`terms within a class should be indifferent
`discriminators
`(i.e.,
`those with
`near-zero
`discrimination
`values).
`Thus
`in order
`to use
`the criterion
`suggested
`by
`the discrimination
`model,
`the discrimination
`value of each
`term
`in the collection
`must
`be
`calculated
`and
`the
`terms
`ranked
`as good,
`indifferent
`and
`poor
`discriminators
`according
`to their discrimination
`values.
`Two different
`expensive.
`But the calculation
`of discrimination
`value
`is normaliy
`exact method,
`involves
`have been used. One approach,
`the so-called
`approaches
`the calculation
`of all pairwise
`similarities
`between
`the document
`vectors
`of
`the
`collection.
`For a collection
`of n documents
`and m word
`types,
`the complexity
`of this
`algorithm
`is O(mn2).
`The second
`or approximate
`approach
`to the calculation
`of
`discrimination
`value
`involves
`the construction
`of an artificial,
`average
`document,
`the
`centroid,
`and computes
`the sum of
`the similarities
`of each
`document
`with
`the
`centroid.
`The centroid
`algorithm
`is O(mn).
`times
`the execution
`improve
`which
`Modifications
`have
`been
`suggested
`associated
`with
`both
`the exact
`and
`the approximate
`methods
`of calculating
`discrimination
`value
`ICRAWFORD75,
`WILLET85,
`CROUCHf38J.
`Although
`the
`discrimination
`values
`produced
`by these
`two approaches
`differ significantly
`for a
`particular
`collection,
`it has been shown
`that
`the
`rankings
`of the
`terms are
`in fact
`highly
`compatible
`[CROUCH88].
`Thus of these
`two methods,
`the more efficient,
`centroid
`approach
`is the obvious method of choice when discrimination
`values must
`be calculated.
`
`-31
`
`I-
`
`IPR2017-01039
`Unified EX1022 Page 3
`
`
`
`of
`
`to the
`<approach
`th.e centroid
`of any size, even
`collection
`for a document
`But
`has
`alternative
`A, reasona.ble
`value may be expensive.
`of discrimination
`calculation
`Yang,
`and Yu, who
`suggest
`the use of doc:ument
`been
`provided
`by Salton,
`frequency
`as an approximation
`to
`Idiscrimination
`value.
`For any
`term
`k
`in a
`. document
`collection,
`the document
`frequency
`term
`k, dk,
`is defined
`as
`the
`number
`of documents
`in which
`term
`k appears.
`Empirical
`resullts
`indicate
`that
`frequency
`and discrimination
`value
`are strongly
`correlated.
`Let n
`dOCl~ment
`represent
`the number
`of documents
`in a collection
`whose
`terms
`are
`ranked
`by
`increasing
`document
`frequency.
`According
`to
`[SALTON75a],
`those
`terms whose
`document
`frequency
`is less
`than n/100 may be considered
`low
`frequency
`terms.
`The discrimination
`values
`of these
`terms are normally
`near-zero,
`and
`these
`terms
`as a whole may be considered
`indifferent
`discriminators.
`Likewise,
`terms whose
`document
`frequencies
`are greater
`than n/10 may be considered
`high
`frequency
`terms.
`These
`terms
`normally
`have
`negative
`discrimination
`values
`and
`are
`considered
`poor discriminators.
`The
`remaining
`terms
`(i.e., n/IO
`I dk 5 n/100) make
`up
`the set of good discriminators.
`The discrimination
`values
`of these
`terms
`are
`posilive.
`to discrimination
`may be used as an approximation
`frequency
`Thus document
`classes, which
`thesretically
`should
`consist
`of groupings
`of terms
`value.
`Thesaurus
`with near-zero
`discrimination
`values, may
`instead
`be constructed
`of sets of
`low
`frequency
`terms.
`Since document
`frequency
`is readily available
`for every
`term
`in a
`collection,
`the cost associated
`with
`this approach
`is minimal.
`
`AN APPROACH
`
`TO THESAURUS
`
`CONSTRUCTION
`
`a
`of constructing
`feasibility
`the
`investigate
`to
`was designed
`An experiment
`is
`term
`“global
`thesaurus”
`The
`thesaurus
`based
`on
`low
`frequency
`terms.
`global
`by
`“local
`thesaurus”
`described
`to differentiate
`this
`type of thesaurus
`from
`the
`used
`once
`Attar and Fraenkel
`[ATTAR77].
`In a, global
`approach,
`thesaurus
`classes,
`constructed,
`are used
`to index both documents
`and queries.
`The
`local
`thesaurus,
`in
`contrast,
`uses
`information
`obtained
`from
`the documents
`retrieved
`in response
`to a
`particular
`query
`to modify
`that query, which
`11s then
`resubmitted
`to
`the
`retrieval
`system
`for processing
`in lieu of the original.
`Thus a global
`thesaurus
`is constructed
`prior
`to
`the
`indexing
`process
`and
`the
`thesaurus
`classes
`are used
`to
`index
`both
`whereas
`a
`local
`thesaurus
`is constructed
`dynamically
`documents
`and queries,
`during
`query
`processing
`and uses
`information
`retrieved
`in response
`to a specific
`query
`to modify only
`that query.
`
`Constructing
`
`Thesaurus
`
`Classes
`
`thesaurus
`a global
`Constructing
`of thesaurus
`classes
`for the generation
`a viable
`alternative)
`low
`frequency
`
`based on the discrimination
`consisting
`of indifferent
`The question
`‘terms.
`
`value model calls
`discriminators
`or (as
`of .how
`the classes
`
`-312m-
`
`IPR2017-01039
`Unified EX1022 Page 4
`
`
`
`class
`a thesaurus
`Intuitively,
`open.
`remains
`to be constructed
`are
`themselves
`in the context
`of the current
`related
`of terms which
`are closely
`should
`consist
`in the conventional
`sense.)
`synonyms
`collection.
`{Such
`terms are not necessarily
`terms
`is to cluster
`all
`the
`One approach
`to generating
`groups
`of closely
`related
`cluster
`together might
`then
`terms
`in the collection.
`The
`low
`frequency
`terms which
`be considered
`a thesaurus
`class.
`Unfortunately,
`‘in an environment
`where
`there
`is
`little
`information
`to exploit,
`the resultant
`clusters
`are seldom meaningful.
`This
`is the
`case
`in the
`low
`frequency
`domain,
`where
`terms
`are contained
`in only a small
`number
`of the documents
`in the collection.
`to
`and
`of the collection
`the documents
`An alternative
`approach
`is to cluster
`in
`the
`terms
`contained
`low
`frequency
`generate
`thesaurus
`classes
`from
`the
`document
`clusters.
`A key question
`then arises,
`namely, what
`type of clustering
`algorithm
`should be used?
`The choice
`is dictated
`largely by the
`type of clusters
`an
`algorithm
`produces.
`In order
`to generate meaningful
`thesaurus
`classes,
`the
`low
`frequency
`terms
`in a class
`should
`come
`from closely
`related
`documents.
`This
`be small and tight.
`implies
`that
`the document
`clusters
`themselves
`should
`An
`algorithm
`which
`produces
`clusters
`of
`this
`type
`is
`the complete-link
`clustering
`algorithm,
`one of a class of agglomerative,
`hierarchical
`clustering
`algorithms
`that
`has
`received
`some
`attention
`in
`the
`fiterature
`[VANRlJSB79,
`VOORHEES85,
`VOORHEES861.
`Consequently,
`this was
`the algorithm
`used
`to cluster
`documents in our test collections.
`
`the
`
`Constructing
`
`a Global
`
`Thesaurus
`
`thesaurus;
`algorithm.
`are generated,
`
`a globaf
`to construct
`has been utilized
`procedure
`following
`The
`I. The document
`collection
`is clustered
`via the complete-link
`2. The resultant hierarchy
`is traversed
`and
`thesaurus
`classes
`based on specified,
`user-supplied
`parameters.
`3. The documents
`and queries
`are
`indexed
`by the
`The
`characteristics
`of
`the
`thesaurus
`classes
`determined
`by the
`following,
`user-supplied
`parameters:
`(a)
`THRESHOLD
`VALUE
`a
`produces
`algorithm
`clustering
`complete-link
`Application
`‘of
`the
`the
`cluster
`at
`(i.e.,
`those which
`tightest
`clusters
`hierarchy
`in which
`the
`lie at the bottom of the cluster
`tree. These
`nodes
`highest
`threshold
`values)
`are
`the
`leaves
`of the
`tree.
`For example,
`consider
`Fig. 1. The squares
`represent
`documents
`and
`the numbers
`in the circles
`represent
`the
`levels at
`which
`the documents
`cluster.
`Documents
`A and 8 cluster
`at a threshold
`value of 0.089, D and E cluster at a level of 0.149, and document
`C clusters
`with
`the D-E subtree
`at a level of 0.077.
`The A-B subtree
`and
`the C-D-E
`subtree
`cluster at a threshold
`value of 0.029.
`the documents
`determines
`The user-supplied
`threshold
`value
`largely
`class.
`In Fig. 1, a
`from which
`terms are selected
`for inclusion
`in a thesaurus
`threshold
`value of 0.090 would
`return only
`the D-E document
`cluster,
`since
`
`thesaurus
`generated
`
`classes.
`in step
`
`2 are
`
`-313-
`
`IPR2017-01039
`Unified EX1022 Page 5
`
`
`
`Fig. 1 A sample
`
`hierarchy
`
`generated
`
`by the complete
`
`link algorithm
`
`cluster must be greater
`the documents
`level at which
`the
`for
`the cluster
`to be
`the specified
`threshold
`in order
`value of 0.085, on the other hand, would
`return
`threshold
`and D-E).
`If a threshold
`value of 0.075 were specified,
`be returned:
`A-B and either D-E or C-D-E.
`
`to
`than or equal
`A
`recognized.
`two clusters
`(A-B
`two clusters would
`
`[b)
`
`IN A CLUSTER
`NUMBER OF DOCUMENTS
`Since both D-E and C-D-E meet
`in Fig. 1,
`value of 0.075
`threshold
`the
`the number
`of documents
`to be
`another
`parameter
`is needed
`to determine
`included
`in the
`final cluster.
`This parameter,
`supplied
`the user,
`is the
`by
`a value of 2 for
`number
`of documents
`in a cluster.
`In the example,
`this
`parameter
`would
`insure
`that only
`the D-E: cluster would
`be
`returned.
`A
`value of 3 or greater would allow
`the entire C-D-E cluster
`to be returned.
`
`(c)
`
`FREQUENCY
`MlNlMUM DOCUMENT
`classes
`thesaurus
`selected,
`have been
`clusters
`the document
`Once
`clusters.
`in
`those
`contained
`frequel?cy
`terms
`formed
`from
`the
`low
`are
`of
`E3efore such
`terms
`can be selected,
`the minimum
`document
`frequency
`“low
`each
`term,
`i.e.,
`the criterion
`by which
`terms
`are determined
`to be
`frequency,”
`must be specified.
`J-he guidelines
`in [SALTON75a]
`may be
`helpful
`in setting
`this parameter.
`
`((3)
`
`OF GLASS FORMA TtON ME’ViOD
`SPECIFiCATlON
`clusters have been determined
`(via the settings
`When
`the document
`from
`the clustered
`first
`two parameters)
`and
`the
`low
`frequency
`terms
`the
`themselves
`can be
`have been gathered,
`the
`thesaurus
`classes
`documents
`generated.
`The simplest
`approach
`is to define
`a thesaurus
`cfass as
`the
`intersection
`of all the
`low
`frequency
`terms
`in a cluster.
`Alternate methods,
`using
`(1) union
`and
`(2)
`restricted
`intersection
`{ranking
`the
`terms
`in
`the
`intersection
`and using
`the
`top
`ranked
`terms
`from
`that set)
`in
`lieu of the
`intersection,
`were also
`implemented.
`the most
`be
`to
`out
`turned
`The most
`successful
`approach
`by using
`simple
`were
`obtained
`The
`best
`results
`straightforward.
`results described
`in this paper are based on
`intersection,
`The experimental
`t hlis approach.
`
`of
`
`-3 14-
`
`IPR2017-01039
`Unified EX1022 Page 6
`
`
`
`by
`indexed
`can be
`a collection
`are generated,
`classes
`thesaurus
`the
`Once
`rep/acing a
`entails
`Replacement
`(see step 3 above).
`or augmentation
`replacement
`term.
`Augmentation
`term
`in a vector
`by a thesaurus
`class which
`contains
`that
`involves adding
`term
`remains
`in the
`the
`thesaurus
`class
`to the vector;
`the origina
`(For a discussion of term weighting
`vector,
`although
`its weight may be modified.
`methods,
`see
`[SALTON87].)
`Both methods were
`implemented
`in this experiment.
`The
`results
`obtained
`by augmentation
`are
`in every
`case vastly
`superior
`to
`those
`, and
`thus
`this discussion
`focuses
`on
`indexing
`by
`obtained
`through
`replacement
`augmentation.
`is the weighting
`process
`indexing
`the
`during
`interest
`factor of some
`Another
`must be weighted
`so as
`themselves
`The classes
`the new
`thesaurus
`classes.
`terms
`nor have a negligible
`ensure
`that
`they neither
`override
`the original
`vector
`approach
`is
`to compute
`the
`effect
`during
`retrieval.
`An
`intuitively
`appealing
`contained
`in that class.
`As a
`thesaurus
`class weight
`as a function
`of the
`terms
`were computed
`according
`to
`resuit,
`the
`thesaurus
`class weights
`in this experiment
`the
`following
`formula, where
`tc-wt
`denotes
`the
`thesaurus
`class weight,
`tc-con-wti
`denotes
`the weight
`of term
`i in this
`thesaurus
`class,
`jtc-con]
`represents
`the number
`of terms
`in the
`thesaurus
`class, and
`avg-tc-con-wt
`represents
`the average weight
`of a term
`in the thesaurus
`class. Then
`
`of
`to
`
`avg-tc-co
`
`n-wt
`
`=
`
`jtc-con1
`tc-COn-wti
`c
`i=f
`itc-conj
`
`and
`
`tc-wt
`
`=
`
`avg-t c-co n-wt
`pZ-COtJ/
`
`* 0.5
`
`the average weight of a term
`by dividing
`is computed
`class weight
`Thus a thesaurus
`that value by
`in this class by the number
`of terms
`in the class and downweighting
`0.5. The experimental
`results
`indicate
`that
`this approach
`is indeed effective.
`
`Methodology
`
`THE EXPERIMENT
`
`the procedure
`using
`test collections
`two
`for
`were constructed
`thesauri
`Global
`were clustered
`using
`the collections
`recap:
`in the previous
`section.
`To
`described
`and thesaurus
`were
`traversed
`the complete-link
`algorithm.
`The hierarchies
`classes
`were generated
`based
`on user-supplied
`parameters
`(threshold
`value,
`number
`of
`documents
`in a cluster, minimum
`document
`frequency,
`and specification
`of class
`The
`documents
`and queries
`were
`then
`indexed
`by
`the
`formation
`method).
`thesaurus
`classes.
`In all cases,
`the
`indexing was performed
`by augmentation,
`and
`the
`thesaurus
`class weights
`were
`generated
`based
`on
`the weighting
`scheme
`
`-315-
`
`IPR2017-01039
`Unified EX1022 Page 7
`
`
`
`detailed
`modified.
`
`in
`
`the
`
`previous
`
`section.
`
`The weights
`
`of
`
`the original
`
`terms
`
`were
`
`not
`
`Sett,;ng
`
`the Parameters
`
`reference
`
`to the parameters’
`
`used
`
`in step 2 of the
`
`thesaurus
`
`With
`procedure:
`value
`(a) Threshold
`is clustered,
`a collection
`After
`is collection-dependent.
`by examining
`the cluster
`tree.
`The
`values may be determined
`appropriate
`threshold
`in
`.factor
`settil?g
`of
`the
`threshold
`value
`is
`the most
`vital
`forming
`the
`thesaurus
`formation
`method
`determines
`the
`classes.
`This setting
`in conjunction
`with
`the class
`size of
`the
`thesaurus
`class.
`If the setting
`is
`too high,
`thesaurus
`classes
`will be
`generated
`with either
`too many
`terms
`(using
`union
`for class
`formation)
`or with
`too
`few
`terms
`(using
`intersection).
`If the
`threshold
`is too
`low, very
`few classes
`will be
`generated.
`i/7 a cluster obviously
`(b) The number
`of documents
`thesaurus
`affects
`also
`than
`that of
`less
`size, but
`the overall
`effect
`of this parameter
`is considerably
`class
`terms
`in a
`that
`is
`threshold
`value.
`An
`initial
`premise
`of
`this
`experiment
`the
`which
`in turn
`implies
`thesaurus
`class
`should
`come
`from closely
`related
`documents,
`tight.
`For
`investigative
`that
`Ihe document
`clusters
`themselves
`must be small
`and
`purposes
`this parameter
`was allowed
`to range
`from 2 to 5 in our
`test
`runs,
`but
`it is
`always
`subject
`to
`(i.e., governed
`by)
`the
`threshold
`value
`and
`in most
`cases,
`the
`actual
`number
`of documents
`in a cluster
`ranges
`from 2 to 3.
`(c) Minimum
`document
`frequency
`can
`be set by using
`suggested
`in [SALTON75a].
`(d) Class
`formation method
`how
`exactly
`specifies
`which
`is the parameter
`has been determined.
`cluster
`thesaurus
`class
`itself
`is formed
`once
`the document
`the
`three
`approaches
`investigated
`(union,
`intersection,
`and
`restricted
`intersection),
`simple
`intersection
`gives
`the best
`results
`in all cases.
`Thus
`a thesaurus
`composed
`of thesaurus
`classes
`consisting
`of sets of
`low
`frequency
`terms
`common
`to a small
`group
`of closely
`related
`documents
`appears
`to have
`the greatest
`potential
`for
`effecting
`an
`improvement
`in system
`performance.
`
`construction
`
`the guidelines
`
`the
`Of
`
`Collection Weighting
`
`coilection
`the
`and
`constructed
`been
`has
`thesaurus
`global
`the
`Once
`are
`then
`the vectors
`thesaurus
`classes,
`by
`the
`indexed
`and queries)
`(documents
`the
`Smart
`through
`made
`available
`facilities
`using
`appropriately,
`weighted
`(The
`original
`vectors,
`before
`[BUCKLEY85].
`system
`retrieval
`experimental
`by simple
`term
`frequency.)
`Smart
`offers
`a number
`of
`are weighted
`modification,
`The most
`common
`approach
`is probably
`to
`respect
`to weighting.
`options
`with
`“tf-idf” weights
`(i.e.,
`term
`frequency
`multiplied
`by
`inverse
`weight
`the vectors
`using
`This
`scheme
`is appealing
`because
`it allows
`a term
`to be
`document
`frequency).
`weighted
`by a factor
`dependent
`upon
`both
`its
`importance
`within
`a document
`(its
`--3 I 6-
`
`IPR2017-01039
`Unified EX1022 Page 8
`
`
`
`its
`
`as a whole
`colletition
`the document
`within
`importance
`and
`frequency)
`term
`tf-idf weighting
`has proved most
`effective
`Moreover,
`document
`frequency).
`comparison
`with other weighting
`schemes
`in numerous
`experiments.
`of the
`The collection weighting method used
`in this experiment
`is a variation
`idf scheme,
`wherein
`the
`term
`frequency
`component
`is referred
`to as “augmented
`normalized
`term
`frequency.”
`ln this method,
`the
`term
`frequency
`is normalized
`and
`augmented
`and
`is thus guaranteed
`to
`lie between
`0.5 and 1.0.
`This component
`then
`replaces
`the
`tf component
`in the
`tf-idf scheme
`to generate
`the weight
`for this
`term
`in the modified
`document
`collection.
`
`(its
`in
`
`tf-
`
`Similarity Measure
`
`A common
`cosine
`function
`
`measure
`
`of the similarity
`
`between
`
`two vectors
`
`dj and dk
`
`is the
`
`cos(dj,
`
`is the
`where
`dj*dk
`compute
`similarity
`
`and
`inner product
`in this experiment.
`
`Collection
`
`Characteristics
`
`dk) =
`
`dj*dk
`Ild jll
`lld kll
`lldj 112 = (dj -dj).
`This
`
`is the measure
`
`used
`
`to
`
`The
`standard
`collections
`collections,
`
`in this paper was
`described
`procedure
`construction
`thesaurus
`The characteristics
`test collections,
`namely, ADl and Medlars.
`are
`found
`in Table
`I.
`(For more
`complete
`descriptions
`see
`[FOX83A,
`FOX83b, VOORHEES851.)
`
`run on two
`of these
`of
`these
`
`I
`Table
`Collection
`Statistics
`
`Collection
`
`ADI
`Medlars
`
`Number
`Vectors
`
`of
`
`82
`1033
`
`Number
`Terms
`
`of
`
`of
`Mean Number
`Terms per Vector
`
`822
`6927
`
`25.5
`51.6
`
`-317-
`
`IPR2017-01039
`Unified EX1022 Page 9
`
`
`
`Empirical
`
`Results
`
`collections,
`to the ADI and Medlars
`:is applied
`of the experiment,
`results
`The
`and
`recali,
`are displayed
`in Table
`II. The normal
`evaluation
`measures,
`precision
`are used.
`(Recall
`is defined
`as the
`Iproportion
`of relevant material
`retrieved,
`while
`precision
`is the proportion
`of retrieved material
`that
`is relevant.)
`Part
`(a) gives
`the
`statistics
`in terms of precision
`and
`recall when
`the
`thesaurus
`generation
`procedure
`is a.pplied
`to
`the AD1 collection
`(using
`in this case
`the
`foliowing
`parameters:
`threshold
`level of 0.075,
`5 documents
`per cluster,
`a minimum
`document
`frequency
`to
`of 20, and
`intersection
`as the class
`formation method).
`The statistics which apply
`to
`the original,
`unmodified
`ADl collection
`are
`found
`in column
`I. Column
`2 pertains
`the modified
`ADI collection
`after application
`of the global
`thesaurus.
`The average
`precision
`at 3 points of recall
`(0.25, 0.50, and 0.75)
`is calculated
`for each collection
`and shows
`an overall
`improvement
`of 10.6 per cent when
`the global
`thesaurus
`applied.
`the Medlars
`to
`is applied
`the procedure
`view when
`a similar
`(b) gives
`Part
`intersection).
`settings
`of 0.120, 3, 50, and
`parameter
`(with corresponding
`collection
`to
`2 applies
`Again,
`column
`1 applies
`to the original Medlars
`collection
`and column
`in
`same
`collection
`after application
`of
`the global
`thesaurus.
`The
`improvement
`precision
`(based
`on the 3 point averalge)
`of the modified
`collection
`over
`the original
`is 15.8 per cent.
`
`is
`
`a
`
`SUMMARY
`
`of constructing
`feasibility
`the
`investigate
`to
`was designed
`An experiment
`and an appropriate
`value model
`based
`on
`the discrimination
`global
`thesaurus
`clustering
`algorithm.
`A
`thesaurus
`construction
`procedure
`was
`designed,
`implemented,
`and applied
`to two standard
`test collections.
`The preliminary
`results
`show
`that
`substantial
`improvements
`can
`be effected
`by making
`a properly
`consItructed
`global
`thesaurus
`available
`to a retrieval
`system
`during
`processing.
`a
`it is possible
`to construct
`such
`Moreover,
`as
`the empirical
`results
`indicate,
`thesaurus
`automatically.
`The question
`of whether
`it is possible
`to construct
`such a
`thesaurus
`for other collections
`remains
`to be answered.
`
`a
`
`-3I8-
`
`IPR2017-01039
`Unified EX1022 Page 10
`
`
`
`A Comparison
`The Original
`versus
`
`It
`TABLE
`of Retrieval Statistics:
`the Thesaurus-Based
`Collection
`
`collactlo~:
`Rueall
`tlV*l
`0.00
`0.0s
`0.10
`0.16
`0.20
`0.26
`0.30
`0.36
`0.40
`0.46
`0.60
`0.66
`o.eo
`0.16
`0.70
`O-76
`0.80
`0.86
`0.90
`0.06
`1.00
`
`&4l
`- Prrcisloa
`OA68
`0. ImLe
`0.1868
`0.8779
`0.6214
`0.6687
`0.6407
`0.6020
`0.4e91
`0.4798
`0.4790
`0.3100
`0.3eoo
`0.3613
`0 .zaee
`0 .a176
`0.2370
`0.2264
`0.2242
`a.2231
`0.2236
`
`tbtlmhola:
`
`-075
`
`no.
`
`aocr:
`
`&
`
`o-7:74
`0.7074
`0.7017
`0.6012
`0. US48
`0.6005
`O.blS#
`0.6509
`0.6304
`0.6201
`0.6160
`0.4291
`0.4204
`0.4192
`0.1372
`0.5372
`0.3016
`0.2882
`0.2846
`0.2837
`0.2837
`
`nab
`collBction:
`era11
`- Pr~clmlon
`1.
`LITOi
`0.9206
`0.00
`0.8427
`0.06
`0.7996
`0.10
`0.16
`0.7682
`0.20
`0.7444
`0.26
`‘0.7247
`a.30
`0. POVU
`0.36
`0.8901
`0. co
`0.8488
`0.46
`0.8oao
`0.60
`0.6711
`0.66
`0.6219
`o.ao
`0.4026
`u.as
`0.4437
`0.70
`0.4140
`a.ae7a
`0.76
`0.3407
`O.BO
`0.2018
`0.66
`0.2289
`0.90
`0.1360
`0.06
`0.0069
`1.00
`prsclsion
`Aruragm
`*Pruc=
`0.5646
`% Prmc CAsncm =
`
`tArr*Aold:
`
`.12o
`
`no.
`
`aoca:
`
`8
`
`2
`0.0847
`0.9270
`0 -8788
`0.0430
`0.8107
`0.8060
`a.7917
`0.7610
`0.7184
`0.101v
`0.0610
`0.6305
`0.1000
`0.6616
`0.6217
`0.4082
`0.4698
`0.4106
`o.ao13
`0.1607
`0.1664
`for
`5
`0.0681
`Lb.6
`
`intar8odirto
`
`pofnts
`
`S
`
`lntoraodirtm
`
`point8
`
`for
`0.4816
`10.1
`
`prsclslon
`0.4363
`Chmga
`
`=
`
`Avsrag8
`prsc-
`# Prrc
`Stntlwtlc
`Norm Rac811
`Horn Precls
`Rank Ftscrll
`Log Pracls
`10 Uoca
`Preclu
`after
`a0 docm
`~recis
`after
`10 docm
`Rac~ll
`after
`so doca
`REClll
`sfter
`E. 0.6.
`IV dots
`E. 1.0.
`10 aocs
`E. 2.0.
`10 dacs
`E. 0.6.
`30 doccl
`E, 1.0,
`30 dOCS
`E. 2.0.
`30 does
`Number
`qusrlss
`
`1
`0.7949
`0.6311
`0.1671
`0.304s
`0.2267
`0.1162
`0.6924
`0.7670
`0.7601
`0.7166
`0.$2&Z
`0.8161
`0.8140
`0.8709
`95
`
`2
`O.BOIS
`O*U616
`0.1862
`0.4eoe
`0.2200
`0.1171
`0.6364
`0.7761
`0.7629
`0.7189
`0.8238
`0.6036
`0.8122
`O.U772
`36
`
`Statlstlc
`norm RIChll
`Ilora Precls
`Rant Rmcrll
`Log Prlcll
`Prscla
`after
`10 does
`Pracls
`bftar
`30 dots
`10 dots
`Rtcall
`altar
`Rmcall
`after
`30
`aocr
`E. 0.6.
`10 docn
`e.
`1.0.
`10 doca
`E.
`2.0.
`IO dots
`t.
`0.6.
`30 aacs
`E. 1.0.
`30 does
`e.
`2.0.
`SO dots
`6uzbtt
`qnsrlss
`
`1
`0.9274
`0.7Q62
`0 .ZlCD
`0.7079
`0.8200
`0.4600
`0.3003
`0.8117
`0.6060
`0.809s
`0.6703
`0.6339
`0.4ODI
`0.44b6
`30
`
`2
`0. D67a
`0.8603
`0.8044
`0.7837
`0.6817
`0.6161
`0.3300
`0.6931
`0.4647
`0.6603
`0.1376
`0.4167
`0.4280
`0.3EOI
`a0
`
`(a) ADI
`
`(b) Medlars
`
`-319-
`
`IPR2017-01039
`Unified EX1022 Page 11
`
`
`
`[A-T-TAR771
`
`[BUCKLEY851
`
`[CRAWFORD75]
`
`fCRlOUCH88J
`
`[FOX83a]
`
`[FOX83bJ
`
`[SALTON75a]
`
`[SALTON75b]
`
`[SALTON
`
`[SAL.TON87]
`
`[VAhlRIJSB79}
`
`[VOORHEESSS]
`
`[VOC)RHEES86]
`
`[WILLET85]
`
`REF’ERENC:ES
`
`Local Feedback
`Attar, R., and A.S. Fraenkel,
`Systems,
`Journal
`of the ACM, 24(3):397-417,
`
`in F’ull-text Retrieval
`1977.
`
`Retrieval
`Science,
`
`Information
`of the Smart
`Implementation
`C.,
`Buckley,
`Tech. Report
`85-6816, Dept.
`of Computer
`System,
`Cornell University, May 198!j.
`Values,
`of Discrimination
`Crawford,
`R., The Computation
`19’75.
`Information
`Processing
`and Management,
`11249-253,
`An Analysis
`of Approximate
`Versus
`Exact
`C.,
`Crouch,
`Values,
`tnforrnafion
`Processing
`and Management,
`Discrimination
`24(1):5-l
`6, 1988.
`of
`Space Models
`and Vector
`the Boolean
`Fox, E., Extending
`Information
`Retrieval with P-norm Queries
`and Multiple Concept
`Types, Ph.D. Thesis, Cornell University,
`Aug. 1983.
`
`in
`Collections
`of Two New Experimental
`Fox, E., Characteristics
`and
`Textual
`Computer
`and
`Information1
`Science
`Containing
`Bibliographic
`Concepts,
`Tech. Report 83-561, Dept. of Computer
`Science, Cornell University,
`Sept. 1983.
`
`Importance
`of Term
`Salton, G., C.S. Yang, and C.T. Yu, A Theory
`Journal
`of the ASIS
`, 26(1):33-44,
`in Automatic
`Text Analysis,
`1975.
`for
`Salton, G., A. Wang, and C. S. Yang, A Vector Space Model
`Automatic
`Indexing,
`Communications
`of
`the ACM,
`18:613-20,
`1975.
`and Phrases
`the Role of Words
`Salton, G., and A. Wong, On
`Automatic
`Text Analysis,
`Computers
`and Humanities,
`10:69-87,
`1976.
`
`in
`
`in
`
`Retrieval,
`
`2nd
`
`edition,
`
`Approaches
`Term Weighting
`C. Buckley,
`Salton, G. and
`Automatic
`Text Retrieval,
`Tech. Report 87-881, Dept. of Computer
`Science, Cornell Uni;versity,
`Nov. 1987.
`Information
`C.J.,
`van Rijsbergen,
`England,
`1979.
`Butterworths,
`London,
`of Agglomerative
`Voorhees,
`E., The Effectiveness
`and Efficiency
`Tech. Report
`85-
`Hierarchic
`Clustering
`in Document
`Retrieval,
`705, Dept. of Computer
`Science, Cornell University, Oct. 1985.
`Implementing
`Agglomerative
`Hierarchic
`Clustering
`Voorh