throbber
ACLUSTER-BASEDAPPROACHTOTHESAURUSCONSTRUCTION
`
`J. Crouch
`Carolyn
`Department
`of Computer
`Tulane
`University
`New Orleans,
`LA 70118
`
`Science
`
`in
`
`information
`of an
`operation
`successful
`the
`thesaurus
`of a
`importance
`The
`which
`support
`the automatic
`techniques
`Yet
`recognized.
`is well
`retrieval
`system
`This
`paper
`describes
`one
`undiscovered.
`remain
`largely
`thesauri
`generation
`of
`thesauri,
`based
`on
`the
`the
`automatic
`generation
`of global
`approach
`to
`value
`model
`of Salton,
`Yang,
`and Yu and
`on an appropriate
`discrimination
`This method
`has been
`implemented
`and
`applied
`to
`two
`algorithm.
`clustering
`collections.
`Preliminary
`results
`indicate
`that
`this method,
`which
`produces
`document
`improvements
`in retrieval
`performance
`in excess
`of 10 and 15 percent
`in
`the
`test
`collections,
`is viable
`and worthy
`of continued
`investigation.
`
`INTRODUCTION
`
`is
`system
`retrieval
`information
`of any
`operation
`in the successful
`factor
`A major
`dictionaries
`(e.g.,
`Of these
`available
`for use by
`the system.
`the set of dictionaries
`term hierarchies,
`etc.),
`the
`and/or
`syntactic
`phrase
`dictionaries,
`thesauri,
`statistical
`system
`performance
`is
`on
`the
`greatest
`potential
`impact
`dictionary
`having
`thesaurus.
`undoubtedly
`the
`accruing
`from
`the use of a well
`Although
`the benefits
`system
`performance
`are well
`of
`in
`constructed
`thesaurus
`terms
`increased
`creating
`such
`a thesaurus
`remains
`recognized,
`the methodology
`for automatically
`in use are
`idiosyncratic.
`unspecified.
`In fact, virtually
`all thesauri
`presently
`the
`improve
`researchers
`aiming
`to
`Thus
`a topic
`of considerable
`interest
`to
`thesaurus
`systems
`is automatic
`overall
`performance
`of
`information
`retrieval
`to the automatic
`construction
`of a
`construction,
`This paper
`describes
`an approach
`global
`thesaurus
`based
`on
`the discrimination
`value model
`of Salton,
`Yang,
`and Yu
`[SALTON75aJ
`and on an appropriate
`clustering
`algorithm.
`The discrimination
`value
`model
`itself
`is based
`on the vector
`space model described
`below.
`
`to copy without
`Permission
`butcd
`for direct
`commercial
`appear, and notice
`is given
`otherwise,
`or to republish,
`
`the copies are not made or distri-
`that
`is granted provided
`fee all part of this material
`the title of the publication
`and
`its date
`advantage,
`the ACM
`copyright
`notice and
`that copying
`is by pcrnmission of the Association
`for Computing Machinery.
`To copy
`requires
`a fee and/or
`spccifk
`permission.
`
`C
`
`1988
`
`ACM
`
`O-8979 1-274-S
`
`88
`
`0600
`
`0309
`
`$ 1,50
`
`-303--
`
`IPR2017-01039
`Unified EX1022 Page 1
`
`

`

`The Vector
`
`Space
`
`Model
`
`a
`djk
`of
`
`space model.
`is the vector
`retrieval
`information
`in
`the major models
`One of
`collection
`as a set of unique
`in
`the document
`views
`each
`document
`This model
`words
`or word
`types.
`Each documeni:
`may
`then be regarded
`as a term vect,or,
`and
`the complete
`document
`collection
`becomes
`a vector
`space
`of dimension
`m, where
`m
`is
`the number
`of word
`types
`in
`the collection.
`In
`the vector
`space model,
`dj,
`is represented
`document
`vector,
`by a set of
`lerms,
`djk, 1 2 k Am,
`where
`represents
`the
`frequency
`(or weight)
`of
`term
`k in document
`j (i.e.,
`the number
`If djk = 0. term k does not appear
`in document
`times
`term k appears
`in document
`j).
`dj. Queries,
`like documents,
`are
`represented
`by weighted
`term vectors.
`be
`may
`Given
`any
`two
`term
`vectors,
`the
`similarity
`between
`the
`vectors
`vectors
`assumed
`to be
`inversely
`related
`t”le angle
`between
`them.
`If the
`two
`to
`In
`two
`coincide,
`the angle
`between
`them
`is zero,
`arid
`the vectors
`are
`identical.
`The
`dimensions,
`the
`vector
`space
`may
`be
`represented
`by
`its
`envelope.
`the
`and
`the
`(normalized)
`vectors
`are
`then
`viewed
`as points
`in
`vector
`space,
`the
`of
`distance
`between
`any
`two
`points
`is
`inversely
`related
`to
`the
`similarity
`the
`corresponding
`document
`vectors.
`The smaller
`the distance
`between
`two points,
`smaller
`the angle
`between
`the corresponding
`vectors,
`and
`the greater
`the similarity
`of the vectors
`in terms
`of the number
`of word
`types
`they have
`in common.
`the
`that
`Salton
`ef a/
`[SALTON75a,
`SALTON75b,
`SALTON
`have
`shown
`best document
`space
`for
`retrieval
`purposes
`is one which maximizes
`the average
`separation
`between
`documents
`in the document
`space.
`In this space,
`it is easier
`to
`distinguish
`between
`documents
`and
`thus
`easier
`to
`retrieve
`documents
`which
`are
`most
`similar
`to a given
`query.
`The model which
`allows
`the
`terms
`in a collection
`to
`be ranked
`in order of their effect on space
`density
`is called
`the discrimination
`value
`model.
`
`The Discrimination
`
`Value
`
`Model
`
`to single
`roles
`and provides
`a
`can be
`ranked
`in
`a
`It also
`offers
`
`specific
`assigns
`[SALTON75a]
`value model
`‘The discrimination
`terms,
`purposes
`analysis
`for content
`term classes
`and
`term phrases,
`in a collection
`index
`term
`each
`potential
`framework
`within which
`discriminator.
`accordance
`with
`its usefulness
`as a document
`process.
`reasonable
`physical
`interpretation
`of the
`indexing
`by a set of
`represented
`each
`If we consider
`a collection
`of documents,
`coefficient
`computed
`between
`weighted
`m-dimensional
`vectors,
`then
`the similarity
`any
`Iwo
`term
`vectors
`can
`be
`interpreted
`as a measure
`the
`closeness
`or
`of
`relatedness
`between
`the vectors
`in m-space.
`If the similarity
`coefficient
`is large,
`the
`documents
`are very
`similar
`and
`appear
`in close
`proximity
`to each
`other
`in
`the
`And
`if the similarity
`coefficient
`is small,
`the documents
`exhibit
`little
`document
`space.
`similarity
`and are widely
`separated
`in the document
`space.
`
`-310
`
`IPR2017-01039
`Unified EX1022 Page 2
`
`

`

`of the change
`as a measure
`is then defined
`value of a term
`The discrimination
`is assigned
`to the document
`which occurs when a given
`term
`in space
`separation
`discriminaior
`is a term which, when assigned
`to a document,
`collection.
`A good
`decreases
`the space
`density
`(i.e.,
`renders
`the documents
`less similar
`to each
`other).
`Conversely,
`the assignment
`of a poor discriminator
`increases
`the space
`before and after
`density.
`By computing
`the density
`of the document
`space
`the
`assignment
`of each
`term,
`the discrimination
`value of the
`term can be determined.
`The
`terms can
`then be ranked
`in decreasing
`order of their discrimination
`values.
`to
`Salton,
`Yang,
`and Yu
`[SALTON75a]
`have
`used
`discrimination
`value
`determine
`three
`categories
`of discriminators,
`namely,
`good,
`poor, and
`indifferent
`discriminators.
`A term with a positive
`discrimination
`value has been
`found
`to be a
`good discriminator.
`Safton el al suggest
`that
`these
`terms be used directfy as index
`terms.
`Those
`terms with negative
`discrimination
`values are poor discriminators;
`the
`retrieval
`properties
`of such
`terms
`can
`be
`transformed
`by
`including
`them
`in
`appropriate
`phrases.
`The majority
`of terms are
`indifferent
`discriminators
`with near-
`zero
`discrimination
`values.
`The
`retrieval
`capabilities
`of
`these
`terms
`can be
`enhanced
`by means of their
`incorporation
`in appropriate
`thesaurus
`classes.
`of
`Thus
`the discrimination
`value model
`presents
`a criterion
`for the
`formation
`this model, a thesaurus
`global
`thesauri.
`According
`to
`of
`is composed
`of a’set
`thesaurus
`classes.
`A thesaurus
`dass
`is composed
`of a group
`of terms
`or word
`types.
`The
`terms within a class should be indifferent
`discriminators
`(i.e.,
`those with
`near-zero
`discrimination
`values).
`Thus
`in order
`to use
`the criterion
`suggested
`by
`the discrimination
`model,
`the discrimination
`value of each
`term
`in the collection
`must
`be
`calculated
`and
`the
`terms
`ranked
`as good,
`indifferent
`and
`poor
`discriminators
`according
`to their discrimination
`values.
`Two different
`expensive.
`But the calculation
`of discrimination
`value
`is normaliy
`exact method,
`involves
`have been used. One approach,
`the so-called
`approaches
`the calculation
`of all pairwise
`similarities
`between
`the document
`vectors
`of
`the
`collection.
`For a collection
`of n documents
`and m word
`types,
`the complexity
`of this
`algorithm
`is O(mn2).
`The second
`or approximate
`approach
`to the calculation
`of
`discrimination
`value
`involves
`the construction
`of an artificial,
`average
`document,
`the
`centroid,
`and computes
`the sum of
`the similarities
`of each
`document
`with
`the
`centroid.
`The centroid
`algorithm
`is O(mn).
`times
`the execution
`improve
`which
`Modifications
`have
`been
`suggested
`associated
`with
`both
`the exact
`and
`the approximate
`methods
`of calculating
`discrimination
`value
`ICRAWFORD75,
`WILLET85,
`CROUCHf38J.
`Although
`the
`discrimination
`values
`produced
`by these
`two approaches
`differ significantly
`for a
`particular
`collection,
`it has been shown
`that
`the
`rankings
`of the
`terms are
`in fact
`highly
`compatible
`[CROUCH88].
`Thus of these
`two methods,
`the more efficient,
`centroid
`approach
`is the obvious method of choice when discrimination
`values must
`be calculated.
`
`-31
`
`I-
`
`IPR2017-01039
`Unified EX1022 Page 3
`
`

`

`of
`
`to the
`<approach
`th.e centroid
`of any size, even
`collection
`for a document
`But
`has
`alternative
`A, reasona.ble
`value may be expensive.
`of discrimination
`calculation
`Yang,
`and Yu, who
`suggest
`the use of doc:ument
`been
`provided
`by Salton,
`frequency
`as an approximation
`to
`Idiscrimination
`value.
`For any
`term
`k
`in a
`. document
`collection,
`the document
`frequency
`term
`k, dk,
`is defined
`as
`the
`number
`of documents
`in which
`term
`k appears.
`Empirical
`resullts
`indicate
`that
`frequency
`and discrimination
`value
`are strongly
`correlated.
`Let n
`dOCl~ment
`represent
`the number
`of documents
`in a collection
`whose
`terms
`are
`ranked
`by
`increasing
`document
`frequency.
`According
`to
`[SALTON75a],
`those
`terms whose
`document
`frequency
`is less
`than n/100 may be considered
`low
`frequency
`terms.
`The discrimination
`values
`of these
`terms are normally
`near-zero,
`and
`these
`terms
`as a whole may be considered
`indifferent
`discriminators.
`Likewise,
`terms whose
`document
`frequencies
`are greater
`than n/10 may be considered
`high
`frequency
`terms.
`These
`terms
`normally
`have
`negative
`discrimination
`values
`and
`are
`considered
`poor discriminators.
`The
`remaining
`terms
`(i.e., n/IO
`I dk 5 n/100) make
`up
`the set of good discriminators.
`The discrimination
`values
`of these
`terms
`are
`posilive.
`to discrimination
`may be used as an approximation
`frequency
`Thus document
`classes, which
`thesretically
`should
`consist
`of groupings
`of terms
`value.
`Thesaurus
`with near-zero
`discrimination
`values, may
`instead
`be constructed
`of sets of
`low
`frequency
`terms.
`Since document
`frequency
`is readily available
`for every
`term
`in a
`collection,
`the cost associated
`with
`this approach
`is minimal.
`
`AN APPROACH
`
`TO THESAURUS
`
`CONSTRUCTION
`
`a
`of constructing
`feasibility
`the
`investigate
`to
`was designed
`An experiment
`is
`term
`“global
`thesaurus”
`The
`thesaurus
`based
`on
`low
`frequency
`terms.
`global
`by
`“local
`thesaurus”
`described
`to differentiate
`this
`type of thesaurus
`from
`the
`used
`once
`Attar and Fraenkel
`[ATTAR77].
`In a, global
`approach,
`thesaurus
`classes,
`constructed,
`are used
`to index both documents
`and queries.
`The
`local
`thesaurus,
`in
`contrast,
`uses
`information
`obtained
`from
`the documents
`retrieved
`in response
`to a
`particular
`query
`to modify
`that query, which
`11s then
`resubmitted
`to
`the
`retrieval
`system
`for processing
`in lieu of the original.
`Thus a global
`thesaurus
`is constructed
`prior
`to
`the
`indexing
`process
`and
`the
`thesaurus
`classes
`are used
`to
`index
`both
`whereas
`a
`local
`thesaurus
`is constructed
`dynamically
`documents
`and queries,
`during
`query
`processing
`and uses
`information
`retrieved
`in response
`to a specific
`query
`to modify only
`that query.
`
`Constructing
`
`Thesaurus
`
`Classes
`
`thesaurus
`a global
`Constructing
`of thesaurus
`classes
`for the generation
`a viable
`alternative)
`low
`frequency
`
`based on the discrimination
`consisting
`of indifferent
`The question
`‘terms.
`
`value model calls
`discriminators
`or (as
`of .how
`the classes
`
`-312m-
`
`IPR2017-01039
`Unified EX1022 Page 4
`
`

`

`class
`a thesaurus
`Intuitively,
`open.
`remains
`to be constructed
`are
`themselves
`in the context
`of the current
`related
`of terms which
`are closely
`should
`consist
`in the conventional
`sense.)
`synonyms
`collection.
`{Such
`terms are not necessarily
`terms
`is to cluster
`all
`the
`One approach
`to generating
`groups
`of closely
`related
`cluster
`together might
`then
`terms
`in the collection.
`The
`low
`frequency
`terms which
`be considered
`a thesaurus
`class.
`Unfortunately,
`‘in an environment
`where
`there
`is
`little
`information
`to exploit,
`the resultant
`clusters
`are seldom meaningful.
`This
`is the
`case
`in the
`low
`frequency
`domain,
`where
`terms
`are contained
`in only a small
`number
`of the documents
`in the collection.
`to
`and
`of the collection
`the documents
`An alternative
`approach
`is to cluster
`in
`the
`terms
`contained
`low
`frequency
`generate
`thesaurus
`classes
`from
`the
`document
`clusters.
`A key question
`then arises,
`namely, what
`type of clustering
`algorithm
`should be used?
`The choice
`is dictated
`largely by the
`type of clusters
`an
`algorithm
`produces.
`In order
`to generate meaningful
`thesaurus
`classes,
`the
`low
`frequency
`terms
`in a class
`should
`come
`from closely
`related
`documents.
`This
`be small and tight.
`implies
`that
`the document
`clusters
`themselves
`should
`An
`algorithm
`which
`produces
`clusters
`of
`this
`type
`is
`the complete-link
`clustering
`algorithm,
`one of a class of agglomerative,
`hierarchical
`clustering
`algorithms
`that
`has
`received
`some
`attention
`in
`the
`fiterature
`[VANRlJSB79,
`VOORHEES85,
`VOORHEES861.
`Consequently,
`this was
`the algorithm
`used
`to cluster
`documents in our test collections.
`
`the
`
`Constructing
`
`a Global
`
`Thesaurus
`
`thesaurus;
`algorithm.
`are generated,
`
`a globaf
`to construct
`has been utilized
`procedure
`following
`The
`I. The document
`collection
`is clustered
`via the complete-link
`2. The resultant hierarchy
`is traversed
`and
`thesaurus
`classes
`based on specified,
`user-supplied
`parameters.
`3. The documents
`and queries
`are
`indexed
`by the
`The
`characteristics
`of
`the
`thesaurus
`classes
`determined
`by the
`following,
`user-supplied
`parameters:
`(a)
`THRESHOLD
`VALUE
`a
`produces
`algorithm
`clustering
`complete-link
`Application
`‘of
`the
`the
`cluster
`at
`(i.e.,
`those which
`tightest
`clusters
`hierarchy
`in which
`the
`lie at the bottom of the cluster
`tree. These
`nodes
`highest
`threshold
`values)
`are
`the
`leaves
`of the
`tree.
`For example,
`consider
`Fig. 1. The squares
`represent
`documents
`and
`the numbers
`in the circles
`represent
`the
`levels at
`which
`the documents
`cluster.
`Documents
`A and 8 cluster
`at a threshold
`value of 0.089, D and E cluster at a level of 0.149, and document
`C clusters
`with
`the D-E subtree
`at a level of 0.077.
`The A-B subtree
`and
`the C-D-E
`subtree
`cluster at a threshold
`value of 0.029.
`the documents
`determines
`The user-supplied
`threshold
`value
`largely
`class.
`In Fig. 1, a
`from which
`terms are selected
`for inclusion
`in a thesaurus
`threshold
`value of 0.090 would
`return only
`the D-E document
`cluster,
`since
`
`thesaurus
`generated
`
`classes.
`in step
`
`2 are
`
`-313-
`
`IPR2017-01039
`Unified EX1022 Page 5
`
`

`

`Fig. 1 A sample
`
`hierarchy
`
`generated
`
`by the complete
`
`link algorithm
`
`cluster must be greater
`the documents
`level at which
`the
`for
`the cluster
`to be
`the specified
`threshold
`in order
`value of 0.085, on the other hand, would
`return
`threshold
`and D-E).
`If a threshold
`value of 0.075 were specified,
`be returned:
`A-B and either D-E or C-D-E.
`
`to
`than or equal
`A
`recognized.
`two clusters
`(A-B
`two clusters would
`
`[b)
`
`IN A CLUSTER
`NUMBER OF DOCUMENTS
`Since both D-E and C-D-E meet
`in Fig. 1,
`value of 0.075
`threshold
`the
`the number
`of documents
`to be
`another
`parameter
`is needed
`to determine
`included
`in the
`final cluster.
`This parameter,
`supplied
`the user,
`is the
`by
`a value of 2 for
`number
`of documents
`in a cluster.
`In the example,
`this
`parameter
`would
`insure
`that only
`the D-E: cluster would
`be
`returned.
`A
`value of 3 or greater would allow
`the entire C-D-E cluster
`to be returned.
`
`(c)
`
`FREQUENCY
`MlNlMUM DOCUMENT
`classes
`thesaurus
`selected,
`have been
`clusters
`the document
`Once
`clusters.
`in
`those
`contained
`frequel?cy
`terms
`formed
`from
`the
`low
`are
`of
`E3efore such
`terms
`can be selected,
`the minimum
`document
`frequency
`“low
`each
`term,
`i.e.,
`the criterion
`by which
`terms
`are determined
`to be
`frequency,”
`must be specified.
`J-he guidelines
`in [SALTON75a]
`may be
`helpful
`in setting
`this parameter.
`
`((3)
`
`OF GLASS FORMA TtON ME’ViOD
`SPECIFiCATlON
`clusters have been determined
`(via the settings
`When
`the document
`from
`the clustered
`first
`two parameters)
`and
`the
`low
`frequency
`terms
`the
`themselves
`can be
`have been gathered,
`the
`thesaurus
`classes
`documents
`generated.
`The simplest
`approach
`is to define
`a thesaurus
`cfass as
`the
`intersection
`of all the
`low
`frequency
`terms
`in a cluster.
`Alternate methods,
`using
`(1) union
`and
`(2)
`restricted
`intersection
`{ranking
`the
`terms
`in
`the
`intersection
`and using
`the
`top
`ranked
`terms
`from
`that set)
`in
`lieu of the
`intersection,
`were also
`implemented.
`the most
`be
`to
`out
`turned
`The most
`successful
`approach
`by using
`simple
`were
`obtained
`The
`best
`results
`straightforward.
`results described
`in this paper are based on
`intersection,
`The experimental
`t hlis approach.
`
`of
`
`-3 14-
`
`IPR2017-01039
`Unified EX1022 Page 6
`
`

`

`by
`indexed
`can be
`a collection
`are generated,
`classes
`thesaurus
`the
`Once
`rep/acing a
`entails
`Replacement
`(see step 3 above).
`or augmentation
`replacement
`term.
`Augmentation
`term
`in a vector
`by a thesaurus
`class which
`contains
`that
`involves adding
`term
`remains
`in the
`the
`thesaurus
`class
`to the vector;
`the origina
`(For a discussion of term weighting
`vector,
`although
`its weight may be modified.
`methods,
`see
`[SALTON87].)
`Both methods were
`implemented
`in this experiment.
`The
`results
`obtained
`by augmentation
`are
`in every
`case vastly
`superior
`to
`those
`, and
`thus
`this discussion
`focuses
`on
`indexing
`by
`obtained
`through
`replacement
`augmentation.
`is the weighting
`process
`indexing
`the
`during
`interest
`factor of some
`Another
`must be weighted
`so as
`themselves
`The classes
`the new
`thesaurus
`classes.
`terms
`nor have a negligible
`ensure
`that
`they neither
`override
`the original
`vector
`approach
`is
`to compute
`the
`effect
`during
`retrieval.
`An
`intuitively
`appealing
`contained
`in that class.
`As a
`thesaurus
`class weight
`as a function
`of the
`terms
`were computed
`according
`to
`resuit,
`the
`thesaurus
`class weights
`in this experiment
`the
`following
`formula, where
`tc-wt
`denotes
`the
`thesaurus
`class weight,
`tc-con-wti
`denotes
`the weight
`of term
`i in this
`thesaurus
`class,
`jtc-con]
`represents
`the number
`of terms
`in the
`thesaurus
`class, and
`avg-tc-con-wt
`represents
`the average weight
`of a term
`in the thesaurus
`class. Then
`
`of
`to
`
`avg-tc-co
`
`n-wt
`
`=
`
`jtc-con1
`tc-COn-wti
`c
`i=f
`itc-conj
`
`and
`
`tc-wt
`
`=
`
`avg-t c-co n-wt
`pZ-COtJ/
`
`* 0.5
`
`the average weight of a term
`by dividing
`is computed
`class weight
`Thus a thesaurus
`that value by
`in this class by the number
`of terms
`in the class and downweighting
`0.5. The experimental
`results
`indicate
`that
`this approach
`is indeed effective.
`
`Methodology
`
`THE EXPERIMENT
`
`the procedure
`using
`test collections
`two
`for
`were constructed
`thesauri
`Global
`were clustered
`using
`the collections
`recap:
`in the previous
`section.
`To
`described
`and thesaurus
`were
`traversed
`the complete-link
`algorithm.
`The hierarchies
`classes
`were generated
`based
`on user-supplied
`parameters
`(threshold
`value,
`number
`of
`documents
`in a cluster, minimum
`document
`frequency,
`and specification
`of class
`The
`documents
`and queries
`were
`then
`indexed
`by
`the
`formation
`method).
`thesaurus
`classes.
`In all cases,
`the
`indexing was performed
`by augmentation,
`and
`the
`thesaurus
`class weights
`were
`generated
`based
`on
`the weighting
`scheme
`
`-315-
`
`IPR2017-01039
`Unified EX1022 Page 7
`
`

`

`detailed
`modified.
`
`in
`
`the
`
`previous
`
`section.
`
`The weights
`
`of
`
`the original
`
`terms
`
`were
`
`not
`
`Sett,;ng
`
`the Parameters
`
`reference
`
`to the parameters’
`
`used
`
`in step 2 of the
`
`thesaurus
`
`With
`procedure:
`value
`(a) Threshold
`is clustered,
`a collection
`After
`is collection-dependent.
`by examining
`the cluster
`tree.
`The
`values may be determined
`appropriate
`threshold
`in
`.factor
`settil?g
`of
`the
`threshold
`value
`is
`the most
`vital
`forming
`the
`thesaurus
`formation
`method
`determines
`the
`classes.
`This setting
`in conjunction
`with
`the class
`size of
`the
`thesaurus
`class.
`If the setting
`is
`too high,
`thesaurus
`classes
`will be
`generated
`with either
`too many
`terms
`(using
`union
`for class
`formation)
`or with
`too
`few
`terms
`(using
`intersection).
`If the
`threshold
`is too
`low, very
`few classes
`will be
`generated.
`i/7 a cluster obviously
`(b) The number
`of documents
`thesaurus
`affects
`also
`than
`that of
`less
`size, but
`the overall
`effect
`of this parameter
`is considerably
`class
`terms
`in a
`that
`is
`threshold
`value.
`An
`initial
`premise
`of
`this
`experiment
`the
`which
`in turn
`implies
`thesaurus
`class
`should
`come
`from closely
`related
`documents,
`tight.
`For
`investigative
`that
`Ihe document
`clusters
`themselves
`must be small
`and
`purposes
`this parameter
`was allowed
`to range
`from 2 to 5 in our
`test
`runs,
`but
`it is
`always
`subject
`to
`(i.e., governed
`by)
`the
`threshold
`value
`and
`in most
`cases,
`the
`actual
`number
`of documents
`in a cluster
`ranges
`from 2 to 3.
`(c) Minimum
`document
`frequency
`can
`be set by using
`suggested
`in [SALTON75a].
`(d) Class
`formation method
`how
`exactly
`specifies
`which
`is the parameter
`has been determined.
`cluster
`thesaurus
`class
`itself
`is formed
`once
`the document
`the
`three
`approaches
`investigated
`(union,
`intersection,
`and
`restricted
`intersection),
`simple
`intersection
`gives
`the best
`results
`in all cases.
`Thus
`a thesaurus
`composed
`of thesaurus
`classes
`consisting
`of sets of
`low
`frequency
`terms
`common
`to a small
`group
`of closely
`related
`documents
`appears
`to have
`the greatest
`potential
`for
`effecting
`an
`improvement
`in system
`performance.
`
`construction
`
`the guidelines
`
`the
`Of
`
`Collection Weighting
`
`coilection
`the
`and
`constructed
`been
`has
`thesaurus
`global
`the
`Once
`are
`then
`the vectors
`thesaurus
`classes,
`by
`the
`indexed
`and queries)
`(documents
`the
`Smart
`through
`made
`available
`facilities
`using
`appropriately,
`weighted
`(The
`original
`vectors,
`before
`[BUCKLEY85].
`system
`retrieval
`experimental
`by simple
`term
`frequency.)
`Smart
`offers
`a number
`of
`are weighted
`modification,
`The most
`common
`approach
`is probably
`to
`respect
`to weighting.
`options
`with
`“tf-idf” weights
`(i.e.,
`term
`frequency
`multiplied
`by
`inverse
`weight
`the vectors
`using
`This
`scheme
`is appealing
`because
`it allows
`a term
`to be
`document
`frequency).
`weighted
`by a factor
`dependent
`upon
`both
`its
`importance
`within
`a document
`(its
`--3 I 6-
`
`IPR2017-01039
`Unified EX1022 Page 8
`
`

`

`its
`
`as a whole
`colletition
`the document
`within
`importance
`and
`frequency)
`term
`tf-idf weighting
`has proved most
`effective
`Moreover,
`document
`frequency).
`comparison
`with other weighting
`schemes
`in numerous
`experiments.
`of the
`The collection weighting method used
`in this experiment
`is a variation
`idf scheme,
`wherein
`the
`term
`frequency
`component
`is referred
`to as “augmented
`normalized
`term
`frequency.”
`ln this method,
`the
`term
`frequency
`is normalized
`and
`augmented
`and
`is thus guaranteed
`to
`lie between
`0.5 and 1.0.
`This component
`then
`replaces
`the
`tf component
`in the
`tf-idf scheme
`to generate
`the weight
`for this
`term
`in the modified
`document
`collection.
`
`(its
`in
`
`tf-
`
`Similarity Measure
`
`A common
`cosine
`function
`
`measure
`
`of the similarity
`
`between
`
`two vectors
`
`dj and dk
`
`is the
`
`cos(dj,
`
`is the
`where
`dj*dk
`compute
`similarity
`
`and
`inner product
`in this experiment.
`
`Collection
`
`Characteristics
`
`dk) =
`
`dj*dk
`Ild jll
`lld kll
`lldj 112 = (dj -dj).
`This
`
`is the measure
`
`used
`
`to
`
`The
`standard
`collections
`collections,
`
`in this paper was
`described
`procedure
`construction
`thesaurus
`The characteristics
`test collections,
`namely, ADl and Medlars.
`are
`found
`in Table
`I.
`(For more
`complete
`descriptions
`see
`[FOX83A,
`FOX83b, VOORHEES851.)
`
`run on two
`of these
`of
`these
`
`I
`Table
`Collection
`Statistics
`
`Collection
`
`ADI
`Medlars
`
`Number
`Vectors
`
`of
`
`82
`1033
`
`Number
`Terms
`
`of
`
`of
`Mean Number
`Terms per Vector
`
`822
`6927
`
`25.5
`51.6
`
`-317-
`
`IPR2017-01039
`Unified EX1022 Page 9
`
`

`

`Empirical
`
`Results
`
`collections,
`to the ADI and Medlars
`:is applied
`of the experiment,
`results
`The
`and
`recali,
`are displayed
`in Table
`II. The normal
`evaluation
`measures,
`precision
`are used.
`(Recall
`is defined
`as the
`Iproportion
`of relevant material
`retrieved,
`while
`precision
`is the proportion
`of retrieved material
`that
`is relevant.)
`Part
`(a) gives
`the
`statistics
`in terms of precision
`and
`recall when
`the
`thesaurus
`generation
`procedure
`is a.pplied
`to
`the AD1 collection
`(using
`in this case
`the
`foliowing
`parameters:
`threshold
`level of 0.075,
`5 documents
`per cluster,
`a minimum
`document
`frequency
`to
`of 20, and
`intersection
`as the class
`formation method).
`The statistics which apply
`to
`the original,
`unmodified
`ADl collection
`are
`found
`in column
`I. Column
`2 pertains
`the modified
`ADI collection
`after application
`of the global
`thesaurus.
`The average
`precision
`at 3 points of recall
`(0.25, 0.50, and 0.75)
`is calculated
`for each collection
`and shows
`an overall
`improvement
`of 10.6 per cent when
`the global
`thesaurus
`applied.
`the Medlars
`to
`is applied
`the procedure
`view when
`a similar
`(b) gives
`Part
`intersection).
`settings
`of 0.120, 3, 50, and
`parameter
`(with corresponding
`collection
`to
`2 applies
`Again,
`column
`1 applies
`to the original Medlars
`collection
`and column
`in
`same
`collection
`after application
`of
`the global
`thesaurus.
`The
`improvement
`precision
`(based
`on the 3 point averalge)
`of the modified
`collection
`over
`the original
`is 15.8 per cent.
`
`is
`
`a
`
`SUMMARY
`
`of constructing
`feasibility
`the
`investigate
`to
`was designed
`An experiment
`and an appropriate
`value model
`based
`on
`the discrimination
`global
`thesaurus
`clustering
`algorithm.
`A
`thesaurus
`construction
`procedure
`was
`designed,
`implemented,
`and applied
`to two standard
`test collections.
`The preliminary
`results
`show
`that
`substantial
`improvements
`can
`be effected
`by making
`a properly
`consItructed
`global
`thesaurus
`available
`to a retrieval
`system
`during
`processing.
`a
`it is possible
`to construct
`such
`Moreover,
`as
`the empirical
`results
`indicate,
`thesaurus
`automatically.
`The question
`of whether
`it is possible
`to construct
`such a
`thesaurus
`for other collections
`remains
`to be answered.
`
`a
`
`-3I8-
`
`IPR2017-01039
`Unified EX1022 Page 10
`
`

`

`A Comparison
`The Original
`versus
`
`It
`TABLE
`of Retrieval Statistics:
`the Thesaurus-Based
`Collection
`
`collactlo~:
`Rueall
`tlV*l
`0.00
`0.0s
`0.10
`0.16
`0.20
`0.26
`0.30
`0.36
`0.40
`0.46
`0.60
`0.66
`o.eo
`0.16
`0.70
`O-76
`0.80
`0.86
`0.90
`0.06
`1.00
`
`&4l
`- Prrcisloa
`OA68
`0. ImLe
`0.1868
`0.8779
`0.6214
`0.6687
`0.6407
`0.6020
`0.4e91
`0.4798
`0.4790
`0.3100
`0.3eoo
`0.3613
`0 .zaee
`0 .a176
`0.2370
`0.2264
`0.2242
`a.2231
`0.2236
`
`tbtlmhola:
`
`-075
`
`no.
`
`aocr:
`
`&
`
`o-7:74
`0.7074
`0.7017
`0.6012
`0. US48
`0.6005
`O.blS#
`0.6509
`0.6304
`0.6201
`0.6160
`0.4291
`0.4204
`0.4192
`0.1372
`0.5372
`0.3016
`0.2882
`0.2846
`0.2837
`0.2837
`
`nab
`collBction:
`era11
`- Pr~clmlon
`1.
`LITOi
`0.9206
`0.00
`0.8427
`0.06
`0.7996
`0.10
`0.16
`0.7682
`0.20
`0.7444
`0.26
`‘0.7247
`a.30
`0. POVU
`0.36
`0.8901
`0. co
`0.8488
`0.46
`0.8oao
`0.60
`0.6711
`0.66
`0.6219
`o.ao
`0.4026
`u.as
`0.4437
`0.70
`0.4140
`a.ae7a
`0.76
`0.3407
`O.BO
`0.2018
`0.66
`0.2289
`0.90
`0.1360
`0.06
`0.0069
`1.00
`prsclsion
`Aruragm
`*Pruc=
`0.5646
`% Prmc CAsncm =
`
`tArr*Aold:
`
`.12o
`
`no.
`
`aoca:
`
`8
`
`2
`0.0847
`0.9270
`0 -8788
`0.0430
`0.8107
`0.8060
`a.7917
`0.7610
`0.7184
`0.101v
`0.0610
`0.6305
`0.1000
`0.6616
`0.6217
`0.4082
`0.4698
`0.4106
`o.ao13
`0.1607
`0.1664
`for
`5
`0.0681
`Lb.6
`
`intar8odirto
`
`pofnts
`
`S
`
`lntoraodirtm
`
`point8
`
`for
`0.4816
`10.1
`
`prsclslon
`0.4363
`Chmga
`
`=
`
`Avsrag8
`prsc-
`# Prrc
`Stntlwtlc
`Norm Rac811
`Horn Precls
`Rank Ftscrll
`Log Pracls
`10 Uoca
`Preclu
`after
`a0 docm
`~recis
`after
`10 docm
`Rac~ll
`after
`so doca
`REClll
`sfter
`E. 0.6.
`IV dots
`E. 1.0.
`10 aocs
`E. 2.0.
`10 dacs
`E. 0.6.
`30 doccl
`E, 1.0,
`30 dOCS
`E. 2.0.
`30 does
`Number
`qusrlss
`
`1
`0.7949
`0.6311
`0.1671
`0.304s
`0.2267
`0.1162
`0.6924
`0.7670
`0.7601
`0.7166
`0.$2&Z
`0.8161
`0.8140
`0.8709
`95
`
`2
`O.BOIS
`O*U616
`0.1862
`0.4eoe
`0.2200
`0.1171
`0.6364
`0.7761
`0.7629
`0.7189
`0.8238
`0.6036
`0.8122
`O.U772
`36
`
`Statlstlc
`norm RIChll
`Ilora Precls
`Rant Rmcrll
`Log Prlcll
`Prscla
`after
`10 does
`Pracls
`bftar
`30 dots
`10 dots
`Rtcall
`altar
`Rmcall
`after
`30
`aocr
`E. 0.6.
`10 docn
`e.
`1.0.
`10 doca
`E.
`2.0.
`IO dots
`t.
`0.6.
`30 aacs
`E. 1.0.
`30 does
`e.
`2.0.
`SO dots
`6uzbtt
`qnsrlss
`
`1
`0.9274
`0.7Q62
`0 .ZlCD
`0.7079
`0.8200
`0.4600
`0.3003
`0.8117
`0.6060
`0.809s
`0.6703
`0.6339
`0.4ODI
`0.44b6
`30
`
`2
`0. D67a
`0.8603
`0.8044
`0.7837
`0.6817
`0.6161
`0.3300
`0.6931
`0.4647
`0.6603
`0.1376
`0.4167
`0.4280
`0.3EOI
`a0
`
`(a) ADI
`
`(b) Medlars
`
`-319-
`
`IPR2017-01039
`Unified EX1022 Page 11
`
`

`

`[A-T-TAR771
`
`[BUCKLEY851
`
`[CRAWFORD75]
`
`fCRlOUCH88J
`
`[FOX83a]
`
`[FOX83bJ
`
`[SALTON75a]
`
`[SALTON75b]
`
`[SALTON
`
`[SAL.TON87]
`
`[VAhlRIJSB79}
`
`[VOORHEESSS]
`
`[VOC)RHEES86]
`
`[WILLET85]
`
`REF’ERENC:ES
`
`Local Feedback
`Attar, R., and A.S. Fraenkel,
`Systems,
`Journal
`of the ACM, 24(3):397-417,
`
`in F’ull-text Retrieval
`1977.
`
`Retrieval
`Science,
`
`Information
`of the Smart
`Implementation
`C.,
`Buckley,
`Tech. Report
`85-6816, Dept.
`of Computer
`System,
`Cornell University, May 198!j.
`Values,
`of Discrimination
`Crawford,
`R., The Computation
`19’75.
`Information
`Processing
`and Management,
`11249-253,
`An Analysis
`of Approximate
`Versus
`Exact
`C.,
`Crouch,
`Values,
`tnforrnafion
`Processing
`and Management,
`Discrimination
`24(1):5-l
`6, 1988.
`of
`Space Models
`and Vector
`the Boolean
`Fox, E., Extending
`Information
`Retrieval with P-norm Queries
`and Multiple Concept
`Types, Ph.D. Thesis, Cornell University,
`Aug. 1983.
`
`in
`Collections
`of Two New Experimental
`Fox, E., Characteristics
`and
`Textual
`Computer
`and
`Information1
`Science
`Containing
`Bibliographic
`Concepts,
`Tech. Report 83-561, Dept. of Computer
`Science, Cornell University,
`Sept. 1983.
`
`Importance
`of Term
`Salton, G., C.S. Yang, and C.T. Yu, A Theory
`Journal
`of the ASIS
`, 26(1):33-44,
`in Automatic
`Text Analysis,
`1975.
`for
`Salton, G., A. Wang, and C. S. Yang, A Vector Space Model
`Automatic
`Indexing,
`Communications
`of
`the ACM,
`18:613-20,
`1975.
`and Phrases
`the Role of Words
`Salton, G., and A. Wong, On
`Automatic
`Text Analysis,
`Computers
`and Humanities,
`10:69-87,
`1976.
`
`in
`
`in
`
`Retrieval,
`
`2nd
`
`edition,
`
`Approaches
`Term Weighting
`C. Buckley,
`Salton, G. and
`Automatic
`Text Retrieval,
`Tech. Report 87-881, Dept. of Computer
`Science, Cornell Uni;versity,
`Nov. 1987.
`Information
`C.J.,
`van Rijsbergen,
`England,
`1979.
`Butterworths,
`London,
`of Agglomerative
`Voorhees,
`E., The Effectiveness
`and Efficiency
`Tech. Report
`85-
`Hierarchic
`Clustering
`in Document
`Retrieval,
`705, Dept. of Computer
`Science, Cornell University, Oct. 1985.
`Implementing
`Agglomerative
`Hierarchic
`Clustering
`Voorh

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket