`
`Expansion
`
`Using
`
`Local
`
`and Global
`
`Document
`
`Analysis
`
`Jinxi Xu and W. Bruce Croft
`Center for Intelligent
`Information Retrieval
`Computer Science Department
`University of Massachusetts, Amherst
`Amherst, MA 01003-4610, USA
`xu@cs.umass.edu
`croft@ cs.umass.edu
`
`Abstract
`
`as a
`suggested
`been
`long
`has
`expansion
`query
`Automatic
`of word
`issue
`fundamental
`the
`for
`dealing
`with
`technique
`of approaches
`A number
`retrieval.
`in information
`mismatch
`to ezpanrnion
`have
`been studied
`and, more
`recently,
`attention
`has focused
`on techniques
`that
`analyze
`the corpus
`to discover
`word
`relationship
`(global
`techniques)
`and
`those
`that
`analyze
`documents
`retrieved
`by the initial
`quer~
`(
`local
`feedback).
`In
`this paper,
`we compare
`the effectiveness
`of
`these
`approaches
`and show that,
`although
`global
`analysis
`haa some
`advantages,
`local
`analysia
`is generally
`more
`effective.
`We also show that
`using
`global
`analysis
`techniques,
`such
`as word
`contezt
`and
`phrase
`structure,
`on the local
`aet of documents
`produces
`re-
`sults
`that
`are both more
`effective
`and more
`predictable
`than
`simple
`local
`feedback.
`
`1
`
`Introduction
`
`to informa-
`is fundamental
`of word mismatch
`problem
`The
`often
`use
`people
`it means
`that
`Simply
`stated,
`retrieval.
`tion
`than
`au-
`queries
`to describe
`concepts
`in their
`different
`words
`thors
`use to describe
`the
`same
`concepts
`in their
`documents.
`The
`severity
`of
`the
`problem
`tends
`to
`decrease
`as queries
`get
`longer,
`since
`there
`is more
`chance
`of
`some
`important
`words
`co-occurring
`in
`the
`query
`and
`relevant
`documents.
`In many
`applications,
`however,
`the
`queries
`are very
`short.
`For
`example,
`applications
`that
`provide
`searching
`across
`the
`World-Wide
`Web
`typically
`record
`average
`query
`lengths
`of
`two words
`[Croft
`et al.,
`1995].
`Although
`this may
`be one ex-
`of
`treme
`in terms
`IR applications,
`it does
`indicate
`that most
`IR queries
`are not
`long
`and
`that
`techniques
`for dealing
`with
`word mismatch
`are needed.
`is query
`problem
`this
`solving
`to
`An
`obvious
`approach
`expansion.
`or phrases
`words
`using
`is expanded
`The
`query
`the chances
`and
`query
`to those
`in the
`with
`similar
`meaning
`therefore
`in-
`are
`in relevant
`documents
`of mat
`thing
`words
`creased.
`This
`is the basic
`idea
`behind
`the use of a thesaurus
`
`this work for per-
`copy of all part of
`to make digital/had
`Permission
`sonal or classroom use is granted without
`fee provided
`that copies are
`not made or distributed
`for profit or commercial
`advaatage,
`the copy-
`right notice,
`the title of
`the publication
`and its date appczr, and notice
`is given that copying N by permission
`of ACM,
`Inc. To copy othewi-
`se, to republish,
`to post on servers or to redistribute
`to lists,
`requires
`prior specific permission
`and/or
`fee.
`Zurich,
`Switzerland@1996
`SIGIR96,
`8/96/08 $3.50
`
`0-89791-792-
`
`ACM
`
`4
`
`evidence
`the
`
`little
`improving
`are
`selected
`it has
`been
`text
`of
`the
`or query
`
`that
`effec-
`by
`the
`proposed
`corpus
`be-
`expansion
`
`is, however,
`There
`formulation.
`in query
`is of any
`use
`in
`thesaurus
`a general
`the
`search,
`even
`if words
`of
`tiveness
`[Voorhees,
`1994].
`Instead,
`searchers
`that
`by automatically
`analyzing
`the
`ing
`searched,
`a more
`effective
`thesaurus
`technique
`could
`be produced.
`out
`carried
`was
`type
`this
`One
`the
`earliest
`studies
`words
`clustered
`1971] who
`by Sparck
`Jones
`[Sparck
`Jones,
`clus-
`used
`those
`based
`on co-occurrence
`in documents
`and
`studies
`similar
`of
`to
`ters
`expand
`the
`queries.
`A number
`pos-
`consistently
`followed
`but
`it was not
`until
`recently
`that
`have
`itive
`results
`have
`been
`obtained.
`The
`techniques
`that
`been
`used
`recently
`can be described
`as being
`based
`on either
`global
`or
`local
`analysis
`of
`the documents
`in the corpus
`being
`searched.
`The
`global
`techniques
`examine
`word
`occurrences
`in-
`and
`relationships
`in the
`corpus
`se a whole,
`and
`use this
`formation
`to expand
`any particular
`query.
`Given
`their
`focus
`on analyzing
`the
`corpus,
`these
`techniques
`are extensions
`of
`Sparck
`Jones’
`original
`approach.
`top
`the
`only
`involves
`hand,
`Local
`analysis,
`on the
`other
`the
`original
`query. We have
`ranked
`documents
`retrieved
`by
`called
`local
`because
`the
`techniques
`are
`variations
`of
`the
`original
`work
`on
`local
`feedback
`[Attar
`& Fraenkel,
`1977,
`Croft
`& Harper,
`1979].
`This
`work
`treated
`local
`feedback
`as
`a special
`case
`relevance
`feedback
`where
`the
`top
`ranked
`documents
`were
`assumed
`to be relevant.
`Queries
`were
`both
`reweighted
`and
`expanded
`based
`on this
`information.
`of ex-
`Both
`global
`and
`local
`analysis
`have
`the
`advantage
`This
`panding
`the
`query
`based
`on all
`the words
`in the
`query.
`is in contrast
`to a thesaurus-based
`approach
`where
`individ-
`ual words
`and
`phrases
`in the
`query
`are expanded
`and word
`amb@it
`y is a problem.
`Global
`analysis
`is inherently
`more
`expensive
`than
`local
`analysis.
`On
`the
`other
`hand,
`global
`analysis
`provides
`a thesaurus-like
`resource
`that
`can be used
`for
`browsing
`without
`searching,
`and
`retrieval
`results
`with
`not
`local
`feedback
`on small
`test
`collections
`were
`promising.
`More
`recent
`results
`with
`the TREC
`collection,
`however,
`indicate
`that
`local
`feedback
`approaches
`can be effective
`and,
`in some
`cases,
`outperform
`global
`analysis
`techniques.
`In this
`paper,
`we compare
`these
`approaches
`using
`different
`query
`sets
`corpora.
`In addition,
`we propose
`and
`evaluate
`and
`new
`technique
`which
`borrows
`ideas
`from
`global
`analysis,
`such
`as the use of context
`and
`phrase
`structure,
`applies
`but
`them to the
`local
`document
`set. We call
`the
`new technique
`local
`context
`analysis
`to distinguish
`it
`from local
`feedback.
`In the
`next
`section,
`we describe
`the
`global
`analysis
`pro-
`cedure
`used
`in these
`experiments,
`which
`is the Phrasejinder
`component
`of
`the
`INQUERY
`retrieval
`system
`[Jing
`& Croft,
`
`of
`
`of
`
`it
`
`of
`
`a
`
`IPR2019-01304
`BloomReach, Inc. EX1021 Page 1
`
`
`
`The
`procedures.
`analysis
`local
`the
`3 covers
`Section
`1994].
`ap-
`on the most
`successfid
`is based
`technique
`feedback
`local
`19961.
`conference
`[Harman.
`TREC
`tmoaches
`from the recent
`4
`‘
`in detail:
`is described
`~ocal
`context
`analysis
`4.
`in section
`are presented
`and
`results
`The
`experiments
`[Turtle,
`1994]
`and WEST
`[Harman,
`1995]
`the TREC
`Both
`results
`in
`differ-
`compare
`used
`in
`order
`to
`collections
`are
`with
`local
`context
`A number
`of experiments
`ent
`domains.
`of parameter
`varia-
`analysis
`are reported
`to show the
`effect
`techniques
`are run
`tions
`on this
`new technique.
`The
`other
`In the
`comparison
`of
`using
`established
`parameter
`settings.
`recall/precision
`averages
`global
`and
`local
`techniques,
`both
`and
`query-by-query
`results
`are used.
`The
`latter
`evaluation
`is particularly
`useful
`to determine
`the robustness
`of
`the tech-
`niques,
`in terms
`of how many
`queries
`perform
`substantially
`worse
`after
`exrmnsion.
`In the
`final
`section.
`we summarize
`the results
`and
`suggest
`future
`work.
`
`2
`
`Global
`
`Analysis
`
`used
`other
`and
`effec-
`Other
`& Frei,
`good
`
`here has been
`we describe
`technique
`analysis
`global
`The
`in TREC
`evaluations
`and
`system
`the
`INQUERY
`in
`1994,
`Callan
`et al.,
`1995],
`& Croft,
`applications
`[Jing
`to produce
`consistent
`techniques
`was
`one of
`the
`fist
`automatic
`expansion.
`through
`tiveness
`improvements
`researchers
`have
`developed
`similar
`approaches
`[Qiu
`1993, %hiitze
`& Pedersen,
`1994]
`and have
`also reported
`results.
`con-
`global
`the
`is that
`analysis
`in global
`idea
`basic
`The
`be-
`similarities
`can be used
`to determine
`of a concept
`text
`Context
`can be defined
`in a number
`of ways,
`concepts.
`tween
`The
`simplest
`definitions
`are that
`all words
`as can concepts.
`are concepts
`(except
`perhaps
`stop words)
`and
`that
`the
`con-
`text
`for
`a word
`is all
`the words
`that
`co-occur
`in documents
`with
`that
`word.
`This
`is the
`approach
`used
`[Qiu
`& Frei,
`by
`1993],
`and
`the
`analysis
`produced
`is related
`to the
`represen-
`tations
`generated
`by
`other
`dimensionaEty-reduction
`tech-
`1993].
`niques
`[Deerwester
`et
`al.,
`1990,
`Caid
`et
`al.,
`The
`essential
`difference
`is that
`global
`analysis
`is only
`used
`for
`the
`query
`expansion
`and
`does
`not
`replace
`original
`word-
`based
`document
`represent
`ations.
`Reducing
`dimensions
`in
`the
`document
`representation
`leads
`to problems
`with
`preci-
`sion.
`Another
`related
`approach
`uses
`clustering
`to determine
`for
`the
`context
`document
`analysis
`[Crouch
`& Yang,
`1992].
`In the Phrasefinder
`technique
`used with
`INQUERY,
`the
`basic
`definition
`for
`a concept
`is a noun
`group,
`and
`the
`con-
`text
`is defined
`as the
`collection
`of
`fixed
`leneth
`windows
`sur-
`rounding
`the
`concepts.
`A noun
`group
`(p~ase)
`is either
`single
`noun,
`two
`adj scent
`nouns
`or
`three
`adjacent
`nouns.
`are
`Typical
`effective
`window
`sizes
`from
`1 to
`3 sentences.
`One way of visualizing
`the technique,
`although
`not
`the most
`efficient
`way
`of
`implementing
`it,
`is to consider
`every
`concept
`(noun
`group)
`to be associated
`with
`a pseudo-document.
`The
`contents
`of
`the pseudo-document
`for a concept
`are the words
`that
`occur
`in every
`window
`that
`concept
`in the
`corpus.
`for
`For example,
`the
`concept
`pilot might
`have
`the words
`airline
`pay,
`strike,
`safety,
`air,
`FAA
`occurring
`frequently
`trafic
`and
`in
`the
`corresponding
`pseudo-document,
`depending
`on
`corpus
`being
`analyzed.
`An INQUERY
`database
`is built
`these
`pseudo-documents,
`creating
`a concept
`database,
`ltering
`step
`is used
`to remove
`words
`that
`are too
`frequent
`too
`rare,
`in order
`to control
`the
`size of
`the
`database,
`database
`To expand
`a query,
`it
`is run
`against
`the concept
`of phrasal
`using
`INQUERY,
`which
`will
`generate
`a ranked
`list
`concepts
`as output,
`instead
`of
`the
`usual
`list
`of document
`names.
`Document
`and
`collection-based
`weighting
`of match-
`
`a
`
`the
`from
`A fa-
`or
`
`the
`the
`
`only
`than
`
`1
`
`in a
`ranking,
`concept
`the
`to determine
`are used
`ing words
`top-ranking
`Some
`of
`to document
`ranking.
`similar
`way
`and
`query
`added
`to
`the
`list
`are
`then
`phrases
`from
`used
`queries
`Phrasefinder
`weighted
`appropriately.
`the
`In
`are
`and
`query
`in this
`paper,
`30 phrases
`are added
`into
`each
`Phrases
`downweighted
`in proportion
`to their
`rank
`position.
`containing
`terms
`in
`the
`original
`query
`are weighted
`more
`heavily
`those
`containing
`terms
`not
`in the
`origi-
`nal
`query.
`by
`retrieved
`concepts
`30
`top
`the
`shows
`Figure
`are the differ-
`214 “What
`query
`for
`the TREC4
`Phrasefinder
`induced
`hypnosis”.
`While
`self
`used
`to create
`ent
`techniques
`concepts
`are reasonable,
`others
`are difficult
`to
`some
`of
`the
`This
`is due
`to a number
`of spurious
`matches
`understand.
`ent words
`in the
`query.
`with
`noncont
`approach
`analysis
`advantages
`of a global
`The main
`is relatively
`robust
`in INQUERY
`is that
`one used
`the
`tends
`to improve
`performance
`of queries
`average
`that
`of expansion,
`and
`it provides
`a thesaurus-like
`this
`type
`ing
`of
`can
`be
`used
`for
`browsing
`other
`types
`or
`resource
`that
`The
`disadvantages
`of
`approach
`is that
`concept
`search.
`this
`it
`can
`be
`expensive
`in
`terms
`of disk
`space
`and
`computer
`time
`to do the global
`context
`analysis
`and
`build
`the
`search-
`able
`database,
`and
`individual
`queries
`can
`be
`significantly
`degraded
`by expansion.
`
`it
`
`like
`in
`us-
`
`3
`
`Local
`
`Analysis
`
`3.1
`
`Local
`
`Feedback
`
`least
`at
`back
`& Fraenkel,
`for
`a query
`building
`an
`were
`clus-
`& Harper,
`is used to
`in the
`rel-
`of query
`added.
`but
`
`dates
`feedback
`local
`of
`concept
`general
`The
`[Attar
`Fraenkel
`and
`paper
`by Attar
`to a 1977
`ranked
`documents
`top
`1977].
`In this
`paper,
`the
`of
`information
`for
`were
`proposed
`as a source
`in these
`documents
`automatic
`thesaurus.
`Terms
`tered
`and
`treated
`as quasi-synonyms.
`In
`[Croft
`1979],
`information
`from the top
`ranked
`documents
`re-estimate
`the
`probabilities
`of
`term occurrence
`for
`evant
`set
`a query.
`In other
`words,
`the weights
`not
`terms
`would
`be modified
`but
`new
`terms
`were
`This experiment
`produced
`effectiveness
`improvements,
`was only
`carried
`out
`on a small
`test
`collection.
`col-
`small
`Experiments
`carried
`out with
`other
`standard
`the
`simple
`not
`lections
`did
`give
`promising
`results.
`Since
`version
`of
`this
`technique
`consists
`of adding
`common
`words
`the
`from
`top-ranked
`documents
`to
`the
`original
`query,
`the
`effectiveness
`of
`the
`technique
`is obviously
`highly
`influenced
`by
`the
`proportion
`of
`relevant
`documents
`in the
`high
`ranks.
`Queries
`that
`perform
`poorly
`and
`retrieve
`few relevant
`doc-
`uments
`would
`seem likely
`to perform
`even worse
`after
`local
`feedback,
`since most words
`added
`to the
`query
`would
`come
`from non-relevant
`documents.
`feed-
`local
`simple
`however,
`In recent
`TREC
`conferences,
`In this
`well.
`quite
`performed
`techniques
`appear
`to have
`back
`used
`by
`to that
`we expand
`using
`a procedure
`similar
`paper,
`1996].
`et al.,
`the Cornell
`group
`in TREC
`4 & 3 [Buckley
`of adja-
`(pairs
`The most
`frequent
`50 terms
`and
`10 phrases
`documents
`are
`cent
`non
`stop
`words)
`from the
`top
`ranked
`are reweighted
`added
`to the
`query.
`The
`terms
`in the
`query
`using
`the Rocchio
`formula
`with
`a : /3 : ~ = 1 :1:0,
`local
`Figure
`2 shows
`terms
`and
`phrases
`added
`by
`to the
`same
`query
`used
`in the previous
`section.
`back
`the
`terms
`in the
`query
`are stemmed.
`case,
`One
`advantage
`of
`local
`feedback
`is that
`tively
`efficient
`to do expansion
`based
`on high
`uments.
`It may
`be
`slightly
`slower
`at
`run-time
`
`feed-
`In this
`
`it
`
`be rela-
`can
`ranking
`doc-
`than,
`for
`
`5
`
`IPR2019-01304
`BloomReach, Inc. EX1021 Page 2
`
`
`
`y-virus
`
`hypnosis
`dentists
`psychiatry
`susceptibllit
`atoms
`confession
`katie
`reflexes
`correlation
`ike
`
`y
`
`meditation
`antibodies
`immunodeficienc
`therapists
`van-dyke
`stare
`johns-hopkins-university
`voltage
`conde-nast
`illnesses
`
`practitioners
`disorders
`anesthesia
`dearth
`self
`proteins
`growing-acceptance
`ad-hoc
`dynamics
`hoffman
`
`Figure
`
`1: Phrasefinder
`
`concepts
`
`for TREC4
`
`query
`
`214
`
`hypnot
`psychosomat
`mesmer
`austrian
`shesaid
`hallucin
`hilgard
`19820902
`physician
`hemophiliac
`01
`spiegel
`suggest
`immunoglobulin
`person
`psorias
`17150000
`austrian-physician
`hypnot-state
`late-18th
`ms-ol
`
`hypnotiz
`psychiatr
`franz
`dyck
`tranc
`18th
`llth
`syndrom
`told
`strang
`defic
`diseas
`dyke
`reseach
`numb
`treatment
`ms
`psychosomat-medicin
`fight-immun
`diseas-fight
`
`19960500
`immun
`suscept
`psychiatrist
`professor
`centur
`unaccept
`exper
`patient
`cortic
`muncie
`imagin
`feburar
`fresco
`ktie
`medicin
`franz-mesmer
`
`int em-congress
`hypnotiz-peopl
`
`F@re
`
`2: Local
`
`feedback
`
`terms
`
`and
`
`phrases
`
`for TREC4
`
`query
`
`214
`
`construction
`no thesaurus
`needs
`but
`Phrasefinder,
`example,
`and
`access
`an extra
`search
`reauires
`Local
`feedback
`phase.
`is stored
`if document
`information
`lo document
`information.
`as a space
`this
`should
`be counted
`then
`only
`for
`this
`purpose,
`but
`it
`likely
`to be significantly
`overhead
`for
`the
`technique,
`A disadvantage
`currently
`less
`than
`a concept
`database.
`this
`technique
`work
`that
`it
`is not
`clear
`how well
`queries
`that
`retrieve
`few relevant
`documents.
`
`will
`
`is
`
`with
`
`at
`of a concept
`a co-occurrence
`t epics,
`multiple
`about
`of a long
`docu-
`and
`a term at
`the
`end
`beginning
`the
`may mean
`nothing.
`It
`is also more
`efficient
`to
`ment
`use passages
`because
`we can eliminate
`the
`cost
`of pro-
`cessing
`the
`unnecessary
`parts
`of
`the
`documents.
`
`2. Concepts
`ranked
`
`(noun
`according
`
`phrases)
`to the
`
`in
`formula
`
`the
`
`top
`
`n passages
`
`are
`
`3.2
`
`Local
`
`Context
`
`Analysis
`
`bel(Q,
`
`C) =
`
`~
`
`(J + log(a~(c,
`
`ti))
`
`idf./
`
`log(n))
`
`ix’
`
`combines
`which
`is a new teclmique
`analysis
`noun
`Phrasefinder,
`feedback.
`Like
`and
`local
`based
`are selected
`and
`concepts
`as concepts
`chosen
`terms.
`Concepts
`are
`with
`query
`similar
`to local
`feedback,
`ranked
`documents,
`instead
`of whole
`documents.
`passages
`are used
`INQUERY
`ranking
`is not
`used
`in this
`tech-
`
`context
`Local
`analysis
`global
`are used
`groups
`on
`co-occurrence
`from the
`top
`but
`the best
`The
`standard
`nique.
`to use local
`steps
`are the
`Below
`a query
`Q on a collection.
`
`pand
`
`context
`
`analysis
`
`to ex-
`
`1. Use a standard
`top
`n ranked
`of
`fixed
`size
`1994]).
`
`IR system
`passages.
`(3OO words
`
`the
`to retrieve
`(INQUERY)
`is a text
`window
`A passage
`in these
`experiments
`[Callan,
`
`are two reasons
`There
`documents.
`Since
`
`that we use passages
`documents
`can
`be
`very
`
`rather
`long
`
`than
`and
`
`t<EQ
`
`Where
`
`af(c, ti) =
`idfi
`=
`idf=
`=
`
`~:~y
`
`ftij
`
`fCj
`
`maZ(l.0,
`maz(l.0,
`
`20g10(lV/Ni)/5.0)
`log10(N’/IV=)/5.0)
`
`c
`
`ftij
`
`f.j
`
`N
`Ni
`N.
`
`J
`
`is a concept
`is the
`number
`is the
`number
`is the
`number
`is the
`number
`is the
`number
`is 0.1 in this
`
`of occurrences
`of occurrences
`of passages
`of passages
`of passages
`paper
`to avoid
`
`ti
`
`in
`
`~j
`
`of
`of
`Pj
`in
`c
`collection
`in the
`containing
`ti
`containing
`c
`zero
`bel
`value
`
`The
`used
`
`is a variant
`formula
`above
`by most
`IR systems.
`In the
`
`the tf
`of
`formula,
`
`idf measure
`the
`af
`part
`
`6
`
`IPR2019-01304
`BloomReach, Inc. EX1021 Page 3
`
`
`
`rewards
`terms,
`quently
`frequent
`phasize
`
`frequently
`co-occurring
`concepts
`concepts
`part
`penalizes
`the
`idfc
`in the collection,
`the idfi
`part
`query
`terms.
`Multiplication
`co-occurrence
`with
`all query
`
`query
`with
`fre-
`occurring
`emphasizes in-
`is used
`to
`em-
`terms.
`
`3. Add m top
`formula:
`
`ranked
`
`concepts
`
`to Q using
`
`the
`
`following
`
`Q new
`Q!
`
`=
`=
`
`#WSU&f(l.O
`#wsuM(l.o
`
`1.0 Q w QI)
`WI c1 W2 C2 .. . W-
`
`cm)
`
`m is set
`experiments,
`In our
`specified
`Unless
`1.0 – 0.9*
`i/70.
`the
`auxiliary
`2.o. We call Q/
`operator
`INQUERY
`query
`average
`its
`components.
`
`of
`
`to
`
`to
`set
`is
`70 and w;
`to
`w is set
`otherwise,
`query.
`#WSUM
`is an
`which
`computes
`a weighted
`
`text
`
`added
`
`by 10CSJ con-
`
`a
`
`30 concepts
`top
`the
`3 shows
`F@me
`query
`214.
`to TREC4
`analysis
`is com-
`It
`advantages.
`has
`several
`Local
`context
`analysis
`need
`For
`each
`collection,
`we only
`putationaily
`practical.
`the collection
`frequencies
`for
`the terms
`single
`pass
`to collect
`This
`pass
`takes
`about
`3 hours
`on
`an
`and
`noun
`phrases.
`for
`the
`TREC4
`collection.
`The major
`Alpha
`workstation
`a query
`is an extra
`search
`to retrieve
`overhead
`to expand
`the
`top
`ranked
`passages.
`On
`a modern
`computer
`system,
`this
`overhead
`reasonably
`small.
`Once
`the
`ranked
`is
`top
`passages
`are
`available,
`query
`expansion
`is fast:
`when
`100
`passages
`are used,
`our
`current
`implementation
`requires
`only
`several
`seconds
`of CPU
`time
`to
`expand
`a TREC4
`query.
`So local
`context
`analysis
`is practical
`even
`for
`interactive
`applications.
`For
`queries
`containing
`proximity
`constraints
`co-
`add
`(e.g.
`phrases),
`Phrasefinder
`may
`concepts
`which
`con-
`but
`occur
`with
`all query
`terms
`do not
`satisfy
`proximity
`a prob-
`straints.
`Local
`context
`analysis
`does
`not
`have
`such
`using
`the
`lem because
`the
`top
`ranked
`passages
`are retrieved
`original
`query.
`Because
`it does
`not
`filter
`out
`frequent
`con-
`cepts,
`local
`context
`analysis
`also has
`the advantage
`of using
`frequent
`but
`potentially
`good
`expansion
`concepts.
`A disad-
`vantage
`of
`local
`context
`analysis
`is that
`it may
`require
`more
`time
`to expand
`a query
`than
`Phrasefinder.
`
`4
`
`Experiments
`
`4.1
`
`Collections
`
`and Query
`
`Sets
`
`that
`TREC3
`on 3 collections:
`out
`are carried
`Experiments
`(topics
`2 datasets
`with
`50 queries
`Tipster
`1 and
`comprises
`comprises
`Tlpster
`2 and
`3 datasets
`TREC4
`that
`151-200),
`202-250)
`and WEST
`with
`34 queries.
`with
`49 queries
`(topics
`(about
`2 GBs
`each)
`are much
`larger
`TREC3
`and
`TREC4
`than WEST.
`The
`average
`docu-
`and more
`het erogenous
`documents
`is only
`1/7
`of
`that
`of
`ment
`length
`of
`the TREC
`average
`number
`relevant
`doc-
`the WEST
`documents.
`The
`of
`uments
`per
`query
`with
`the TREC
`collections
`is much
`larger
`than
`that
`of WEST.
`Table
`1 lists
`some
`statistics
`about
`the
`collections
`and
`the
`query
`sets.
`Stop words
`are not
`included.
`
`4.2
`
`Local
`
`Context
`
`Analysis
`
`of
`
`context
`local
`performance
`the
`2 shows
`Table
`are added
`into
`70 concepts
`the three
`collections.
`formula
`in section
`3.2.
`using
`the
`expansion
`on TREC3
`performs
`very
`well
`Local
`text
`analysis
`TREC4.
`All
`runs
`produce
`significant
`improvements
`the
`baseline
`on
`the
`TREC
`collections.
`The
`best
`
`on
`analysis
`each
`query
`
`and
`over
`on
`
`run
`
`7
`
`baseline.
`the
`than
`is 23.5% better
`passages)
`(100
`TREC4
`is 24.4% better
`than
`(200 passages)
`on TREC3
`run
`The
`best
`over
`the baseline
`the improvements
`On WEST,
`the baseline.
`Wkh
`too many
`are not
`as good
`as on TREC3
`and TREC4.
`the
`baseline.
`than
`passages,
`the
`performance
`is even worse
`(53.8% average
`The
`high
`baseline
`of
`the WEST
`collection
`are of very
`good
`precision)
`suggests
`that
`the original
`queries
`emphasis.
`So we
`quality
`and we
`should
`give
`them
`more
`downweight
`the expansion
`concepts
`by 50% by reducing
`the
`weight
`of auxiliary
`query
`QI
`from 2.0 to 1.0. Table
`3 shows
`that
`downweighting
`the
`expansion
`concepts
`does
`improve
`performance.
`of passages
`number
`to see how the
`It
`is interesting
`see it more
`clearly,
`performance.
`To
`tiects
`retrieval
`the
`plot
`performance
`curve
`on TREC4
`in figure
`4.
`Initially,
`increasing
`the
`number
`of passages
`quickly
`improves
`perfor-
`mance.
`The
`performance
`peaks
`a certain
`point.
`After
`staying
`relatively
`flat
`a period,
`the
`performance
`curves
`for
`drop
`slowly
`when more
`passages
`sre used.
`For TREC3
`and
`TREC4,
`optimal
`number
`passages
`is around
`100,
`the
`of
`while
`on WEST,
`the
`optimal
`number
`of passages
`is around
`the
`20.
`This
`is not
`surprising
`because
`first
`two
`collections
`are a order
`of magnitude
`larger
`than WEST.
`Currently
`do not
`know
`how
`to
`automatically
`determine
`the
`optimal
`number
`of passages
`to use.
`Fortunately,
`local
`cent ext
`anal-
`ysis
`is relatively
`insensitive
`to the
`number
`the
`passages
`used,
`especially
`for
`large
`collections
`like
`TREC
`collec-
`the
`tions.
`On
`the TREC
`collections,
`between
`30 and
`300
`pas-
`sages
`produces
`very
`good
`retrievsl
`performance.
`
`used
`we
`
`we
`
`at
`
`of
`
`5
`
`Local
`
`Text
`
`Analysis
`
`vs Global
`
`Analysis
`
`context
`local
`and
`Phrasefinder
`we compare
`section
`In this
`4-5 com-
`Tables
`performance.
`in term of
`retrieval
`analysis
`techniques
`on
`two
`of
`the
`retrieval
`performance
`pare
`the
`local
`context
`collections,
`both
`collections.
`On
`the
`TREC
`On
`TREC3,
`is much
`better
`than
`Phrasefinder.
`analysis
`while
`local
`is 7.8% better
`than
`the
`baseline
`Pbraaefinder
`context
`analysis
`using
`the
`ranked
`100 passages
`is 23.3%
`top
`better
`than
`the
`baseline.
`On TREC4,
`Phrasefinder
`is only
`3.4% better
`than
`the
`baseline
`while
`local
`context
`analysis
`top
`using
`the
`ranked
`100 passages
`is 23.5~0
`than
`the
`base-
`line.
`In fact,
`all
`local
`context
`analysis
`runs
`in table
`2 are
`better
`than
`Phrasefinder
`on TREC3
`TREC4.
`On both
`and
`collections,
`Phraseiinder
`hurts
`the
`high-precision
`end while
`local
`context
`analysis
`helps
`improve
`precision.
`The
`results
`show that
`local
`context
`analysis
`is a better
`query
`expansion
`technique
`than
`Phraseilnder.
`why
`show
`queries
`TREC4
`We
`examine
`two
`For
`analysis.
`Pbrasefinder
`is not
`as good
`as local
`context
`good
`concepts
`one
`example,
`“China”
`and
`“Iraq”
`are
`very
`for TREC4
`query
`“Status
`of nuclear
`proliferation
`treaties
`into
`violations
`and monitoring”.
`They
`are added
`the
`query
`by
`local
`context
`analysis
`not
`by Phrasefinder.
`It
`ap-
`but
`pears
`that
`they
`are faltered
`out by Phrasefkder
`because
`they
`are frequent
`concepts.
`For
`the
`other
`example,
`Phrasefinder
`added
`the
`concept
`“oil
`spill”
`to TREC4
`query
`“As
`a result
`of DNA
`testing,
`are more
`defendants
`being
`absolved
`or con-
`victed
`of crimes”.
`This
`seems
`to be strange.
`It appears
`that
`Phrasefinder
`did this
`because
`“oil
`spill”
`co-occurs
`with many
`of
`the terms
`in the query,
`e.g.,
`“result”,
`“test”,
`“defendant”,
`!labsolve}~
`“crime”.
`But
`“oil
`spill”
`does
`not
`co-occur
`of
`is a key
`element
`the
`query.
`While
`which
`“DNA”,
`with
`to automatically
`determine
`which
`terms
`are
`it
`is very
`hard
`of a query,
`the
`product
`fimction
`used
`by local
`key
`elements
`cent ext
`analysis
`for
`selecting
`expansion
`concepts
`should
`be
`
`to
`
`-
`
`~d
`
`IPR2019-01304
`BloomReach, Inc. EX1021 Page 4
`
`
`
`hypnosis
`technique
`brain
`hallucination
`van-dyck
`case
`hypnotizable
`patient
`katie
`studv
`
`brain-wave
`pulse
`ms.-olness
`process
`behavior
`spiegel
`subject
`memory
`muncie
`Doint
`
`ms.-burns
`reed
`trance
`circuit
`suggestion
`finding
`van-dyke
`application
`approach
`contrast
`
`F@re3:
`
`Local
`
`Context
`
`Analysis
`
`concepts
`
`for
`
`query
`
`214
`
`collection
`Number
`Raw text
`Number
`Mean
`Mean
`Number
`
`of queries
`size in gigabytes
`of documents
`words
`per
`document
`per
`relevant
`documents
`of words
`in a collection
`
`query
`
`WEST
`34
`0.26
`11,953
`1,970
`29
`23,516,042
`
`TREC3
`50
`2.2
`741,856
`260
`196
`192,684,738
`
`TREC4
`49
`2.07
`567,529
`299
`133
`169,682,351
`
`Table
`
`1: Statistics
`
`on text
`
`corpora
`
`collection
`TREC4
`
`TREC3
`
`WEST
`
`10
`29.5
`+17
`36.6
`+16
`54.8
`+1.9
`
`20
`29.9
`+18.6
`37.5
`+18.9
`55.4
`+3.0
`
`30
`30.2
`+19.8
`38.7
`+22.6
`54.5
`+1.3
`
`40
`30.3
`+20.3
`39.0
`+23.6
`54.6
`+1.6
`
`Number
`50
`30.4
`+20.6
`38.9
`+23.2
`54.2
`+0.7
`
`of passages
`100
`31.1
`+23.5
`38.9
`23.3
`54.2
`+0.8
`
`200
`31.0
`+23.0
`39.3
`+24.4
`53.1
`-1.3
`
`300
`30.7
`+21.8
`39.1
`+23.7
`52.7
`-2.0
`
`500
`29.9
`+18.6
`38.3
`+21.3
`52.1
`-3.2
`
`1000
`29.0
`+15
`37.6
`+19
`51.7
`-3.9
`
`2000
`27.9
`+10.7
`36.6
`+16.0
`51.7
`-3.9
`
`Table
`
`2: Performance
`
`of
`
`local
`
`context
`
`analysis
`
`using
`
`11 point
`
`average
`
`precision
`
`collection
`WEST
`
`I
`
`10
`55.9
`+3.8
`
`20
`56.5
`+5.0
`
`30
`55.6
`i-3.4
`
`40
`55.7
`+3.6
`
`Number
`50
`55.8
`+3.7
`
`of passages
`100
`200
`55.6
`54.6
`+3.3
`+1.6
`
`300
`54.4
`+1.2
`
`500
`53.6
`-0.4
`
`1000
`53.7
`-0.1
`
`2000
`53.7
`-0.1
`
`3: Downweight
`
`expansion
`
`concepts
`
`of
`
`local
`
`context
`
`analysis
`
`on WEST.
`
`The weight
`
`of
`
`the
`
`auxiliary
`
`query
`
`is reduced
`
`to
`
`Table
`1.0
`
`the
`than
`better
`the product
`with
`to dominate
`other
`
`sum fimction
`fhnction
`query
`
`by Pbrasefinder
`used
`is harder
`for
`some
`it
`terms.
`
`because
`terms
`
`query
`
`6
`
`Local
`
`Text
`
`Analysis
`
`vs
`
`Local
`
`Feedback
`
`performances
`Table
`7 shows
`
`of
`
`lo-
`the
`
`the retrieval
`we compare
`section
`In this
`and
`local
`context
`analysis.
`cal
`feedback
`retrieval
`performance
`of
`local
`feedback.
`the expansion
`Table
`8 shows
`the result
`of downweighting
`this
`is to make
`for
`concepts
`by 5070 on WEST.
`The
`reason
`Remember
`analysis.
`a fair
`comparison
`with
`local
`context
`that
`we also
`downweighted
`the
`expansion
`concepts
`of
`local
`context
`analysis
`by 50% on WEST.
`run
`best
`The
`on TREC3.
`Local
`feedback
`does
`very well
`to
`close
`over
`the
`baseline,
`produces
`a 20.5% improvement
`the
`It
`is also
`context
`analysis.
`local
`of
`the best
`run
`of
`24.4y0
`used
`for
`number
`of documents
`relatively
`insensitive
`to the
`feedback
`on TREC3.
`Increasing
`the
`number
`of documents
`from 10 to 50 does
`tiect
`performance
`much.
`not
`produces
`It
`also
`does well
`on TREC4.
`The
`run
`14.070
`improvement
`over
`the
`baseline,
`significant,
`lower
`than
`the
`23.57.
`of
`the best
`run
`context
`
`best
`very
`local
`
`of
`
`a
`
`but
`analy-
`
`8
`
`for
`
`a
`
`local
`
`used
`of documents
`to the number
`sensitive
`is very
`It
`sis.
`number
`of documents
`Increasing
`the
`on TREC4.
`feedback
`in a blg
`performance
`loss.
`In contrast,
`from 5 to 20 results
`is relatively
`insensitive
`to the number
`local
`context
`analysis
`collections.
`of passages
`on all
`three
`at all. Wkh-
`not work
`On WEST,
`local
`feedback
`does
`results
`in
`it
`concepts,
`downweighting
`the
`expansion
`out
`Downweighting
`all
`runs.
`significant
`performance
`loss
`over
`It
`amount
`of
`loss.
`the
`expansion
`concepts
`only
`reduces
`the
`is also
`sensitive
`to the
`number
`of documents
`used
`for
`feed-
`of
`back.
`Increasing
`the
`number
`feedback
`documents
`results
`in significantly
`more
`performance
`loss.
`its
`and
`feedback
`It
`seems
`the
`performance
`of
`that
`feedback
`used
`for
`sensitivity
`y
`number
`of documents
`the
`to
`the
`col-
`depend
`on
`number
`relevant
`documents
`in
`of
`the
`average
`lection
`for
`query.
`From table
`1 we know
`that
`the
`is 196,
`number
`of
`relevant
`documents
`per
`query
`on TREC3
`than
`29
`larger
`than
`133 of TREC4,
`which
`is in turn
`larger
`of WEST.
`This
`corresponds
`to the
`relative
`performance
`of
`local
`feedback
`on the
`collections.
`between
`comparison
`Tables
`4-6 show a side by side
`recall
`at different
`feedback
`and
`local
`context
`analysis
`on the three
`collections.
`10 documents
`are used for
`
`Top
`
`local
`levels
`local
`
`IPR2019-01304
`BloomReach, Inc. EX1021 Page 5
`
`
`
`29
`
`t
`
`F@re
`
`4: Performance
`
`curve
`
`of
`
`local
`
`context
`
`analysis
`
`on TREC4
`
`Recall
`
`o
`10
`20
`30
`40
`50
`60
`70
`80
`90
`100
`average
`
`base
`71.0
`49.3
`40.4
`33.3
`27.3
`21.6
`14.8
`9.5
`6.2
`3.1
`0.4
`25.2
`
`48.6
`40.0
`;.:
`
`23:9
`18.8
`11.8
`8.1
`4.2
`0.6
`26.0
`
`~-1.6j
`(-1.0)
`(+1.8)
`(+2.5)
`(+10.3)
`(+27.1)
`(+24.7)
`(+31.0)
`(+33.6)
`(+24.Oj
`(+3.4)
`
`52.8
`43.2
`;.:
`
`24:5
`19.7
`14.8
`10.8
`6.4
`0.9
`27.9
`
`(+7.Oj
`(+7.0)
`(+8.0)
`(+9.2)
`(+13.2)
`(+33.4)
`(+56.9)
`(+74.7)
`(+104.6)
`(+93.3)
`(+11.0)
`
`Ml
`
`lca-100p
`73.2
`57.1
`46.8
`39.9
`35.3
`29.9
`23.6
`17.9
`11.8
`5.7
`0.8
`31.1
`
`(+3.2)
`(+15.7)
`(+16.0)
`(+19.8)
`(+29.1)
`(+38.4)
`(+59.8)
`(+89.1)
`(+91.0)
`(+80.2)
`(+88.2)
`(+23.5)
`
`Table
`feedback
`
`4: A comparison
`(lf-10doc).
`
`of baseline,
`100 passages
`
`Phrasefinder,
`for
`local
`context
`
`and
`feedback
`local
`analysis
`(lea-100p)
`
`local
`
`context
`
`analvsis
`
`on TREC4.
`
`10 documents
`
`for
`
`local
`
`for
`used
`are
`passages
`6 for WEST,
`In table
`by 5070 for
`both
`
`context
`local
`the expansion
`local
`feedback
`
`100
`top
`and
`feedback
`tables.
`in these
`analysis
`are downweighted
`concepts
`context
`analysis.
`and
`local
`best
`the
`of
`comparison
`We
`also made
`a query-by-query
`anal-
`context
`run
`of
`local
`of
`local
`feedback
`and
`the best
`run
`21 and
`hurts
`on TREC4.
`Of 49 queries,
`local
`feedback
`ysis
`11 and
`im-
`improves
`28, while
`local
`context
`analysis
`hurts
`5 queries
`proves
`38. Of
`the
`queries
`hurt
`local
`feedback,
`by
`have
`a more
`thsn
`5% percent
`loss in average
`precision.
`The
`worst
`case is query
`232, whose
`average
`precision
`is reduced
`from 24.8% to 4.3%.
`Of
`those
`hurt
`by local
`context
`analysis,
`only
`one
`has
`a more
`570 percent
`loss
`in average
`precision.
`Local
`feedback
`also
`tends
`to hurt
`queries
`with
`poor
`perfor-
`mance.
`Of 9 queries
`with
`baseline
`average
`precision
`less than
`5%, 10CSI
`feedback
`hurts
`8 and
`improves
`1.
`In contrast,
`lo-
`cal context
`analysis
`hurts
`4 and
`improves
`5.
`Its
`tendency
`to
`hurt
`“bad”
`queries
`and
`queries
`with
`few relevant
`documents
`is
`(such
`as the WEST
`queries)
`suggests
`that
`local
`feedback
`very
`sensitive
`to the
`number
`of
`relevant
`documents
`in the
`top
`ranked
`documents.
`In comparison,
`local
`context
`analy-
`sis is not
`so sensitive.
`It
`is interesting
`analysis
`and
`local
`passages/documents,
`them
`is
`very
`small.
`
`two
`are
`feedback
`local
`and
`analysis
`cent ext
`queries
`Some
`techniques.
`expansion
`query
`differently
`are improved
`by both methods.
`expansion
`overlap
`for query
`214 of TREC4
`different
`techniques
`used
`to create
`self-induced
`19 terms,
`yet both methods
`improve
`the
`query
`
`different
`quite
`quite
`expanded
`the
`For example,
`(” What
`are the
`hypnosis”)
`is
`significantly.
`
`7
`
`Conclusion
`
`and
`
`Future
`
`Work
`
`the
`compares
`paper
`This
`query
`expansion
`tomatic
`and
`local
`context
`feedback
`collections
`show that
`on three
`and
`local
`context
`feedback
`document
`analysis.
`global
`analysis,
`which
`context
`on the local
`document
`in terms
`of
`retrieval
`We will
`continue
`
`au-
`three
`of
`effectiveness
`retrieval
`local
`analysis,
`global
`techniques:
`results
`Experimental
`analysis.
`(local
`10CSI document
`analysis
`than
`analysis)
`is more
`effective
`locsl
`The
`results
`also
`show that
`uses some
`global
`analysis
`techniques
`set outperforms
`simple
`local
`feedback
`effectiveness
`and
`predictability.
`our work
`in these
`aspects:
`
`how
`determine
`automatically
`analysis:
`context
`local
`to
`concepts
`to add
`how msny
`to use,
`passages
`many
`how to assign
`the weights
`to them on a
`query
`and
`the
`baais.
`Currently
`the
`parameter
`values
`by query
`query
`are decided
`experiment
`ally
`and
`fixed
`for
`cdl queries.
`
`Phraseflnder:
`Currently
`tion,
`which
`
`a new metric
`Pbrasefider
`uses
`is not
`designed
`
`selecting
`for
`Inquery’s
`select
`
`concepts.
`belief
`func-
`concepts.
`We
`
`to
`
`1.
`
`2.
`
`9
`
`to note
`feedback
`the
`On
`
`that
`fmd
`overlap
`TREC4,
`
`although
`concepts
`of
`the
`the
`
`context
`local
`both
`ranked
`from top
`concepts
`chosen
`by
`average
`number
`of
`
`terms
`unique
`feedback
`local
`age overlap
`
`exp~sion
`in the
`78 by
`local
`and
`query
`is only
`
`per
`
`concepts
`context
`17.6
`terms.
`
`is 58 by
`query
`per
`The
`aver-
`analysis.
`This means
`local
`
`IPR2019-01304
`BloomReach, Inc. EX1021 Page 6
`
`
`
`Recall
`
`o
`10
`20
`30
`40
`50
`60
`70
`80
`90
`100
`average
`
`base
`82.2
`57.3
`46.2
`39.1
`32.7
`27.5
`22.6
`18.0
`13.3
`7.9
`0.5
`31.6
`
`Precmon
`Pbrasefinder
`79.4
`(–3.3)
`:.:
`(+4.8
`(+9.1
`(+10.7
`(+12.8
`(+15.9
`(+15.1
`(+14XI
`(+18.6
`[+18.7
`
`43:3
`36.9
`31.8
`26.1
`20.6
`15.8
`9.4
`
`c
`L
`
`-50
`change)
`If-10doc
`
`que
`
`es
`
`82.5
`64.9
`56.1
`48.3
`41.6
`36.8
`30.9
`25.2
`19.4
`11.5
`1.2
`38.0
`
`[+0.4)
`(;13.3j
`(+21.5)
`(+23.5)
`(+26.9)
`(+34.1)
`(+36.7)
`(+40.0)
`(+45.7)
`(+44.3)
`(+143.5)
`(+20.5)
`
`B
`
`1C8-1OOP
`(+5.9)
`87.0
`65.5
`(+14.3)
`57.2
`(+23.8)
`48.4
`(+23.8)
`42.7
`(+30.4)
`37.9
`(+38.0)
`31.5
`(+39.3)
`25.6
`(+42.1)
`19.4
`(+45.7)
`11.7
`(+47.3)
`
`Table
`feedback
`
`5: A comparison
`(lf-10doc).
`
`of baseline,
`100 passages
`
`Phrasefinder,
`local
`context
`
`for
`
`local
`
`text
`
`analysis
`
`on TREC3.
`
`10 documents
`
`for
`
`local
`
`and
`feedback
`local
`analysis
`(lea-100p)
`r
`
`Recall
`
`o
`10
`20
`30
`40
`50
`60
`70
`80
`90
`100
`average
`
`base
`88.0
`80.0
`77.5
`74.1
`62.9
`57.5
`49.7
`41.5
`32.7
`19.3
`8.6
`53.8
`
`70 ch ange)
`croon
`lf-10doc-dwO.5
`81.9
`–7.0
`76.9
`[-4.0]
`71.4
`(-7.8)
`68.2
`(-7.9)
`:.:
`(-3.3)
`(-1.2)
`(+0.8)
`(+1.3)
`(+1.1)
`(+13.0)
`(+7.8)
`(–3.3)
`
`50:1
`42.1
`33.1
`21.8
`9.3
`52.0
`
`queries
`-34
`lca-100p-wl.O
`(+4.7)
`92.1
`~.:
`(+5.4)
`
`73:9
`::.:
`
`50:7
`44.2
`36.4
`22.6
`
`[:::?]
`(-1.7)
`(-1.2)
`(+2.2)
`(+6.4)
`(+11.2)
`(+17.1)
`
`Table
`weights
`auxiliary
`
`6: A comparison
`for
`expansion
`query
`set
`
`local
`of baseline,
`units
`downweighted
`to 1.0 (lea-100p-wl.0).
`
`feedback
`by
`50%
`
`text
`local
`and
`(if- 10doc-dwO.5).
`
`on WEST.
`analysis
`100 passages
`for
`
`10 i
`local
`
`.ocuments
`context
`
`for
`local
`analysis
`
`feedback
`with
`weight
`
`with
`for
`
`metric
`
`will
`
`improve
`
`the
`
`performance
`
`of
`
`a better
`hope
`Phrasefinder.
`
`8
`
`Acknowledgements
`
`[Caid et al., 1993] Caid,
`B.,
`Sudbeck,
`D.
`(1993).
`HNC
`In %oceeding8
`of Tipster
`92.
`
`S., Carleton,
`Gallant,
`Phase
`I Fkml
`Tlpster
`Text Progmm
`(Phase
`
`J.,
`Report.
`I), pp.
`69-
`
`&
`
`dur-
`help
`their
`Allan
`James
`and
`Dan Nachbar
`We thank
`the
`in part
`by
`research
`is supported
`research.
`This
`ing
`this
`Information
`Retrieval
`at Univer-
`NSF Center
`for
`Intelligent
`sit y of Massachusetts,
`Amherst.
`supported
`This
`material
`is based
`on work
`NRaD
`Contract
`Number
`N66001-94-D-6054.
`findings
`and
`conclusions
`or
`recommendations
`this material
`are
`the
`author(