`Weissman et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 6,453,315 BL
`Sep. 17, 2002
`
`US006453315B1
`
`(54) MEANING-BASED INFORMATION
`ORGANIZATION AND RETRIEVAL
`
`(75)
`
`Inventors: Adam J. Weissman; Gilad Israel
`Elbaz, both of Los Angeles, CA (US)
`
`(73) Assignee: Applied Semantics, Ine., Los Angeles,
`CA (US)
`
`(") Notice:
`
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`US.C. 154(b) by 0 days.
`
`(21) Appl. No: 09/431,760
`(22)
`Filed:
`Noy. 1, 1999
`
`(60)
`
`Related U.S. Application Data
`Provisional application No, 60/155,667, filed on Sep. 22,
`1999,
`
`(SL) Mt Ch seen GO6F 17/30; GO6E 7/00
`
`C32) USD cep pceetssecsccstpeccsetetesesnseeseenenee POTS, 707/93
`(58) Field of Seared occ TOTNES
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`6/1989 Deerwesteret al... 77/5
`4.539.853 A *
`5,056,021 A * LO/L991 AUSDOM veces TOAD
`
`7/1992 Sadler w..0
`ges. FO4/2
`5,128,865 A *
`5,680,511 A * 10/1997 Baker et al,
`74/257
`
`5,694,523 A * 12/1997 Wieal 446
`pore
`FON/AS
`...,
`we TOTS
`5,778,362 A *
`7/1998 Deerwester
`
`8/1998 Dahlgren et al.
`a TAS
`5,794,050 A ©
`
`...
`oe TOO
`2/1999 Liddyet al.
`5,873,056 A *
`
`a. WIG
`5,940,821 A *
`8/1999 Wical
`.....
`
`« TOTS
`5,953,718 A *
`9/1999 Wieal ....,
`4/2000 Wical .......
`ow. FOS
`6.038560 A *
`
`8/2000 Wieal et al.
`...
`FOT/S3)
`6101515 A *
`6/2001 Shilyama etal. wo FOS
`6,247,009 Bl *
`OTHER PUBLICATIONS
`
`Ferri et al, “Toward a Retrieval of HTML Documents Using
`a Semantic Approach”, IEEE International Conference on
`Multimedia and Expo, vol. 3, 2000, pp. 1571-1574.*
`
`Caudal, P. “Using Complex Lexical Types to Model the
`Polysemy of Collective Nouns within the Generative Lexi-
`con”, Proceedings of the Ninth International Workshop on
`Database and Expert Systems Applications, 1998, pp.
`154-159.*
`Fellbaum, C., ed, “WordNet: An Electronic Lexical Dala-
`base”, Cambridge:MIT Press, Mar. 1998. P325.5D38W67
`L998."
`Meijs, W. “Inferring Grammar from Lexis: Machine—Read~-
`able Dictionaries as Sources of Wholesale Syntactic and
`Semantic Information”, [EEE Colloquium on Grammatical
`Inference: Theory, Applications and Alternatives, 1993, pp.
`P3/1-P3/5*
`
`Budanitsky, A. and Hirst, G. “Semantic Distance in Word-
`Net: An Expenmental, Application—Oriented Evaluation of
`Five Measures”, Proc. of the North Amer. Association for
`Computational Linguistics, WordNet and Other Lexical
`Reasources Workshop, Jun. 2-7, 2001.*
`
`(List continued on next page.)
`
`Primary Examiner—lean R. Womere
`Assistant Exantiner—Luke S Wassum
`(74) Attorney, Agent, or Firm—Pillsbury Winthrop LLP
`
`(57)
`
`ABSTRACT
`
`The present invention rehes on the idea of a meaning-based
`search, allowing users to locale information that is clase in
`meaning,
`to the concepts they are searching. A semantic
`space is created by a
`lexicon of concepts and relations
`between concepts. A query is mapped to a first meaning
`dilferentiator, representing the location of the query in the
`semantic space, Similarly, each data element in the target
`data set being searched is mapped to a second meaning
`differentiator, representing the location ofthe data clement
`in the semantic space. Searching is accomplished by deter-
`mining a semantic distance between the first and second
`meaning differentiator, wherein this distance represents their
`closeness in meaning. Search resulis on the input query are
`presented where the target data elements that are closest in
`meaning, based on their determined semantic distance, are
`ranked higher.
`
`24 Claims, 5 Drawing Sheets-
`
`
`
`PETITIONERS - EXHIBIT 1031
`PETITIONERS- EXHIBIT 1031
`
`IPR2022-00217
`
`IPR2022-00217
`
`
`
`US 6,453,315 BI
`Page 2
`
`
`
`OTHER PUBLICATIONS
`
`Mibalcea, R. and Moldovan, D. “Semantic Indexing using
`WordNet Senses”, Proceedings of the ACL Workshop on IR
`and NLP Oet. 2000.*
`Budanitsky, A. “Lexical Semantic Relatedness and its Appli-
`cation in Natural Language Processing”, Technical Report
`CSRG-390, Computer Systems Research Group, University
`of Toronto, Aug, 1999.*
`Resnick, P. “Semantic Similarity in a Taxonomy: An Infor-
`mation—Based Measure and its Application to Problems of
`Ambiguity in Natural Language”, Journal of Artificial [ntel-
`ligence Research, vol. 11, Jul. 1999.*
`Smeaton, AJ. and Quigley,
`1. “Experiments on Using
`Semantic Distances Between Words in Image Caption
`Retrieval”, Proceedings of the 19”International Conference
`on Research and Development
`in Information Retrieval,
`Aug, 1996, pp. 174-180.*
`Sutcliffe, R.E.E.
`et al. “Beyond Keywords: Accurate
`Reirieval from Full Text Documents”, Proceedings of the
`2" Language Engineering Convention, Oct. 16-18, 1995,*
`Chakravarthy, A.S.
`and Haase, K.B. “NetSerf: Using
`Semantic Knowledge
`to Vind
`Internet
`Information
`Archives", Proceedings of the 18” Annual International
`ACM SIGIR Conference on Research and Developement in
`Information Retrieval, Jul. 1995, pp. 4—L1L.*
`Sutcliffe, RE. et al. “he Automatic Acquisition of a
`Board—Coverage Semantic Lexicon for use in Information
`Retrieval”, Proce of the AAAI Symp ‘Representation and
`Acquisition of Lexical Knowledge: Polysemy. Ambiguity
`and Generativily, Mar, 27-29, 1995,*
`
`Si-Onge, D. “Detecting and Correcting Malapropisms with
`Lexical Chains”, M.S. Thesis, University of Toronto, Mar.
`1995.*
`
`Richardson, R., Smeaton, A.F. and Murphy, J. “Using Word-
`net for Conceptual Distance Measurement”, roc, of the 16
`Research Colloquium of
`the BCS-IRSG,
`1994. pp.
`100-123,*
`
`Buckley, C. et al. “Automatic Query Expansion Using
`SMART: TREC—3”, Proceedingsof the Text Retrieval Con-
`ference (TREC3), Nov. 1994, pp. 69-81.*
`
`Voorhees, E.M. “Query Expansion Using Lexical-Semantic
`Relations”, Proceedings of the 17” Annual International
`ACM-SIGIR Conlerence on Research and Development in
`Information Retrieval, Jul. 1994, pp. 61-69.
`
`Rada, R. et al. “Development and Application ofa Metric on
`Semantic Nets”, IEEE ‘Transactions of Systems, Man and
`Cybernetics, vol. 19, No. 1, Jan./Feb./ 1989, pp. 17-30.*
`
`Brachman, R.J. and Schmolze, J.G. “An Overview of the
`KL-ONE Knowledge Representation System”, Cognitive
`Science, vol. 9, 1985, pp 171-216*
`
`Collins, A.M. and Loftus, ELF. “A Spreading—Activation
`Theory of Semantic Processing”, Psychological Review,
`vol, 82, No. 6, 1975, pp. 407-428.*
`
`* cited by examiner
`
`
`
`U.S. Patent
`
`Sep. 17, 2002
`
`Sheet 1 of 5
`
`US 6,453,315 Bl
`
`110
`
`Update/Develop lexicon in
`orderto define semantic
`space
`
`
`
`
`
`Meanings attached
`to Subject Nodes/
`
`SS Target documents
`positioned in
`semantic space
`
`tpreprocessing
`organiza
` Interpret input
`for these Meanings
`allow user to navigate to
` Show Meanings,allow
`
`
`/
`
`ion
`
`Retrieval
`
`
`
`
`Semantic
`
`
`Distances
`
`
`Calculated/Set and
`
`Meanings related
`Scores for Nodes near
`
`
`
`to each Meaning
`
`to each Meaning
`
`
`Calculated
`Calculated
`
`
`130
`
`Receive user
`
`
`inputof serach
`string as specific
`
`
`terms/meanings
`Meanings
`
`
`
`Collectian and
`
`Ranking of Nodes
`
`
`
`
`Results Presentation
`
` Refine Search —
`user lo refine search
`
`Go to Node ——
`
`Show Noderesults,
`
`\
`
`180
`
`FIGURE1
`
`
`
`U.S. Patent
`
`Sep. 17, 2002
`
`Sheet 2 of 5
`
`US 6,453,315 Bl
`
`Lexicon 200
`
`
`athletic
`
`
` equipment
`
`
`
`
`
`
`
`compan surfboard
`
`a
`
`etl
`
`equipment
`
`LEGEND:
`
`
`
`———_>—_\——__ “pani of" relationship
`
`—————>——__
`
`"kind of" relationship
`
`—$_—_— _faleral bond relationship
`
`FIGURE 2
`
`
`
`U.S. Patent
`
`Sep. 17, 2002
`
`Sheet 3 of 5
`
`US 6,453,315 Bl
`
`ithleti
` af
`
`vm
`]2
`[sr
`
`
`
`te
`
`
`
`
`
`
`bindings
`
`
`
`<
`
`strong relationship
`
`a
`
`medium relationship
`
`—_weakrelationship
`
`x
`
`distance from "ski"
`
`FIGURE3
`
`€
`
`
`U.S. Patent
`
`Sep. 17, 2002
`
`Sheet 4 of 5
`
`US 6,453,315 Bl
`
`Lexicon 400
`
`
`
`—— Sports
`
`Skiing
`Football
`
`Baseball
`
`Tennis
`
`Skiing in California
`
`Skiingin Colorado
`
`Skiing in Alaska
`
`Skiing in Washington
`
`LEGEND:
`
`ayf Directory410
`
`—————_>——— "part of relationship
`
`————————>——_
`
`“kind of” relationship
`
`+-————— lateral bond relationship
`
`FIGURE 4
`
`
`
`U.S. Patent
`
`Sep. 17, 2002
`
`Sheet 5 of 5
`
`US 6,453,315 Bl
`
`COMPUTER SYSTEM
`510
`
`512
`
`511
`
`Processor
`
`Memory
`
`System Bus 513
`
`18
`
`Bridge 514
`
`V/O Bus 515
`
`516
`
`CD-ROM
`
`NIC
`
`NETWORK 500
`
`FIGURE 5
`
`
`
`US 6,453,315 Bl
`
`1
`MEANING-BASED INFORMATION
`ORGANIZATION AND RETRIEVAL
`
`2
`FIG. 5 is a system diagram of one embodiment of the
`invention.
`
`This application claims the benefit of Provisional Appli-
`cation Ser. No. 60/155,667, filed Sep. 22, 1999.
`
`FIELD OF THE INVENTION
`
`The invention relates generally information organization
`and retrieval. More specifically,
`the invention relates to
`search engine technology.
`
`BACKGROUND
`
`‘The Internet, whichis a global network of interconnected
`networks and computers, commonly makes available a wide
`variety of information through a vehicle known as the world
`wide web (WWW), Currently, hundreds of millions of “web
`sites,” that house and format such information in documents
`called web pages are available to users ofthe Internet. Since
`the content of such pages is ungoverned, unregulated and
`largely unorganized between one site and the next, finding
`certain desired information is made difficult.
`
`‘To aid users in finding siles or pages having information
`they desire, search engines were developed. Search engines
`and directories allempt to index pages and/or sites so that
`users can find particular
`information. Typically, search
`engines are initiated by prompting a user to lype in one or
`more keywords of their choosing along with connectors
`(such as “and”) and delimiters. The search engine matches
`the keywords with documents or categories in an index that
`contain those keywords or are indexed by those keywords
`and returns results (either categories or documents or both)
`to the user in the form of URLs (Uniform Resource
`Locators). One predominant web search engine receives
`submissions ofsites and manually assigns them to categories
`within their directory. When the user types in a keyword, a
`literal sub-string match of that keyword with either the
`description of the site in their index or the name of the
`category occurs. The results of this sub-string search will
`contain some sites of interest, but in addition, may contain
`many sites that are not relevant or on point. Though one may
`refine the search with yel more keywords, the same sub-
`string match will be employed, but
`to the result set just
`obtained. Almost all search engines attempt to index sites
`and documents and leave it
`to the user to formulate an
`
`appropriate query, and then to eliminate undesired search
`results themselves. Recently other search engines using
`natural language queries have been developed butthese also
`ofien result in many undesired responses.
`The quality of the resulis obtained varies, but by doing
`essentially sub-string matches or category browsing,
`the
`engines are unable to properly discern what the user actually
`intends or means when a particular keyword is entered.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The objects, features and advantages of the method and
`apparatus for the present invention will be apparent from the
`following description in which:
`FIG. Lis a flow diagram of one or more embodiments of
`the invention.
`FIG, 2 illustrates a portion of a relationship based lexicon
`employed in one or more embodiments of the invention.
`FIG, 3 ilhistrates the concept of bond strength and seman-
`lic distance in one or more embodiments of the invention.
`
`FIG. 4 illustrates the application of synsets to categories
`in a subject directory tree,
`
`30)
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`DETAILED DESCRIPTION
`
`Referring to the figures, exemplary embodiments of the
`invention will now be described, The exemplary embodi-
`ments are provided toillustrate aspects of the invention and
`should not be construed as limiting the scope of the inven-
`tion. The exemplary embodiments are primarily described
`with reference to block diagrams or flowcharts. As to the
`flowcharts, each block within the Howeharts represents both
`a method step and an apparatus element for performing the
`method step, Depending upon the implementation, the cor-
`responding apparatus element may be configured in
`hardware, software, firmware or combinations thereof.
`
`The searching paradigm presented herein relics on an
`unconventional approach to information retrieval; namely,
`ihe idea of a “meaning-based” search. Instead of simply
`indexing words thal appearin target documents, and allow-
`ing users to find desired word instances within documents or
`an index, searches are instead conducted within the realm of
`“semantic space”, allowing users to locate information that
`is “close in meaning” to the concepts they are interested in.
`A search engine and searching paradigm so implemented
`enables Web users Lo easily locate subject categories within
`a large subject directory, such as Netscape’s Open Directory
`(a product of Netscape Communications Corporation) by a
`convenient and meaningful manner. A “target document”
`refers to a single subject page within such a directory, Such
`a subject directory,
`is arranged in a roughly hierarchical
`fashion and consists of many unique topies. By allowing
`users to refine their searches to specific meanings of words,
`ihe invention in its various embodiments enables users to
`quickly filler out undesired responses, and therefore achieve
`more precise and more relevant search results. For example,
`ihe user would be able to filter out results relating to the
`concept of “Bulls” as a basketball team, because they are
`only interested in the concept of “Bulls” as a kind of cattle.
`Because searches conducted using search engines imple-
`mented according to the invention resull
`in presenting
`conceptual areas “near” to a particular meaning, the useris
`also presented with categories that are likely to be of
`interest, yet might have been missed bya traditional search
`approach. An example would be a result of “Cows” to a
`search on “Bulls”, which would come up as a result because
`ihe concepts are deemed “near” to each other in semantic
`space,
`
`FIG. 1 is a flow diagram of one or more embodiments of
`the invention.
`
`The flow diagram in IG. 1 for implementing meaning-
`based information organization and retrieval system may be
`summarized as follows:
`
`Mn
`
`Pre-Processing/Organization
`1. Define a “semantic space” by creating an intercon-
`nected lexicon of meanings.
`. Determine the “semantic distance” between meanings
`In semantic space that describes how close,
`conceptually, one meaning is to another.
`3. Designate a “location” in semantic space [or each target
`document.
`
`4. Pre-calculaic “scores” for each “target document” for
`cach relevant
`input meaning, based on nearness in
`semantic space obiained by measuring semantic dis-
`lance.
`
`
`
`15
`
`20)
`
`30)
`
`Retrieval
`L. Implement a search engine that converts an input siring,
`into a set of probable desired meanings, then locates
`targets based mainly on pre-calculated scores.
`2. Create a user-interface to the engine that allows users
`to easily make “initial” and “refined” searches, and that
`presents results to users ina logical and orderly fashion.
`Referring to FIG, 1
`Block 110
`In the pre-processing or organization stage, the first step
`is develop/update a useful meaning-based lexicon (block
`110). A “lexicon” is a network of interconnected meanings,
`it describes the “semantic space” that is employed by the
`search engine. One such lexicon already in existence con-
`sisis of thousands of meanings, or “synsets”, which are
`connected to one another through two key relationships:
`“kind of? and “part of”, For example, the concept of “table”
`is connected to the concept of “furniture” through a “kind
`of connection. Thus, “table” is a kind of “furniture”,
`Similarly, “California” is a part of “United States”.
`From this basis, the lexicon can be updated and expanded
`to include new meanings or update connections for mean-
`ings already present. New meanings can be updated mainly
`to reflect
`the importance in everyday culture of a large
`number of proper nouns, which are not available in many 2
`lexicons suchas the names of companies, people, towns, etc.
`In addition, new meanings may have to be developed within
`the lexicon in order to cover subject areas of common
`interest with greater specificity. Por example, whereas an
`existing lexicon might define the idea of a “programming
`language’, it may be useful to expandthis concept within the
`lexicon to designate the many hundreds of specific program-
`ming languages thal may exist as “kind ofchildren to this
`meaning.
`Additionally, an innovative type of relationship between
`meanings has been developed as a corollary to the invention,
`in order to convey information that is nottreated in lexicons.
`This relationship, called a “bind”, describes one meaning’s
`closeness lo another meaning in people’s common under-
`standing. For example, “skier” and “skiing” are not closely
`related concepts in existing lexicons. The former is a kind of
`“athlete”, ultimately a kind of “human being”; and thus
`would reside within the “entity” or “living thing” tree. The
`latter is a kind of“sport”, ultimately a kind of “activity”; it
`is in the “actions” tree. Though the subjects are closely
`related in everyday usage, they may be in widely separated
`locations within the lexicon’s network of meanings.
`‘To
`remedy this, a “bind” has been made between the two
`meanings, lo reflect their close proximity in semantic space
`(when you think of one concept, you tend to think of the
`other), This new type of bonding between meanings is
`essential for creating a “semantic space” that will yield
`useful search resulis.
`An extension to this “bind” connection is the concept of
`varying “bond strengths” between meanings. A value can be
`assigned to a connection from one meaning to another that
`signifies how strongly the second meaningrelates to.the first.
`‘These connection strengths are dependent on the direction of
`the bond, so that, for example, “skier” might imply a strong
`connection to “skiing”, whereas “skiing” need not imply
`“skier” to the same degree.
`One other enhancement is the addition of a “common-
`ness” value to each meaning, which reflects a combination
`of “how often does this concept arise in everyday usage” and
`“how specific a concept is this’? This value allows both
`more accurate interpretation of input terms passed to the
`search engine (“what is more likely lo be meant by this”),
`
`40
`
`45
`
`50.
`
`35
`
`60
`
`65
`
`US 6,453,315 Bl
`
`3
`
`4
`as Well as improved ranking of search results, showing users
`ihe results that they are more likely 1o be interested in first
`(the more specific terms are Likely to be more important to
`ihe user).
`Meanings within the lexicon may also be flagged in
`several new ways:
`Meanings thal signify geographical places are marked as
`“locations”. This allows special calculations to come
`into play in the search engine that are unique to these
`meanings.
`Meanings that are “offensive” are marked, indicating thal
`the general public may find these meanings to be
`distasteful or vulgar in some way. This flag allows us to
`give users the option to filter search results that
`they
`may find offensive.
`Meanings that signify specific “instances” of things are
`marked. For example, “compuler company” is a
`generic kind of concept, but “Microsoft” describes a
`unique entity. Knowing when “kind of” children ofa
`meaning are “instances” allows the semantic distance
`calculations to more accurately estimate the precision
`of a parent meaning (this is described in more detail
`later).
`The “currentness” of meanings may be noted. Meanings
`marked as “current” are those that are in some way
`timely; values surrounding them are more likely to vary
`over time than other concepts. For example, the ward
`“Monica” might currently imply the meaning “Monica
`Lewinsky” to a greater degree today than it will a year
`from now, By marking meanings that are uniquely
`likely to change within the near future,
`the Lexicon
`may be more easily kept up to date.
`Meanings may be marked as “pseudosynsets”. This term
`describes meanings that are either not in common usage
`because they are highly specific or technical, or that
`exist within the Lexicon purely for the purpose of
`forming a parent category for a group of child mean-
`ings. An example of the former might be the Latin
`terms that describe phylum, class, or species within the
`biological taxonomy, An example of the latter would be
`the concept of “field sports”, which exists mainly for
`the purpose of grouping similar specific sports together
`cleanly in the Lexicon, rather than because it in itselfis
`actually an off used meaning in common usage. By
`marking “pseudosynsets”, more accurate values for
`semantic distance may be calculated (this is described
`in more detail later).
`Block 115
`
`Each subject, or node, within the directory is given a
`“location” within the semantic space that is established by
`the Lexicon, Before a search engine can lind anything within
`ibis space, targets lo retrieve must be placed there. The
`process is envisioned to be a manual one; editors examine
`each directory subject node, and decide upon the “meaning”
`of the subject. However, this process may also be automated.
`To specify a node’s location in semantic space, one or
`more individual meanings from the Lexicon are “attached”
`io the node, These meanings may be grouped together into
`“synset groups’, each of which, in a sense, describes the
`node’s position within a different dimension of semantic
`space.
`For example, a node within the directory that is about
`“Piano Players” would be assigned the meaning piano
`player. A node about “Piano Players in Australia” would be
`assigned two meanings, piano player and Australia. The two
`concepts represent twodislincl “synsel groups” on the node,
`
`
`
`US 6,453,315 Bl
`
`5
`as each establishes a location for the node within a different
`“dimension” of meaning, Another way to look al
`this
`example is to say that the node represents the “intersection”
`of the concepts of piano player and Australia—il is where
`the two ideas come together within the directory.
`Extending this example, consider a node thal is about
`“Piano Players in Australia and New Zealand”. In this case
`the meanings Australia and New Zealand might both be
`placed on the node, bul grouped together in one “synset
`group”, because the combination of the two describes the
`node’s location within the “geographical location” dimen-
`sion of meaning. Another way to look at this example would
`be to say that
`the node is about
`the intersection of the
`concept of piano player with the union of the concepts
`Australia and New Zealand. The purpose of this synset
`grouping is solely to provide more accurate positioning, of
`the node within semantic space, ancl a more accurate reflec-
`tion of the specificity of the node, both of which result in
`improved search engine retrieval,
`The lexicon is updated and further developed primarily
`when ascribing nodes to a location in semantic space is
`impossible because the needed meaning does not exist in the
`semantic space. ‘Thus, blocks 110 and 115 can be thought of
`as implemented in complementary fashion to one another.
`Block 120
`If the semantic space laid out by the lexicon developed as
`described wiih respect to block 110is to be effectively used,
`the concept of “distance” from one meaning to another
`within this space must be defined. The input to this portion
`of the process is the lexicon itself and the output is a table
`of information that details the distance from each meaning,
`to each other meaning that falls within a certain “radius of
`semantic closeness”.
`The closeness of meaningsis allected to a large degree by
`their perceived “precision”. For example, we can guess at
`how close the concepts of “sports” and “baseball” are based
`on the fact
`that there are many other particular kinds of
`sports under “sports” than baseball. As baseball appears to
`be one of many, it’s connection to the concept of “sports” is
`not as strong as if, say, there were only two sports in the
`world, and baseball was thus one of only two possibilities
`for what Is meant by “sports”, This idea is reflected in an
`algorithm that estimates the “kind of” and “part of” preci-
`sion of a meaning based onthe total countofits descendants,
`following “kind of and “part of” relationships. In these
`calculations, meanings marked as “instances” are biased
`against, as they would tendto incorrectly dilute the precision
`of a concept otherwise.
`Differences in estimates of precision are used to generate
`a semantic distance between two directly connected mean-
`ings only when a connection strength has not been set.
`Manual settings override the calculated estimates; thus the
`semantic distance results come about from a combination of
`automatically estimated connection strengths, and strengths
`that have been manually set.
`The process for discovering meanings thai are semanti-
`cally close lo a specific meaning mvolves a traditional
`breadth-first search outward from the origin meaning.
`Neighboring meanings in the network of nodes are explored
`in an Outward sccking fashion, and distance from the origin
`is tracked, When a certain radius has been reached,
`the
`search stops. Intricacies in this search include the following:
`l. A “sealing factor’, somewhat
`like a “velocity” is
`tracked as the search spreads outward, This scaling
`factor multiplies the perceived distance for a single
`jump. One net effect of this factor is to reduce the
`perceived distance to meanings thal are close, thus the
`
`FU
`
`1)
`
`20)
`
`aon
`
`30
`
`3
`
`40
`
`45
`
`50)
`
`55
`
`60
`
`65
`
`‘hd
`
`6
`drop-olf of distance is not linear as the search expands.
`This is a result of an increase in sealing factor based
`linearly on the previous jumpdistance,
`. The scaling factor is also modified by a change in
`direction of the search within the lexicon hierarchy. Por
`example, a jump down to a child from a parent that was
`previously jumped up to from another child, incurs a
`scale factor increase penalty. Similar penalties arise
`from jumps down then up, from jumps in “kind of” that
`occur after “part of(and vice versa), and from combi-
`nations ofthese,
`
`3, Lateral “bond” type connections also incur scale factor
`penalties, based on the set distance of the jump.
`4, *Psuedosynset” and “instance” meaningsare treated in
`a special way. When usedas the origin, they imply that
`the search for related meanings should be within a
`smaller radius, as their own greater degree of exactness
`imply a more specific kind of search for meanings is
`called for. Thus the search does not expand as far; this
`is controlled by starting the search with a higher scaling
`factor, Additionally, a different measurement of preci-
`sion is used, which includes detailed terms that are
`otherwise excluded [rom the standard precision algo-
`rithm initially. (Alternately, if the orngin meaning is nol
`a pseudo-synset or instance meaning, then the standard
`precision values excluding count of descendant pseu-
`dosynsets are used.)
`Block 130
`Once distances between meanings within the Lexicon
`have been determined, and target nodes within the directory
`have been given fixed positions within the semantic space
`described by the Lexicon, it is possible to generate scores for
`all nodes that fall close to each individual meaning. Pre-
`calculating these scores ahead of time allows a much quicker
`response time to actual searches. The inputs to this process
`are the lexicon,
`the table of relatives of each meaning,
`showing semantic distances, and the data that details what
`meanings have been attached to what nodes. The output of
`this process is information describing what nodes are close
`to each given meaning, and a “score” for that closeness,
`which is a direct reflection ofthe semantic distance from the
`origin meaning lo the meanings that have been attached to
`ihe node. Other factors that affect the pre-calculated node
`scores for a meaning are the number of meanings attached
`to the node, and the “commonness™ value of the meaning in
`question,
`An additional element of this pre-calculation step
`involves the creation oftables of information that allow very
`fast comparison between meaning when determining which
`nodes are hit by multiple meanings simultaneously. For this,
`bitmapped information that reflects a compressedversion of
`the node—meaning score information is generated.
`Essentially, if a meaning has any score whatsoever for a
`particular node, a single bil is set to 1
`in a binary field,
`marking that node as hit. By comparing the “bitmaps” for
`two meanings against each other, a quick assessment of
`“combination hits” can be made—il is simply a process of
`performing a bitwise AND operation on the two bitmaps.
`Because an uncompressed bitmap for every meaning
`would be an unmanageably large amount of data, which
`would also require a lot of processing time to analyze, the
`bitmap data is compressed. There are two levels of
`compression, cach a level & times more condensed than the
`last. In the first level of compression, a single bit represenis
`a set of 8 nodes. If a meaning has a score for any of these
`nodes, the bit is set. In the second level of compression, each
`bit corresponds to a set of 64 nodes, in the same way.
`
`
`
`US 6,453,315 Bl
`
`7
`Storing the bitmap information in this way allows large
`chunks of empty raw bitmap data to be ignored, resulting in
`a much smaller data set. In addition, the bitwise operations
`performed on these bitmaps can be done with greater speed,
`because detailed sections of the bitmap data do not have to
`be examined unless the higher-order compressed version
`indicates that there is information of value present there.
`Retneval
`Block 140
`While pre-processing and information organization may
`be an ongoing process, at anytime where at least a partial
`table of scores for nodes is available, the user can then input
`a specific search term as he/she would when using the
`conventional search engine.
`When users use the search engine to find subjects within
`the directory,
`the search engine conducts two phases of
`processing. The first phase, interpretation, involves analyz-
`ing the user’s input so that the meanings that the user desires
`ean be identified (see block 150 description), The second
`phase, collection and ranking of resulis, involves the col-
`lecting of nodes that have good scores for
`the desired
`meanings, and ordering them based on predicted relevance
`(see block 160 description).
`Block 150: Interpretation Phase
`There are two key inputs possible for a search: an input
`string, and a seLof known meanings. Either, or both, may be 2
`received and processed for results. The input string is simply
`a set of words that the user types into a search box and
`submits for query. An interpretation of these words is made,
`to map words to probable desired meanings. In addition to,
`or instead of, the input string, a set of meanings that are
`known aheadof time to be of interest, may be passed to the
`search engine for query. These require less processing, as
`plain words to not have to be mapped to meanings, as they
`are already predefined, Meanings passed tothe engine in this
`way are called preconceptions.
`The process for interpreting the input string begins with
`stemming, or morphing, of the individual words in the
`string. This involves mainly an attempt to reduce words
`perceived as being plurals to their singular form. This is
`necessary because word mappings to meanings within the
`Lexicon are stored in singular form only, except in special
`cases.
`
`30
`
`40
`
`Next, all possible combinations of words (and their mor-
`phed variants) are examined for possible matches to mean-
`ings. Larger numbers of combined wordsare tested first, to
`give preference over individual words (for example, “United
`States” must take precedence over analysis of “United” and
`“States”). Partial matches with meanings are possible, so
`that “Rocky Horror’ might bring up a match for “Rocky
`Horror Picture Show”, for example.
`As the set possible meanings is being compiled, prob-
`abilities are assigned to each, These values reflect
`the
`likelihood that
`the user really means a ceriain concept.
`Because many words have multiple meanings, probabilities
`for implied meanings for words may be manually or auto-
`matically pre-assigned. These values are used in this phase
`of the engine processing, In order lo estimate whal meanings
`are most likely implied by particular search words. Other
`factors that affect the probabilities given to meanings are:
`was the meaning matched by a morphed word or the word
`in its “pure” form (favor pure forms); was the meaning only
`partially matched the input word(s) (if so,
`reduce
`probability); was the meaning the result of a match on
`multiple words (if so, increase probability); the commonness
`of the meaning implied (favor more common meanings).
`Another kind of “concept induction” is applied to the
`analysis at this point. All implied meanings are examined
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`and compared against each other, so that relationships might
`be discovered. If there is a connection between two
`meanings,
`those meanin