`US 6,453,315 B1
`(10) Patent N0.:
`Weissman et al.
`(45) Date of Patent:
`Sep. 17, 2002
`
`US006453315B1
`
`(54) MEANING-BASED INFORMATION
`ORGANIZATION AND RETRIEVAL
`
`(75)
`
`Inventors: Adam J. Weissman; Gilad Israel
`Elbaz, both of Los Angeles, CA (US)
`
`(73) Assignee: Applied Semantics, Inc., Los Angeles,
`CA (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/431,760
`
`(22)
`
`Filed:
`
`Nov. 1, 1999
`
`(60)
`
`Related US. Application Data
`Provisional application No. 60/155,667, filed on Sep. 22,
`1999.
`
`Int. Cl.7 ............................ G06F 17/30; G06F 7/00
`(51)
`(52) US. Cl.
`................................................ 707/5; 707/3
`(58) Field of Search .......................................... 707/1—5
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`4,839,853 A *
`6/1989 Deerwester et al.
`........... 707/5
`5,056,021 A * 10/1991 Ausborn ...........
`704/9
`
`5,128,865 A *
`7/1992 Sadler ..........
`704/2
`
`704/257
`5,680,511 A * 10/1997 Baker et al.
`
`5,694,523 A * 12/1997 Wical ................ 706/45
`
`7/1998 Deerwester .......
`5,778,362 A *
`707/5
`.
`5,794,050 A *
`8/1998 Dahlgren et al.
`717/8
`
`......
`5,873,056 A *
`2/1999 Liddy etal.
`704/9
`5,940,821 A *
`8/1999 Wical
`.....
`707/3
`
`5,953,718 A *
`9/1999 Wical
`.....
`.. 707/5
`6,038,560 A *
`4/2000 Wical
`...........
`707/5
`6,101,515 A *
`8/2000 Wical et al.
`......
`707/531
`
`6/2001 Shiiyama et al.
`.............. 707/3
`6,247,009 B1 *
`OTHER PUBLICATIONS
`
`Ferri et al. “Toward a Retrieval of HTML Documents Using
`a Semantic Approach”, IEEE International Conference on
`Multimedia and Expo, vol. 3, 2000, pp. 1571—1574.*
`
`
`Caudal, P. “Using Complex Lexical Types to Model the
`Polysemy of Collective Nouns within the Generative Lexi-
`con”, Proceedings of the Ninth International Workshop on
`Database and Expert Systems Applications, 1998, pp.
`154—159.*
`
`Fellbaum, C., ed. “WordNet: An Electronic Lexical Data-
`base”, Cambridge:MIT Press, Mar. 1998. P325.5D38W67
`1998*
`
`Meijs, W. “Inferring Grammar from Lexis: Machine—Read-
`able Dictionaries as Sources of Wholesale Syntactic and
`Semantic Information”, IEEE Colloquium on Grammatical
`Inference: Theory, Applications and Alternatives, 1993, pp.
`P3/1—P3/5.*
`Budanitsky, A. and Hirst, G. “Semantic Distance in Word-
`Net: An Experimental, Application—Oriented Evaluation of
`Five Measures”, Proc. of the North Amer. Association for
`Computational Linguistics, WordNet and Other Lexical
`Reasources Workshop, Jun. 2—7, 2001.*
`
`(List continued on next page.)
`
`Primary Examiner—Jean R. Homere
`Assistant Examiner—Luke S Wassum
`
`(74) Attorney, Agent, or Firm—Pillsbury Winthrop LLP
`
`(57)
`
`ABSTRACT
`
`The present invention relies on the idea of a meaning-based
`search, allowing users to locate information that is close in
`meaning to the concepts they are searching. A semantic
`space is created by a lexicon of concepts and relations
`between concepts. A query is mapped to a first meaning
`differentiator, representing the location of the query in the
`semantic space. Similarly, each data element in the target
`data set being searched is mapped to a second meaning
`differentiator, representing the location of the data element
`in the semantic space. Searching is accomplished by deter-
`mining a semantic distance between the first and second
`meaning differentiator, wherein this distance represents their
`closeness in meaning. Search results on the input query are
`presented where the target data elements that are closest in
`meaning, based on their determined semantic distance, are
`ranked higher.
`
`24 Claims, 5 Drawing Sheets-
`
`
`
`
` K2310 if
`/ \ 5
`
`st.
`
`0
`
`
`wmmm
`
`12
`
`
`
`
`
`
`5
`
`bmdmgs
`
`4
`
`
`company
`
`
`
`19
`
`
`
`equmm 12
`
`LEGEND
`swung mmmnsmp
` <
`
`e
`s .
`
`i walk mlmonxhip
`X mam from w
`
`ELASTIC - EXHIBIT 1031
`
`ELASTIC - EXHIBIT 1031
`
`
`
`US 6,453,315 B1
`Page 2
`
`OTHER PUBLICATIONS
`
`Mihalcea, R. and Moldovan, D. “Semantic Indexing using
`WordNet Senses”, Proceedings of the ACL Workshop on IR
`and NLP Oct. 2000.*
`
`Budanitsky, A. “Lexical Semantic Relatedness and its Appli-
`cation in Natural Language Processing”, Technical Report
`CSRG—390, Computer Systems Research Group, University
`of Toronto, Aug. 1999.*
`Resnick, P. “Semantic Similarity in a Taxonomy: An Infor-
`mation—Based Measure and its Application to Problems of
`Ambiguity in Natural Language”, Journal of Artificial Intel-
`ligence Research, vol. 11, Jul. 1999.*
`Smeaton, AF. and Quigley,
`I. “Experiments on Using
`Semantic Distances Between Words in Image Caption
`Retrieval”, Proceedings of the 19th International Conference
`on Research and Development in Information Retrieval,
`Aug. 1996. pp. 174—180.*
`“Beyond Keywords: Accurate
`Sutcliffe, R.F.E.
`et
`al.
`Retrieval from Full Text Documents”, Proceedings of the
`2”“ Language Engineering Convention, Oct. 16—18, 1995 .*
`Chakravarthy, AS.
`and Haase, K.B. “NetSerf: Using
`Semantic Knowledge
`to
`Find
`Internet
`Information
`Archives”, Proceedings of the 18th Annual International
`ACM SIGIR Conference on Research and Developement in
`Information Retrieval, Jul. 1995, pp. 4—11.*
`Sutcliffe, R.F.E. et al. “The Automatic Acquisition of a
`Board—Coverage Semantic Lexicon for use in Information
`Retrieval”, Proc of the AAAI Symp ”Representation and
`Acquisition of Lexical Knowledge: Polysemy, Ambiguity
`and Generativity, Mar. 27—29, 1995.*
`
`St—Onge, D. “Detecting and Correcting Malapropisms with
`Lexical Chains”, MS. Thesis, University of Toronto, Mar.
`1995 .*
`
`Richardson, R., Smeaton, AF. and Murphy, J. “Using Word-
`net for Conceptual Distance Measurement”, roc. of the 16‘”
`Research Colloquium of
`the BCS—IRSG,
`1994. pp.
`100—123.*
`
`Buckley, C. et al. “Automatic Query Expansion Using
`SMART: TREC—3”, Proceedings of the Text Retrieval Con-
`ference (TREC3), Nov. 1994, pp. 69—81.*
`
`Voorhees, E.M. “Query Expansion Using Lexical—Semantic
`Relations”, Proceedings of the 17th Annual International
`ACM—SIGIR Conference on Research and Development in
`Information Retrieval, Jul. 1994, pp. 61—69.*
`
`Rada, R. et al. “Development and Application of a Metric on
`Semantic Nets”, IEEE Transactions of Systems, Man and
`Cybernetics, vol. 19, No. 1, Jan./Feb./ 1989, pp. 17—30.*
`
`Brachman, R.J. and Schmolze, J.G. “An Overview of the
`KL—ONE Knowledge Representation System”, Cognitive
`Science, vol. 9, 1985, pp 171—216.*
`
`Collins, AM. and Loftus, BE. “A Spreading—Activation
`Theory of Semantic Processing”, Psychological Review,
`vol. 82, No. 6, 1975, pp. 407—428.*
`
`* cited by examiner
`
`
`
`US. Patent
`
`Sep. 17, 2002
`
`Sheet 1 0f5
`
`US 6,453,315 B1
`
`110
`115
`
`Update/Develop lexicon in
`Meanings attached
`
`
`order to define semantic
`to Subject Nodes/
`space \ Target documents
`positioned in
`semantic space
`
`
`
`
`
`130
`Semantic
`
`Distances
`Calculated/Set and
`
`Meanings related
`Scores for Nodes near
`
`
`
`to each Meaning
`to each Meaning
`
`Calculated
`
`
`Calculated
`
`Receive user
`Interpret input
`Collection and
`
`
`
`inputof serach
`string as specific —> Ranking of Nodes
`
`
`terms/meanings
`for these Meanings
`Meanings
`
`
`
`
`/
`
`Retrieval
`
` preprocessmg
`'Iont
`organlza
`
`
`
`
`
`
`
`Results Presentation
`
`
`Show Meanings. allow
`
`Refine Search— user to refine search
`
`Show Node results,
`
`Go to Node—allow user to navigate to
`Nodes
`
`170
`
`
`
`180
`
`FIGURE 1
`
`
`
`US. Patent
`
`Sep. 17, 2002
`
`Sheet 2 0f5
`
`US 6,453,315 B1
`
`“
`
`Lexicon 200
`
`athletic
`equipment
`
`co
`
`mpany
`
`surfboard
`
`- 1c
`a
`equipment
`coman
`
`slalom skiing
`
`4
`
`p
`
`LEGEND:
`
`———l>— "part ol" relationship
`
`——>——— "kind of" relationship
`
`<—-————-————b
`
`lateral bond relationship
`
`FIGURE 2
`
`
`
`US. Patent
`
`Sep. 17, 2002
`
`Sheet 3 0f5
`
`US 6,453,315 B1
`
`athletic
`
`skiing
`eqUipment
`
`
`
`2
`
`1 9
`
`a
`
`e K:
`
`6
`
`Ski
`
`O
`
`1 2
`
`eqUipment
`com an
`
`1 2
`
`4 m— K2
`
`8
`
`LEGEND:
`
`<
`
`<
`
`strong relationship
`
`medium relationship
`
`<2: weak relationship
`
`X
`
`distance from "ski"
`
`FIGURE 3
`
`
`
`US. Patent
`
`Sep. 17, 2002
`
`Sheet 4 0f5
`
`US 6,453,315 B1
`
`Lexicon 400
`
`
`
`
`
`athletic
`equipment
`
`-——— Sports
`
`Skiing
`Football
`
`Baseball
`
`Tennis
`
`Skiing in California
`
`Skiing in Colorado M Directory 410
`
`Skiing in Alaska
`
`Skiing in Washington
`
`LEGEND:
`
`——{>———— "part of" relationship
`
`—-—>———— "kind oi" relationship
`
`dw—b Ialeral bond relationship
`
`FIGURE 4
`
`
`
`US. Patent
`
`Sep. 17, 2002
`
`Sheet 5 0f5
`
`US 6,453,315 B1
`
`COMPUTER SYSTEM
`
`510
`
`512
`
`511
`
`System Bus 513
`
`18
`
`Bridge 514
`
`Disk
`
`l/O Bus 515
`
`516
`
`517
`
`CD—ROM
`
`NIC
`
`NETWORK 500
`
`FIGURE 5
`
`
`
`1
`MEANING-BASED INFORMATION
`ORGANIZATION AND RETRIEVAL
`
`2
`FIG. 5 is a system diagram of one embodiment of the
`invention.
`
`US 6,453,315 B1
`
`This application claims the benefit of Provisional Appli-
`cation Ser. No. 60/155,667, filed Sep. 22, 1999.
`
`FIELD OF THE INVENTION
`
`The invention relates generally information organization
`and retrieval. More specifically,
`the invention relates to
`search engine technology.
`
`BACKGROUND
`
`The Internet, which is a global network of interconnected
`networks and computers, commonly makes available a wide
`variety of information through a vehicle known as the world
`wide web (WWW). Currently, hundreds of millions of “web
`sites,” that house and format such information in documents
`called web pages are available to users of the Internet. Since
`the content of such pages is ungoverned, unregulated and
`largely unorganized between one site and the next, finding
`certain desired information is made difficult.
`
`To aid users in finding sites or pages having information
`they desire, search engines were developed. Search engines
`and directories attempt to index pages and/or sites so that
`users can find particular information. Typically, search
`engines are initiated by prompting a user to type in one or
`more keywords of their choosing along with connectors
`(such as “and”) and delimiters. The search engine matches
`the keywords with documents or categories in an index that
`contain those keywords or are indexed by those keywords
`and returns results (either categories or documents or both)
`to the user in the form of URLs (Uniform Resource
`Locators). One predominant web search engine receives
`submissions of sites and manually assigns them to categories
`within their directory. When the user types in a keyword, a
`literal sub-string match of that keyword with either the
`description of the site in their index or the name of the
`category occurs. The results of this sub-string search will
`contain some sites of interest, but in addition, may contain
`many sites that are not relevant or on point. Though one may
`refine the search with yet more keywords, the same sub-
`string match will be employed, but to the result set just
`obtained. Almost all search engines attempt to index sites
`and documents and leave it
`to the user to formulate an
`
`appropriate query, and then to eliminate undesired search
`results themselves. Recently other search engines using
`natural language queries have been developed but these also
`often result in many undesired responses.
`The quality of the results obtained varies, but by doing
`essentially sub-string matches or category browsing,
`the
`engines are unable to properly discern what the user actually
`intends or means when a particular keyword is entered.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The objects, features and advantages of the method and
`apparatus for the present invention will be apparent from the
`following description in which:
`FIG. 1 is a flow diagram of one or more embodiments of
`the invention.
`
`FIG. 2 illustrates a portion of a relationship based lexicon
`employed in one or more embodiments of the invention.
`FIG. 3 illustrates the concept of bond strength and seman-
`tic distance in one or more embodiments of the invention.
`
`FIG. 4 illustrates the application of synsets to categories
`in a subject directory tree.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`DETAILED DESCRIPTION
`
`Referring to the figures, exemplary embodiments of the
`invention will now be described. The exemplary embodi-
`ments are provided to illustrate aspects of the invention and
`should not be construed as limiting the scope of the inven-
`tion. The exemplary embodiments are primarily described
`with reference to block diagrams or fiowcharts. As to the
`fiowcharts, each block within the flowcharts represents both
`a method step and an apparatus element for performing the
`method step. Depending upon the implementation, the cor-
`responding apparatus element may be configured in
`hardware, software, firmware or combinations thereof.
`
`The searching paradigm presented herein relies on an
`unconventional approach to information retrieval; namely,
`the idea of a “meaning-based” search. Instead of simply
`indexing words that appear in target documents, and allow-
`ing users to find desired word instances within documents or
`an index, searches are instead conducted within the realm of
`“semantic space”, allowing users to locate information that
`is “close in meaning” to the concepts they are interested in.
`A search engine and searching paradigm so implemented
`enables Web users to easily locate subject categories within
`a large subject directory, such as Netscape’s Open Directory
`(a product of Netscape Communications Corporation) by a
`convenient and meaningful manner. A “target document”
`refers to a single subject page within such a directory. Such
`a subject directory,
`is arranged in a roughly hierarchical
`fashion and consists of many unique topics. By allowing
`users to refine their searches to specific meanings of words,
`the invention in its various embodiments enables users to
`
`quickly filter out undesired responses, and therefore achieve
`more precise and more relevant search results. For example,
`the user would be able to filter out results relating to the
`concept of “Bulls” as a basketball team, because they are
`only interested in the concept of “Bulls” as a kind of cattle.
`Because searches conducted using search engines imple-
`mented according to the invention result
`in presenting
`conceptual areas “near” to a particular meaning, the user is
`also presented with categories that are likely to be of
`interest, yet might have been missed by a traditional search
`approach. An example would be a result of “Cows” to a
`search on “Bulls”, which would come up as a result because
`the concepts are deemed “near” to each other in semantic
`space.
`
`FIG. 1 is a flow diagram of one or more embodiments of
`the invention.
`
`The flow diagram in FIG. 1 for implementing meaning-
`based information organization and retrieval system may be
`summarized as follows:
`
`Pre -Processing/Organization
`1. Define a “semantic space” by creating an intercon-
`nected lexicon of meanings.
`2. Determine the “semantic distance” between meanings
`in semantic space that describes how close,
`conceptually, one meaning is to another.
`3. Designate a “location” in semantic space for each target
`document.
`
`4. Pre-calculate “scores” for each “target document” for
`each relevant
`input meaning, based on nearness in
`semantic space obtained by measuring semantic dis-
`tance.
`
`
`
`US 6,453,315 B1
`
`Retrieval
`
`3
`
`1. Implement a search engine that converts an input string
`into a set of probable desired meanings, then locates
`targets based mainly on pre-calculated scores.
`2. Create a user-interface to the engine that allows users
`to easily make “initial” and “refined” searches, and that
`presents results to users in a logical and orderly fashion.
`Referring to FIG. 1
`Block 110
`
`In the pre-processing or organization stage, the first step
`is develop/update a useful meaning-based lexicon (block
`110). A “lexicon” is a network of interconnected meanings,
`it describes the “semantic space” that is employed by the
`search engine. One such lexicon already in existence con-
`sists of thousands of meanings, or “synsets”, which are
`connected to one another through two key relationships:
`“kind of” and “part of”. For example, the concept of “table”
`is connected to the concept of “furniture” through a “kind
`of” connection. Thus, “table” is a kind of “furniture”.
`Similarly, “California” is a part of “United States”.
`From this basis, the lexicon can be updated and expanded
`to include new meanings or update connections for mean-
`ings already present. New meanings can be updated mainly
`to reflect the importance in everyday culture of a large
`number of proper nouns, which are not available in many
`lexicons such as the names of companies, people, towns, etc.
`In addition, new meanings may have to be developed within
`the lexicon in order to cover subject areas of common
`interest with greater specificity. For example, whereas an
`existing lexicon might define the idea of a “programming
`language”, it may be useful to expand this concept within the
`lexicon to designate the many hundreds of specific program-
`ming languages that may exist as “kind of” children to this
`meaning.
`Additionally, an innovative type of relationship between
`meanings has been developed as a corollary to the invention,
`in order to convey information that is not treated in lexicons.
`This relationship, called a “bind”, describes one meaning’s
`closeness to another meaning in people’s common under-
`standing. For example, “skier” and “skiing” are not closely
`related concepts in existing lexicons. The former is a kind of
`“athlete”, ultimately a kind of “human being”; and thus
`would reside within the “entity” or “living thing” tree. The
`latter is a kind of “sport”, ultimately a kind of “activity”; it
`is in the “actions” tree. Though the subjects are closely
`related in everyday usage, they may be in widely separated
`locations within the lexicon’s network of meanings. To
`remedy this, a “bind” has been made between the two
`meanings, to reflect their close proximity in semantic space
`(when you think of one concept, you tend to think of the
`other). This new type of bonding between meanings is
`essential for creating a “semantic space” that will yield
`useful search results.
`
`An extension to this “bind” connection is the concept of
`varying “bond strengths” between meanings. Avalue can be
`assigned to a connection from one meaning to another that
`signifies how strongly the second meaning relates to the first.
`These connection strengths are dependent on the direction of
`the bond, so that, for example, “skier” might imply a strong
`connection to “skiing”, whereas “skiing” need not imply
`“skier” to the same degree.
`One other enhancement is the addition of a “common-
`
`ness” value to each meaning, which reflects a combination
`of “how often does this concept arise in everyday usage” and
`“how specific a concept is this”? This value allows both
`more accurate interpretation of input terms passed to the
`search engine (“what is more likely to be meant by this?”),
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`as well as improved ranking of search results, showing users
`the results that they are more likely to be interested in first
`(the more specific terms are likely to be more important to
`the user).
`Meanings within the lexicon may also be flagged in
`several new ways:
`Meanings that signify geographical places are marked as
`“locations”. This allows special calculations to come
`into play in the search engine that are unique to these
`meanings.
`Meanings that are “offensive” are marked, indicating that
`the general public may find these meanings to be
`distasteful or vulgar in some way. This flag allows us to
`give users the option to filter search results that they
`may find offensive.
`Meanings that signify specific “instances” of things are
`marked. For example, “computer company” is a
`generic kind of concept, but “Microsoft” describes a
`unique entity. Knowing when “kind of” children of a
`meaning are “instances” allows the semantic distance
`calculations to more accurately estimate the precision
`of a parent meaning (this is described in more detail
`later).
`The “currentness” of meanings may be noted. Meanings
`marked as “current” are those that are in some way
`timely; values surrounding them are more likely to vary
`over time than other concepts. For example, the word
`“Monica” might currently imply the meaning “Monica
`Lewinsky” to a greater degree today than it will a year
`from now. By marking meanings that are uniquely
`likely to change within the near future, the Lexicon
`may be more easily kept up to date.
`Meanings may be marked as “pseudosynsets”. This term
`describes meanings that are either not in common usage
`because they are highly specific or technical, or that
`exist within the Lexicon purely for the purpose of
`forming a parent category for a group of child mean-
`ings. An example of the former might be the Latin
`terms that describe phylum, class, or species within the
`biological taxonomy. An example of the latter would be
`the concept of “field sports”, which exists mainly for
`the purpose of grouping similar specific sports together
`cleanly in the Lexicon, rather than because it in itself is
`actually an oft used meaning in common usage. By
`marking “pseudosynsets”, more accurate values for
`semantic distance may be calculated (this is described
`in more detail later).
`Block 115
`
`Each subject, or node, within the directory is given a
`“location” within the semantic space that is established by
`the Lexicon. Before a search engine can find anything within
`this space, targets to retrieve must be placed there. The
`process is envisioned to be a manual one; editors examine
`each directory subject node, and decide upon the “meaning”
`of the subject. However, this process may also be automated.
`To specify a node’s location in semantic space, one or
`more individual meanings from the Lexicon are “attached”
`to the node. These meanings may be grouped together into
`“synset groups”, each of which, in a sense, describes the
`node’s position within a different dimension of semantic
`space.
`For example, a node within the directory that is about
`“Piano Players” would be assigned the meaning piano
`player. A node about “Piano Players in Australia” would be
`assigned two meanings, piano player and Australia. The two
`concepts represent two distinct “synset groups” on the node,
`
`
`
`US 6,453,315 B1
`
`5
`as each establishes a location for the node within a different
`“dimension” of meaning. Another way to look at
`this
`example is to say that the node represents the “intersection”
`of the concepts of piano player and Australia—it is where
`the two ideas come together within the directory.
`Extending this example, consider a node that is about
`“Piano Players in Australia and New Zealand”. In this case
`the meanings Australia and New Zealand might both be
`placed on the node, but grouped together in one “synset
`group”, because the combination of the two describes the
`node’s location within the “geographical location” dimen-
`sion of meaning. Another way to look at this example would
`be to say that the node is about the intersection of the
`concept of piano player with the union of the concepts
`Australia and New Zealand. The purpose of this synset
`grouping is solely to provide more accurate positioning of
`the node within semantic space, and a more accurate reflec-
`tion of the specificity of the node, both of which result in
`improved search engine retrieval.
`The lexicon is updated and further developed primarily
`when ascribing nodes to a location in semantic space is
`impossible because the needed meaning does not exist in the
`semantic space. Thus, blocks 110 and 115 can be thought of
`as implemented in complementary fashion to one another.
`Block 120
`
`If the semantic space laid out by the lexicon developed as
`described with respect to block 110 is to be effectively used,
`the concept of “distance” from one meaning to another
`within this space must be defined. The input to this portion
`of the process is the lexicon itself and the output is a table
`of information that details the distance from each meaning
`to each other meaning that falls within a certain “radius of
`semantic closeness”.
`
`The closeness of meanings is affected to a large degree by
`their perceived “precision”. For example, we can guess at
`how close the concepts of “sports” and “baseball” are based
`on the fact that there are many other particular kinds of
`sports under “sports” than baseball. As baseball appears to
`be one of many, it’s connection to the concept of “sports” is
`not as strong as if, say, there were only two sports in the
`world, and baseball was thus one of only two possibilities
`for what is meant by “sports”. This idea is reflected in an
`algorithm that estimates the “kind of” and “part of” preci-
`sion of a meaning based on the total count of its descendants,
`following “kind of” and “part of” relationships. In these
`calculations, meanings marked as “instances” are biased
`against, as they would tend to incorrectly dilute the precision
`of a concept otherwise.
`Differences in estimates of precision are used to generate
`a semantic distance between two directly connected mean-
`ings only when a connection strength has not been set.
`Manual settings override the calculated estimates; thus the
`semantic distance results come about from a combination of
`
`automatically estimated connection strengths, and strengths
`that have been manually set.
`The process for discovering meanings that are semanti-
`cally close to a specific meaning involves a traditional
`breadth-first search outward from the origin meaning.
`Neighboring meanings in the network of nodes are explored
`in an outward seeking fashion, and distance from the origin
`is tracked. When a certain radius has been reached,
`the
`search stops. Intricacies in this search include the following:
`1. A “scaling factor”, somewhat
`like a “velocity” is
`tracked as the search spreads outward. This scaling
`factor multiplies the perceived distance for a single
`jump. One net effect of this factor is to reduce the
`perceived distance to meanings that are close, thus the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`drop-off of distance is not linear as the search expands.
`This is a result of an increase in scaling factor based
`linearly on the previous jump distance.
`2. The scaling factor is also modified by a change in
`direction of the search within the lexicon hierarchy. For
`example, a jump down to a child from a parent that was
`previously jumped up to from another child, incurs a
`scale factor increase penalty. Similar penalties arise
`from jumps down then up, from jumps in “kind of” that
`occur after “part of”(and vice versa), and from combi-
`nations of these.
`
`3. Lateral “bond” type connections also incur scale factor
`penalties, based on the set distance of the jump.
`4. “Psuedosynset” and “instance” meanings are treated in
`a special way. When used as the origin, they imply that
`the search for related meanings should be within a
`smaller radius, as their own greater degree of exactness
`imply a more specific kind of search for meanings is
`called for. Thus the search does not expand as far; this
`is controlled by starting the search with a higher scaling
`factor. Additionally, a different measurement of preci-
`sion is used, which includes detailed terms that are
`otherwise excluded from the standard precision algo-
`rithm initially. (Alternately, if the origin meaning is not
`a pseudo-synset or instance meaning, then the standard
`precision values excluding count of descendant pseu-
`dosynsets are used.)
`Block 130
`
`Once distances between meanings within the Lexicon
`have been determined, and target nodes within the directory
`have been given fixed positions within the semantic space
`described by the Lexicon, it is possible to generate scores for
`all nodes that fall close to each individual meaning. Pre-
`calculating these scores ahead of time allows a much quicker
`response time to actual searches. The inputs to this process
`are the lexicon,
`the table of relatives of each meaning,
`showing semantic distances, and the data that details what
`meanings have been attached to what nodes. The output of
`this process is information describing what nodes are close
`to each given meaning, and a “score” for that closeness,
`which is a direct reflection of the semantic distance from the
`
`origin meaning to the meanings that have been attached to
`the node. Other factors that affect the pre-calculated node
`scores for a meaning are the number of meanings attached
`to the node, and the “commonness” value of the meaning in
`question.
`An additional element of this pre-calculation step
`involves the creation of tables of information that allow very
`fast comparison between meaning when determining which
`nodes are hit by multiple meanings simultaneously. For this,
`bitmapped information that reflects a compressed version of
`the node—meaning score information is generated.
`Essentially, if a meaning has any score whatsoever for a
`particular node, a single bit is set to 1 in a binary field,
`marking that node as hit. By comparing the “bitmaps” for
`two meanings against each other, a quick assessment of
`“combination hits” can be made—it is simply a process of
`performing a bitwise AND operation on the two bitmaps.
`Because an uncompressed bitmap for every meaning
`would be an unmanageably large amount of data, which
`would also require a lot of processing time to analyze, the
`bitmap data is compressed. There are two levels of
`compression, each a level 8 times more condensed than the
`last. In the first level of compression, a single bit represents
`a set of 8 nodes. If a meaning has a score for any of these
`nodes, the bit is set. In the second level of compression, each
`bit corresponds to a set of 64 nodes, in the same way.
`
`
`
`US 6,453,315 B1
`
`7
`Storing the bitmap information in this way allows large
`chunks of empty raw bitmap data to be ignored, resulting in
`a much smaller data set. In addition, the bitwise operations
`performed on these bitmaps can be done with greater speed,
`because detailed sections of the bitmap data do not have to
`be examined unless the higher-order compressed version
`indicates that there is information of value present there.
`Retrieval
`Block 140
`While pre-processing and information organization may
`be an ongoing process, at anytime where at least a partial
`table of scores for nodes is available, the user can then input
`a specific search term as he/she would when using the
`conventional search engine.
`When users use the search engine to find subjects within
`the directory,
`the search engine conducts two phases of
`processing. The first phase, interpretation, involves analyz-
`ing the user’s input so that the meanings that the user desires
`can be identified (see block 150 description). The second
`phase, collection and ranking of results, involves the col-
`lecting of nodes that have good scores for the desired
`meanings, and ordering them based on predicted relevance
`(see block 160 description).
`Block 150: Interpretation Phase
`There are two key inputs possible for a search: an input
`string, and a set of known meanings. Either, or both, may be
`received and processed for results. The input string is simply
`a set of words that the user types into a search box and
`submits for query. An interpretation of these words is made,
`to map words to probable desired meanings. In addition to,
`or instead of, the input string, a set of meanings that are
`known ahead of time to be of interest, may be passed to the
`search engine for query. These require less processing, as
`plain words to not have to be mapped to meanings, as they
`are already predefined. Meanings passed to the engine in this
`way are called preconceptions.
`The process for interpreting the input string begins with
`stemming, or morphing, of the individual words in the
`string. This involves mainly an attempt to reduce words
`perceived as being plurals to their singular form. This is
`necessary because word mappings to meanings within the
`Lexicon are stored in singular form only, except in special
`cases.
`
`Next, all possible combinations of words (and their mor-
`phed variants) are examined for possible matches to mean-
`ings. Larger numbers of combined words are tested first, to
`give preference over individual words (for example, “United
`States” must take precedence over analysis of “United” and
`“States”). Partial matches with meanings are possible, so
`that “Rocky Horror” might bring up a match for “Rocky
`Horror Picture Show”, for example.
`As the set possible meanings is being compiled, prob-
`abilities are assigned to each. These values reflect
`the
`likelihood that
`the user really means a certain concept.
`Because many words have multiple meanings, probabilities
`for implied meanings for words may be manually or auto-
`matically pre-assigned. These values are used in this phase
`of the engine processing, in order to estimate what meanings
`are most likely implied by particular search words. Other
`factors that affect the probabilities given to meanings are:
`was the meaning matched by a morphed word or the word
`in its “pure” form (favor pure forms); was the meaning only
`partially matched the input word(s)
`(if so,
`reduce
`probability); was the meaning