`Bushee
`
`11111 111111111111111111 111111111111111 IIII IIIII IIIII 111111111
`US006711569B1
`US 6,711,569 Bl
`Mar.23,2004
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) MEfHOD FOR AUTOMATIC SELECTION
`OF DATABASES FOR SEARCHING
`
`(75)
`
`Inventor: William J. Bushee, Sioux Falls, SD
`(US)
`
`(73) As.signee: Bright Planet Corporation, Sioux
`Falls, SD (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154{b) by239 days.
`
`(21) Appl. No.: 09/911,452
`Jul. 24, 2001
`
`(22) Filed:
`
`Int. Cl.7
`................................................ G06F 17/30
`(51)
`(52) U.S. CJ. ................................. 707/5; 707/6; 707/10;
`707/104.1; 715/501.1; 715/513
`(58) Fleld of Search ................................ 707/3, 5, 6, 9,
`707/10, 101, 102, 103 R, 2, 4, 100, 104.1;
`715/501.1, 513; 709/233
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,257,185 A
`5,321,833 A
`5,338,976 A
`5,446,891 A
`
`• 10/1993 Farley et al.
`............... 7Cfl/100
`• 6/1994 Chang et al.
`.................. 7Cfl/S
`• 8/1994 Anwyl et al. .................. 704/2
`• 8/1995 Kaplan et al. ................. 7Cfl/2
`
`• 2/1998 Schultz .......................... 7Cfl/4
`5,721,902 A
`• 7/1998 Light ............................ 7Cfl/S
`5,778,363 A
`• 10/1998 Nielsen ...................... 709/233
`5,826,031 A
`• 11/1998 Pirolli et al. ................... 7Cfl/3
`5,835,905 A
`• 8/2000 Kleinberg ...................... 7Cfl/S
`6,112,202 A
`6,418,433 Bl • 7/2002 Chakrabarti et al ............ 7Cfl/S
`6,510,427 Bl • 1/2003 Bossemeyer et al. .......... 7Cfl/6
`FOREIGN PATENT DOCUMENTS
`
`........... G<XJF/17/60
`........... GOSF/17/30
`........... G<XJF/15/40
`. .......... G<XJF/17/27
`
`JP
`• 8/1999
`11224292 A
`JP
`• 8/1999
`411224256 A
`WO
`WO 9204681 Al • 3/1992
`WO
`WO 9712333 Al • 4/1997
`* cited by examiner
`Primary Examiner-Shahid Alam
`(74) Attorney, Agent, or Firm-Kaan:lal & Leonard, LLP
`(57)
`ABSTRACT
`
`A method for automatic selection of databases for improving
`the efficiency of data capture and management systems. The
`method for automatic selection ofdatabasesincludesobtain
`ing a candidate database listing providing a uniform
`resource locator (URL) for each one of a plurality of
`candidate databases to be considered during selection,
`obtaining a query from a user, matching a subset of candi
`date databases to said query, and storing a listing of selected
`databases to be used for retrieving information relative to
`said query.
`
`2 Clai~ 2 Drawing Sheets
`
`70 ~-4i~-7
`
`~D.:'._tabase Database I Database~
`2
`l_/
`I
`L ___________ .J
`Network
`
`Document
`
`22
`
`24
`
`Storage Means
`
`Communication Means
`
`Computer
`
`40
`
`50
`
`20
`
`Que,y
`Input
`Means
`
`Evaluation
`Portion
`
`Candidate
`Database
`Listing
`
`60
`
`Selected Database Portion
`
`Score
`
`Record
`
`I
`Database A eraged
`Location ~ ·
`I Number ~core t Segment I Segment
`J
`e~J esj 61J';;F
`
`Segment
`
`gmen
`
`001
`
`GOOGLE 1004
`
`
`
`U.S. Patent
`
`Mar.23,2004
`
`Sheet 1 of 2
`
`US 6,711,569 Bl
`
`70
`
`Document
`
`24----
`
`Storage Means
`
`Communication Means
`
`40
`
`Computer
`
`50
`
`Query
`Input
`Means
`
`Evaluation
`Portion
`
`Candidate
`Database
`Listing
`
`60
`
`Selected Database Portion
`
`----,--i--.-- ~
`
`I
`
`Score
`
`Record
`
`.
`Location
`
`d
`verage
`
`rDatabase 'A
`I
`I Number I s!c::nt Segment I Segment
`Segment L 9
`J
`7ij ssJ s1J sgJ
`
`L...-
`
`Figure 1
`
`002
`
`
`
`U.S. Patent
`
`Mar.23,2004
`
`Sheet 2 of 2
`
`US 6,711,569 Bl
`
`Obtain Query
`
`+
`
`Score Each of N
`results
`
`Compare Query to
`Categorization of
`Database in Pool
`
`+
`
`+
`
`Average Score of
`N Results for Each
`Database
`
`+
`
`+
`
`Select Databases
`
`+
`
`Assign Average
`Score
`
`+
`
`Pass Query to
`Selected
`
`Rank Databases
`by Average Score
`
`+
`
`Present Databases
`and Results in
`Ranked Order
`
`+
`
`Collect Results
`from Database
`
`+
`
`Pull First N Results
`from Each
`Database
`I
`
`Figure 2
`
`003
`
`
`
`US 6,711,569 Bl
`
`1
`METHOD FOR AUTOMATIC SELECTION
`OF DATABASES FOR SEARCHING
`
`INCORPORATION BY REFERENCE
`
`This patent application discloses an invention which may
`optionally form a portion of a larger system. Other portions
`of the larger system are disclosed and described in the
`following patent applications, all of which are subject to an
`obligation of assignment to the same person. The disclosures
`of these applications are herein incorporated by reference in
`their entireties.
`MEIBOD AND SYSTEM FOR AUTOMATIC HAR(cid:173)
`VESTING AND QUALIFICATION OF DYNAMIC 15
`DATABASE CONTENT, William J. Bushee, Thomas
`W. Tiahrt, and Michael K. Bergman, and Filed Jul. 24,
`2001, application Ser. No. 09/911,522 now pending.
`AUTOMATIC SYSTEM FOR CONFIGURING TO
`DYNAMIC DATABASE SEARCH FORMS, William 20
`J. Bushee, Filed Jul. 24, 2001, application Ser. No.
`09/911,435 now pending.
`SYSTEM AND METHOD FOR EFFICIENT CONTROL
`AND CAPTURE OF DYNAMIC DATABASE 25
`CONTENT, William J. Bushee and Thomas W. Tiahrt,
`Filed Jul. 24, 2001, application Ser. No. 09/911,434
`now pending.
`SYSTEM FOR AUTOMATICALLY CATEGORIZING
`CONTENT IN HIERARCHICAL SUBJECT 30
`STRUCTURES, Thomas W. Tiahrt, Michael K.
`Bergman, and William J. Bushee, Filed Jul. 24, 2001,
`application Ser. No. 09/911,433 now pending.
`SYSTEM AND METHOD FOR FLEXIBLE INDEXING
`OF DOCUMENT CONTENT, Thomas W. Tiahrt, Filed 35
`Jul. 24, 2001, application Ser. No. 09/911,432 now
`pending.
`
`BACKGROUND OF THE INVENTION
`
`2
`to the users' queries. Because of the similarity between web
`sites specifically and databases in general the terms docu(cid:173)
`ment and web page are used synonymously throughout this
`document unless otherwise distinguished by context.
`5 Similarly, the terms search engine and database are also used
`synonymously throughout this document unless otherwise
`distinguished by context.
`Many enterprises, whether business, governmental, or
`10 other coordinated undertakings, require large amounts of
`"current" information to be analyzed and available for use in
`the daily execution of their activities. The Internet has made
`the availability information in near real time a reality.
`However, this very current information is distributed across
`several thousand, if not millions, of computer systems linked
`to the Internet. Additionally, this information may be stored
`in various different formats, such as documents, web pages,
`and other machine readable formats. Locating information
`relevant to a specific query posed by a user often requires
`specific knowledge of the information's location, sophisti(cid:173)
`cated search strategies and even professional researchers.
`The use of search engines to locate information related to a
`user's query is well known and has to some extent sped up
`the process of locating related information.
`A significant portion of related information returned by
`search engines may not be considered truly relevant to a
`user's query. The resources required to evaluate all of the
`information identified by a search engine in order to filter out
`non-relevant information can be more than substantial. The
`resources used may include, by way of example and not
`limitation, transmission bandwidth, data storage, and time
`(both of system usage and of personnel) required to filter out
`related but not relevant information. The need to capture and
`organize relevant information can be overwhelming, and an
`automated system is required to effectively solve this prob(cid:173)
`lem.
`In these respects, the method for automatic selection of
`databases according to the present invention substantially
`departs from the conventional concepts and designs of the
`prior art, and in so doing provides a system primarily
`developed for the purpose of improving the efficiency of
`data capture and management systems.
`
`SUMMARY OF IBE INVENTION
`
`40
`
`45
`
`1. Field of the Invention
`The present invention relates to search engines and more
`particularly pertains to a new method for automatic selection
`of databases for improving the efficiency of data capture and
`management systems.
`2. Description of the Prior Art
`The Internet is a worldwide system of computer networks
`in which users at any one computer may get information
`In view of the foregoing disadvantages inherent in the
`located on virtually any other computer with appropriate
`authorization. The Internet uses a set of protocols called 50
`known types of search engines now present in the prior art,
`the present invention provides a new method for automatic
`Transmission Control Protocol/Internet Protocol or TCP/IP.
`selection of databases construction wherein the same can be
`The World Wide Web (often abbreviated as WWW) is a
`utilized for improving the efficiency of data capture and
`portion of the Internet using hypertext as a method for rapid
`cross-referencing that links one document or site to another. 55 management systems.
`The invention contemplates a method of selection and
`A database is a collection of data, which is organized in
`characterization of search engines and databases which
`a manner that allows its contents to be easily accessed,
`includes obtaining a candidate database listing providing a
`managed, and updated. Given this definition an Internet site
`uniform resource locator (URL) for each one of a plurality
`can be viewed as a database with a collection of data that can
`60 of candidate databases to be considered during selection,
`be viewed as pages, or accessible documents. Similarly, any
`obtaining a query from a user, matching a subset of candi(cid:173)
`network for accessing documents can be considered a
`date databases to said query, and storing a listing of selected
`database, including intranets and extranets. These network
`databases to be used for retrieving information relative to
`databases can be either static or dynamic. A static network
`65 said query.
`database provides the same set of documents or pages to
`every user. A dynamic network database presents unique
`There has thus been outlined, rather broadly, the more
`important features of the invention in order that the detailed
`documents or pages to different users, typically as a response
`
`004
`
`
`
`US 6,711,569 Bl
`
`3
`description thereof that follows may be better understood,
`and in order that the present contribution to the art may be
`better appreciated. There are additional features of the
`invention that will be described hereinafter and which will
`form the subject matter of the claims appended hereto.
`In this respect, before explaining at least one embodiment
`of the invention in detail, it is to be understood that the
`invention is not limited in its application to the details of
`construction and to the arrangements of the components set
`forth in the following description or illustrated in the draw(cid:173)
`ings. The invention is capable of other embodiments and of
`being practiced and carried out in various ways. Also, it is
`to be understood that the phraseology and terminology
`employed herein are for the purpose of description and 15
`should not be regarded as limiting.
`As such, those skilled in the art will appreciate that the
`conception, upon which this disclosure is based, may readily
`be utilized as a basis for the designing of other structures,
`methods and systems for carrying out the several purposes 20
`of the present invention. It is important, therefore, that the
`claims be regarded as including such equivalent construc(cid:173)
`tions insofar as they do not depart from the spirit and scope
`of the present invention.
`The objects of the invention, along with the various
`features of novelty which characterize the invention, are
`pointed out with particularity in the claims annexed to and
`forming a part of this disclosure. For a better understanding
`of the invention, its operating advantages and the specific 30
`objects attained by its uses, reference should be made to the
`accompanying drawings and descriptive matter in which
`there are illustrated preferred embodiments of the invention.
`
`10
`
`4
`The evaluation portion 40 of the system 10 is used for
`capturing, storing and scoring a plurality of responsive
`documents 70 (such as, for example, web pages) returned by
`each one of the plurality of databases 4 in response to the
`5 user's query.
`A candidate database listing 50 may provide an index of
`uniform resource locators (URLs) for each database 4 to be
`considered for selection in response to the user's query.
`The evaluation portion 40 determines a page score for a
`numerical representation of each one of the responsive
`documents 70 associated with each one of the plurality of
`databases 4. The range score is a numerical representation of
`the relative relevancy of the document 70 to the user's query.
`The evaluation portion further determines an averaged score
`for each one of the plurality of databases 4 based upon an
`average of each one of the page scores. The averaged score
`is used to evaluate the relevancy of the database 4 to the
`user's query.
`A selected database portion 60 provides information
`related to each one of the plurality of databases 4 being
`selected as relevant to the user's query. The selected data-
`25 base portion 60 may provide a plurality of fields for storing
`this information about each of the databases.
`In one embodiment of the invention, the plurality of fields
`include a database location number segment 62, an averaged
`score segment 64, a plurality of score segments 66, and a
`plurality of record segments 68. The database location
`number segment 62 provides a cross-reference to a location
`in the candidate database listing 50 of a URL associated with
`the database 4. The averaged score segment 64 records the
`35 averaged score for the database 4 for the user's query. Each
`one of the plurality of score segments 66 record the page
`score for one of the responsive documents 70 used to
`determine the averaged score for the database 4. Each one of
`the record segments 68 provides a cross-reference to a
`40 location of each one of the responsive documents 70 in a
`storage medium.
`Each of the database location segments 62 and the record
`segments 68 may comprise a 64-bit representation of loca-
`45 tion for facilitating access to more than 4.3 billion discrete
`locations.
`A listing of candidate databases is provided to the system
`for consideration with respect to a query or series of queries
`provided by a user. The queries may be provided directly by
`50 the user, or may be passed to the system through a file
`transfer or file access process.
`The subject query being processed is forwarded or passed
`to each of the candidate databases ( e.g. from the listing) and
`waits for the databases to provide responsive web pages.
`Typically these responsive web pages will provide URLs for
`responsive documents. Each URL may be followed to the
`document and a copy of the associated document is captured
`for evaluation.
`An evaluation parameter may be used to define a maxi(cid:173)
`mum number of responsive documents to be captured from
`each one of the plurality of databases. In a preferred embodi(cid:173)
`ment the evaluation parameter may be set and adjusted by
`65 the user to a maximum number of responsive documents.
`The evaluation parameter preferably may have a value in the
`range between 2 and 20 (inclusive) documents. More
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`The invention will be better understood and objects other
`than those set forth above will become apparent when
`consideration is given to the following detailed description
`thereof. Such description makes reference to the annexed
`drawings wherein:
`FIG. 1 is a schematic functional interconnect view of a
`new system for automatic selection of databases according
`to the present invention.
`FIG. 2 is a schematic flow diagram of a method aspect of
`the present invention.
`
`DESCRIPTION OF THE PREFERRED
`EMBODIMENT
`With reference now to the drawings, and in particular to
`FIGS. 1 and 2 thereof, a new method for automatic selection
`of databases embodying the principles and concepts of the
`present invention will be described.
`As best illustrated in FIG. 1, the system 10 for the 55
`automatic selection of databases generally comprises a com(cid:173)
`puter system 20, a query input means 30, and an evaluation
`portion 40.
`The computer system 20 includes a storage means 22 for
`facilitating the retention and recall of dynamic database 60
`content and a communications means 24 for performing
`bi-directional communication between the computer system
`20 and a network 2.
`The query input means 30 of the system 10 is used for
`receiving a plurality of queries from a user and transferring
`the plurality of queries to a plurality of databases 4.
`
`005
`
`
`
`US 6,711,569 Bl
`
`5
`
`40
`
`5
`preferably, the evaluation parameter has a value falling in
`the range between 4 and 10 documents (inclusive). Most
`preferably, the evaluation parameter has a value of approxi(cid:173)
`mately 5 documents.
`A database providing the documents (such as a search
`engine) may also indicate relative scores or rankings for the
`relevancy of each of the documents with respect to the query
`based upon various factors determined by the entity oper(cid:173)
`ating the database. In a preferred embodiment, the docu- 10
`ments captured for storage and analysis are the documents
`with the highest associated scores or rankings determined by
`the source database.
`Each of the captured documents copies is stored on the 15
`system for recall and analysis without having to return to the
`source databases of the documents.
`Each of the documents (e.g. web pages) is then evaluated
`for the number of occurrences of the term or terms of the
`query in the document and the title of the document. The 20
`length of the document may also be determined for evalu(cid:173)
`ating relevancy. This information is used to determine a
`numerical score for each document. The numerical scores
`for each document retrieved from a database are averaged 25
`together, and this averaged score is then assigned to the
`database as an indication of relevance of that database to the
`user's query.
`The databases may be sorted or ranked by averaged score
`such that databases with relatively higher averaged scores 30
`are presented to the user before databases with relatively
`lower averaged scores.
`An information stream may be created which contains
`multiple information portions. Each information portion is 35
`associated with each one of the databases still under con-
`sideration after initial screening or filtering.
`In a preferred embodiment, the information portions may
`include a database location number segment, an average
`score segment, a plurality of score segments, and a plurality
`of record segments. The database location number segments
`provide a cross-reference to a location of the database in the
`candidate database listing. Each of the score segments
`provides the numerical scores determined for each of the 45
`responsive pages used to develop the averaged score. Each
`of the record segments provides a cross-reference to a
`location of each of the captured copies of the responsive
`pages.
`Therefore, the foregoing is considered as illustrative only
`of the principles of the invention. Further, since numerous
`modifications and changes will readily occur to those skilled
`in the art, it is not desired to limit the invention to the exact
`construction and operation shown and described, and 55
`accordingly, all suitable modifications and equivalents may
`be resorted to, falling within the scope of the invention.
`I claim:
`1. A method for the automatic selection and characteriza-
`tion of search engines and databases comprising:
`obtaining a candidate database listing providing a uniform
`resource locator (URL) for each one of a plurality of
`candidate databases to be considered during selection;
`obtaining a query from a user;
`submitting the query from the user to each one of said
`plurality of candidate databases;
`
`50
`
`60
`
`65
`
`6
`obtaining an evaluation parameter providing a predeter(cid:173)
`mined number of responsive documents to capture;
`selecting a number of URLs associated with responsive
`documents corresponding to said evaluation parameter,
`said responsive documents being selected according to
`a score provided by said database such that higher
`scoring responsive documents are selected over lower
`scoring responsive documents;
`collecting a document associated with each one of said
`URLs;
`storing each one of said documents for analysis;
`evaluating each responsive documents for occurrence of
`the query term, length of said responsive documents,
`and title of said responsive documents;
`creating a page score for each one of said responsive
`documents associated with each one of said plurality of
`databases;
`calculating an averaged score for each one of said data(cid:173)
`bases based upon an average of all of said pages scores
`associated with each one of said databases;
`associating said averaged score with said database and the
`user's query;
`sorting said candidate listing of databases by said aver(cid:173)
`aged score such that relatively higher scoring databases
`are presented substantially before relatively lower scor(cid:173)
`ing databases;
`storing said listing of selected databases associated with
`the user's query;
`creating an information stream having a plurality of
`information portions, each one of said information
`portions being associated with one of said plurality of
`selected databases;
`creating a plurality of fields within each one of said
`plurality of information portions, said plurality of fields
`including a database location number segment provid(cid:173)
`ing a cross-reference to a location of said database in
`said candidate database listing, an average score seg(cid:173)
`ment for storing said averaged score for said database
`associated with the user's query, a plurality of score
`segments for storing each one of said page scores
`associated with said database, a plurality of record
`segments for storing a location of each one of said
`responsive documents associated with said database;
`sorting said plurality of information portions such that
`information portions associated with relatively higher
`scoring databases are positioned earlier in said infor(cid:173)
`mation stream than information portions associated
`with relatively lower scoring databases; and
`writing said information stream to a storage medium to
`provide a selected listing of databases associated with
`the user's query to be polled for relative information.
`2. A system for the automatic selection of websites
`comprising:
`a computer system having a storage means for facilitating
`the retention and recall of dynamic database content,
`said computer system having a communications means
`for performing bi-directional communication between
`said computer system and a network;
`a query input means for receiving a plurality of queries
`from a user and transferring the plurality of queries to
`a plurality of databases;
`
`006
`
`
`
`US 6,711,569 Bl
`
`7
`an evaluation portion for capturing, storing and scoring a
`plurality of responsive documents returned by said
`databases;
`an evaluation parameter defining a maximum number of
`responsive documents to be captured for each one of 5
`said plurality of databases;
`a candidate database listing providing an index of uniform
`resource locators (URLs) for each database to be con(cid:173)
`sidered for selection in response to the user's query;
`said evaluation portion determines a page score as a
`numerical representation of each one of said responsive
`documents associated with each one of said plurality of
`databases, said evaluation portion further determining
`an averaged score for each one of said plurality of 15
`databases based upon an average of each one of said
`page scores, said averaged score being used to evaluate
`relevancy of said database to the user's query;
`a selected database portion providing information related
`to each one of said plurality of databases being selected 20
`as relevant to the user's query, said selected database
`portion providing a plurality of fields;
`wherein said plurality of fields further comprises:
`
`10
`
`8
`a database location number segment providing a cross(cid:173)
`reference to a location of a URL associated with said
`database in said candidate database listing;
`an averaged score segment recording said averaged
`score for said database associated with the user's
`query;
`a plurality of score segments, each one of said plurality
`of score segments recording said page score for each
`one of said responsive documents used to determine
`said averaged score;
`a plurality of record segments, each one of said record
`segments providing a cross-reference to a location of
`each one of said responsive documents stored for
`determining said page scores;
`wherein each one of said database location segments
`and said record segments comprise 32-bit represen(cid:173)
`tations of location; and
`wherein each one of said database location segments
`and said record segments comprise 64-bit represen(cid:173)
`tations of location for facilitating accessing more
`than 4.3 billion discrete locations.
`
`* * * * *
`
`007
`
`