`Information Extraction from Web Documents
`
`Mario Magnanelli and Antonia Erni and Moira Norrie
`Institute for Information Systems
`Swiss Federal Institute of Technology ETH
`ETH(cid:4)Zentrum(cid:5) CH(cid:4) Zurich(cid:5) Switzerland(cid:10)
`fmagnanel(cid:2)norrie(cid:2)ernig(cid:3)inf(cid:4)ethz(cid:4)ch
`
`Abstract
`
`We describe an Internet agent which
`gathers information from the Web in or(cid:2)
`der to maintain a local database and
`ensure its currency(cid:3) As a speci c ap(cid:2)
`plication(cid:5) we detail an agent maintain(cid:2)
`ing a database with information about
`academic contacts(cid:5)
`their projects and
`publications(cid:3) Agent operation is driven
`by an extraction pro le which speci es
`what and how information is to be ex(cid:2)
`tracted from Web documents(cid:3) The agent
`detects new and updated information
`and(cid:5) when the con dence level is above
`a user(cid:2)speci ed threshold(cid:5) automatically
`updates the database accordingly(cid:3)
`
` Introduction
`
`The World Wide Web WWW has become a ma(cid:2)
`jor source of information about all areas of interest(cid:3)
`Users typically spend many hours searching not
`only for new Web documents(cid:5) but also for updates
`to documents(cid:3) For example(cid:5) an academic may look
`for new technical reports(cid:5) a nancial analyst for
`new economic data and a computer enthusiast for
`new software products and versions(cid:3) Further(cid:5) it
`also requires signi cant time to download informa(cid:2)
`tion and e ort to organize it in a convenient form(cid:3)
`To assist users in the tasks of nding(cid:5) fetching
`and working with information published in Web
`documents(cid:5) we use an Internet agent to gather in(cid:2)
`formation and store it in a local client database(cid:5)
`thereby allowing users to browse(cid:5) query and pro(cid:2)
`cess that information at their convenience(cid:3) Agent
`operation is driven by a combination of an extrac(cid:2)
`tion pro le specifying what and how information
`is to be extracted from Web documents and the
`local database specifying the particular entities of
`interest(cid:3) Thus(cid:5) the user accesses the local database
`system and it is the responsibility of the agent to
`maintain this database and ensure its currency(cid:3)
`While the approach is general and the agent dy(cid:2)
`namically con gurable(cid:5) here we use a speci c ap(cid:2)
`plication system(cid:5) Academia(cid:5) to describe the oper(cid:2)
`ation of the agent and the information extraction
`
`process(cid:3) Academia is a system to support aca(cid:2)
`demics by automatically keeping track of contact
`information for other researchers such as tele(cid:2)
`phone numbers and email addresses and also in(cid:2)
`formation on their projects and publications(cid:3)
`The Academia agent runs in the background(cid:5)
`periodically searching the Web(cid:3) The frequency of
`the search is speci ed by the user(cid:3) By creating
`an entry for each researcher of interest(cid:5) the user
`e ectively speci es the domain of interest and the
`agent uses this information to know who or what
`to search for(cid:3)
`The information extraction process is controlled
`by an extraction pro le which speci es how infor(cid:2)
`mation is to be extracted from Web documents
`based on a combination of keyword searches(cid:5) term
`matching and proximity measures(cid:3) Con dence
`measures are associated with the various extraction
`patterns(cid:5) thereby allowing the agent to calculate
`reliability scores for extracted information items(cid:3)
`These reliability scores(cid:5) along with user(cid:2)speci ed
`con dence thresholds(cid:5) determine whether(cid:5) for a
`given information item(cid:5) the agent updates the
`database directly or consults the user(cid:3)
`Academia combines techniques developed in
`various research areas for extracting information
`from Web documents(cid:3) In the database area(cid:5) sys(cid:2)
`tems are being developed to allow querying over
`dynamically generated Web documents(cid:3) For ex(cid:2)
`ample(cid:5) in Hammer et al(cid:3)(cid:5) (cid:5) a language is pro(cid:2)
`posed for specifying extraction patterns to enable
`structured objects to be constructed from informa(cid:2)
`tion contained in HTML documents(cid:3) These sys(cid:2)
`tems only work over xed Web sites for which pat(cid:2)
`terns have been speci ed(cid:3) In contrast(cid:5) our agent
`does not base extraction on xed patterns and can
`extract information from any form of Web page(cid:3)
`Our agent does use pattern(cid:2)based extraction
`mechanisms to extract information on publications
`and projects(cid:3) However(cid:5) the agent itself generates
`these patterns based on the structure of individ(cid:2)
`ual items found in repeating items such as HTML
`lists and tables(cid:3) Similar techniques have been used(cid:5)
`for example(cid:5) in comparative shopping agents to
`extract information from speci c sites of on(cid:2)line
`stores Doorenbos et al(cid:3)(cid:5) (cid:3) However(cid:5) these
`Google Inc. 1012
`
`0001
`
`
`
`agents use training keywords to learn tl1e patterns
`of announced pages, while our agent finds pages by
`itself and does not need explicit training keywords.
`Work such as [Menezer, 1997] and [Armstrong
`et ul., 1995] use more complex retrieval functions,
`but focus mainly on presenting whole Web pages
`to the user.
`In our agent, the extraction profile
`drives retrieval by specifying how to find possible
`pages of interest and its main task is to then ex-
`tract information from these pages.
`Section 2 describes the components and opera-
`tion of the AC-ADEMIA system and section 3 gives
`details of the extraction profile and the extraction
`process. Section /1 describes the specific process
`of extracting information on publications. Section
`5 describes how confidence values are assigned to
`extracted facts.
`Finally, concluding remarks are
`given in section 6.
`
`AC‘.-'\L)EI\-‘IIA System
`2
`./\CADE1\~1[A is used to reduce the work of an aca-
`
`demic in finding and updating information about
`other researchers. While we use this specific ap-
`plication to explain our general extraction mecha-
`nisms, we note that the general concepts of this
`system may be used in other applications and,
`with this aim in mind, the agent can be dynam-
`ically eonfigurcd. Figure 1 shows the components
`of the .»‘\CADEMlA system and the work flow be-
`tween them.
`
`
`
`Figure l: The components of .«'\CADEMIA
`
`The ACADEMIA database is implemented us-
`ing the OMS object-oriented database manage-
`ment system (DBMS) described in [Norrie and
`\‘Viirgler, 1997; Norrie, 1993]. OMS provides a
`graphical browser, full query language and meth-
`ods which are used to support user operations
`such as downloading documents. Since the system
`also supports URLs as a base type, viewing Web
`pages and sending email via an Internet browser
`can be done directly from OMS. Further, since a
`0002
`
`generic WWW interface for OMS is available, the
`ACADEMIA database can also be accessed through
`such a browser.
`
`information in the database
`Tl1e key contact
`consists of person names and \'V\V\'V addresses.
`The name is necessary to identify the person, while
`the address is a general starting point for the agent
`to search for updates.
`The database also sto1'es general facts about per-
`sons such as title, address, photo and info1'ma—
`tion about research activities including the titles of
`publications, URLs leading to abstracts or a pub-
`lication file, project titles and URLs of pages con-
`taining further information on the project.
`The user accesses the database directly to re-
`trieve and process information on academic con-
`tacts.
`Tl1e ACADEMIA agent provides a ‘value-
`uclclctl service by using information extracted from
`\Veb documents to maintain the database and en-
`
`sure its currency.
`The agent may either update
`the database directly, or consult with the user as
`to whether or not it should perform the updates.
`The extraction process of the agent is specified
`by an extraction profile. For a given application
`system such as ACADEMIA, this profile is provided
`as part of the system. However,
`the user could
`adapt it
`to search for additional information.
`In
`section 3, the profile is explained in detail.
`An ACADEMIA agent runs in the background ac-
`cording to the periodicity specified by the user.
`It
`first reads the name and \-V\‘\"\‘V-addi‘ess of each
`
`person in the database to determine the search do-
`main. lfthe agent does not find a \-V\‘V\‘V-address
`for a person,
`it
`tries to find a \‘\"\'\"\-V-addi'ess by
`in this case,
`using the Alta\"ista searcl1 engine.
`the only search arguments are the first and last
`name of the person and, of course,
`it
`is not sure
`whether relevant documents will be found.
`The
`
`agent performs a search with each of the first ten
`pages returned by Alta Vista and, in the case that
`information is found, later consults with the user
`who decides whether this information is reliable or
`not and should be stored in the database.
`\-Ve note
`
`including those specifi-
`that other search engines
`cally for personal home pages
`have been tried and
`we are investigating which combinations of search
`engines are best for our application.
`Given one or more possible home pages for a per-
`son, the agent starts to extract information from
`these and referenced pages. Searching home pages
`is done in two basic ways
`keyword-based and
`pattern-based search. In the case of key word-based
`search, the agent searches for keywords as specified
`in the extraction profile. For each keyword, a set
`of options is specified which tells the agent what
`information may be found in proximity to the key-
`word. For example, if a URL follows the keyword
`“www",
`it
`is likely to be a link to another home
`page. Details of the extraction process and the
`format of the extraction profile are given in the
`next section. Although such keyword searching is
`
`0002
`
`
`
`relatively simple(cid:2) it has proved e ective and is used
`in Academia to nd general information about a
`person and also potential links to pages containing
`publication lists or project descriptions(cid:5)
`Pattern(cid:6)based search is used to nd information
`about publications and projects(cid:5)
`In most cases(cid:2)
`this information is represented in lists and cannot
`be extracted by the keyword approach(cid:5) For exam(cid:6)
`ple(cid:2) publications are frequently represented within
`Web documents as an HTML list with each item
`giving the authors(cid:2) title(cid:2) publication information
`and one or more URLs to download the document(cid:5)
`The keywords author(cid:8) or title(cid:8) do not occur ex(cid:6)
`plicitly(cid:5) Our agent therefore tries to detect a re(cid:6)
`curring pattern in the HTML page indicating the
`occurrence of such a list(cid:5) This is based on HTML(cid:6)
`commands around text items and the use of lists(cid:2)
`tables and di erent fonts to structure information(cid:5)
`Details of pattern(cid:6)based search is given in section (cid:5)
`As mentioned(cid:2) the agent starts searching from
`a potential home page and repeatedly follows in(cid:6)
`teresting links to search for further information(cid:5)
`Links are collected in a search list and can be of
`three types(cid:10) links that likely lead to general infor(cid:6)
`mation(cid:2) to publications or to research topics(cid:5) The
`agent searches each link using the corresponding
`search technique de ned for each type of link(cid:5) A
`future version of Academia will contain the possi(cid:6)
`bility for the user to alter these search techniques(cid:5)
`After the search for one person(cid:2) a con dence
`value CV is computed for each piece of informa(cid:6)
`tion found based on the reliability of the extrac(cid:6)
`tion pattern used to nd that item and the level
`of corroborating andor contradictory information
`found(cid:5) For example(cid:2) if the same telephone number
`is found in several places(cid:2) the level of con dence
`will be high(cid:5) However(cid:2) if di erent phone numbers
`are found on di erent pages(cid:2) the con dence will be
`low(cid:5) These CVs can only be calculated at the end
`of the search since it is not possible to predict when
`and where items will be found(cid:5)
`Once the search is complete(cid:2) the agent starts the
`interaction with the database(cid:5) For every fact that
`has a CV greater than the user(cid:6)speci ed thresh(cid:6)
`old(cid:2) the agent writes the fact in the database and
`records this action in a log which the user may
`access to examine the agent(cid:14)s actions(cid:5) For facts
`which have CVs below the threshold(cid:2) the agent
`will later consult the user who decides whether the
`fact will be stored or not(cid:5) The agent stores the
`decisions of the user for future reference(cid:2) thereby
`avoiding repeatedly asking the user the same ques(cid:6)
`tions(cid:5) Whenever the user gains more con dence in
`the agent(cid:2) he may reduce the threshold to give the
`agent greater autonomy(cid:5)
`
` Extraction Pro le
`
`In this section(cid:2) we describe the extraction pro le in
`detail(cid:5) To assist understanding(cid:2) we explain some
`of the issues that led to our solutions by means of
`examples(cid:5)
`
`The pro le consists of a set of extraction pat(cid:6)
`terns each of which speci es a keyword to be
`searched for and also its signi cance in terms of
`the form of information to be extracted(cid:2) the prox(cid:6)
`imity of this information(cid:2) any supporting keywords
`and an associated CV(cid:5)
`The general idea behind the extraction process
`based on these patterns is as follows(cid:5) The agent
` rst searches the page for a keyword of an extrac(cid:6)
`tion pattern(cid:5)
`If a keyword is found(cid:2) it indicates
`that information we seek may be located in the
`vicinity(cid:5) There are several ways in which this in(cid:6)
`formation can be found(cid:5) For example(cid:2) consider the
`case of looking for a person(cid:14)s phone number(cid:5) The
`keyword phone(cid:8) indicates that a phone number
`may follow(cid:5) Phone numbers usually consist of dig(cid:6)
`its with a few special characters(cid:5) Further(cid:2) it is very
`likely that this number follows immediately after
`the keyword(cid:5) With this background knowledge(cid:2) it
`is not di cult to extract a string that is likely to
`be the phone number if it exists(cid:5) As another ex(cid:6)
`ample(cid:2) consider nding the title of a person(cid:5) If the
`keyword prof(cid:8) is found followed by the name of
`the person(cid:2) it is likely to be their title(cid:5)
`Such reasoning leads to the speci cation of the
`various extraction patterns(cid:5) At any stage(cid:2) the user
`can add new patterns or re ne existing ones(cid:5) Part
`of the extraction pro le for the Academia agent
`is shown in table (cid:5)
`
`Obj
`R in D N FN ML XL C
`Keyword
`title
`b
`x
`
`
`
`
`
`
`prof
`b
`x
`
`
`
`
`
` title
`prof
`publication p
`l
`
`
`
`
`
` (cid:7)
`
`Table (cid:10) Example of the extraction pro le
`
`Each line in the pro le corresponds to an ex(cid:6)
`traction pattern specifying a keyword along with
`additional information to determine if a fact of the
`appropriate form has been found and how to ex(cid:6)
`tract that fact(cid:5) The extraction pattern is speci ed
`in terms of a number of attributes some of which
`are optional(cid:5) We start by explaining the attributes
`shown in table (cid:5)
`Attribute R speci es the type of information to
`be extracted(cid:5) The agent distinguishes between two
`main categories of information reference and tex(cid:6)
`tual information(cid:5) Textual information consists of
`facts to be extracted(cid:5) It may be of several types(cid:5)
`If R(cid:2)s(cid:2) a string value is to be extracted(cid:5)
`If R(cid:2)b(cid:2)
`a Boolean value is returned indicating whether or
`not a speci c term has been located(cid:5) For example(cid:2)
`for the title of a person(cid:2) we simply want to know
`whether a speci c designation such as prof(cid:8) ap(cid:6)
`pears in front of that person(cid:14)s name(cid:5) Other types
`include email(cid:6)addresses R(cid:2)e(cid:2) dates d and im(cid:6)
`ages i(cid:5)
`Reference information consists of links to other
`Web documents of interest and is used to direct the
`search(cid:5) It can be one of the three main link types(cid:2)
`
`0003
`
`
`
`general page l(cid:4) publication page p or research
`page r(cid:4) or it can be a link to a Web page where
`a nger(cid:7)(cid:8)command is performed f(cid:9)
`The next attribute(cid:4) in(cid:4) determines the position
`of the keyword in the Web document(cid:9) The keyword
`may be found in usual text x(cid:4) in text belonging
`to a link k(cid:4) in the title of a page t(cid:4) in a header
`h(cid:4) inside of an HTML(cid:8)command c or in a link
`reference l(cid:9) Thus(cid:4) in row of table (cid:4) we specify
`that publication(cid:7) has to be found in a link in(cid:8)
`dicating that a reference to a Web page containing
`a list of publications has possibly been found(cid:9)
`D determines the locality of the information to
`be extracted in terms of the maximum distance in
`characters from the keyword(cid:9) means the result
`can be in any distance from the keyword(cid:9)
`The attributes N and FN can be used to specify
`that the surname or forename of the person must
`appear in proximity to the keyword(cid:9) For example(cid:4)
`the rst extraction pattern of table speci es that
`the surname of the person must appear at most
` characters from the keyword prof(cid:7)(cid:9) This is
`used to check that the designation belongs to the
`person whose information we are seeking(cid:9) The sec(cid:8)
`ond line speci es that the forename also occurs(cid:9)
`A for either of these attributes means that the
`corresponding name does not have to occur(cid:9) ML
`and XL determine the minimum(cid:4) respectively max(cid:8)
`imum(cid:4) length of the resulting information for types
`string and email(cid:8)address(cid:9)
`The CV associated with an extraction pattern
`is speci ed in attribute C(cid:9) It is given in percent(cid:9)
`More about CVs is given in section (cid:9)
`The last attribute(cid:4) Obj(cid:4) is used to tell the agent
`where the extracted information is to be stored in
`the database(cid:9) In the case of reference information(cid:4)
`no information is stored and therefore Obj is un(cid:8)
`speci ed(cid:9)
`In table (cid:4) we show other optional attributes for
`specifying in more detail the format of values to
`be extracted(cid:9) Note that we have omitted here the
`CVs which happen to all be (cid:9)
`
`ML XL Obj
`Keyword R in D CharSet
`
`
`e
`x
`
`(cid:3)
`phone
`s
`x
` (cid:7)
`
`phone
` nger
`f
`k
`
`(cid:3)
`
`
` nger
`
`Table (cid:17) Example of the extraction pro le
`
`CharSet speci es all possible characters allowed
`to occur in a string value(cid:9) These are used in ta(cid:8)
`ble to specify expected forms of telephone num(cid:8)
`bers(cid:9) Thus(cid:4) to nd a phone number(cid:4) the agent
`looks for a string containing only those characters
`and starting within a distance of characters after
`the keyword phone(cid:7)(cid:9) The keyword must occur in
`usual text(cid:9) The result is a phone number with a
`length between and (cid:9)
`In the case of email and nger information(cid:4) the
`character sets are unspeci ed(cid:4) however the result
`types e and f indicate the format of values to be
`
`extracted(cid:9) Thus(cid:4) for an email address(cid:4) the agent
`automatically looks for a string containing a (cid:20)(cid:7)(cid:9)
`Another part of the extraction pro le is given
`in table (cid:9) SK is used to specify a second keyword
`that has to occur in the same Web document(cid:9) SKD
`speci es the maximum distance of the second key(cid:8)
`word from the rst(cid:9) means any distance(cid:9)
`
`Obj
`SKD ML XL C
`Keyword R in D SK
`home
`l
`k
`
`page
`
`
` (cid:3)
`my
`l
`x
` work
`
`
` (cid:3)
`project
`r
`x
`
`lead
`(cid:3)
`
`
` (cid:3)
`
`Table (cid:17) Example of the extraction pro le
`
`The rst two extraction patterns in table are
`used to get links to pages with general facts(cid:9) The
` rst speci es that keywords home(cid:7) and page(cid:7)
`must both occur in text that belongs to a link(cid:9) The
`distance between them is not speci ed(cid:4) but they
`have to occur in the same link text(cid:9) The second
`line speci es that work(cid:7) should begin within
`characters of my(cid:7) and both should appear in reg(cid:8)
`ular text(cid:9) The extracted link to a further Web doc(cid:8)
`ument of possible general interest has to be found
`within a maximum distance of characters(cid:9)
`The third extraction pattern of table is used to
` nd a link to a page which may speci cally contain
`information about projects(cid:9) If keywords project(cid:7)
`and lead(cid:7) occur in usual text with lead(cid:7) appear(cid:8)
`ing at most characters before project(cid:7)(cid:4) a link
`within a distance of no more than characters is
`assumed to be a possible link to a Web document
`listing projects(cid:9) For example(cid:4) a line of an HTML(cid:8)
`page may contain the text(cid:17) Currently(cid:4) I(cid:21)m leading
`a project called Artemis(cid:7)(cid:9) If(cid:4) following Artemis(cid:4)
`there is a link to the home page of this project(cid:4)
`the extraction pattern would cause the agent to
`extract this link and search the resulting Web doc(cid:8)
`ument for project information(cid:9)
`The speci c values shown in the example tables
`were those which(cid:4) during testing(cid:4) led to good re(cid:8)
`sults(cid:9) We chose them by analyzing the forms of
`many WWW(cid:8)pages containing relevant informa(cid:8)
`tion and then adapting the proximity values based
`on experience(cid:9) More detailed information about
`the extraction pro le and the keyword(cid:8)based ex(cid:8)
`traction process can be found in Magnanelli(cid:4) (cid:9)
`
` Extraction of Publications
`
`In this section(cid:4) we describe pattern(cid:8)based extrac(cid:8)
`tion by detailing how the agent extracts informa(cid:8)
`tion about publications(cid:9) We start by assuming that
`the agent has located a document or part of a
`document that is deemed likely to contain infor(cid:8)
`mation on publications(cid:9) The agent then looks for
`some form of pattern of repeated entries such as
`an HTML list or table structure(cid:9) If the agent de(cid:8)
`tects such a recurring pattern(cid:4) it next tries to nd
`the structure of the items(cid:9) See gure for an ex(cid:8)
`ample of a publication list which(cid:4) with respect to
`
`0004
`
`
`
`our agent operation(cid:2) is ideal in terms of extracting
`information(cid:3)
`
` H Object(cid:3)Oriented Temporal Databases H
` B A(cid:5) Steiner and M(cid:5) C(cid:5) Norrie(cid:5) B
` I Institute for Information Systems(cid:6) ETH Zuerich(cid:5) I
`April (cid:5) br
`Proc(cid:5) th Int(cid:5) Conf(cid:5) on DASFAA (cid:6) Melbourne(cid:6) Australia
` br br Available les(cid:13)
` a href(cid:15)(cid:16)ftp(cid:13)(cid:5) (cid:5) (cid:5) c(cid:3)dasfaa(cid:5)abstract(cid:16) abstract a
` a href(cid:15)(cid:16)ftp(cid:13)(cid:5) (cid:5) (cid:5) c(cid:3)dasfaa(cid:5)ps(cid:16) postscript a
` p
` H New Programming Environment for Oberon H
` B J(cid:5) Supcik and M(cid:5) C(cid:5) Norrie(cid:5) B
` I Institute for Information Systems(cid:6) ETH Zuerich(cid:5) I
`March (cid:5) br
`Proc(cid:5) JMLC (cid:6) Linz(cid:6) Austria
` br br Available les(cid:13)
` a href(cid:15)(cid:16)ftp(cid:13)(cid:5) (cid:5) (cid:5) b(cid:3)jmlc(cid:5)abstract(cid:16) abstract a
` a href(cid:15)(cid:16)ftp(cid:13)(cid:5) (cid:5) (cid:5) b(cid:3)jmlc(cid:5)ps(cid:16) postscript a
` p
`
`Figure (cid:5) Part of an HTML publication list
`
`Both entries shown contain the same structure of
`HTML(cid:6)commands(cid:3) We note that this case occurs
`seldomly as it may be that not all entries contain
`the same elds(cid:3) For example(cid:2) a particular publi(cid:6)
`cation entry may contain no date or proceedings(cid:3)
`In fact(cid:2) typically(cid:2) the larger a pattern is(cid:2) the more
`likely it is that there are small di erences between
`several entries and our agent respects that(cid:3)
`Because of possible irregularities in items(cid:2) we de(cid:6)
`cided not to use every HTML(cid:6)command to de ne
`the pattern of an entry(cid:3) The tag br (cid:10)(cid:2) for ex(cid:6)
`ample(cid:2) never stands for a signi cant separation of
`two parts in an entry(cid:3) Also the links beginning with
` a (cid:3)(cid:3)(cid:3) (cid:11) should not be used because not all pub(cid:6)
`lications may have referenced pages or postscript
`versions(cid:3)
`The agent rst looks for the position of the name
`of the person in question(cid:3) For example(cid:2) in gure (cid:2)
`we might look for publications of which Supcik is
`an author(cid:3) Supcik(cid:10) is found between the HTML(cid:6)
`tags B and B (cid:3) The agent therefore assumes
`that for every item(cid:2) the names of the authors will
`be located in the corresponding part(cid:3)
`The agent next tries to extract the title of the
`publication(cid:3) For this(cid:2) we used the observation that
`the title occurs towards the beginning of an item
`either in the rst or second position(cid:3) Also(cid:2) usually(cid:2)
`the title contains at least twenty characters and
`seldomly contains commas(cid:3) The agent uses these
`statistics to extract the titles from the entries(cid:3) The
`associated CV will re ect the reliability that an
`extracted string is the title based on whether or not
`these various observations occur(cid:3) Thus(cid:2) if a title is
`less than twenty characters in length(cid:2) it may still
`be extracted(cid:2) but have a lower CV(cid:3)
`The agent also examines every link given in an
`entry(cid:3) These links are also stored if they appear to
`be of interest to the user(cid:2) for example postscript
` les or abstracts of the topic(cid:3)
`
`Pattern(cid:6)based extraction is also used to look for
`information on research projects(cid:3) However(cid:2) in this
`case(cid:2) the ability of the agent to extract information
`is not as good as for publications(cid:3) The main reason
`for this is that information given about research
`projects tends to be less well(cid:6)structured(cid:3) In fact(cid:2)
`it is often given as free text(cid:2) without any heading or
`page name to indicate that it is indeed information
`about research topics or projects(cid:3)
`
` Con dence Values
`
`Having explained how information is extracted
`from Web documents(cid:2) we now describe how the
`agent determines the reliability of this information(cid:3)
`As stated previously(cid:2) each extraction pattern has
`an associated CV which gives a measure of the re(cid:6)
`liability of an extracted information item in isola(cid:6)
`tion(cid:3) To compute an overall con dence measure(cid:2)
`the agent must consider the context in which the
`information was found and also the occurrence of
`any corroborating or con icting information(cid:3)
`We refer to the CVs associated with extraction
`patterns as conditional possibilities(cid:2) i(cid:3)e(cid:3)
`
`Cfjk (cid:17) the possibility that fact f oc(cid:6)
`curs given a keyword k
`
`The idea of CVs is adapted from certainty fac(cid:6)
`tors as de ned in Buchanan and Shortli e(cid:2) (cid:3)
`The main di erence between those values is the
`range(cid:3) Certainty factors normally range from (cid:6)
`complete uncertainty to (cid:24) complete certainty(cid:3)
`Our CVs range from to in nity(cid:2) because there ex(cid:6)
`ists no complete certainty whether a fact found is
`reliable(cid:3) We let the user set a threshold which in(cid:6)
`dicates the CV that a fact has to reach in order to
`seem reliable to the user(cid:3)
`Mathematically(cid:2) there is no complete certainty
`but(cid:2) in practice(cid:2) we found many patterns which
`always led to reliable facts(cid:3) Therefore(cid:2) we decided
`to use percentage values for the CVs to indicate
`the reliability of the extraction pattern in terms of
`the number of cases in which the pattern leads to
`correct facts(cid:3) Thus(cid:2) a value of percent means
`that only in half of all cases a fact extracted using
`that pattern is(cid:2) in isolation(cid:2) considered reliable(cid:3)
`It is however not su cient to consider only the
`e ectiveness of a given extraction pattern in calcu(cid:6)
`lating the CV of a fact(cid:3) For example(cid:2) it may indeed
`be the case that an extraction pattern leads cor(cid:6)
`rectly to a phone number(cid:2) but that we have a low
`con dence that the page being analyzed contains
`information about the person in question(cid:3) Thus(cid:2)
`the CV of a fact also depends on the CV associated
`with the context(cid:3) Each page is therefore assigned
`a CV that indicates how likely it is that facts on
`this page belong to the processed person(cid:3)
`
`Cp (cid:17) possibility that page p contains
`useful information on the focused person
`
`An initial page obtained from a URL stored in
`the database is allocated a CV of indicating
`
`0005
`
`
`
`that it is certain that this page contains informa(cid:2)
`tion about the person in question(cid:3)
`To receive the nal CV for an occurrence of a
`fact(cid:5) we multiply the CV of the associated key(cid:2)
`word extraction pattern with the one of the page
`in which it was found(cid:3) Of course(cid:5) the fact found
`can also be a new page to process further(cid:3)
`
`Cf (cid:8) Cp Cfjk
`
`At the end of the search of all pages concerning
`one person(cid:5) the same fact may be found more than
`once(cid:3)
`In this case(cid:5) the information is considered
`more reliable and as a CV for that fact we take the
`sum of the CVs of all equal facts found(cid:9)
`
`Cf (cid:8) Pi Cfi
`
`Note that(cid:5) with this rule(cid:5) it is possible to get
`CVs above percent(cid:3)
`Also(cid:5) it is possible that similar(cid:5) but unequal(cid:5)
`facts may be found(cid:3) In such a case(cid:5) according to
`the similarity(cid:5) we e ectively merge the similar facts
`by selecting that with the highest CV(cid:3)
`
`Cf (cid:8) maxiCfi
`
`It can also be the case that certain facts are de(cid:2)
`pendent on other facts(cid:3) For example(cid:5) the title of a
`publication and the title of an abstract associated
`with that publication should be the same(cid:3) What
`happens if the extraction process yields unequal
`values(cid:13) If the two values are similar(cid:5) for example(cid:5)
`the title of the abstract is a substring of that of
`the publication(cid:5) the agent will assume that the ti(cid:2)
`tle of the publication is the correct title and also
`associate that with the abstract(cid:3) If the CV asso(cid:2)
`ciated with the publication title is lower than that
`of the extracted abstract title(cid:5) then the CV of the
`abstract title will also be updated(cid:3) We introduce
`this example to show that the calculation of CVs
`where similar(cid:5) but unequal(cid:5) facts are found can be
`quite complicated and depends on many factors
`including the context of the search and the type of
`the facts being extracted(cid:3) It is beyond the scope of
`this paper to present all details of the con dence
`rules for Academia(cid:5) however these are fully dis(cid:2)
`cussed in Magnanelli(cid:5) (cid:3)
`It is important to point out that we consider the
`information in the Web not only as free to use but
`also as true and updated(cid:3) The agent is unable to
`detect that information is wrong in the case the
`correct information is not available(cid:3)
`
` Conclusions and Further Work
`
`This work showed the important role that au(cid:2)
`tonomous agents may take in the future in help(cid:2)
`ing users to bene t from the wealth of information
`from the Internet without requiring them to invest
`vast amounts of time and e ort(cid:3) Our experiments
`with Academia showed that it can handle a large
`amount of data in comparatively little time(cid:3) In a
`test set of persons(cid:5) it extracted approximately
`
` facts(cid:5) of which more than percent were cor(cid:2)
`rect(cid:3) This involved the agent searching about
`WWW(cid:2)pages a total of more than megabytes(cid:3)
`In the future(cid:5) we want to focus on improving the
`general operation of the agent in two ways reduc(cid:2)
`ing the search time through improved caching tech(cid:2)
`niques for retrieved WWW pages and improving
`the reliability of information extracted by means
`of learning techniques(cid:3) Additionally(cid:5) we are cur(cid:2)
`rently generalizing the system to further support
`the rapid development of other applications sys(cid:2)
`tems through dynamic con guration(cid:3)
`
`References
`
`Armstrong et al(cid:2)(cid:5) R(cid:3) Armstrong(cid:5) D(cid:3) Fre(cid:2)
`itag(cid:5) T(cid:3) Joachims(cid:5) and T(cid:3) Mitchell(cid:3) Web(cid:2)
`Watcher(cid:9) A learning apprentice for the world
`wide web(cid:3) In Proc(cid:2) Symp(cid:2) on Information Gath(cid:3)
`