throbber
ACADEMIA(cid:2) An Agent(cid:3)Maintained Database based on
`Information Extraction from Web Documents
`
`Mario Magnanelli and Antonia Erni and Moira Norrie
`Institute for Information Systems
`Swiss Federal Institute of Technology ETH
`ETH(cid:4)Zentrum(cid:5) CH(cid:4)  Zurich(cid:5) Switzerland(cid:10)
`fmagnanel(cid:2)norrie(cid:2)ernig(cid:3)inf(cid:4)ethz(cid:4)ch
`
`Abstract
`
`We describe an Internet agent which
`gathers information from the Web in or(cid:2)
`der to maintain a local database and
`ensure its currency(cid:3) As a specic ap(cid:2)
`plication(cid:5) we detail an agent maintain(cid:2)
`ing a database with information about
`academic contacts(cid:5)
`their projects and
`publications(cid:3) Agent operation is driven
`by an extraction prole which species
`what and how information is to be ex(cid:2)
`tracted from Web documents(cid:3) The agent
`detects new and updated information
`and(cid:5) when the condence level is above
`a user(cid:2)specied threshold(cid:5) automatically
`updates the database accordingly(cid:3)
`
` Introduction
`
`The World Wide Web WWW has become a ma(cid:2)
`jor source of information about all areas of interest(cid:3)
`Users typically spend many hours searching not
`only for new Web documents(cid:5) but also for updates
`to documents(cid:3) For example(cid:5) an academic may look
`for new technical reports(cid:5) a nancial analyst for
`new economic data and a computer enthusiast for
`new software products and versions(cid:3) Further(cid:5) it
`also requires signicant time to download informa(cid:2)
`tion and eort to organize it in a convenient form(cid:3)
`To assist users in the tasks of nding(cid:5) fetching
`and working with information published in Web
`documents(cid:5) we use an Internet agent to gather in(cid:2)
`formation and store it in a local client database(cid:5)
`thereby allowing users to browse(cid:5) query and pro(cid:2)
`cess that information at their convenience(cid:3) Agent
`operation is driven by a combination of an extrac(cid:2)
`tion prole specifying what and how information
`is to be extracted from Web documents and the
`local database specifying the particular entities of
`interest(cid:3) Thus(cid:5) the user accesses the local database
`system and it is the responsibility of the agent to
`maintain this database and ensure its currency(cid:3)
`While the approach is general and the agent dy(cid:2)
`namically congurable(cid:5) here we use a specic ap(cid:2)
`plication system(cid:5) Academia(cid:5) to describe the oper(cid:2)
`ation of the agent and the information extraction
`
`process(cid:3) Academia is a system to support aca(cid:2)
`demics by automatically keeping track of contact
`information for other researchers  such as tele(cid:2)
`phone numbers and email addresses  and also in(cid:2)
`formation on their projects and publications(cid:3)
`The Academia agent runs in the background(cid:5)
`periodically searching the Web(cid:3) The frequency of
`the search is specied by the user(cid:3) By creating
`an entry for each researcher of interest(cid:5) the user
`eectively species the domain of interest and the
`agent uses this information to know who or what
`to search for(cid:3)
`The information extraction process is controlled
`by an extraction prole which species how infor(cid:2)
`mation is to be extracted from Web documents
`based on a combination of keyword searches(cid:5) term
`matching and proximity measures(cid:3) Condence
`measures are associated with the various extraction
`patterns(cid:5) thereby allowing the agent to calculate
`reliability scores for extracted information items(cid:3)
`These reliability scores(cid:5) along with user(cid:2)specied
`condence thresholds(cid:5) determine whether(cid:5)
`for a
`given information item(cid:5) the agent updates the
`database directly or consults the user(cid:3)
`Academia combines techniques developed in
`various research areas for extracting information
`from Web documents(cid:3) In the database area(cid:5) sys(cid:2)
`tems are being developed to allow querying over
`dynamically generated Web documents(cid:3) For ex(cid:2)
`ample(cid:5) in Hammer et al(cid:3)(cid:5) (cid:5) a language is pro(cid:2)
`posed for specifying extraction patterns to enable
`structured objects to be constructed from informa(cid:2)
`tion contained in HTML documents(cid:3) These sys(cid:2)
`tems only work over xed Web sites for which pat(cid:2)
`terns have been specied(cid:3) In contrast(cid:5) our agent
`does not base extraction on xed patterns and can
`extract information from any form of Web page(cid:3)
`Our agent does use pattern(cid:2)based extraction
`mechanisms to extract information on publications
`and projects(cid:3) However(cid:5) the agent itself generates
`these patterns based on the structure of individ(cid:2)
`ual items found in repeating items such as HTML
`lists and tables(cid:3) Similar techniques have been used(cid:5)
`for example(cid:5) in comparative shopping agents to
`extract information from specic sites of on(cid:2)line
`stores Doorenbos et al(cid:3)(cid:5) (cid:3) However(cid:5) these
`
`1/6
`
`SAMSUNG EX. 1012
`
`

`

`agents use training keywords to learn the patterns
`of announced pages, while our agent finds pages by
`itself and does not need explicit training keywords.
`Work such as [Menczen 1997] and [Armstrong
`ct (11., 1995] use more complex retrieval functions,
`but focus mainly on presenting whole Web pages
`to the user.
`In our agent,
`the extraction profile
`drives retrieval by specifying how to find possible
`pages of interest and its main task is to then ex—
`tract information from these pages.
`Section 2 describes the components and opera—
`tion of the ACADEMIA system and section 3 gives
`details of the extraction profile and the extraction
`process. Section ’1 describes the specific process
`of extracting information on publications. Section
`5 describes how confidence values are assigned to
`extracted facts.
`Finally, concluding remarks are
`given in section 6.
`
`ACADEIVIIA System
`2
`ACADEMIA is used to reduce the work of an aca—
`
`demic in finding and updating information about
`other researchers. While we use this specific ap—
`plication to explain our general extraction mecha-
`nisms, we note that the general concepts of this
`system may be used in other applications and,
`with this aim in mind, the agent can be dynam—
`ically configured. Figure 1 shows the components
`of the ACADEMIA system and the work flow be—
`tween them.
`
`
`
`Figure 1: The components of ACADEMIA
`
`implemented us—
`'1'he ACADEMIA database is
`ing the OMS object—oriented database manage—
`ment system (DBMS) described in [Norrie and
`\-Viirgler, 1997; Norrie, 1993]. OMS provides a
`graphical browser, full query language and meth—
`ods which are used to support user operations
`such as downloading documents. Since the system
`also supports URLs as a base type, viewing Web
`pages and sending email via an Internet browser
`can be done directly from OMS. Further, since a
`
`generic WWW’ interface for OMS is available, the
`ACADEMIA database can also be accessed through
`such a browser.
`
`information in the database
`'1'11e key contact
`consists of person names and \VWW addresses.
`'1‘11e name is necessary to identify the person, while
`the address is a general starting point for the agent
`to search for updates.
`The database also stores general facts about per—
`sons such as titlc, address, photo and informa—
`tion about research activities including the titles of
`publications, URLs leading to abstracts or a pub—
`lication file, project titles and URLs of pages con—
`taining further information on the project.
`The user accesses the database directly to re—
`trieve and process information on academic con—
`tacts.
`'1'he ACADEMIA agent provides a value—
`added service by using information extracted from
`\Veb documents to maintain the database and en—
`
`sure its currency.
`’l'he agent may either update
`the database directly, or consult with the user as
`to whether or not it should perform the updates.
`The extraction process of the agent is specified
`by an extraction profile. For a given application
`system such as ACADEMIA, this profile is provided
`as part of the system.
`However,
`the user could
`adapt it
`to search for additional information.
`In
`section 3, the profile is explained in detail.
`An ACADEMIA agent runs in the background ac—
`cording to the periodicity specified by the user.
`it
`first reads the name and \\"\V'\V'—elddi'css of each
`
`person in the database to determine the search do—
`main. lfthe agent does not find a \‘VVVW—address
`for a person,
`it
`tries to find a \-V\-V\V—address by
`using the AltaVista search engine.
`In this case,
`the only search arguments are the first and last
`name of the person and, of course,
`it
`is not sure
`whether relevant documents will be found.
`'1'he
`
`agent performs a search with each of the first ten
`pages returned by AltaVista and, in the case that
`information is found, later consults with the user
`who decides whether this information is reliable or
`not and should be stored in the database. we note
`
`including those specifi—
`that other search engines
`cally for personal home pages
`have been tried and
`we are investigating which combinations of search
`engines are best for our application.
`Given one or more possible home pages for a per—
`son, the agent starts to extract information from
`these and referenced pages. Searching home pages
`is done in two basic ways
`keyword—based and
`pattern—based search. In the case of keyword—based
`search, the agent searches for keywords as specified
`in the extraction profile. For each keyword, a set
`of options is specified which tells the agent what
`information may be found in proximity to the key—
`word. 1"or example, if a URL follows the keyword
`"www",
`it is likely to be a link to another home
`page. Details of the extraction process and the
`format of the extraction profile are given in the
`next section. Although such keyword searching is
`
`2/6
`
`SAMSUNG EX. 1012
`
`2/6
`
`SAMSUNG EX. 1012
`
`

`

`relatively simple(cid:2) it has proved eective and is used
`in Academia to nd general information about a
`person and also potential links to pages containing
`publication lists or project descriptions(cid:5)
`Pattern(cid:6)based search is used to nd information
`about publications and projects(cid:5)
`In most cases(cid:2)
`this information is represented in lists and cannot
`be extracted by the keyword approach(cid:5) For exam(cid:6)
`ple(cid:2) publications are frequently represented within
`Web documents as an HTML list with each item
`giving the authors(cid:2) title(cid:2) publication information
`and one or more URLs to download the document(cid:5)
`The keywords author(cid:8) or title(cid:8) do not occur ex(cid:6)
`plicitly(cid:5) Our agent therefore tries to detect a re(cid:6)
`curring pattern in the HTML page indicating the
`occurrence of such a list(cid:5) This is based on HTML(cid:6)
`commands around text items and the use of lists(cid:2)
`tables and dierent fonts to structure information(cid:5)
`Details of pattern(cid:6)based search is given in section (cid:5)
`As mentioned(cid:2) the agent starts searching from
`a potential home page and repeatedly follows in(cid:6)
`teresting links to search for further information(cid:5)
`Links are collected in a search list and can be of
`three types(cid:10) links that likely lead to general infor(cid:6)
`mation(cid:2) to publications or to research topics(cid:5) The
`agent searches each link using the corresponding
`search technique dened for each type of link(cid:5) A
`future version of Academia will contain the possi(cid:6)
`bility for the user to alter these search techniques(cid:5)
`After the search for one person(cid:2) a condence
`value CV is computed for each piece of informa(cid:6)
`tion found based on the reliability of the extrac(cid:6)
`tion pattern used to nd that item and the level
`of corroborating andor contradictory information
`found(cid:5) For example(cid:2) if the same telephone number
`is found in several places(cid:2) the level of condence
`will be high(cid:5) However(cid:2) if dierent phone numbers
`are found on dierent pages(cid:2) the condence will be
`low(cid:5) These CVs can only be calculated at the end
`of the search since it is not possible to predict when
`and where items will be found(cid:5)
`Once the search is complete(cid:2) the agent starts the
`interaction with the database(cid:5) For every fact that
`has a CV greater than the user(cid:6)specied thresh(cid:6)
`old(cid:2) the agent writes the fact in the database and
`records this action in a log which the user may
`access to examine the agent(cid:14)s actions(cid:5) For facts
`which have CVs below the threshold(cid:2) the agent
`will later consult the user who decides whether the
`fact will be stored or not(cid:5) The agent stores the
`decisions of the user for future reference(cid:2) thereby
`avoiding repeatedly asking the user the same ques(cid:6)
`tions(cid:5) Whenever the user gains more condence in
`the agent(cid:2) he may reduce the threshold to give the
`agent greater autonomy(cid:5)
`
` Extraction Prole
`
`In this section(cid:2) we describe the extraction prole in
`detail(cid:5) To assist understanding(cid:2) we explain some
`of the issues that led to our solutions by means of
`examples(cid:5)
`
`The prole consists of a set of extraction pat(cid:6)
`terns each of which species a keyword to be
`searched for and also its signicance in terms of
`the form of information to be extracted(cid:2) the prox(cid:6)
`imity of this information(cid:2) any supporting keywords
`and an associated CV(cid:5)
`The general idea behind the extraction process
`based on these patterns is as follows(cid:5) The agent
`rst searches the page for a keyword of an extrac(cid:6)
`tion pattern(cid:5)
`If a keyword is found(cid:2) it indicates
`that information we seek may be located in the
`vicinity(cid:5) There are several ways in which this in(cid:6)
`formation can be found(cid:5) For example(cid:2) consider the
`case of looking for a person(cid:14)s phone number(cid:5) The
`keyword phone(cid:8) indicates that a phone number
`may follow(cid:5) Phone numbers usually consist of dig(cid:6)
`its with a few special characters(cid:5) Further(cid:2) it is very
`likely that this number follows immediately after
`the keyword(cid:5) With this background knowledge(cid:2) it
`is not dicult to extract a string that is likely to
`be the phone number  if it exists(cid:5) As another ex(cid:6)
`ample(cid:2) consider nding the title of a person(cid:5) If the
`keyword prof(cid:8) is found followed by the name of
`the person(cid:2) it is likely to be their title(cid:5)
`Such reasoning leads to the specication of the
`various extraction patterns(cid:5) At any stage(cid:2) the user
`can add new patterns or rene existing ones(cid:5) Part
`of the extraction prole for the Academia agent
`is shown in table (cid:5)
`
`Obj
`R in D N FN ML XL C
`Keyword
`title
`b
`x
`
`
`
`
`
`
`prof
`b
`x
`
`
`
`
`
` title
`prof
`publication p
`l
`
`
`
`
`
` (cid:7)
`
`Table (cid:10) Example of the extraction prole
`
`Each line in the prole corresponds to an ex(cid:6)
`traction pattern specifying a keyword along with
`additional information to determine if a fact of the
`appropriate form has been found and how to ex(cid:6)
`tract that fact(cid:5) The extraction pattern is specied
`in terms of a number of attributes  some of which
`are optional(cid:5) We start by explaining the attributes
`shown in table (cid:5)
`Attribute R species the type of information to
`be extracted(cid:5) The agent distinguishes between two
`main categories of information  reference and tex(cid:6)
`tual information(cid:5) Textual information consists of
`facts to be extracted(cid:5) It may be of several types(cid:5)
`If R(cid:2)s(cid:2) a string value is to be extracted(cid:5)
`If R(cid:2)b(cid:2)
`a Boolean value is returned indicating whether or
`not a specic term has been located(cid:5) For example(cid:2)
`for the title of a person(cid:2) we simply want to know
`whether a specic designation such as prof(cid:8) ap(cid:6)
`pears in front of that person(cid:14)s name(cid:5) Other types
`include email(cid:6)addresses R(cid:2)e(cid:2) dates d and im(cid:6)
`ages i(cid:5)
`Reference information consists of links to other
`Web documents of interest and is used to direct the
`search(cid:5) It can be one of the three main link types(cid:2)
`
`3/6
`
`SAMSUNG EX. 1012
`
`

`

`general page l(cid:4) publication page p or research
`page r(cid:4) or it can be a link to a Web page where
`a nger(cid:7)(cid:8)command is performed f(cid:9)
`The next attribute(cid:4) in(cid:4) determines the position
`of the keyword in the Web document(cid:9) The keyword
`may be found in usual text x(cid:4) in text belonging
`to a link k(cid:4) in the title of a page t(cid:4) in a header
`h(cid:4) inside of an HTML(cid:8)command c or in a link
`reference l(cid:9) Thus(cid:4) in row of table (cid:4) we specify
`that publication(cid:7) has to be found in a link  in(cid:8)
`dicating that a reference to a Web page containing
`a list of publications has possibly been found(cid:9)
`D determines the locality of the information to
`be extracted in terms of the maximum distance in
`characters from the keyword(cid:9) means the result
`can be in any distance from the keyword(cid:9)
`The attributes N and FN can be used to specify
`that the surname or forename of the person must
`appear in proximity to the keyword(cid:9) For example(cid:4)
`the rst extraction pattern of table species that
`the surname of the person must appear at most
` characters from the keyword prof(cid:7)(cid:9) This is
`used to check that the designation belongs to the
`person whose information we are seeking(cid:9) The sec(cid:8)
`ond line species that the forename also occurs(cid:9)
`A for either of these attributes means that the
`corresponding name does not have to occur(cid:9) ML
`and XL determine the minimum(cid:4) respectively max(cid:8)
`imum(cid:4) length of the resulting information for types
`string and email(cid:8)address(cid:9)
`The CV associated with an extraction pattern
`is specied in attribute C(cid:9) It is given in percent(cid:9)
`More about CVs is given in section (cid:9)
`The last attribute(cid:4) Obj(cid:4) is used to tell the agent
`where the extracted information is to be stored in
`the database(cid:9) In the case of reference information(cid:4)
`no information is stored and therefore Obj is un(cid:8)
`specied(cid:9)
`In table (cid:4) we show other optional attributes for
`specifying in more detail the format of values to
`be extracted(cid:9) Note that we have omitted here the
`CVs which happen to all be (cid:9)
`
`ML XL Obj
`Keyword R in D CharSet
`
`
`email
`email
`e
`x
`
`(cid:3)
`phone
`s
`x
` (cid:7)    
`
`phone
`nger
`f
`k
`
`(cid:3)
`
`
`nger
`
`Table (cid:17) Example  of the extraction prole
`
`CharSet species all possible characters allowed
`to occur in a string value(cid:9) These are used in ta(cid:8)
`ble  to specify expected forms of telephone num(cid:8)
`bers(cid:9) Thus(cid:4) to nd a phone number(cid:4) the agent
`looks for a string containing only those characters
`and starting within a distance of  characters after
`the keyword phone(cid:7)(cid:9) The keyword must occur in
`usual text(cid:9) The result is a phone number with a
`length between  and (cid:9)
`In the case of email and nger information(cid:4) the
`character sets are unspecied(cid:4) however the result
`types e and f indicate the format of values to be
`
`extracted(cid:9) Thus(cid:4) for an email address(cid:4) the agent
`automatically looks for a string containing a (cid:20)(cid:7)(cid:9)
`Another part of the extraction prole is given
`in table (cid:9) SK is used to specify a second keyword
`that has to occur in the same Web document(cid:9) SKD
`species the maximum distance of the second key(cid:8)
`word from the rst(cid:9) means any distance(cid:9)
`
`Obj
`SKD ML XL C
`Keyword R in D SK
`home
`l
`k
`
`page
`
`
` (cid:3)
`my
`l
`x
`  work 
`
`
` (cid:3)
`project
`r
`x
`
`lead
`(cid:3)
`
`
` (cid:3)
`
`Table (cid:17) Example of the extraction prole
`
`The rst two extraction patterns in table are
`used to get links to pages with general facts(cid:9) The
`rst species that keywords home(cid:7) and page(cid:7)
`must both occur in text that belongs to a link(cid:9) The
`distance between them is not specied(cid:4) but they
`have to occur in the same link text(cid:9) The second
`line species that work(cid:7) should begin within 
`characters of my(cid:7) and both should appear in reg(cid:8)
`ular text(cid:9) The extracted link to a further Web doc(cid:8)
`ument of possible general interest has to be found
`within a maximum distance of  characters(cid:9)
`The third extraction pattern of table is used to
`nd a link to a page which may specically contain
`information about projects(cid:9) If keywords project(cid:7)
`and lead(cid:7) occur in usual text with lead(cid:7) appear(cid:8)
`ing at most  characters before project(cid:7)(cid:4) a link
`within a distance of no more than  characters is
`assumed to be a possible link to a Web document
`listing projects(cid:9) For example(cid:4) a line of an HTML(cid:8)
`page may contain the text(cid:17) Currently(cid:4) I(cid:21)m leading
`a project called Artemis(cid:7)(cid:9) If(cid:4) following Artemis(cid:4)
`there is a link to the home page of this project(cid:4)
`the extraction pattern would cause the agent to
`extract this link and search the resulting Web doc(cid:8)
`ument for project information(cid:9)
`The specic values shown in the example tables
`were those which(cid:4) during testing(cid:4) led to good re(cid:8)
`sults(cid:9) We chose them by analyzing the forms of
`many WWW(cid:8)pages containing relevant informa(cid:8)
`tion and then adapting the proximity values based
`on experience(cid:9) More detailed information about
`the extraction prole and the keyword(cid:8)based ex(cid:8)
`traction process can be found in Magnanelli(cid:4) (cid:9)
`
` Extraction of Publications
`
`In this section(cid:4) we describe pattern(cid:8)based extrac(cid:8)
`tion by detailing how the agent extracts informa(cid:8)
`tion about publications(cid:9) We start by assuming that
`the agent has located a document or part of a
`document that is deemed likely to contain infor(cid:8)
`mation on publications(cid:9) The agent then looks for
`some form of pattern of repeated entries such as
`an HTML list or table structure(cid:9) If the agent de(cid:8)
`tects such a recurring pattern(cid:4) it next tries to nd
`the structure of the items(cid:9) See gure  for an ex(cid:8)
`ample of a publication list which(cid:4) with respect to
`
`4/6
`
`SAMSUNG EX. 1012
`
`

`

`our agent operation(cid:2) is ideal in terms of extracting
`information(cid:3)
`
` H Object(cid:3)Oriented Temporal Databases H
` B A(cid:5) Steiner and M(cid:5) C(cid:5) Norrie(cid:5) B
` I Institute for Information Systems(cid:6) ETH Zuerich(cid:5) I
`April (cid:5) br
`Proc(cid:5) th Int(cid:5) Conf(cid:5) on DASFAA (cid:6) Melbourne(cid:6) Australia
` br br Available les(cid:13)
` a href(cid:15)(cid:16)ftp(cid:13)(cid:5) (cid:5) (cid:5)  c(cid:3)dasfaa(cid:5)abstract(cid:16) abstract a 
` a href(cid:15)(cid:16)ftp(cid:13)(cid:5) (cid:5) (cid:5)  c(cid:3)dasfaa(cid:5)ps(cid:16) postscript a 
` p
` H New Programming Environment for Oberon H
` B J(cid:5) Supcik and M(cid:5) C(cid:5) Norrie(cid:5) B
` I Institute for Information Systems(cid:6) ETH Zuerich(cid:5) I
`March (cid:5) br
`Proc(cid:5) JMLC (cid:6) Linz(cid:6) Austria
` br br Available les(cid:13)
` a href(cid:15)(cid:16)ftp(cid:13)(cid:5) (cid:5) (cid:5)  b(cid:3)jmlc(cid:5)abstract(cid:16) abstract a 
` a href(cid:15)(cid:16)ftp(cid:13)(cid:5) (cid:5) (cid:5)  b(cid:3)jmlc(cid:5)ps(cid:16) postscript a 
` p
`
`Figure (cid:5) Part of an HTML publication list
`
`Both entries shown contain the same structure of
`HTML(cid:6)commands(cid:3) We note that this case occurs
`seldomly as it may be that not all entries contain
`the same elds(cid:3) For example(cid:2) a particular publi(cid:6)
`cation entry may contain no date or proceedings(cid:3)
`In fact(cid:2) typically(cid:2) the larger a pattern is(cid:2) the more
`likely it is that there are small dierences between
`several entries and our agent respects that(cid:3)
`Because of possible irregularities in items(cid:2) we de(cid:6)
`cided not to use every HTML(cid:6)command to dene
`the pattern of an entry(cid:3) The tag  br (cid:10)(cid:2) for ex(cid:6)
`ample(cid:2) never stands for a signicant separation of
`two parts in an entry(cid:3) Also the links beginning with
` a (cid:3)(cid:3)(cid:3) (cid:11) should not be used because not all pub(cid:6)
`lications may have referenced pages or postscript
`versions(cid:3)
`The agent rst looks for the position of the name
`of the person in question(cid:3) For example(cid:2) in gure (cid:2)
`we might look for publications of which Supcik is
`an author(cid:3) Supcik(cid:10) is found between the HTML(cid:6)
`tags B and B (cid:3) The agent therefore assumes
`that for every item(cid:2) the names of the authors will
`be located in the corresponding part(cid:3)
`The agent next tries to extract the title of the
`publication(cid:3) For this(cid:2) we used the observation that
`the title occurs towards the beginning of an item 
`either in the rst or second position(cid:3) Also(cid:2) usually(cid:2)
`the title contains at least twenty characters and
`seldomly contains commas(cid:3) The agent uses these
`statistics to extract the titles from the entries(cid:3) The
`associated CV will reect the reliability that an
`extracted string is the title based on whether or not
`these various observations occur(cid:3) Thus(cid:2) if a title is
`less than twenty characters in length(cid:2) it may still
`be extracted(cid:2) but have a lower CV(cid:3)
`The agent also examines every link given in an
`entry(cid:3) These links are also stored if they appear to
`be of interest to the user(cid:2) for example postscript
`les or abstracts of the topic(cid:3)
`
`Pattern(cid:6)based extraction is also used to look for
`information on research projects(cid:3) However(cid:2) in this
`case(cid:2) the ability of the agent to extract information
`is not as good as for publications(cid:3) The main reason
`for this is that information given about research
`projects tends to be less well(cid:6)structured(cid:3) In fact(cid:2)
`it is often given as free text(cid:2) without any heading or
`page name to indicate that it is indeed information
`about research topics or projects(cid:3)
`
` Condence Values
`
`Having explained how information is extracted
`from Web documents(cid:2) we now describe how the
`agent determines the reliability of this information(cid:3)
`As stated previously(cid:2) each extraction pattern has
`an associated CV which gives a measure of the re(cid:6)
`liability of an extracted information item in isola(cid:6)
`tion(cid:3) To compute an overall condence measure(cid:2)
`the agent must consider the context in which the
`information was found and also the occurrence of
`any corroborating or conicting information(cid:3)
`We refer to the CVs associated with extraction
`patterns as conditional possibilities(cid:2) i(cid:3)e(cid:3)
`
`Cfjk (cid:17) the possibility that fact f oc(cid:6)
`curs given a keyword k
`
`The idea of CVs is adapted from certainty fac(cid:6)
`tors as dened in Buchanan and Shortlie(cid:2) (cid:3)
`The main dierence between those values is the
`range(cid:3) Certainty factors normally range from (cid:6)
`complete uncertainty to (cid:24) complete certainty(cid:3)
`Our CVs range from to innity(cid:2) because there ex(cid:6)
`ists no complete certainty whether a fact found is
`reliable(cid:3) We let the user set a threshold which in(cid:6)
`dicates the CV that a fact has to reach in order to
`seem reliable to the user(cid:3)
`Mathematically(cid:2) there is no complete certainty
`but(cid:2) in practice(cid:2) we found many patterns which
`always led to reliable facts(cid:3) Therefore(cid:2) we decided
`to use percentage values for the CVs to indicate
`the reliability of the extraction pattern in terms of
`the number of cases in which the pattern leads to
`correct facts(cid:3) Thus(cid:2) a value of  percent means
`that only in half of all cases a fact extracted using
`that pattern is(cid:2) in isolation(cid:2) considered reliable(cid:3)
`It is however not sucient to consider only the
`eectiveness of a given extraction pattern in calcu(cid:6)
`lating the CV of a fact(cid:3) For example(cid:2) it may indeed
`be the case that an extraction pattern leads cor(cid:6)
`rectly to a phone number(cid:2) but that we have a low
`condence that the page being analyzed contains
`information about the person in question(cid:3) Thus(cid:2)
`the CV of a fact also depends on the CV associated
`with the context(cid:3) Each page is therefore assigned
`a CV that indicates how likely it is that facts on
`this page belong to the processed person(cid:3)
`
`Cp (cid:17) possibility that page p contains
`useful information on the focused person
`
`An initial page obtained from a URL stored in
`the database is allocated a CV of indicating
`
`5/6
`
`SAMSUNG EX. 1012
`
`

`

`that it is certain that this page contains informa(cid:2)
`tion about the person in question(cid:3)
`To receive the nal CV for an occurrence of a
`fact(cid:5) we multiply the CV of the associated key(cid:2)
`word extraction pattern with the one of the page
`in which it was found(cid:3) Of course(cid:5) the fact found
`can also be a new page to process further(cid:3)
`
`Cf (cid:8) Cp Cfjk
`
`At the end of the search of all pages concerning
`one person(cid:5) the same fact may be found more than
`once(cid:3)
`In this case(cid:5) the information is considered
`more reliable and as a CV for that fact we take the
`sum of the CVs of all equal facts found(cid:9)
`
`Cf (cid:8) Pi Cfi
`
`Note that(cid:5) with this rule(cid:5) it is possible to get
`CVs above percent(cid:3)
`Also(cid:5) it is possible that similar(cid:5) but unequal(cid:5)
`facts may be found(cid:3) In such a case(cid:5) according to
`the similarity(cid:5) we eectively merge the similar facts
`by selecting that with the highest CV(cid:3)
`
`Cf (cid:8) maxiCfi
`
`It can also be the case that certain facts are de(cid:2)
`pendent on other facts(cid:3) For example(cid:5) the title of a
`publication and the title of an abstract associated
`with that publication should be the same(cid:3) What
`happens if the extraction process yields unequal
`values(cid:13) If the two values are similar(cid:5) for example(cid:5)
`the title of the abstract is a substring of that of
`the publication(cid:5) the agent will assume that the ti(cid:2)
`tle of the publication is the correct title and also
`associate that with the abstract(cid:3) If the CV asso(cid:2)
`ciated with the publication title is lower than that
`of the extracted abstract title(cid:5) then the CV of the
`abstract title will also be updated(cid:3) We introduce
`this example to show that the calculation of CVs
`where similar(cid:5) but unequal(cid:5) facts are found can be
`quite complicated and depends on many factors 
`including the context of the search and the type of
`the facts being extracted(cid:3) It is beyond the scope of
`this paper to present all details of the condence
`rules for Academia(cid:5) however these are fully dis(cid:2)
`cussed in Magnanelli(cid:5) (cid:3)
`It is important to point out that we consider the
`information in the Web not only as free to use but
`also as true and updated(cid:3) The agent is unable to
`detect that information is wrong in the case the
`correct information is not available(cid:3)
`
` Conclusions and Further Work
`
`This work showed the important role that au(cid:2)
`tonomous agents may take in the future in help(cid:2)
`ing users to benet from the wealth of information
`from the Internet without requiring them to invest
`vast amounts of time and eort(cid:3) Our experiments
`with Academia showed that it can handle a large
`amount of data in comparatively little time(cid:3) In a
`test set of  persons(cid:5) it extracted approximately
`
` facts(cid:5) of which more than percent were cor(cid:2)
`rect(cid:3) This involved the agent searching about 
`WWW(cid:2)pages  a total of more than  megabytes(cid:3)
`In the future(cid:5) we want to focus on improving the
`general operation of the agent in two ways  reduc(cid:2)
`ing the search time through improved caching tech(cid:2)
`niques for retrieved WWW pages and improving
`the reliability of information extracted by means
`of learning techniques(cid:3) Additionally(cid:5) we are cur(cid:2)
`rently generalizing the system to further support
`the rapid development of other applications sys(cid:2)
`tems through dynamic conguration(cid:3)
`
`References
`
`Armstrong et al(cid:2)(cid:5)  R(cid:3) Armstrong(cid:5) D(cid:3) Fre(cid:2)
`itag(cid:5) T(cid:3) Joachims(cid:5) and T(cid:3) Mitchell(cid:3) Web(cid:2)
`Watcher(cid:9) A learning apprentice for the world
`wide web(cid:

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket