`
`Fifteenth National Conference on
`
`Artificial Intelligence ,(AAAI—98)
`
`Tenth Conference on Innovative Applications
`of Artificial Intelligence (IAAI—98)
`
`I
`AAAI PRESS / THE MIT PRESS
`Menlo Park, California 9 Cambridge, Massachusetts ° London, England
`
`(cid:51)(cid:79)(cid:68)(cid:76)(cid:71)(cid:3)(cid:20)(cid:19)(cid:19)(cid:23)
`Plaid 1004 g
`
`
`
`
`
`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`
`
`of the many other relevant sites include the NATO site,
`which lists the NATO member countries, as shown in
`Figure 2, and the World Governments site, which lists
`the head of state and other government ofiicers for each
`country (not shown due to space limitations). Consider
`queries such as “What NATO countries have popula-
`tions less than 10 million?”
`and “List the heads of
`state of all the countries in the Middle East”. Since
`these queries span multiple countries and require com-
`bining information from multiple sources, answering
`them by hand is time consuming. Ariadne allows us to
`rapidly put together a new application that can answer
`a wide range of queries by extracting and integrating
`data from prespecified web sources.
`
`Qficuwzéfilfisv
`
`Figure 2: NATO members page
`
`In the following section we describe our basic ap-
`proach to query planning, where a unifying domain
`model
`is used to tie together multiple information
`sources. We then describe the details of our model-
`
`ing approach: how we represent and query individual
`web pages, how we represent the relationships among
`multiple pages in a single site, and how we integrate
`data that spans multiple sites. In each section, we also
`describe the AI methods that are used in modeling
`and query processing, and how the uniform represen-
`tational scheme supports these methods.
`
`Approach to Information Integration
`
`Ariadne’s approach to information integration is based
`heavily on the SIMS mediator architecture (Arens et
`al. 1996; Knoblock 1995). SIMS enables users to ob-
`tain information from multiple heterogeneous informa-
`tion sources. The framework consists of two parts: 1)
`a query planner/executor that determines how to ef-
`ficiently process a query given the set of available in-
`formation sources and 2) wrappers that provide uni-
`form access to the information sources so that they
`can be queried as if they were SQL databases. The
`SIMS framework was designed with specific types of
`information sources in mind, primarily databases and
`knowledge bases (and to some extent programs), but
`as we will explain, the approach can be extended to
`handle web sources.
`
`One of the most important ideas underlying SIMS
`
`is that for each application there is a unifying domain
`model that provides a single ontology for the appli-
`cation. The domain model_is represented using the
`Loom knowledge representation system (MacGregor
`1988) and is used to describe the contents of each infor-
`mation source. Given a query in terms of the domain
`model, the system dynamically selects an appropriate
`set of sources and then generates a plan to efliciently
`produce the requested data.
`To illustrate this, let us first suppose that the in-
`formation in the three web sites described earlier, the
`CIA World Factbook, the World Governments site, and
`the NATO members page, are each available in three
`separate databases, along with a fourth database con-
`taining a map for each country. To define a new infor-
`mation agent, one would first define a domain model
`that contains the set of terms that the user might want
`to query about. An example domain model is shown
`in Figure 3. The model contains four classes with
`some relations between them, e.g., ‘NATO Country’
`is a subclass of ‘Country’, and ‘Country’ has a relation
`called ‘Head—of—State’ which points to a class with the
`same name. We then use the domain model to describe
`
`each of the individual information sources. This pro-
`vides the glue for answering queries that span multiple
`sources. For example, the figure shows that the CIA
`factbook is a source for information about Countries,
`and the World Governments database is a source for
`Heads of State. Each class has a set of attributes (e.g.,
`total area, latitude, population, etc.) which may be
`available from one or more sources.
`
`CIA factbook database
`country nm
`total area
`latitude
`longitude
`population
`etc
`
`Map database
`C0‘-“my Hm
`map
`
`NATO Countries
`database
`country nm
`
`
`
`person nm
`title
`country nm
`World Governments
`database
`
`Figure 3: Domain Model with Database Sources
`
`Query Processing
`Queries are presented to the system in terms of the
`domain model. For example, a query might be “List
`the heads of state of all the countries whose popula-
`tion is less than ten million.”3 The system then de-
`composes the query into subqueries on the individual
`sources, such as the World Governments and Factbook
`
`3In actuality, queries are phrased in the Loom KR lan-
`guage, the same language used to express the domain the-
`ory. We use English translations for clarity.
`
`212 Modeling the Web
`
`
`
`
`
`sources, producing a partially—ordered query plan con-
`sisting of a series of relational operators,
`i.e., joins,
`selects, projects, remote subqueries, etc.
`The SIMS query planner
`(Knoblock 1995) was
`designed primarily for database applications, but
`database applications typically involve only a small
`number of databases, while web applications can in-
`volve accessing many more sources. Since the SIMS
`planner did not scale well to large numbers of sources,
`we developed an approach capable of efficiently con-
`structing very large query plans. We addressed this
`problem by combining preprocessing techniques with
`a local—search method for query planning.
`In Ariadne, query processing is broken into a pre-
`processing phase and a query planning phase.
`In the
`first phase the system determines the possible ways
`of combining the available sources to answer a query.
`Since sources may be overlapping (i.e., an attribute
`may be available from several sources) or replicated,
`the system must determine an appropriate combina-
`tion of sources that can answer the query. The Ari-
`adne source selection algorithm CAmbite et al. 1998)
`preprocesses the domain model so that the system can
`efficiently and dynamically select sources based on the
`classes and attributes mentioned in the query.
`In the second phase, Ariadne generates a plan us-
`ing a method called Planning—by—Rewriting, developed
`by Ambite and Knoblock (Ambite and Knoblock 1997;
`1998). This approach takes an initial, suboptimal plan
`and then attempts to improve it by applying rewrit-
`ing rules.
`In the case of query planning, producing
`an initial, suboptimal plan is straightforward; we can
`generate an initial plan in O(n) time, where 72 is the
`length of the query, based on a depth-first parse of the
`query. The rewriting process iteratively improves the
`query via a local search process that can change both
`the sources used to answer a query and the order of
`the operations on the data.
`Consider
`the processing required to retrieve all
`NATO countries that have a population of less than
`10 million. Using the domain model in Figure 3, the
`source selection step would determine that the NATO
`source is the only source for the class of NATO coun-
`tries. This source provides only the names of the
`NATO countries and not their populations. However,
`the population information can be extracted from the
`Factbook source since it provides data for a superclass
`of NATO countries. The reasoning to combine sources
`in this way is done efficiently by precomputing the way
`sources can be combined before any queries are pro-
`cessed.
`
`The next step in the example is to construct a plan
`for efficiently retrieving and combining the data.
`In
`this case,
`the system might first construct an initial
`plan that retrieves the data from the NATO source,
`separately retrieves the names and population for all
`countries from the Factbook source, and then combines
`the data locally, which is very costly since the Factbook
`source is quite large. This initial, suboptimal plan is
`then improved in a series of rewriting steps that would
`
`order the retrieval of the NATO source before the fact-
`book source so that only population data on the NATO
`countries would need to be retrieved. The optimized
`plan would then be executed, returning only Denmark,
`which has a population of just over 5 million.
`Because Ariadne combines an efficient source selec-
`tion algorithm with an efficient, anytime planning al-
`gorithm, the system can produce query plans for web
`environments in a robust, efficient manner. Ariadne’s
`development was aided by the fact that the relational
`algebra is very simple and well understood, so that we
`could concentrate on the issues involved in searching
`for a plan, rather than on the underlying plan repre-
`sentation, which simply consists of a partially ordered
`set of relational operators. To move to the Web, we
`only needed one extension to the basic representation,
`which was the inclusion of “binding patterns” (Kwok
`and Weld 1996). That is, unlike database sources,
`web sources may have input/output constraints (e.g., a
`stock quote server requires a ticker symbol in order to
`retrieve a stock quote). This is _a small extension that
`is naturally handled by the source selection algorithm
`and planning operators.
`In the remainder of the paper we consider in more
`detail
`the modeling issues involved in creating a
`database—like view of the Web.
`
`Modeling the Information on a Page
`
`The previous section describes how the planner decom-
`poses a complex query into simple queries on individ-
`ual information sources. To treat a web page as an
`information source so that it can be queried, Ariadne
`needs a wrapper that can extract and return the re-
`quested information from that type of page. While we
`cannot currently create such wrappers for unrestricted
`natural language texts, many information sources on
`the Web are semistructured. A web page is semistruc-
`tured if information on the page can be located using a
`concise formal grammar, such as a context—free gram-
`mar. Given such a grammar, the information can be
`extracted from the source without recourse to sophisti-
`cated natural language understanding techniques. For
`example, a wrapper for pages in the CIA factbook
`would be able to extract fields such as the Total Area,
`Population, etc. based on a simple grammar describing
`the structure of factbook pages.
`Our goal is to enable application developers to easily
`create their own wrappers for web—based information
`sources. To construct a wrapper, we need both a se-
`mantic model of the source that describes the fields
`available on that type of page and a syntactic model,
`or grammar, that describes the page format, so the
`fields can be extracted. Requiring developers to de-
`scribe the syntactic structure of a web page by writ-
`ing a grammar by hand is too demanding, since we
`want
`to make it easy for relatively unsophisticated
`users to develop applications. Instead, Ariadne has a
`“demonstration—oriented user interface” (DoUI) where
`users show the system what
`information to extract
`
`Automated Reasoning
`
`21 3
`
`
`
`from example pages. Underlying the interface is a ma-
`chine learning system for inducing grammar rules.
`Figure 4 shows how an application developer uses
`the interface to teach the system about CIA factbook
`pages, producing both a semantic model and a syntac-
`tic model of the source. The screen is divided into two
`
`parts. The upper half shows an example document, in
`this case the Netherlands page. The lower half shows
`a semantic model, which the user is in the midst of
`constructing for this page. The semantic model in the
`figure indicates that the class Country has attributes
`such as Total Area, Coastline, Latitude, Longitude,
`etc. The user constructs the semantic model incremen-
`
`tally, by typing in each attribute name and then filling
`in the appropriate value by cutting and pasting the
`information from the document. In doing so, the user
`actually accomplishes two functions. First, he provides
`a name for each attribute. Notice that he can choose
`
`the same names as used in the document (e.g., “To-
`tal area”) or he can choose new/ different names (e.g.,
`“Latitude”). As we will explain later,
`the attribute
`names have significance, since they are the basis for
`integrating data across sources. '
`
`Nelsco - 3: ‘me world Fuclbool? - n e mi Nelherlands
`'
`
`E}
`
`
`
`Geography
`Map references: Europe
`Area:
`Iota/an-.1.'37,330 :k; km
`Inn;/an9a:33,920 sq km
`aongpatnliwe nnaa.'sligh11y less than twice the size of New Jersey
`Coastline: 451 krn
`Geognphic coordinates: 52 30 N,
`Land boundaries:
`I012/:1 ,027 km
`bo1‘¢#vI'counn1'as'.'Bel "um 450 km Ge
`Total Area: 37,330 sq km
`Coastline: 451 km
`Latitude: 52 30 N
`Longitude:
`
`Cut and Paste
`
`Figure 4: Creating a Wrapper by Demonstration
`
`The second function achieved by the user’s demon-
`stration is to provide examples so that the system can
`induce the syntactic structure of the page. Ideally, af-
`ter the user has picked out a few examples for each
`field, the system will induce a grammar sufficient for
`extracting the required information for all pages of
`this type. Unfortunately, grammar induction methods
`may require many examples, depending on the class of
`grammars being learned. However, we have observed
`that web pages have common characteristics that We
`can take advantage of, so that a class of grammars suf-
`ficient for extraction purposes can be rapidly learned
`in practice.
`More specifically, we can describe most semistruc-
`tured web pages as embedded catalogs. A catalog is
`either a homogeneous list, such as a list of numbers,
`(1,3,5,7,8), or a heterogeneous tuple, such as a 3-
`tuple consisting of a number, a letter, and a string,
`(1,A,“test”).- An embedded catalog is a catalog where
`
`214 Modeling the Web
`
`the items themselves can be catalogs. As an exam-
`ple, consider a CIA factbook page (see Figure 1). The
`top level consists of an 8-tuple distinguished by sec-
`tion headings: Geography, People, etc. The Geogra-
`phy section is a tuple consisting of Map References,
`Area, Coastline, etc. These can be decomposed fur-
`ther if necessary; Coastline is a tuple consisting of a
`number and the string “km”.
`Because web pages are intended to be human read-
`able, special markers often play a role identifying the
`beginning or ending of an item in an embedded cat-
`alog, separating items in a homogeneous list, and so
`on. These distinguishing markers can be used as land-
`marks for locating information on a page. For instance,
`to find the longitude, simply skip down to the heading
`“Geography”, then to “Geographic Coordinatesz”, and
`then skip past the first comma.
`A landmark grammar describes the position of a field
`via a sequence oflandmarks, where each landmark is it-
`self described by a deterministic finite automaton. Our
`recent work (Muslea et a1. 1998) shows that in prac-
`tice, a subclass of landmark grammars (linear, aug-
`mented landmark grammars) can be learned rapidly
`for a variety of web pages using a greedy covering al-
`gorithm. There are several reasons for this. Firstly,
`because web pages are intended to be human readable,
`there is often a single landmark that distinguishes or
`separates each field from its neighbors. Therefore the
`number of landmarks for a field in an embedded cata-
`
`log will generally be equal to its “depth” in the catalog.
`Since most catalogs are very shallow, this means that
`the length of the grammar rules to be learned will be
`very small, and learning will be easy in practice. Sec-
`ondly, during the demonstration process, users traverse
`a page from top-to-bottom, picking out the positive ex-
`amples of each field. Any position on the page that is
`not marked as a positive example is implicitly a nega-
`tive example. Thus, for every positive example identi-
`fied by the user, we obtain a huge number of negative
`examples that the covering algorithm can use to focus
`its search.
`
`The modeling tool we have described enables un-
`sophisticated users to turn web pages into relational
`information sources. But it has a second advantage as
`well. If the format of a web source changes in minor
`respects, the system could induce a new grammar by
`reusing examples from the original learning episode,
`without any human intervention (assuming the under-
`lying content has not changed significantly). This is a
`capability we are currently exploring.
`
`Modeling the Information in a Site:
`Connections between Pages
`
`The previous section showed how Ariadne extracts in-
`formation from a web page to answer a query. How-
`ever, before extracting information from a page, Ari-
`adne must first locate the page in question. Our ap-
`proach, described in this section, is to model the infor-
`mation required to “navigate” through a web site, so
`
`
`
` each wrapper (shown by the small directional arrows in
`
`Figure 6). The country page wrapper takes a country-
`URL, and acts as a source for “total area”, “popula-
`tion”, “latitude”, etc. The index wrapper takes a coun-
`try name5 and acts as a source for “country-URL”.
`Given the domain model and the binding constraints,
`the system can now construct query plans. For in-
`stance, to obtain the population of a country given its
`name,
`the planner determines that the system must
`first use the country name to retrieve the country-
`URL from the index page wrapper, and then use the
`country—URL to retrieve the population data from the
`country page wrapper.
`
`Factbook country page wrapper
`country url—%
`total area 6
`latitude <6
`longitude %
`population <m
`
`Factbook index wrapper
`country nm —.
`/
`country url 4:‘ [
`Map database
`country nm
`map
`NATO page wra
`country nm
`
`er
`
`
`
`:
`
`country nm
`> person nm
`title
`World Governments
`page wrapper
`
`Figure 6: Domain Model with Web Sources
`
`Explicitly modeling ‘navigation’ pages, such as the
`factbook index, as information sources enables us to
`reuse the same modeling tools and planning method-
`ology underlying the rest of the system. The approach
`works well
`in part because there are only two com-
`mon types of navigation strategies used on the Web
`— direct indexing and form-based retrieval. We have
`already seen how index pages are handled; form-based
`navigation is also straightforward. A wrapper for an
`HTML form simply mimics the action of the form, tak-
`ing as input a set of attributes, each associated with
`a form parameter name, and communicating with the
`server specified in the form’s HTML source.
`When the resulting page is returned,
`the Wrapper
`extracts the relevant attributes in the resulting page.
`Imagine, for instance, a form-based front end to the
`factbook, Where the user types in a country name and
`the form returns the requested country page. To create
`a wrapper for this front end, the developer would first
`specify that the parameter associated with the type-
`in box would be filled by a “country-nm”. He would
`then specify how the system should extract information
`from the page returned by the form using the approach
`described in the last section.
`The Factbook example described in this section il-
`lustrates our basic approach to modeling navigation
`pages. Many web sites are more complex than the
`factbook. The approach still works, but the models
`
`5No URL is needed as input to the index page wrapper
`since the URL of the index page is a constant.
`
`Automated Reasoning
`
`215
`
`that the planner can automatically determine how to
`locate a page.
`For example, consider a query to our example infor-
`mation agent asking for the population of the Nether-
`lands. To extract the population from the factbook’s
`page on the Netherlands,
`the system must first find
`the URL for that page. A person faced with the same
`task would look at the index page for the factbook,
`shown in Figure 5, which lists each country by name to-
`gether with a hypertext link to the page in question. In
`our approach, Ariadne does essentially the same thing.
`The index page serves as an information source that
`provides a URL for each country page. These pages in
`turn serve as a source for country~specific information.
`
`
`e_>fghLrIis;m(3£I_<§).
`itfllili).
`:A.1a§_I1'_»1(Z2£l3_).
`American Samoa (25 KB),
` ).
`&€91a(2iIS.B}.
`é‘AIg%(;4_I§i).
`Antarctica (26 KB)
`A__nt_igua and Barbuda (2? KBI
`Arctic Ocean (15 KB}
`A_trgenu‘na (30 KB)
`An__n_en_ia_(&1S§).
`Aruba (24 KB)
`0 Ashmnre and Canier Islands (16 KB).
`
`
`0 Atlantic Ocean (16 KB}
`
`
`Figure 5: CIA Factbook Index
`
`To create a wrapper for the index page, the devel-
`oper uses the approach described in the last section,
`where we illustrated how a wrapper for the factbook’s
`country pages is created. There is only one difference:
`this wrapper only wraps a single page, the index page.
`The developer creates a semantic model indicating that
`the index page contains a list of countries, each with
`two attributes, country-nm and country-URL.4 The
`learning system induces a grammar for the entire page
`after the developer shows how the first few lines in the
`file should be parsed.
`As the wrappers for each source are developed, they
`are integrated into the unifying domain model. Fig-
`ure 6 shows the domain model for the completed geopo-
`litical agent.
`(Notice that we have substituted web
`source wrappers for the hypothetical databases used
`previously.) To create the domain model, the devel-
`oper specifies the relationship between the wrappers
`and the domain concepts. For instance, the developer
`specifies that the Factbook country wrapper and the
`Factbook index wrapper are both information sources
`for “country” information, and he identifies which at-
`tributes are keys (i.e., unique identifiers). In the exam-
`ple, “country—nm” and “country-URL” are both keys.
`Binding constraints specify the input and output of
`
`‘During the demonstration, a special copy command is
`used to obtain a URL from a hyperlink, as opposed to grab-
`bing text.
`'
`
`
`
`source to each of the other sources where a different
`
`naming scheme is used. An advantage of the Ariadne
`architecture is that the mapping itself can be repre-
`sented as simply another wrapped information source.
`One way to do this is to create a mapping table, which
`specifies for each entry in one data source what the
`equivalent entity is called in another data source. Al-
`ternatively, if the mapping is computable, it can be
`represented by a mapping function, which is a program
`that converts one form into another form.
`
`Figure 8 illustrates the role of mapping tables in our
`geopolitical information agent. The Factbook is the
`primary source for a country’s name. A mapping table
`maps each factbook country name into the name used
`in the World Governments source (i.e., WG—country-
`nm). The mapping source contains only two attributes,
`the (factbook) country name and the WG—country-nm.
`The NATO source is treated similarly. So, for exam-
`ple, if someone wanted to find the Heads of State of the
`NATO countries, the query planner would retrieve the
`NATO country names from the NATO wrapper, map
`them into (factbook) country names using the NATO
`mapping table, then into the World Government coun-
`try names using the World Governments mapping ta-
`ble, and finally retrieve the appropriate heads of state
`from the World Governments wrapper.
`
`
`
`World Governments
`mapping table
`Country nm
` WG comm-y nm
`T WG
`,,,,,;".,“,’.§”’ "m
`—> "u
`World Giiveemments
`page wrapper
`r - country nm
`‘T NATO cqunuy nm
`NATO mapping table
`
`Factbook country page wrapper
`country url—>
`total area ’
`latitude 6
`'°"E*"!d° <—
`"°"“”“'°“ 6
`Factbook index wrapp -
`country nm —*
`/
`country url <v
`Map database 8
`counuyniflli,
`NATO page wra
`er
`Nato country nnréefl
`
`[
`
`
`
`become more involved. For instance, indexes can be
`hierarchical, in which case each level of the hierarchy
`must be modeled as an information source.
`Imagine
`the top—level factbook index was a list of letters, so that
`clicking on a letter “C” would produce an index page
`for countries starting with “C” (a “subindex”). We
`would model this top level index as a relation between
`letters and subindex—URL’s. To traverse this index, we
`also need an information source that takes a country
`name and returns the first letter of the name (e.g., a
`string manipulation program). Thus, altogether four
`wrappers would be involved in the navigation process,
`as shown in Figure 7. Given a query asking for the
`Netherlands’ population, the first wrapper would take
`the name “Netherlands”, call the string manipulation
`program, and return the first letter of the name, “N”.
`The second wrapper would take the letter “N”, ac-
`cess the top level index page, and return the subindex-
`URL. The third wrapper would take the subindex-URL
`and the country name, access the subindex page for
`countries starting with “N”, and return the country-
`URL. Finally, the last wrapper would take the country-
`URL and access the Netherlands page. The advan-
`tage of our approach is that all these wrappers are
`treated uniformly as information sources, so the query
`planner can automatically determine how to compose
`the query plan. Furthermore,
`the wrappers can be
`semi—automatically created via the learning approach
`described earlier, except for the string manipulation
`wrapper, which is a common utility.
`
`Factbook country page wrapper
`country url —e—
`total area <—
`latitude %
`longitude %
`population (j
`
`String Manipulation
`program wrapper
`% country nm
`% first letter
`
`SublndexWrapper Top level indexwrapper
`
`subindex ur1T>
`country nm%
`country url%
`
`/
`
`\
`
`% country nm
`% first letter
`> subindex url
`
`Figure 7: Domain Model with Hierarchical Index
`
`Modeling Information Across Sites
`Within a single site, entities (e.g., people, places, coun-
`tries, companies, etc.) are usually named in a consis-
`tent fashion. However, across sites, the same entities
`may be referred to with different names. For example,
`the CIA factbook refers to the “Vatican City” while
`the World Governments site refers to “The Holy See”.
`Sometimes formatting conventions are responsible for
`differences, such as “Denmark” vs. “Denmark, King-
`dom of”. To make sense of data that spans multiple
`sites, we need to be able to recognize and resolve these
`differences.
`
`Our approach is to select a primary source for an
`entity’s name and then provide a mapping from that
`
`216 Modeling the Web
`
`Figure 8: Domain Model with Mapping Tables
`
`Currently, mapping tables and functions must be
`created manually, but we are developing a semi-
`automated method for building mapping tables and
`functions by analyzing the underlying data in advance.
`The basic idea is to use information retrieval
`tech-
`
`(Cohen
`niques to provide an initial mapping (e.g.,
`1998)), and then use additional data in the sources to
`resolve any remaining ambiguities via statistical learn-
`ing methods (e.g., (Huang and Russell 1997)). For ex-
`ample, both the Factbook and the World Governments
`sources list the title of the Heads of State.” This infor-
`mation can help determine that the Factbook’s “North
`Korea” and “South Korea” refer respectively to the
`World Government’s “Democratic Republic of Korea”
`
`“The Factbook lists the name of the Heads of State as
`well but, unlike the VVorld Governments site, the informa-
`tion is often out of date. This is one reason why the World
`Governments site is useful.
`
`
`
`
`
`and “Republic of Korea”, rather than the other way
`around. Our approach can also be; used to. automat-
`ically update mapping tables when I1€W'S.0.ll:E(Z€S
`are
`released. For instance, each year a new version of the
`CIA factbook is released, and sometimes'<zo_u"vntries have
`new names, or countries merge or split. These name
`confusions can often be resolved using geographical in-
`formation (e.g., land area, latitude and longitude).
`
`Applications
`
`Below we list some Ariadne applications we are devel-
`oping, illustrating the generality of our approach:
`
`World-Wide Geographic Information Server:
`We are collaborating with another group that is build-
`ing a geographic information system that integrates
`a variety of map—based information sources. These
`sources include satellite images, detailed street maps,
`parcel data, historical aerial photographs, etc. We
`are using Ariadne to extract geographically referenced
`data from the Web and integrate it with map data.
`We have built a system using‘ Ariadne that extracts
`restaurant data from the Zagats Restaurant Reviews
`site, feeds the restaurant address into a geocoder, and
`then places the restaurant on an aerial map. Other
`web sources that we plan to incorporate include cen-
`sus data, US Geological Survey data, and real estate
`data from the Multiple Listing Service.
`
`Electronic Catalog Access: We are applying Ari-
`adne to provide access to online electronic catalogs for
`the Defense Logistics Agency. One implemented appli-
`cation provides real-time access to pricing and avail-
`ability data from the General Services Administration
`web pages. This application accesses only a single site,
`but retrieves pricing data for parts by extracting and
`integrating data from multiple pages in the site.
`
`Financial Information Agent: We have done ini-
`tial work on an agent that accesses stock quote servers,
`stock exchange sources,
`and the SEC’s EDGAR
`Archives (which contains copies of financial filings,
`such as annual reports, by publicly traded companies
`and mutual funds). By integrating these sources, the
`agent could answer queries such as “Find all airline
`companies whose stock has risen more than thirty per-
`cent in the last year” and “Find all people who serve
`as directors of two or more companies located in Los
`Angeles”. Using the modeling tools described earlier,
`users could also include their own personal financial
`data sources, tailoring the system to their needs.
`
`Related Work
`
`There is large body of relevant literature on infor-
`mation integration (Wiederliold 1996), but the most
`closely related work focuses specifically on the prob-
`lems of information integration on the Web, such as In-
`formation Manifold (Levy et al. 1996), Occam (Kwok
`and Weld 1996), Infomaster (Genesereth et a1. 1997),
`
`and InfoSleuth (Bayardo Jr. et al. 1997). These sys-
`tems focus on a variety of issues, including the prob-
`lems of representing and selecting a relevant set of
`sources to answer a query, handling binding patterns,
`and resolving discrepancies among sources. All of this
`work is directly relevant to Ariadne, but the issue ad-
`dressed in this paper that has not been addressed pre-
`viously is how one represents the information within a
`single page, across pages at a site, and across sites to
`support web-based information integration.
`Another closely related body of work is on the ex-
`traction of data from web sources (Hammer et al. 1997;
`Doorenbos et al. 41997; Kushmerick 1997). The fo-
`cus of all of these systems are on building wrappers
`for semi-structured sources. The systems either take a
`template—based specification of a source, as in (Ham-
`mer et a1. 1997), or learn the structure of the source by
`example and then compile a wrapper that provides ac-
`cess to the source, as in (Kushmerick 1997). Our work
`on inducing wrappers takes the latter approach. The
`induction method is not only very general, but is also
`integrated into the larger Ariadne development system
`so that the learned wrappers can be used directly by
`the query planner.
`
`Discussion
`
`There are many examples of impressive AI systems
`based on relatively simple representational schemes. In
`the realm of planning, recent examples include SAT-
`plan (Kautz and Selman 1996) and Graph—plan (Blum
`and Furst 1995); the former employs a propositional
`CSP approach, the latter, a graph—based search.
`In
`machine learning, propositional learning schemes (e.g.,
`decision trees) have been dominant. Though it is often
`difficult to understand exactly what a simple represen-
`tational scheme buys you computationally, one thing
`seems clear: systems with simple representations are
`often easier to design and understand.
`in terms of
`We believe that Ariadne is successful,
`the broad applicability of the approach, because it
`combines a simple representation scheme with sophisti-
`cated modeling tools that map web information sources
`into this simple representation. Ariadne capitalizes on
`a representation scheme adopted from database sys-
`tems, where the world consists of a set of relations (or
`tables) over objects, and simple relational operators
`(retrieve, join, etc.) are composed to answer queries.
`This representation makes it straightforward to inte-
`grate multiple databases using an AI planner. Ari-
`adne’s planner can efficiently search for a sequence of
`joins, selections, etc. that will produce the desired re-
`sult without needing to do any sophisticated reasoning
`about the information sources themselves.
`than the
`The Web environment
`is much richer
`database world, of course. What makes Ariadne possi-
`ble are the modeling tools that enable a user to cre