throbber
Proceedings W W
`
`Fifteenth National Conference on
`
`Artificial Intelligence ,(AAAI—98)
`
`Tenth Conference on Innovative Applications
`of Artificial Intelligence (IAAI—98)
`
`I
`AAAI PRESS / THE MIT PRESS
`Menlo Park, California 9 Cambridge, Massachusetts ° London, England
`
`(cid:51)(cid:79)(cid:68)(cid:76)(cid:71)(cid:3)(cid:20)(cid:19)(cid:19)(cid:23)
`Plaid 1004 g
`
`

`
`

`
`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`

`
`of the many other relevant sites include the NATO site,
`which lists the NATO member countries, as shown in
`Figure 2, and the World Governments site, which lists
`the head of state and other government ofiicers for each
`country (not shown due to space limitations). Consider
`queries such as “What NATO countries have popula-
`tions less than 10 million?”
`and “List the heads of
`state of all the countries in the Middle East”. Since
`these queries span multiple countries and require com-
`bining information from multiple sources, answering
`them by hand is time consuming. Ariadne allows us to
`rapidly put together a new application that can answer
`a wide range of queries by extracting and integrating
`data from prespecified web sources.
`
`Qficuwzéfilfisv
`
`Figure 2: NATO members page
`
`In the following section we describe our basic ap-
`proach to query planning, where a unifying domain
`model
`is used to tie together multiple information
`sources. We then describe the details of our model-
`
`ing approach: how we represent and query individual
`web pages, how we represent the relationships among
`multiple pages in a single site, and how we integrate
`data that spans multiple sites. In each section, we also
`describe the AI methods that are used in modeling
`and query processing, and how the uniform represen-
`tational scheme supports these methods.
`
`Approach to Information Integration
`
`Ariadne’s approach to information integration is based
`heavily on the SIMS mediator architecture (Arens et
`al. 1996; Knoblock 1995). SIMS enables users to ob-
`tain information from multiple heterogeneous informa-
`tion sources. The framework consists of two parts: 1)
`a query planner/executor that determines how to ef-
`ficiently process a query given the set of available in-
`formation sources and 2) wrappers that provide uni-
`form access to the information sources so that they
`can be queried as if they were SQL databases. The
`SIMS framework was designed with specific types of
`information sources in mind, primarily databases and
`knowledge bases (and to some extent programs), but
`as we will explain, the approach can be extended to
`handle web sources.
`
`One of the most important ideas underlying SIMS
`
`is that for each application there is a unifying domain
`model that provides a single ontology for the appli-
`cation. The domain model_is represented using the
`Loom knowledge representation system (MacGregor
`1988) and is used to describe the contents of each infor-
`mation source. Given a query in terms of the domain
`model, the system dynamically selects an appropriate
`set of sources and then generates a plan to efliciently
`produce the requested data.
`To illustrate this, let us first suppose that the in-
`formation in the three web sites described earlier, the
`CIA World Factbook, the World Governments site, and
`the NATO members page, are each available in three
`separate databases, along with a fourth database con-
`taining a map for each country. To define a new infor-
`mation agent, one would first define a domain model
`that contains the set of terms that the user might want
`to query about. An example domain model is shown
`in Figure 3. The model contains four classes with
`some relations between them, e.g., ‘NATO Country’
`is a subclass of ‘Country’, and ‘Country’ has a relation
`called ‘Head—of—State’ which points to a class with the
`same name. We then use the domain model to describe
`
`each of the individual information sources. This pro-
`vides the glue for answering queries that span multiple
`sources. For example, the figure shows that the CIA
`factbook is a source for information about Countries,
`and the World Governments database is a source for
`Heads of State. Each class has a set of attributes (e.g.,
`total area, latitude, population, etc.) which may be
`available from one or more sources.
`
`CIA factbook database
`country nm
`total area
`latitude
`longitude
`population
`etc
`
`Map database
`C0‘-“my Hm
`map
`
`NATO Countries
`database
`country nm
`
`
`
`person nm
`title
`country nm
`World Governments
`database
`
`Figure 3: Domain Model with Database Sources
`
`Query Processing
`Queries are presented to the system in terms of the
`domain model. For example, a query might be “List
`the heads of state of all the countries whose popula-
`tion is less than ten million.”3 The system then de-
`composes the query into subqueries on the individual
`sources, such as the World Governments and Factbook
`
`3In actuality, queries are phrased in the Loom KR lan-
`guage, the same language used to express the domain the-
`ory. We use English translations for clarity.
`
`212 Modeling the Web
`
`
`
`

`
`sources, producing a partially—ordered query plan con-
`sisting of a series of relational operators,
`i.e., joins,
`selects, projects, remote subqueries, etc.
`The SIMS query planner
`(Knoblock 1995) was
`designed primarily for database applications, but
`database applications typically involve only a small
`number of databases, while web applications can in-
`volve accessing many more sources. Since the SIMS
`planner did not scale well to large numbers of sources,
`we developed an approach capable of efficiently con-
`structing very large query plans. We addressed this
`problem by combining preprocessing techniques with
`a local—search method for query planning.
`In Ariadne, query processing is broken into a pre-
`processing phase and a query planning phase.
`In the
`first phase the system determines the possible ways
`of combining the available sources to answer a query.
`Since sources may be overlapping (i.e., an attribute
`may be available from several sources) or replicated,
`the system must determine an appropriate combina-
`tion of sources that can answer the query. The Ari-
`adne source selection algorithm CAmbite et al. 1998)
`preprocesses the domain model so that the system can
`efficiently and dynamically select sources based on the
`classes and attributes mentioned in the query.
`In the second phase, Ariadne generates a plan us-
`ing a method called Planning—by—Rewriting, developed
`by Ambite and Knoblock (Ambite and Knoblock 1997;
`1998). This approach takes an initial, suboptimal plan
`and then attempts to improve it by applying rewrit-
`ing rules.
`In the case of query planning, producing
`an initial, suboptimal plan is straightforward; we can
`generate an initial plan in O(n) time, where 72 is the
`length of the query, based on a depth-first parse of the
`query. The rewriting process iteratively improves the
`query via a local search process that can change both
`the sources used to answer a query and the order of
`the operations on the data.
`Consider
`the processing required to retrieve all
`NATO countries that have a population of less than
`10 million. Using the domain model in Figure 3, the
`source selection step would determine that the NATO
`source is the only source for the class of NATO coun-
`tries. This source provides only the names of the
`NATO countries and not their populations. However,
`the population information can be extracted from the
`Factbook source since it provides data for a superclass
`of NATO countries. The reasoning to combine sources
`in this way is done efficiently by precomputing the way
`sources can be combined before any queries are pro-
`cessed.
`
`The next step in the example is to construct a plan
`for efficiently retrieving and combining the data.
`In
`this case,
`the system might first construct an initial
`plan that retrieves the data from the NATO source,
`separately retrieves the names and population for all
`countries from the Factbook source, and then combines
`the data locally, which is very costly since the Factbook
`source is quite large. This initial, suboptimal plan is
`then improved in a series of rewriting steps that would
`
`order the retrieval of the NATO source before the fact-
`book source so that only population data on the NATO
`countries would need to be retrieved. The optimized
`plan would then be executed, returning only Denmark,
`which has a population of just over 5 million.
`Because Ariadne combines an efficient source selec-
`tion algorithm with an efficient, anytime planning al-
`gorithm, the system can produce query plans for web
`environments in a robust, efficient manner. Ariadne’s
`development was aided by the fact that the relational
`algebra is very simple and well understood, so that we
`could concentrate on the issues involved in searching
`for a plan, rather than on the underlying plan repre-
`sentation, which simply consists of a partially ordered
`set of relational operators. To move to the Web, we
`only needed one extension to the basic representation,
`which was the inclusion of “binding patterns” (Kwok
`and Weld 1996). That is, unlike database sources,
`web sources may have input/output constraints (e.g., a
`stock quote server requires a ticker symbol in order to
`retrieve a stock quote). This is _a small extension that
`is naturally handled by the source selection algorithm
`and planning operators.
`In the remainder of the paper we consider in more
`detail
`the modeling issues involved in creating a
`database—like view of the Web.
`
`Modeling the Information on a Page
`
`The previous section describes how the planner decom-
`poses a complex query into simple queries on individ-
`ual information sources. To treat a web page as an
`information source so that it can be queried, Ariadne
`needs a wrapper that can extract and return the re-
`quested information from that type of page. While we
`cannot currently create such wrappers for unrestricted
`natural language texts, many information sources on
`the Web are semistructured. A web page is semistruc-
`tured if information on the page can be located using a
`concise formal grammar, such as a context—free gram-
`mar. Given such a grammar, the information can be
`extracted from the source without recourse to sophisti-
`cated natural language understanding techniques. For
`example, a wrapper for pages in the CIA factbook
`would be able to extract fields such as the Total Area,
`Population, etc. based on a simple grammar describing
`the structure of factbook pages.
`Our goal is to enable application developers to easily
`create their own wrappers for web—based information
`sources. To construct a wrapper, we need both a se-
`mantic model of the source that describes the fields
`available on that type of page and a syntactic model,
`or grammar, that describes the page format, so the
`fields can be extracted. Requiring developers to de-
`scribe the syntactic structure of a web page by writ-
`ing a grammar by hand is too demanding, since we
`want
`to make it easy for relatively unsophisticated
`users to develop applications. Instead, Ariadne has a
`“demonstration—oriented user interface” (DoUI) where
`users show the system what
`information to extract
`
`Automated Reasoning
`
`21 3
`
`

`
`from example pages. Underlying the interface is a ma-
`chine learning system for inducing grammar rules.
`Figure 4 shows how an application developer uses
`the interface to teach the system about CIA factbook
`pages, producing both a semantic model and a syntac-
`tic model of the source. The screen is divided into two
`
`parts. The upper half shows an example document, in
`this case the Netherlands page. The lower half shows
`a semantic model, which the user is in the midst of
`constructing for this page. The semantic model in the
`figure indicates that the class Country has attributes
`such as Total Area, Coastline, Latitude, Longitude,
`etc. The user constructs the semantic model incremen-
`
`tally, by typing in each attribute name and then filling
`in the appropriate value by cutting and pasting the
`information from the document. In doing so, the user
`actually accomplishes two functions. First, he provides
`a name for each attribute. Notice that he can choose
`
`the same names as used in the document (e.g., “To-
`tal area”) or he can choose new/ different names (e.g.,
`“Latitude”). As we will explain later,
`the attribute
`names have significance, since they are the basis for
`integrating data across sources. '
`
`Nelsco - 3: ‘me world Fuclbool? - n e mi Nelherlands
`'
`
`E}
`
`
`
`Geography
`Map references: Europe
`Area:
`Iota/an-.1.'37,330 :k; km
`Inn;/an9a:33,920 sq km
`aongpatnliwe nnaa.'sligh11y less than twice the size of New Jersey
`Coastline: 451 krn
`Geognphic coordinates: 52 30 N,
`Land boundaries:
`I012/:1 ,027 km
`bo1‘¢#vI'counn1'as'.'Bel "um 450 km Ge
`Total Area: 37,330 sq km
`Coastline: 451 km
`Latitude: 52 30 N
`Longitude:
`
`Cut and Paste
`
`Figure 4: Creating a Wrapper by Demonstration
`
`The second function achieved by the user’s demon-
`stration is to provide examples so that the system can
`induce the syntactic structure of the page. Ideally, af-
`ter the user has picked out a few examples for each
`field, the system will induce a grammar sufficient for
`extracting the required information for all pages of
`this type. Unfortunately, grammar induction methods
`may require many examples, depending on the class of
`grammars being learned. However, we have observed
`that web pages have common characteristics that We
`can take advantage of, so that a class of grammars suf-
`ficient for extraction purposes can be rapidly learned
`in practice.
`More specifically, we can describe most semistruc-
`tured web pages as embedded catalogs. A catalog is
`either a homogeneous list, such as a list of numbers,
`(1,3,5,7,8), or a heterogeneous tuple, such as a 3-
`tuple consisting of a number, a letter, and a string,
`(1,A,“test”).- An embedded catalog is a catalog where
`
`214 Modeling the Web
`
`the items themselves can be catalogs. As an exam-
`ple, consider a CIA factbook page (see Figure 1). The
`top level consists of an 8-tuple distinguished by sec-
`tion headings: Geography, People, etc. The Geogra-
`phy section is a tuple consisting of Map References,
`Area, Coastline, etc. These can be decomposed fur-
`ther if necessary; Coastline is a tuple consisting of a
`number and the string “km”.
`Because web pages are intended to be human read-
`able, special markers often play a role identifying the
`beginning or ending of an item in an embedded cat-
`alog, separating items in a homogeneous list, and so
`on. These distinguishing markers can be used as land-
`marks for locating information on a page. For instance,
`to find the longitude, simply skip down to the heading
`“Geography”, then to “Geographic Coordinatesz”, and
`then skip past the first comma.
`A landmark grammar describes the position of a field
`via a sequence oflandmarks, where each landmark is it-
`self described by a deterministic finite automaton. Our
`recent work (Muslea et a1. 1998) shows that in prac-
`tice, a subclass of landmark grammars (linear, aug-
`mented landmark grammars) can be learned rapidly
`for a variety of web pages using a greedy covering al-
`gorithm. There are several reasons for this. Firstly,
`because web pages are intended to be human readable,
`there is often a single landmark that distinguishes or
`separates each field from its neighbors. Therefore the
`number of landmarks for a field in an embedded cata-
`
`log will generally be equal to its “depth” in the catalog.
`Since most catalogs are very shallow, this means that
`the length of the grammar rules to be learned will be
`very small, and learning will be easy in practice. Sec-
`ondly, during the demonstration process, users traverse
`a page from top-to-bottom, picking out the positive ex-
`amples of each field. Any position on the page that is
`not marked as a positive example is implicitly a nega-
`tive example. Thus, for every positive example identi-
`fied by the user, we obtain a huge number of negative
`examples that the covering algorithm can use to focus
`its search.
`
`The modeling tool we have described enables un-
`sophisticated users to turn web pages into relational
`information sources. But it has a second advantage as
`well. If the format of a web source changes in minor
`respects, the system could induce a new grammar by
`reusing examples from the original learning episode,
`without any human intervention (assuming the under-
`lying content has not changed significantly). This is a
`capability we are currently exploring.
`
`Modeling the Information in a Site:
`Connections between Pages
`
`The previous section showed how Ariadne extracts in-
`formation from a web page to answer a query. How-
`ever, before extracting information from a page, Ari-
`adne must first locate the page in question. Our ap-
`proach, described in this section, is to model the infor-
`mation required to “navigate” through a web site, so
`
`

`
` each wrapper (shown by the small directional arrows in
`
`Figure 6). The country page wrapper takes a country-
`URL, and acts as a source for “total area”, “popula-
`tion”, “latitude”, etc. The index wrapper takes a coun-
`try name5 and acts as a source for “country-URL”.
`Given the domain model and the binding constraints,
`the system can now construct query plans. For in-
`stance, to obtain the population of a country given its
`name,
`the planner determines that the system must
`first use the country name to retrieve the country-
`URL from the index page wrapper, and then use the
`country—URL to retrieve the population data from the
`country page wrapper.
`
`Factbook country page wrapper
`country url—%
`total area 6
`latitude <6
`longitude %
`population <m
`
`Factbook index wrapper
`country nm —.
`/
`country url 4:‘ [
`Map database
`country nm
`map
`NATO page wra
`country nm
`
`er
`
`
`
`:
`
`country nm
`> person nm
`title
`World Governments
`page wrapper
`
`Figure 6: Domain Model with Web Sources
`
`Explicitly modeling ‘navigation’ pages, such as the
`factbook index, as information sources enables us to
`reuse the same modeling tools and planning method-
`ology underlying the rest of the system. The approach
`works well
`in part because there are only two com-
`mon types of navigation strategies used on the Web
`— direct indexing and form-based retrieval. We have
`already seen how index pages are handled; form-based
`navigation is also straightforward. A wrapper for an
`HTML form simply mimics the action of the form, tak-
`ing as input a set of attributes, each associated with
`a form parameter name, and communicating with the
`server specified in the form’s HTML source.
`When the resulting page is returned,
`the Wrapper
`extracts the relevant attributes in the resulting page.
`Imagine, for instance, a form-based front end to the
`factbook, Where the user types in a country name and
`the form returns the requested country page. To create
`a wrapper for this front end, the developer would first
`specify that the parameter associated with the type-
`in box would be filled by a “country-nm”. He would
`then specify how the system should extract information
`from the page returned by the form using the approach
`described in the last section.
`The Factbook example described in this section il-
`lustrates our basic approach to modeling navigation
`pages. Many web sites are more complex than the
`factbook. The approach still works, but the models
`
`5No URL is needed as input to the index page wrapper
`since the URL of the index page is a constant.
`
`Automated Reasoning
`
`215
`
`that the planner can automatically determine how to
`locate a page.
`For example, consider a query to our example infor-
`mation agent asking for the population of the Nether-
`lands. To extract the population from the factbook’s
`page on the Netherlands,
`the system must first find
`the URL for that page. A person faced with the same
`task would look at the index page for the factbook,
`shown in Figure 5, which lists each country by name to-
`gether with a hypertext link to the page in question. In
`our approach, Ariadne does essentially the same thing.
`The index page serves as an information source that
`provides a URL for each country page. These pages in
`turn serve as a source for country~specific information.
`
`
`e_>fghLrIis;m(3£I_<§).
`itfllili).
`:A.1a§_I1'_»1(Z2£l3_).
`American Samoa (25 KB),
` ).
`&€91a(2iIS.B}.
`é‘AIg%(;4_I§i).
`Antarctica (26 KB)
`A__nt_igua and Barbuda (2? KBI
`Arctic Ocean (15 KB}
`A_trgenu‘na (30 KB)
`An__n_en_ia_(&1S§).
`Aruba (24 KB)
`0 Ashmnre and Canier Islands (16 KB).
`
`
`0 Atlantic Ocean (16 KB}
`
`
`Figure 5: CIA Factbook Index
`
`To create a wrapper for the index page, the devel-
`oper uses the approach described in the last section,
`where we illustrated how a wrapper for the factbook’s
`country pages is created. There is only one difference:
`this wrapper only wraps a single page, the index page.
`The developer creates a semantic model indicating that
`the index page contains a list of countries, each with
`two attributes, country-nm and country-URL.4 The
`learning system induces a grammar for the entire page
`after the developer shows how the first few lines in the
`file should be parsed.
`As the wrappers for each source are developed, they
`are integrated into the unifying domain model. Fig-
`ure 6 shows the domain model for the completed geopo-
`litical agent.
`(Notice that we have substituted web
`source wrappers for the hypothetical databases used
`previously.) To create the domain model, the devel-
`oper specifies the relationship between the wrappers
`and the domain concepts. For instance, the developer
`specifies that the Factbook country wrapper and the
`Factbook index wrapper are both information sources
`for “country” information, and he identifies which at-
`tributes are keys (i.e., unique identifiers). In the exam-
`ple, “country—nm” and “country-URL” are both keys.
`Binding constraints specify the input and output of
`
`‘During the demonstration, a special copy command is
`used to obtain a URL from a hyperlink, as opposed to grab-
`bing text.
`'
`
`

`
`source to each of the other sources where a different
`
`naming scheme is used. An advantage of the Ariadne
`architecture is that the mapping itself can be repre-
`sented as simply another wrapped information source.
`One way to do this is to create a mapping table, which
`specifies for each entry in one data source what the
`equivalent entity is called in another data source. Al-
`ternatively, if the mapping is computable, it can be
`represented by a mapping function, which is a program
`that converts one form into another form.
`
`Figure 8 illustrates the role of mapping tables in our
`geopolitical information agent. The Factbook is the
`primary source for a country’s name. A mapping table
`maps each factbook country name into the name used
`in the World Governments source (i.e., WG—country-
`nm). The mapping source contains only two attributes,
`the (factbook) country name and the WG—country-nm.
`The NATO source is treated similarly. So, for exam-
`ple, if someone wanted to find the Heads of State of the
`NATO countries, the query planner would retrieve the
`NATO country names from the NATO wrapper, map
`them into (factbook) country names using the NATO
`mapping table, then into the World Government coun-
`try names using the World Governments mapping ta-
`ble, and finally retrieve the appropriate heads of state
`from the World Governments wrapper.
`
`
`
`World Governments
`mapping table
`Country nm
` WG comm-y nm
`T WG
`,,,,,;".,“,’.§”’ "m
`—> "u
`World Giiveemments
`page wrapper
`r - country nm
`‘T NATO cqunuy nm
`NATO mapping table
`
`Factbook country page wrapper
`country url—>
`total area ’
`latitude 6
`'°"E*"!d° <—
`"°"“”“'°“ 6
`Factbook index wrapp -
`country nm —*
`/
`country url <v
`Map database 8
`counuyniflli,
`NATO page wra
`er
`Nato country nnréefl
`
`[
`
`
`
`become more involved. For instance, indexes can be
`hierarchical, in which case each level of the hierarchy
`must be modeled as an information source.
`Imagine
`the top—level factbook index was a list of letters, so that
`clicking on a letter “C” would produce an index page
`for countries starting with “C” (a “subindex”). We
`would model this top level index as a relation between
`letters and subindex—URL’s. To traverse this index, we
`also need an information source that takes a country
`name and returns the first letter of the name (e.g., a
`string manipulation program). Thus, altogether four
`wrappers would be involved in the navigation process,
`as shown in Figure 7. Given a query asking for the
`Netherlands’ population, the first wrapper would take
`the name “Netherlands”, call the string manipulation
`program, and return the first letter of the name, “N”.
`The second wrapper would take the letter “N”, ac-
`cess the top level index page, and return the subindex-
`URL. The third wrapper would take the subindex-URL
`and the country name, access the subindex page for
`countries starting with “N”, and return the country-
`URL. Finally, the last wrapper would take the country-
`URL and access the Netherlands page. The advan-
`tage of our approach is that all these wrappers are
`treated uniformly as information sources, so the query
`planner can automatically determine how to compose
`the query plan. Furthermore,
`the wrappers can be
`semi—automatically created via the learning approach
`described earlier, except for the string manipulation
`wrapper, which is a common utility.
`
`Factbook country page wrapper
`country url —e—
`total area <—
`latitude %
`longitude %
`population (j
`
`String Manipulation
`program wrapper
`% country nm
`% first letter
`
`SublndexWrapper Top level indexwrapper
`
`subindex ur1T>
`country nm%
`country url%
`
`/
`
`\
`
`% country nm
`% first letter
`> subindex url
`
`Figure 7: Domain Model with Hierarchical Index
`
`Modeling Information Across Sites
`Within a single site, entities (e.g., people, places, coun-
`tries, companies, etc.) are usually named in a consis-
`tent fashion. However, across sites, the same entities
`may be referred to with different names. For example,
`the CIA factbook refers to the “Vatican City” while
`the World Governments site refers to “The Holy See”.
`Sometimes formatting conventions are responsible for
`differences, such as “Denmark” vs. “Denmark, King-
`dom of”. To make sense of data that spans multiple
`sites, we need to be able to recognize and resolve these
`differences.
`
`Our approach is to select a primary source for an
`entity’s name and then provide a mapping from that
`
`216 Modeling the Web
`
`Figure 8: Domain Model with Mapping Tables
`
`Currently, mapping tables and functions must be
`created manually, but we are developing a semi-
`automated method for building mapping tables and
`functions by analyzing the underlying data in advance.
`The basic idea is to use information retrieval
`tech-
`
`(Cohen
`niques to provide an initial mapping (e.g.,
`1998)), and then use additional data in the sources to
`resolve any remaining ambiguities via statistical learn-
`ing methods (e.g., (Huang and Russell 1997)). For ex-
`ample, both the Factbook and the World Governments
`sources list the title of the Heads of State.” This infor-
`mation can help determine that the Factbook’s “North
`Korea” and “South Korea” refer respectively to the
`World Government’s “Democratic Republic of Korea”
`
`“The Factbook lists the name of the Heads of State as
`well but, unlike the VVorld Governments site, the informa-
`tion is often out of date. This is one reason why the World
`Governments site is useful.
`
`
`
`

`
`and “Republic of Korea”, rather than the other way
`around. Our approach can also be; used to. automat-
`ically update mapping tables when I1€W'S.0.ll:E(Z€S
`are
`released. For instance, each year a new version of the
`CIA factbook is released, and sometimes'<zo_u"vntries have
`new names, or countries merge or split. These name
`confusions can often be resolved using geographical in-
`formation (e.g., land area, latitude and longitude).
`
`Applications
`
`Below we list some Ariadne applications we are devel-
`oping, illustrating the generality of our approach:
`
`World-Wide Geographic Information Server:
`We are collaborating with another group that is build-
`ing a geographic information system that integrates
`a variety of map—based information sources. These
`sources include satellite images, detailed street maps,
`parcel data, historical aerial photographs, etc. We
`are using Ariadne to extract geographically referenced
`data from the Web and integrate it with map data.
`We have built a system using‘ Ariadne that extracts
`restaurant data from the Zagats Restaurant Reviews
`site, feeds the restaurant address into a geocoder, and
`then places the restaurant on an aerial map. Other
`web sources that we plan to incorporate include cen-
`sus data, US Geological Survey data, and real estate
`data from the Multiple Listing Service.
`
`Electronic Catalog Access: We are applying Ari-
`adne to provide access to online electronic catalogs for
`the Defense Logistics Agency. One implemented appli-
`cation provides real-time access to pricing and avail-
`ability data from the General Services Administration
`web pages. This application accesses only a single site,
`but retrieves pricing data for parts by extracting and
`integrating data from multiple pages in the site.
`
`Financial Information Agent: We have done ini-
`tial work on an agent that accesses stock quote servers,
`stock exchange sources,
`and the SEC’s EDGAR
`Archives (which contains copies of financial filings,
`such as annual reports, by publicly traded companies
`and mutual funds). By integrating these sources, the
`agent could answer queries such as “Find all airline
`companies whose stock has risen more than thirty per-
`cent in the last year” and “Find all people who serve
`as directors of two or more companies located in Los
`Angeles”. Using the modeling tools described earlier,
`users could also include their own personal financial
`data sources, tailoring the system to their needs.
`
`Related Work
`
`There is large body of relevant literature on infor-
`mation integration (Wiederliold 1996), but the most
`closely related work focuses specifically on the prob-
`lems of information integration on the Web, such as In-
`formation Manifold (Levy et al. 1996), Occam (Kwok
`and Weld 1996), Infomaster (Genesereth et a1. 1997),
`
`and InfoSleuth (Bayardo Jr. et al. 1997). These sys-
`tems focus on a variety of issues, including the prob-
`lems of representing and selecting a relevant set of
`sources to answer a query, handling binding patterns,
`and resolving discrepancies among sources. All of this
`work is directly relevant to Ariadne, but the issue ad-
`dressed in this paper that has not been addressed pre-
`viously is how one represents the information within a
`single page, across pages at a site, and across sites to
`support web-based information integration.
`Another closely related body of work is on the ex-
`traction of data from web sources (Hammer et al. 1997;
`Doorenbos et al. 41997; Kushmerick 1997). The fo-
`cus of all of these systems are on building wrappers
`for semi-structured sources. The systems either take a
`template—based specification of a source, as in (Ham-
`mer et a1. 1997), or learn the structure of the source by
`example and then compile a wrapper that provides ac-
`cess to the source, as in (Kushmerick 1997). Our work
`on inducing wrappers takes the latter approach. The
`induction method is not only very general, but is also
`integrated into the larger Ariadne development system
`so that the learned wrappers can be used directly by
`the query planner.
`
`Discussion
`
`There are many examples of impressive AI systems
`based on relatively simple representational schemes. In
`the realm of planning, recent examples include SAT-
`plan (Kautz and Selman 1996) and Graph—plan (Blum
`and Furst 1995); the former employs a propositional
`CSP approach, the latter, a graph—based search.
`In
`machine learning, propositional learning schemes (e.g.,
`decision trees) have been dominant. Though it is often
`difficult to understand exactly what a simple represen-
`tational scheme buys you computationally, one thing
`seems clear: systems with simple representations are
`often easier to design and understand.
`in terms of
`We believe that Ariadne is successful,
`the broad applicability of the approach, because it
`combines a simple representation scheme with sophisti-
`cated modeling tools that map web information sources
`into this simple representation. Ariadne capitalizes on
`a representation scheme adopted from database sys-
`tems, where the world consists of a set of relations (or
`tables) over objects, and simple relational operators
`(retrieve, join, etc.) are composed to answer queries.
`This representation makes it straightforward to inte-
`grate multiple databases using an AI planner. Ari-
`adne’s planner can efficiently search for a sequence of
`joins, selections, etc. that will produce the desired re-
`sult without needing to do any sophisticated reasoning
`about the information sources themselves.
`than the
`The Web environment
`is much richer
`database world, of course. What makes Ariadne possi-
`ble are the modeling tools that enable a user to cre

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket