`for Efficient Keyword-Based Search
`
`Qi Su
`Stanford University
`qisu@cs.stanford.edu
`
`Jennifer Widom
`Stanford University
`widom@cs.stanford.edu
`
`Abstract
`
`Information Retrieval systems such as web search en-
`gines offer convenient keyword-based search interfaces. In
`contrast, relational database systems require the user to
`learn SQL and to know the schema of the underlying data
`even to pose simple searches. We propose an architecture
`that supports highly efficient keyword-based search over re-
`lational databases: A relational database is ”crawled” in
`advance, text-indexing virtual documents that correspond
`to interconnected database content. At query time, the
`text index supports keyword-based searches with interac-
`tive response, identifying database objects corresponding
`to the virtual documents matching the query. Our system,
`EKSO, creates virtual documents from joining relational tu-
`ples and uses the DB2 Net Search Extender for indexing
`and keyword-search processing. Experimental results show
`that index size is manageable and database updates (which
`are propagated incrementally as recomputed virtual docu-
`ments to the text index) do not significantly hinder query
`performance. We also present a user study confirming the
`superiority of keyword-based search over SQL for a range
`of database retrieval tasks.
`
`1. Introduction
`
`Figure 1. Example database instance
`
`Relational Database Management Systems provide a
`convenient data model and versatile query capabilities over
`structured data. However, casual users must learn SQL and
`know the schema of the underlying data even to pose sim-
`ple searches. For example, suppose we have a customer
`order database, shown in Figure 1, which uses the TPC-H
`[4] schema in Figure 2. We wish to search for Bob’s pur-
`chases related to his aquarium fish hobby. To answer this
`query, we must know to join the Customer, Lineitem, Or-
`ders, and Part relations on the appropriate attributes, and
`we must know which attributes to constrain for ”Bob” and
`”fish”. The following SQL query is formulated:
`
`SELECT c.custkey, o.orderkey, p.partkey
`FROM Customer c, Lineitem l, Orders o, Part p
`WHERE c.custkey = o.custkey AND l.partkey = p.partkey
`AND o.orderkey = l.orderkey AND c.name LIKE ’%Bob%’
`AND p.name LIKE ’%fish%’
`
`this query re-
`In our example database instance,
`turns a single tuple hcustkey, orderkey, partkeyi of
`h100, 10001, 10i. In contrast, keyword-based search, exem-
`plified by web search engines, offers an intuitive interface
`for searching textual content that requires little knowledge
`about the underlying data. In this paradigm, for the above
`example a user should be able to enter the keywords ”Bob
`fish” and find the relevant information from the results.
`
`1
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 1 of 13
`
`
`
`The Crawler can construct textual virtual documents in
`any fashion from the database (typically based on in-
`terconnected tuples from multiple relations if the data-
`base is relational), as long as each virtual document
`is associated with some logical or physical database
`object that makes sense when returned to the user.
`The Crawler includes an Update Handler that com-
`putes changes to virtual documents when the under-
`lying database is updated.
`
`• Virtual documents are sent to the Indexer, which
`builds an inverted index over the virtual documents in
`the usual Information Retrieval (IR) style. When the
`Crawler identifies new or modified virtual documents
`as a result of database updates, the Indexer refreshes
`the inverted index incrementally.
`
`• At search time, the Keyword Search Processor takes
`a keyword query and probes the inverted index (again
`in an IR style) to find the virtual documents satisfying
`the query. The database objects associated with these
`documents are returned to the user. The user may then
`choose to explore the database or refine the query.
`
`The key features and advantages of our approach, and
`the contributions of this paper, are summarized as follows:
`
`• Most previous approaches to keyword-based search in
`structured databases (reviewed in Section 2) perform a
`significant amount of database computation at search
`time. Our offline indexing approach does all signifi-
`cant work in advance so it can provide interactive an-
`swers. We show in our performance study that the
`potential obstacles to this approach-index construction
`and update handling-are manageable. Index construc-
`tion performance depends primarily on DB2, but can
`easily be parallelized. Updates do not have a dramatic
`impact on query performance, although for databases
`with very high update rate, one of the existing online
`approaches may be more appropriate.
`
`• Our approach is extensible. We construct virtual doc-
`uments through user-defined functions (UDFs) that in
`turn execute SQL queries. Currently our virtual docu-
`ments are formed by concatenating text fields in inter-
`connected relational tuples. However, without chang-
`ing the infrastructure we could plug in a different UDF
`to construct virtual documents differently. For exam-
`ple, we could produce XML or some other textual
`format as virtual documents, possibly incorporating
`schema information. If the text search engine provides
`XML search capabilities such as XPath [3], our system
`automatically inherits those capabilities.
`
`Figure 2. TPCH Schema
`
`Figure 3. Architecture
`
`We have developed a system, EKSO (for Efficient Key-
`word Search through Offline Indexing), that supports highly
`efficient keyword-based search on relational databases. For
`the above example, on query ”Bob fish” EKSO returns a re-
`sult equivalent to the SQL query result. EKSO also supports
`much more complex searches over complex data. Our novel
`use of DB2’s Net Search Extender (NSE) [7, 13] enabled us
`to build a powerful system that is extensible, and that can
`incorporate future text indexing and search features in NSE
`without modification. Our system is the first we know of for
`expressive keyword-based search over interconnected rela-
`tional database content that operates within the database en-
`gine and performs full offline indexing for highly efficient
`searches.
`EKSO is an instance of a general architecture we propose
`in this paper, depicted in Figure 3 and consisting of three
`main components:
`
`• The Crawler traverses the entire database offline, con-
`structing virtual documents from database content.
`
`• We exploit existing relational database extensions for
`text indexing and search that is highly expressive and
`
`2
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 2 of 13
`
`
`
`extensible in its treatment of database content. By
`using DB2’s Net Search Extender as the back end,
`we completely avoid reimplementing basic IR capa-
`bilities. Furthermore, our method for exploiting the
`text extension ensures that all features are automati-
`cally available in our search tool, including stemming,
`stopword elimination, case-folding, Boolean queries,
`proximity queries, relevance-ranked results, wildcards,
`fuzzy searches, and natural language searches [7].
`
`• The premise of our work is that keyword-based search
`is more intuitive and convenient than SQL for many
`query tasks over relational databases. To verify this as-
`sumption, we conducted the only study we are aware
`of that quantifies the relative performance of SQL
`queries versus keyword searches on a variety of query
`tasks.
`
`1.1. Outline of Paper
`
`We provide a thorough survey of related work in Sec-
`tion 2. Section 3 is the core of the paper. It describes the
`EKSO system, including functionality, algorithms, and sys-
`tem implementation details. Section 4 presents our perfor-
`mance study of the EKSO system, while Section 5 presents
`our user study demonstrating the advantages of keyword-
`based searches over SQL queries. We conclude and suggest
`future directions in Section 6.
`
`2. Related Work
`
`An early paper [12] proposes implementing abstract data
`types, user-defined operators, and specialized indexes for
`textual databasesa precursor to current relational database
`vendors’ approaches to integrating full-text search function-
`ality.
`IBM DB2 [13], Microsoft SQL Server [9], Oracle
`[6], PostgreSQL, and MySQL all provide text search en-
`gine extensions that are tightly coupled with the database
`engine. However, in all cases each text index is defined over
`a single column. Using this feature alone to do meaningful
`keyword search over an interconnected database would re-
`quire merging the results from many column text indexes.
`If our interconnected tuples contain N text attributes, and
`our keyword query contains M conjunctive terms, then in
`addition to joining relations to generate the interconnected
`tuples (as we do offline; see Section 3.1), we must invoke
`the text search engine up to N M times to match the query
`keywords to the text columns.
`Three systems share the concept of crawling databases to
`build external indexes. Verity [15] crawls the content of re-
`lational databases and builds an external text index for key-
`word searches, as well as external auxiliary indexes to en-
`able parametric searches. This approach is similar to ours
`
`in terms of the index-all-data-offline paradigm. However,
`by leveraging the database vendor’s integrated text search
`engine tool, we avoid transporting data outside the data-
`base for external processing, which can be expensive and
`difficult to synchronize, and we inherit all DBMS features
`automatically.
`DataSpot [5] extracts database content and builds an
`external, graph-based representation called a hyperbase to
`support keyword search. Graph nodes represent data ob-
`jects such as relations, tuples, and attribute values. Query
`answers are connected subgraphs of the hyperbase whose
`nodes contain all of the query keywords. Similar to our
`concept of a root tuple (Section 3.1.1), each answer is rep-
`resented by a node associated with a tuple in the database,
`from which the user can navigate to related tuples. As with
`Verity, DataSpot requires transporting content to be indexed
`outside the database.
`DbSurfer [18] indexes the textual content of each rela-
`tional tuple as a virtual web page. For querying and navi-
`gation, the database foreign-key constraints between tuples
`are treated as hyperlinks between virtual web pages. Given
`a keyword query, the system computes a ranked set of vir-
`tual web pages that match at least one keyword. Then a
`best-first expansion algorithm finds a ranked set of naviga-
`tion paths originating from the starting web pages. The nav-
`igation path concept is similar to our definition of a text ob-
`ject (Section 3.1.2). However, even though the web pages
`are indexed offline, significant query time computation is
`required to expand the starting web page set to find navi-
`gation paths satisfying the full query, work that we perform
`offline.
`Three systems, DBXplorer, BANKS, and DISCOVER,
`share a similar approach: At query time, given a set of key-
`words, first find tuples in each relation that contain at least
`one of the keywords, usually using auxiliary indexes. Then
`use graph-based approaches to find tuples among those
`from the previous step that can be joined together, such that
`the joined tuple contains all keywords in the query. All three
`systems use foreign-key relationships as edges in the graph,
`and point out that their approach could be extended to more
`general join conditions. We discuss the three systems in
`more detail.
`Given a set of keywords, DBXplorer [1] first finds the
`set of tables that contain at least one of the keywords. Then
`using the undirected schema graph (where each node is a
`relation and each edge is a foreign-key relationship), a set
`of subgraphs is enumerated, such that each subgraph repre-
`sents a joining of relations where the result contains rows
`that potentially contain all of the query keywords. For each
`candidate join subgraph generated, a SQL statement is ex-
`ecuted to join the specified relations and select rows that
`do contain all of the keywords. The system is evaluated
`on TPC-H (100MB to 500MB) and three internal Microsoft
`
`3
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 3 of 13
`
`
`
`Figure 4. EKSO System
`
`databases (130MB to 375MB, from 19 tables to 84 tables).
`The paper focuses on matching keywords to an entire at-
`tribute value. It briefly discusses extending the system to
`work with subattribute granularity matches and to provide
`more sophisticated search options.
`BANKS [2] uses a data graph where each tuple is a node
`and each foreign-key relationship between tuples is repre-
`sented as a bidirectional edge. The keyword query result
`is a connected subgraph of tuples, where each keyword is
`found in at least one tuple node. Dijkstra’s algorithm is used
`to find the shortest path such that the nodes covered collec-
`tively contain all query keywords. The system can rank the
`results based on the sum of the node relevance scores and
`edge weights.
`DISCOVER [11] uses schema graph information to enu-
`merate minimum joining networks, which are sequences of
`tuples that can be joined together (based on foreign-key re-
`lationships) into a single tuple that contains all of the key-
`words. The algorithm models a tuple-set graph. Execution
`plans for the set of joining networks are generated and op-
`timized to share intermediate results, then translated into
`SQL statements to build the join tuples that contain all of
`the query keywords. The performance evaluation focuses
`primarily on two aspects, efficiency of pruning irrelevant
`joining networks, and comparison of several algorithms
`for deciding what intermediate results to build and share.
`Execution time increases significantly with the number of
`candidate joining networks or the number of keywords in
`the query. [10] extends DISCOVER to perform relevance-
`ranked and top-k keyword queries.
`An alternate approach, taken in [14, 16], translates key-
`
`word searches to SQL based on the schema, but does
`not focus on efficient implementation techniques. [8] de-
`scribes a system for keyword and proximity search in graph-
`structured databases based on shortest-path computations,
`focusing on a combination of offline and online computa-
`tion.
`While a direct empirical comparison between our system
`and some of the other approaches mentioned in this sec-
`tion would be very interesting, the systems are not publicly
`available, and any effort to implement them well enough for
`a fair comparison would be prohibitive.
`
`3. The EKSO System
`
`The system we have developed, EKSO, is an instanti-
`ation of the general architecture we propose for keyword-
`based search over structured databases that was shown in
`Figure 3. Figure 4 shows details of the EKSO system,
`which will be described as this section proceeds. EKSO is
`fully implemented and functional; performance evaluation
`is presented in Section 4. In Section 3.1 we define EKSO’s
`virtual documents and discuss in detail their computation
`from the underlying relational database. Sections 3.2, 3.3,
`and 3.4 discuss operation of the Crawler, Indexer, and Key-
`word Search Processor, respectively. Section 3.5 describes
`how the system handles database updates, determining and
`recomputing the affected virtual documents. Section 3.6
`contrasts our system with an alternative approach based on
`materialized views. Finally, Section 3.7 describes the user
`interface to the system.
`
`4
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 4 of 13
`
`
`
`input
`
`output
`
`: Schema graph G where nodes are relations and edges are
`foreign key constraints and set of root relations RS
`: Set of text objects T
`
`T ← {};
`foreach root relation R in RS do
`foreach tuple r in R do
`text object t ← {r} ;
`start at R, do breadth-first traversal on G ;
`foreach relation R’ traversed do
`add to t the set of R’ tuples that foreign-key join
`with existing tuples in t ;
`add t to T ;
`
`return T ;
`Figure 5: Text Object Construction (TOC) algorithm
`
`Example: Consider the text object construction process for one root
`tuple P 1 from relation Part in the database instance in Figure 1.
`Starting with tuple P 1, in the first step of the traversal we select
`Partsupp tuple P S1 and Lineitem tuple L1, which both join with
`P 1 on the partkey attribute. Next we select Supplier tuple S1,
`which joins with P S1 on suppkey, and Orders tuple O1, which
`joins with L1 on orderkey. Then we select Customer tuple C 1,
`which joins with O1 on custkey. The text object for P 1 consists of
`the set of tuples {P 1, P S1, L1, S1, O1, C 1}.
`Figure 6: Text Object Construction example
`
`3.1. Text Objects and Virtual Documents
`
`Recall from Section 1 that our goal is to perform a full
`crawl of the database, generating virtual documents from
`database content that define the granularity of text search
`and retrieval in the system. In EKSO we separate two con-
`cepts:
`
`• Text objects are generated from interconnected tuples
`based on joining relational tables.
`
`• Virtual documents are generated from text objects by
`concatenating their string attributes.
`
`This separation has at least two advantages:
`
`1. We can use SQL queries, which are optimized auto-
`matically by the DBMS, to generate text objects.
`
`2. We can independently plug in different functionality
`for generating text objects from the database, or for
`generating virtual documents from text objects.
`
`In EKSO we define a text object to be the set of tuples
`in the database that are related to a given root tuple. When
`a virtual document D is part of the answer to a keyword
`query, the system returns an identifier for the root tuple as-
`sociated with the text object from which D was generated.
`
`3.1.1 Root Relations and Tuples
`
`Root tuples are defined as all of the tuples in root relations.
`Root relations can be designated automatically or set by the
`database administrator. Designating a relation R as a root
`relation creates text objects and virtual documents associ-
`ated with each tuple of R. In EKSO as a default we select as
`root relations all ”entity relations,” i.e., all relations whose
`primary key is not composed entirely of foreign-keys.
`
`3.1.2 Text Objects
`
`We define the text object for a particular root tuple r as the
`set of all tuples in the database that join with r through a se-
`quence of zero or more foreign-key joins. We do not back-
`track, i.e., the sequence of foreign-key joins joins with any
`relation at most once. See example in Figure 6.
`Figure 5 outlines the text object construction (TOC) al-
`gorithm. To construct the text object for a root tuple r of
`root relation R, we start with r. Logically we perform a
`breadth-first traversal of the schema graph. For all neighbor
`relations of R, we select those tuples that join with r, based
`on foreign-key constraints. Next we join those tuples with
`relations two away from the root relation R, and so on, un-
`til we have traversed all relations reachable from R in the
`schema graph. In the schema graph traversal, we treat for-
`eign key constraints as bidirectional. We mark each relation
`during the breadth-first traversal to avoid backtracking and
`cycles in the schema graph.
`The algorithm as described is not guaranteed to cover the
`entire database when creating text objects, unless certain
`conditions hold. We must have a set of root relations such
`that all relations in the database are reachable from some
`root relation via the schema graph. Otherwise we may have
`”dangling relations”, in which case we could use a second
`pass to create additional text objects from the dangling con-
`tent. Note that because we follow foreign key edges, there
`will be no dangling tuples within connected relations.
`Alternate definitions could be used for text objects from
`a given root tuple. For example, we could follow equijoins
`as well as foreign-key joins (with a second pass to handle
`dangling tuples), and we could backtrack. In the Figure 6
`example, the text object for Part tuple P1 includes the sup-
`pliers who supply P1, as well as customers who have or-
`dered P1. We could backtrack to include other parts sup-
`plied by those suppliers who supply P1, as well as other
`parts ordered by those customers who ordered P1.
`
`3.1.3 Virtual Documents
`
`After a text object has been computed, we concatenate all
`textual attributes of all tuples in the text object to form the
`virtual document. The order of the tuples selected during
`traversal is nondeterministic due to SQL semantics, and in
`
`5
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 5 of 13
`
`
`
`the current system we are not concerned about the concate-
`nation order. However, if we extend the system to richer
`virtual documents, as we will discuss next, we may need to
`be careful about concatenation order. We can also include
`nontextual fields, such as numbers and dates, by simply us-
`ing their string representation. As an example, we concate-
`nate the textual attributes of the six tuples that comprise the
`text object for root tuple P1 from the Figure 6 example. For
`one possible ordering the virtual document would look like:
`Discus fish Acme Wisconsin Bob Madison.
`Note that for now we have chosen one particular defi-
`nition of text objects and virtual documents. We can eas-
`ily plug in different definitions. For example, we might
`generate richer structures in virtual documents. If the In-
`dexer and the Keyword Search Processor supported XML
`content, we could generate an XML fragment that captures
`database structure and uses tags to encode attribute and re-
`lation names. Such virtual documents enable XPath [3] or
`other XML-based queries. The XML fragment for our ex-
`ample might look like:
`
`<PART>
`<NAME>Discus fish</NAME>
`<SUPPLIER>
`<NAME>Acme</NAME>
`<ADDRESS>Wisconsin</ADDRESS>
`</SUPPLIER>
`<CUSTOMER>
`<NAME>Bob</NAME>
`<ADDRESS>Madison</ADDRESS>
`</CUSTOMER>
`</PART>
`
`3.2. Crawler
`
`The Crawler is responsible for traversing a structured
`database to produce virtual documents to be indexed.
`In
`Figure 4, the three relations on the left-hand side labeled
`R1, R2, and R3 correspond to the Structured Database in
`Figure 3. The Crawler in Figure 3 maps to the bulk of the
`system diagram in Figure 4, enclosed in the dashed box.
`It centers on a user-defined function (UDF) implemented in
`DB2 that takes as input an identifier for a root tuple, invokes
`a Text Object Construction (TOC) query, and constructs and
`returns a virtual document. The TOC queries are composed
`in advance following the TOC algorithm from Figure 5.
`There is one such query for each root relation. With knowl-
`edge of the schema, SQL WITH queries parameterized by
`each root relation’s primary key values are used. See Ap-
`pendix A for an example TOC query.
`Before indexing, a special table called Text Object Key
`Relation(TOKR) is populated with keys that individually
`identify all of the root relations and tuples in those rela-
`tions. At indexing time, the Crawler UDF decodes each key
`to find the root relation that the tuple belongs to, then looks
`up the corresponding TOC query, parameterizes it with the
`
`6
`
`root tuple identifier values, and executes it. The query result
`is the text object, whose textual attributes are concatenated
`into a Binary Large Object (BLOB). At this point, the UDF
`discards the text object and returns the BLOB as the virtual
`document.
`
`3.3. Indexer
`
`The Indexer is the box labeled NSE in the right-hand por-
`tion of Figure 4. It is managed by the DB2 Net Search Ex-
`tender (NSE). NSE is a full-featured text search engine in-
`tegrated with the IBM DB2 DBMS product. NSE supports
`the creation of text indexes on individual columns, and can
`index the output of applying a user-defined function to a
`column, which is the feature we exploit. We create an NSE-
`managed IR-style inverted index over the virtual documents
`returned by the Crawler UDF described in the previous sec-
`tion. NSE indexes all of the virtual documents, associating
`each one with its corresponding text object key.
`
`3.4. Keyword Search Processor
`
`In Figure 4, the Keyword Search Processor from Figure 3
`is the box labeled ”Translate Keywords to NSE Query”.
`A user’s keyword-based query is translated to an NSE in-
`vocation in the form of a SQL query invoking the special
`CONTAINS function. For example, after indexing our data-
`base instance from Figure 1, the conjunctive keyword query
`”Bob fish” is translated into the following SQL query:
`
`SELECT key FROM Tokr WHERE CONTAINS(key,’"Bob"&
`"Fish"’)=1
`
`The CONTAINS function automatically invokes the
`NSE search engine, which probes its inverted index and re-
`turns text object keys as query results, which are then re-
`turned to the user. The user may choose to explore the data-
`base from these keys, or further refine the search. We de-
`scribe our user interface that enables these activities in Sec-
`tion 3.7. Currently user exploration begins from scratch. To
`speed up user traversal we could consider caching traversal
`path from text object construction.
`By building on commercial DBMS text search features,
`we automatically inherit all of its capabilities, current as
`well as future. For example, EKSO automatically deliv-
`ers ranked search results based on NSE’s built-in ranking
`functions.
`
`3.5. Update Handler
`
`The Update Handler that is a part of the Crawler com-
`ponent in Figure 3 corresponds to the box labeled ”TOU
`Triggers” at the top right of the system diagram in Figure 4.
`When one or more tuples in the underlying database are
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 6 of 13
`
`
`
`input
`
`output
`
`: Schema graph G where nodes are relations and edges are
`foreign key constraints, set of root relations RS, and an
`inserted, deleted, or updated tuple r from Relation R
`: Root tuples RT whose text objects and virtual
`documents need to be recomputed
`
`RT ← {};
`RT ’ ← {r.old, r.new};
`start at R, do breadth-first traversal on G ;
`foreach relation R’ traversed do
`D ← set of tuples from R’ that foreign-key join with existing
`tuples in RT ’;
`add D to RT ’;
`if R’ in RS then
`add D to RT ;
`
`return RT ;
`Figure 7: Text Object Update (TOU) algorithm
`
`r inserted
`
`r deleted
`
`r updated
`
`Root tuple r
`Index TOC(r),
`Modify TOU(r)
`Delete index entries
`for r, Modify TOU(r)
`Delete index entries
`for r.old, Modify TOU(r)
`Index TOC(r.new),
`
`Non-root tuple r
`Modify TOU(r)
`
`Modify TOU(r)
`
`Modify TOU(r)
`
`Table 1. Text Object Update action
`
`modified (inserted, deleted, or updated), some text objects
`and virtual documents may need to be created, deleted, or
`updated. For a given modified tuple r, the first step is to
`compute the delta set of root tuples whose text object and
`virtual document must be recomputed. We call this compu-
`tation Text Object Update (TOU) and we denote the delta set
`of root tuples as TOU(r). Given a modified tuple r, TOU(r)
`is the union of the following two sets:
`
`• The set of existing root tuples whose text object in-
`cludes the old version of r
`
`• The set of existing root tuples whose text object should
`now include the new version of r
`
`If tuple r is inserted, the old-version set is empty. If tuple
`r is deleted, the new-version set is empty. Once the delta
`set TOU(r) is computed, for every root tuple r0 in TOU(r),
`we repeat the crawling and indexing procedure described
`earlier: The Crawler computes the text object and virtual
`document of r0, then the Indexer removes any existing index
`entries of r0 and indexes the recomputed virtual document.
`In Figure 7 we present the TOU algorithm to compute the
`delta set of root tuples whose text objects and virtual doc-
`uments must be recomputed as a result of modifications to
`an arbitrary tuple. Note the TOU algorithm is a dual of the
`TOC algorithm, since the TOU algorithm must find affected
`root tuples based on the text object semantics. Hence if the
`
`text object definition were to change (e.g., to permit back-
`tracking), then both the TOC and the TOU algorithms must
`change accordingly. Like the TOC algorithm, the TOU al-
`gorithm can be implemented as a SQL WITH query com-
`posed offline using schema information. There is one TOU
`query per relation in the database.
`Once the delta set of root tuples is identified, our update
`actions depend on whether the modified tuple r is itself a
`root tuple, and whether the original modification to r was an
`insert, delete, or update. Table 1 enumerates all the cases.
`r.old denotes the old version of a modified tuple r, i.e., a
`tuple being deleted, or a tuple before it is updated. r.new
`denotes the new version of a modified tuple r, i.e., a tuple
`being inserted, or a tuple after it has been updated. ”Modify
`TOU(r)” denotes asking the Crawler to recompute the text
`objects and virtual documents of each root tuple in the delta
`set TOU(r), and asking the Indexer to update its index with
`the recomputed virtual documents. When we insert a new
`root tuple r we must also ask the Crawler to execute the
`TOC algorithm for r and ask the Indexer to index the virtual
`document generated by the Crawler, which we denote by
`”Index TOC(r)”. When we delete a root tuple, we must
`have the Indexer remove index entries associated with the
`deleted root tuple.
`The Update Handler is implemented using triggers. We
`create three TOU triggers (insert, delete and update) on each
`database relation. See Appendix B for an example TOU
`insert trigger. The main trigger action for all three triggers
`is an insert statement that executes the TOU query to find
`the set of root tuples whose virtual document is affected
`by changes to a given tuple, and then inserts into the NSE
`update log table requests to update the NSE index entries
`for these root tuples. At the next NSE index update (either
`a manual NSE index refresh request or automatic periodic
`index refresh), NSE automatically invokes the Crawler UDF
`to reconstruct the requested virtual documents, which are
`used to rebuild the inverted index.
`In the case of the insert trigger on a root relation, a sec-
`ond trigger action inserts a corresponding tuple into the
`TOKR, which automatically creates an NSE update request
`that a new virtual document be generated and indexed. For
`the delete trigger on a root relation, a second trigger action
`deletes the corresponding TOKR tuple, thus generating an
`NSE update request that the appropriate entries in the NSE
`index be removed.
`
`3.6. Materialized View Approach
`
`Our Crawler constructs text objects one at a time, gen-
`erates a virtual document, and discards the text object. The
`Indexer consumes the virtual document to build the inverted
`index, and then discards the virtual document. An alter-
`native approach would be to create a materialized view
`
`7
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 7 of 13
`
`
`
`Relation
`
`Root Reln
`
`#Tuples
`
`CUSTOMER
`LINEITEM
`NATION
`ORDERS
`PART
`PARTSUPP
`REGION
`SUPPLIER
`
`Yes
`No
`No
`Yes
`Yes
`No
`No
`Yes
`
`75,000
`2,999,671
`25
`750,000
`100,000
`400,000
`5
`15,000
`
`Bulkload
`File Size
`12MB
`360MB
`3KB
`82MB
`11MB
`56MB
`1KB
`685KB
`
`Table 2. TPCH500 dataset characteristics
`
`Relation
`
`Data Source
`
`Root Reln
`
`#Tuples
`
`ITEMS
`USERS
`FEEDBACK
`BIDS
`
`Ebay
`TPC dbgen
`TPC dbgen
`TPC dbgen
`
`Yes
`Yes
`No
`No
`
`500,000
`26,913
`500,000
`1,000,000
`
`Bulkload
`File Size
`795MB
`4MB
`76MB
`54MB
`
`Table 3. Auction dataset characteristics
`
`4.1. Datasets
`
`We evaluate our text indexing and search system on two
`datasets, a 500MB TPC-H database, and a 929MB Auction
`database. The TPCH500 dataset follows the complete TPC-
`H schema in Figure 2 and is generated using the official
`TPC-H data generator (DBGEN) [4]. Schema information
`including relation names, attribute names, and join condi-
`tions are provided to our TOC Query construction applica-
`tion. We designate Customer, Orders, Part, and Supplier as
`the root relations. The system default would also include
`Region and Nation, but they are of too coarse a granularity
`to be useful to users. The four parameterized TOC Queries
`for the four root relations constitute our workload at index-
`ing time, since each UDF invocation executes one of these
`queries and concatenates the resulting tuples. We optimized
`the indexing process by submitting this four-query work-
`load to the DB2 Design Advisor, which suggested indexes
`to build on the database tables. Table 2 summarizes the
`TPCH500 dataset.
`The Auction dataset consists of a mixture of primarily
`real-world data and some synthetic data. It models a data-
`base for an online auction service and the schema is shown
`in Figure 8. Whereas the TPC-H schema has many rela-
`tively short textual fields and is fairly complex with many
`join relationships, the Auction dataset contains a few long
`textual fields and is very simple in terms of join relation-
`ships. The Items table is populated with 500,000 Ebay
`auction items crawled in 2001, provided courtesy of the
`University of Wisconsin database group. The Users ta-
`ble is populated from a subset of the Supplier table in our
`
`Figure 8. Auction dataset schema
`
`that persists all the virtual documents, and then maintain
`a text index over this view. Since our approach handles
`data updates, the only possible advantage we see to the
`view approach is exploiting existing infr