throbber
Indexing Relational Database Content Offline
`for Efficient Keyword-Based Search
`
`Qi Su
`Stanford University
`qisu@cs.stanford.edu
`
`Jennifer Widom
`Stanford University
`widom@cs.stanford.edu
`
`Abstract
`
`Information Retrieval systems such as web search en-
`gines offer convenient keyword-based search interfaces. In
`contrast, relational database systems require the user to
`learn SQL and to know the schema of the underlying data
`even to pose simple searches. We propose an architecture
`that supports highly efficient keyword-based search over re-
`lational databases: A relational database is ”crawled” in
`advance, text-indexing virtual documents that correspond
`to interconnected database content. At query time, the
`text index supports keyword-based searches with interac-
`tive response, identifying database objects corresponding
`to the virtual documents matching the query. Our system,
`EKSO, creates virtual documents from joining relational tu-
`ples and uses the DB2 Net Search Extender for indexing
`and keyword-search processing. Experimental results show
`that index size is manageable and database updates (which
`are propagated incrementally as recomputed virtual docu-
`ments to the text index) do not significantly hinder query
`performance. We also present a user study confirming the
`superiority of keyword-based search over SQL for a range
`of database retrieval tasks.
`
`1. Introduction
`
`Figure 1. Example database instance
`
`Relational Database Management Systems provide a
`convenient data model and versatile query capabilities over
`structured data. However, casual users must learn SQL and
`know the schema of the underlying data even to pose sim-
`ple searches. For example, suppose we have a customer
`order database, shown in Figure 1, which uses the TPC-H
`[4] schema in Figure 2. We wish to search for Bob’s pur-
`chases related to his aquarium fish hobby. To answer this
`query, we must know to join the Customer, Lineitem, Or-
`ders, and Part relations on the appropriate attributes, and
`we must know which attributes to constrain for ”Bob” and
`”fish”. The following SQL query is formulated:
`
`SELECT c.custkey, o.orderkey, p.partkey
`FROM Customer c, Lineitem l, Orders o, Part p
`WHERE c.custkey = o.custkey AND l.partkey = p.partkey
`AND o.orderkey = l.orderkey AND c.name LIKE ’%Bob%’
`AND p.name LIKE ’%fish%’
`
`this query re-
`In our example database instance,
`turns a single tuple hcustkey, orderkey, partkeyi of
`h100, 10001, 10i. In contrast, keyword-based search, exem-
`plified by web search engines, offers an intuitive interface
`for searching textual content that requires little knowledge
`about the underlying data. In this paradigm, for the above
`example a user should be able to enter the keywords ”Bob
`fish” and find the relevant information from the results.
`
`1
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 1 of 13
`
`

`
`The Crawler can construct textual virtual documents in
`any fashion from the database (typically based on in-
`terconnected tuples from multiple relations if the data-
`base is relational), as long as each virtual document
`is associated with some logical or physical database
`object that makes sense when returned to the user.
`The Crawler includes an Update Handler that com-
`putes changes to virtual documents when the under-
`lying database is updated.
`
`• Virtual documents are sent to the Indexer, which
`builds an inverted index over the virtual documents in
`the usual Information Retrieval (IR) style. When the
`Crawler identifies new or modified virtual documents
`as a result of database updates, the Indexer refreshes
`the inverted index incrementally.
`
`• At search time, the Keyword Search Processor takes
`a keyword query and probes the inverted index (again
`in an IR style) to find the virtual documents satisfying
`the query. The database objects associated with these
`documents are returned to the user. The user may then
`choose to explore the database or refine the query.
`
`The key features and advantages of our approach, and
`the contributions of this paper, are summarized as follows:
`
`• Most previous approaches to keyword-based search in
`structured databases (reviewed in Section 2) perform a
`significant amount of database computation at search
`time. Our offline indexing approach does all signifi-
`cant work in advance so it can provide interactive an-
`swers. We show in our performance study that the
`potential obstacles to this approach-index construction
`and update handling-are manageable. Index construc-
`tion performance depends primarily on DB2, but can
`easily be parallelized. Updates do not have a dramatic
`impact on query performance, although for databases
`with very high update rate, one of the existing online
`approaches may be more appropriate.
`
`• Our approach is extensible. We construct virtual doc-
`uments through user-defined functions (UDFs) that in
`turn execute SQL queries. Currently our virtual docu-
`ments are formed by concatenating text fields in inter-
`connected relational tuples. However, without chang-
`ing the infrastructure we could plug in a different UDF
`to construct virtual documents differently. For exam-
`ple, we could produce XML or some other textual
`format as virtual documents, possibly incorporating
`schema information. If the text search engine provides
`XML search capabilities such as XPath [3], our system
`automatically inherits those capabilities.
`
`Figure 2. TPCH Schema
`
`Figure 3. Architecture
`
`We have developed a system, EKSO (for Efficient Key-
`word Search through Offline Indexing), that supports highly
`efficient keyword-based search on relational databases. For
`the above example, on query ”Bob fish” EKSO returns a re-
`sult equivalent to the SQL query result. EKSO also supports
`much more complex searches over complex data. Our novel
`use of DB2’s Net Search Extender (NSE) [7, 13] enabled us
`to build a powerful system that is extensible, and that can
`incorporate future text indexing and search features in NSE
`without modification. Our system is the first we know of for
`expressive keyword-based search over interconnected rela-
`tional database content that operates within the database en-
`gine and performs full offline indexing for highly efficient
`searches.
`EKSO is an instance of a general architecture we propose
`in this paper, depicted in Figure 3 and consisting of three
`main components:
`
`• The Crawler traverses the entire database offline, con-
`structing virtual documents from database content.
`
`• We exploit existing relational database extensions for
`text indexing and search that is highly expressive and
`
`2
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 2 of 13
`
`

`
`extensible in its treatment of database content. By
`using DB2’s Net Search Extender as the back end,
`we completely avoid reimplementing basic IR capa-
`bilities. Furthermore, our method for exploiting the
`text extension ensures that all features are automati-
`cally available in our search tool, including stemming,
`stopword elimination, case-folding, Boolean queries,
`proximity queries, relevance-ranked results, wildcards,
`fuzzy searches, and natural language searches [7].
`
`• The premise of our work is that keyword-based search
`is more intuitive and convenient than SQL for many
`query tasks over relational databases. To verify this as-
`sumption, we conducted the only study we are aware
`of that quantifies the relative performance of SQL
`queries versus keyword searches on a variety of query
`tasks.
`
`1.1. Outline of Paper
`
`We provide a thorough survey of related work in Sec-
`tion 2. Section 3 is the core of the paper. It describes the
`EKSO system, including functionality, algorithms, and sys-
`tem implementation details. Section 4 presents our perfor-
`mance study of the EKSO system, while Section 5 presents
`our user study demonstrating the advantages of keyword-
`based searches over SQL queries. We conclude and suggest
`future directions in Section 6.
`
`2. Related Work
`
`An early paper [12] proposes implementing abstract data
`types, user-defined operators, and specialized indexes for
`textual databasesa precursor to current relational database
`vendors’ approaches to integrating full-text search function-
`ality.
`IBM DB2 [13], Microsoft SQL Server [9], Oracle
`[6], PostgreSQL, and MySQL all provide text search en-
`gine extensions that are tightly coupled with the database
`engine. However, in all cases each text index is defined over
`a single column. Using this feature alone to do meaningful
`keyword search over an interconnected database would re-
`quire merging the results from many column text indexes.
`If our interconnected tuples contain N text attributes, and
`our keyword query contains M conjunctive terms, then in
`addition to joining relations to generate the interconnected
`tuples (as we do offline; see Section 3.1), we must invoke
`the text search engine up to N M times to match the query
`keywords to the text columns.
`Three systems share the concept of crawling databases to
`build external indexes. Verity [15] crawls the content of re-
`lational databases and builds an external text index for key-
`word searches, as well as external auxiliary indexes to en-
`able parametric searches. This approach is similar to ours
`
`in terms of the index-all-data-offline paradigm. However,
`by leveraging the database vendor’s integrated text search
`engine tool, we avoid transporting data outside the data-
`base for external processing, which can be expensive and
`difficult to synchronize, and we inherit all DBMS features
`automatically.
`DataSpot [5] extracts database content and builds an
`external, graph-based representation called a hyperbase to
`support keyword search. Graph nodes represent data ob-
`jects such as relations, tuples, and attribute values. Query
`answers are connected subgraphs of the hyperbase whose
`nodes contain all of the query keywords. Similar to our
`concept of a root tuple (Section 3.1.1), each answer is rep-
`resented by a node associated with a tuple in the database,
`from which the user can navigate to related tuples. As with
`Verity, DataSpot requires transporting content to be indexed
`outside the database.
`DbSurfer [18] indexes the textual content of each rela-
`tional tuple as a virtual web page. For querying and navi-
`gation, the database foreign-key constraints between tuples
`are treated as hyperlinks between virtual web pages. Given
`a keyword query, the system computes a ranked set of vir-
`tual web pages that match at least one keyword. Then a
`best-first expansion algorithm finds a ranked set of naviga-
`tion paths originating from the starting web pages. The nav-
`igation path concept is similar to our definition of a text ob-
`ject (Section 3.1.2). However, even though the web pages
`are indexed offline, significant query time computation is
`required to expand the starting web page set to find navi-
`gation paths satisfying the full query, work that we perform
`offline.
`Three systems, DBXplorer, BANKS, and DISCOVER,
`share a similar approach: At query time, given a set of key-
`words, first find tuples in each relation that contain at least
`one of the keywords, usually using auxiliary indexes. Then
`use graph-based approaches to find tuples among those
`from the previous step that can be joined together, such that
`the joined tuple contains all keywords in the query. All three
`systems use foreign-key relationships as edges in the graph,
`and point out that their approach could be extended to more
`general join conditions. We discuss the three systems in
`more detail.
`Given a set of keywords, DBXplorer [1] first finds the
`set of tables that contain at least one of the keywords. Then
`using the undirected schema graph (where each node is a
`relation and each edge is a foreign-key relationship), a set
`of subgraphs is enumerated, such that each subgraph repre-
`sents a joining of relations where the result contains rows
`that potentially contain all of the query keywords. For each
`candidate join subgraph generated, a SQL statement is ex-
`ecuted to join the specified relations and select rows that
`do contain all of the keywords. The system is evaluated
`on TPC-H (100MB to 500MB) and three internal Microsoft
`
`3
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 3 of 13
`
`

`
`Figure 4. EKSO System
`
`databases (130MB to 375MB, from 19 tables to 84 tables).
`The paper focuses on matching keywords to an entire at-
`tribute value. It briefly discusses extending the system to
`work with subattribute granularity matches and to provide
`more sophisticated search options.
`BANKS [2] uses a data graph where each tuple is a node
`and each foreign-key relationship between tuples is repre-
`sented as a bidirectional edge. The keyword query result
`is a connected subgraph of tuples, where each keyword is
`found in at least one tuple node. Dijkstra’s algorithm is used
`to find the shortest path such that the nodes covered collec-
`tively contain all query keywords. The system can rank the
`results based on the sum of the node relevance scores and
`edge weights.
`DISCOVER [11] uses schema graph information to enu-
`merate minimum joining networks, which are sequences of
`tuples that can be joined together (based on foreign-key re-
`lationships) into a single tuple that contains all of the key-
`words. The algorithm models a tuple-set graph. Execution
`plans for the set of joining networks are generated and op-
`timized to share intermediate results, then translated into
`SQL statements to build the join tuples that contain all of
`the query keywords. The performance evaluation focuses
`primarily on two aspects, efficiency of pruning irrelevant
`joining networks, and comparison of several algorithms
`for deciding what intermediate results to build and share.
`Execution time increases significantly with the number of
`candidate joining networks or the number of keywords in
`the query. [10] extends DISCOVER to perform relevance-
`ranked and top-k keyword queries.
`An alternate approach, taken in [14, 16], translates key-
`
`word searches to SQL based on the schema, but does
`not focus on efficient implementation techniques. [8] de-
`scribes a system for keyword and proximity search in graph-
`structured databases based on shortest-path computations,
`focusing on a combination of offline and online computa-
`tion.
`While a direct empirical comparison between our system
`and some of the other approaches mentioned in this sec-
`tion would be very interesting, the systems are not publicly
`available, and any effort to implement them well enough for
`a fair comparison would be prohibitive.
`
`3. The EKSO System
`
`The system we have developed, EKSO, is an instanti-
`ation of the general architecture we propose for keyword-
`based search over structured databases that was shown in
`Figure 3. Figure 4 shows details of the EKSO system,
`which will be described as this section proceeds. EKSO is
`fully implemented and functional; performance evaluation
`is presented in Section 4. In Section 3.1 we define EKSO’s
`virtual documents and discuss in detail their computation
`from the underlying relational database. Sections 3.2, 3.3,
`and 3.4 discuss operation of the Crawler, Indexer, and Key-
`word Search Processor, respectively. Section 3.5 describes
`how the system handles database updates, determining and
`recomputing the affected virtual documents. Section 3.6
`contrasts our system with an alternative approach based on
`materialized views. Finally, Section 3.7 describes the user
`interface to the system.
`
`4
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 4 of 13
`
`

`
`input
`
`output
`
`: Schema graph G where nodes are relations and edges are
`foreign key constraints and set of root relations RS
`: Set of text objects T
`
`T ← {};
`foreach root relation R in RS do
`foreach tuple r in R do
`text object t ← {r} ;
`start at R, do breadth-first traversal on G ;
`foreach relation R’ traversed do
`add to t the set of R’ tuples that foreign-key join
`with existing tuples in t ;
`add t to T ;
`
`return T ;
`Figure 5: Text Object Construction (TOC) algorithm
`
`Example: Consider the text object construction process for one root
`tuple P 1 from relation Part in the database instance in Figure 1.
`Starting with tuple P 1, in the first step of the traversal we select
`Partsupp tuple P S1 and Lineitem tuple L1, which both join with
`P 1 on the partkey attribute. Next we select Supplier tuple S1,
`which joins with P S1 on suppkey, and Orders tuple O1, which
`joins with L1 on orderkey. Then we select Customer tuple C 1,
`which joins with O1 on custkey. The text object for P 1 consists of
`the set of tuples {P 1, P S1, L1, S1, O1, C 1}.
`Figure 6: Text Object Construction example
`
`3.1. Text Objects and Virtual Documents
`
`Recall from Section 1 that our goal is to perform a full
`crawl of the database, generating virtual documents from
`database content that define the granularity of text search
`and retrieval in the system. In EKSO we separate two con-
`cepts:
`
`• Text objects are generated from interconnected tuples
`based on joining relational tables.
`
`• Virtual documents are generated from text objects by
`concatenating their string attributes.
`
`This separation has at least two advantages:
`
`1. We can use SQL queries, which are optimized auto-
`matically by the DBMS, to generate text objects.
`
`2. We can independently plug in different functionality
`for generating text objects from the database, or for
`generating virtual documents from text objects.
`
`In EKSO we define a text object to be the set of tuples
`in the database that are related to a given root tuple. When
`a virtual document D is part of the answer to a keyword
`query, the system returns an identifier for the root tuple as-
`sociated with the text object from which D was generated.
`
`3.1.1 Root Relations and Tuples
`
`Root tuples are defined as all of the tuples in root relations.
`Root relations can be designated automatically or set by the
`database administrator. Designating a relation R as a root
`relation creates text objects and virtual documents associ-
`ated with each tuple of R. In EKSO as a default we select as
`root relations all ”entity relations,” i.e., all relations whose
`primary key is not composed entirely of foreign-keys.
`
`3.1.2 Text Objects
`
`We define the text object for a particular root tuple r as the
`set of all tuples in the database that join with r through a se-
`quence of zero or more foreign-key joins. We do not back-
`track, i.e., the sequence of foreign-key joins joins with any
`relation at most once. See example in Figure 6.
`Figure 5 outlines the text object construction (TOC) al-
`gorithm. To construct the text object for a root tuple r of
`root relation R, we start with r. Logically we perform a
`breadth-first traversal of the schema graph. For all neighbor
`relations of R, we select those tuples that join with r, based
`on foreign-key constraints. Next we join those tuples with
`relations two away from the root relation R, and so on, un-
`til we have traversed all relations reachable from R in the
`schema graph. In the schema graph traversal, we treat for-
`eign key constraints as bidirectional. We mark each relation
`during the breadth-first traversal to avoid backtracking and
`cycles in the schema graph.
`The algorithm as described is not guaranteed to cover the
`entire database when creating text objects, unless certain
`conditions hold. We must have a set of root relations such
`that all relations in the database are reachable from some
`root relation via the schema graph. Otherwise we may have
`”dangling relations”, in which case we could use a second
`pass to create additional text objects from the dangling con-
`tent. Note that because we follow foreign key edges, there
`will be no dangling tuples within connected relations.
`Alternate definitions could be used for text objects from
`a given root tuple. For example, we could follow equijoins
`as well as foreign-key joins (with a second pass to handle
`dangling tuples), and we could backtrack. In the Figure 6
`example, the text object for Part tuple P1 includes the sup-
`pliers who supply P1, as well as customers who have or-
`dered P1. We could backtrack to include other parts sup-
`plied by those suppliers who supply P1, as well as other
`parts ordered by those customers who ordered P1.
`
`3.1.3 Virtual Documents
`
`After a text object has been computed, we concatenate all
`textual attributes of all tuples in the text object to form the
`virtual document. The order of the tuples selected during
`traversal is nondeterministic due to SQL semantics, and in
`
`5
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 5 of 13
`
`

`
`the current system we are not concerned about the concate-
`nation order. However, if we extend the system to richer
`virtual documents, as we will discuss next, we may need to
`be careful about concatenation order. We can also include
`nontextual fields, such as numbers and dates, by simply us-
`ing their string representation. As an example, we concate-
`nate the textual attributes of the six tuples that comprise the
`text object for root tuple P1 from the Figure 6 example. For
`one possible ordering the virtual document would look like:
`Discus fish Acme Wisconsin Bob Madison.
`Note that for now we have chosen one particular defi-
`nition of text objects and virtual documents. We can eas-
`ily plug in different definitions. For example, we might
`generate richer structures in virtual documents. If the In-
`dexer and the Keyword Search Processor supported XML
`content, we could generate an XML fragment that captures
`database structure and uses tags to encode attribute and re-
`lation names. Such virtual documents enable XPath [3] or
`other XML-based queries. The XML fragment for our ex-
`ample might look like:
`
`<PART>
`<NAME>Discus fish</NAME>
`<SUPPLIER>
`<NAME>Acme</NAME>
`<ADDRESS>Wisconsin</ADDRESS>
`</SUPPLIER>
`<CUSTOMER>
`<NAME>Bob</NAME>
`<ADDRESS>Madison</ADDRESS>
`</CUSTOMER>
`</PART>
`
`3.2. Crawler
`
`The Crawler is responsible for traversing a structured
`database to produce virtual documents to be indexed.
`In
`Figure 4, the three relations on the left-hand side labeled
`R1, R2, and R3 correspond to the Structured Database in
`Figure 3. The Crawler in Figure 3 maps to the bulk of the
`system diagram in Figure 4, enclosed in the dashed box.
`It centers on a user-defined function (UDF) implemented in
`DB2 that takes as input an identifier for a root tuple, invokes
`a Text Object Construction (TOC) query, and constructs and
`returns a virtual document. The TOC queries are composed
`in advance following the TOC algorithm from Figure 5.
`There is one such query for each root relation. With knowl-
`edge of the schema, SQL WITH queries parameterized by
`each root relation’s primary key values are used. See Ap-
`pendix A for an example TOC query.
`Before indexing, a special table called Text Object Key
`Relation(TOKR) is populated with keys that individually
`identify all of the root relations and tuples in those rela-
`tions. At indexing time, the Crawler UDF decodes each key
`to find the root relation that the tuple belongs to, then looks
`up the corresponding TOC query, parameterizes it with the
`
`6
`
`root tuple identifier values, and executes it. The query result
`is the text object, whose textual attributes are concatenated
`into a Binary Large Object (BLOB). At this point, the UDF
`discards the text object and returns the BLOB as the virtual
`document.
`
`3.3. Indexer
`
`The Indexer is the box labeled NSE in the right-hand por-
`tion of Figure 4. It is managed by the DB2 Net Search Ex-
`tender (NSE). NSE is a full-featured text search engine in-
`tegrated with the IBM DB2 DBMS product. NSE supports
`the creation of text indexes on individual columns, and can
`index the output of applying a user-defined function to a
`column, which is the feature we exploit. We create an NSE-
`managed IR-style inverted index over the virtual documents
`returned by the Crawler UDF described in the previous sec-
`tion. NSE indexes all of the virtual documents, associating
`each one with its corresponding text object key.
`
`3.4. Keyword Search Processor
`
`In Figure 4, the Keyword Search Processor from Figure 3
`is the box labeled ”Translate Keywords to NSE Query”.
`A user’s keyword-based query is translated to an NSE in-
`vocation in the form of a SQL query invoking the special
`CONTAINS function. For example, after indexing our data-
`base instance from Figure 1, the conjunctive keyword query
`”Bob fish” is translated into the following SQL query:
`
`SELECT key FROM Tokr WHERE CONTAINS(key,’"Bob"&
`"Fish"’)=1
`
`The CONTAINS function automatically invokes the
`NSE search engine, which probes its inverted index and re-
`turns text object keys as query results, which are then re-
`turned to the user. The user may choose to explore the data-
`base from these keys, or further refine the search. We de-
`scribe our user interface that enables these activities in Sec-
`tion 3.7. Currently user exploration begins from scratch. To
`speed up user traversal we could consider caching traversal
`path from text object construction.
`By building on commercial DBMS text search features,
`we automatically inherit all of its capabilities, current as
`well as future. For example, EKSO automatically deliv-
`ers ranked search results based on NSE’s built-in ranking
`functions.
`
`3.5. Update Handler
`
`The Update Handler that is a part of the Crawler com-
`ponent in Figure 3 corresponds to the box labeled ”TOU
`Triggers” at the top right of the system diagram in Figure 4.
`When one or more tuples in the underlying database are
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 6 of 13
`
`

`
`input
`
`output
`
`: Schema graph G where nodes are relations and edges are
`foreign key constraints, set of root relations RS, and an
`inserted, deleted, or updated tuple r from Relation R
`: Root tuples RT whose text objects and virtual
`documents need to be recomputed
`
`RT ← {};
`RT ’ ← {r.old, r.new};
`start at R, do breadth-first traversal on G ;
`foreach relation R’ traversed do
`D ← set of tuples from R’ that foreign-key join with existing
`tuples in RT ’;
`add D to RT ’;
`if R’ in RS then
`add D to RT ;
`
`return RT ;
`Figure 7: Text Object Update (TOU) algorithm
`
`r inserted
`
`r deleted
`
`r updated
`
`Root tuple r
`Index TOC(r),
`Modify TOU(r)
`Delete index entries
`for r, Modify TOU(r)
`Delete index entries
`for r.old, Modify TOU(r)
`Index TOC(r.new),
`
`Non-root tuple r
`Modify TOU(r)
`
`Modify TOU(r)
`
`Modify TOU(r)
`
`Table 1. Text Object Update action
`
`modified (inserted, deleted, or updated), some text objects
`and virtual documents may need to be created, deleted, or
`updated. For a given modified tuple r, the first step is to
`compute the delta set of root tuples whose text object and
`virtual document must be recomputed. We call this compu-
`tation Text Object Update (TOU) and we denote the delta set
`of root tuples as TOU(r). Given a modified tuple r, TOU(r)
`is the union of the following two sets:
`
`• The set of existing root tuples whose text object in-
`cludes the old version of r
`
`• The set of existing root tuples whose text object should
`now include the new version of r
`
`If tuple r is inserted, the old-version set is empty. If tuple
`r is deleted, the new-version set is empty. Once the delta
`set TOU(r) is computed, for every root tuple r0 in TOU(r),
`we repeat the crawling and indexing procedure described
`earlier: The Crawler computes the text object and virtual
`document of r0, then the Indexer removes any existing index
`entries of r0 and indexes the recomputed virtual document.
`In Figure 7 we present the TOU algorithm to compute the
`delta set of root tuples whose text objects and virtual doc-
`uments must be recomputed as a result of modifications to
`an arbitrary tuple. Note the TOU algorithm is a dual of the
`TOC algorithm, since the TOU algorithm must find affected
`root tuples based on the text object semantics. Hence if the
`
`text object definition were to change (e.g., to permit back-
`tracking), then both the TOC and the TOU algorithms must
`change accordingly. Like the TOC algorithm, the TOU al-
`gorithm can be implemented as a SQL WITH query com-
`posed offline using schema information. There is one TOU
`query per relation in the database.
`Once the delta set of root tuples is identified, our update
`actions depend on whether the modified tuple r is itself a
`root tuple, and whether the original modification to r was an
`insert, delete, or update. Table 1 enumerates all the cases.
`r.old denotes the old version of a modified tuple r, i.e., a
`tuple being deleted, or a tuple before it is updated. r.new
`denotes the new version of a modified tuple r, i.e., a tuple
`being inserted, or a tuple after it has been updated. ”Modify
`TOU(r)” denotes asking the Crawler to recompute the text
`objects and virtual documents of each root tuple in the delta
`set TOU(r), and asking the Indexer to update its index with
`the recomputed virtual documents. When we insert a new
`root tuple r we must also ask the Crawler to execute the
`TOC algorithm for r and ask the Indexer to index the virtual
`document generated by the Crawler, which we denote by
`”Index TOC(r)”. When we delete a root tuple, we must
`have the Indexer remove index entries associated with the
`deleted root tuple.
`The Update Handler is implemented using triggers. We
`create three TOU triggers (insert, delete and update) on each
`database relation. See Appendix B for an example TOU
`insert trigger. The main trigger action for all three triggers
`is an insert statement that executes the TOU query to find
`the set of root tuples whose virtual document is affected
`by changes to a given tuple, and then inserts into the NSE
`update log table requests to update the NSE index entries
`for these root tuples. At the next NSE index update (either
`a manual NSE index refresh request or automatic periodic
`index refresh), NSE automatically invokes the Crawler UDF
`to reconstruct the requested virtual documents, which are
`used to rebuild the inverted index.
`In the case of the insert trigger on a root relation, a sec-
`ond trigger action inserts a corresponding tuple into the
`TOKR, which automatically creates an NSE update request
`that a new virtual document be generated and indexed. For
`the delete trigger on a root relation, a second trigger action
`deletes the corresponding TOKR tuple, thus generating an
`NSE update request that the appropriate entries in the NSE
`index be removed.
`
`3.6. Materialized View Approach
`
`Our Crawler constructs text objects one at a time, gen-
`erates a virtual document, and discards the text object. The
`Indexer consumes the virtual document to build the inverted
`index, and then discards the virtual document. An alter-
`native approach would be to create a materialized view
`
`7
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2232
`Page 7 of 13
`
`

`
`Relation
`
`Root Reln
`
`#Tuples
`
`CUSTOMER
`LINEITEM
`NATION
`ORDERS
`PART
`PARTSUPP
`REGION
`SUPPLIER
`
`Yes
`No
`No
`Yes
`Yes
`No
`No
`Yes
`
`75,000
`2,999,671
`25
`750,000
`100,000
`400,000
`5
`15,000
`
`Bulkload
`File Size
`12MB
`360MB
`3KB
`82MB
`11MB
`56MB
`1KB
`685KB
`
`Table 2. TPCH500 dataset characteristics
`
`Relation
`
`Data Source
`
`Root Reln
`
`#Tuples
`
`ITEMS
`USERS
`FEEDBACK
`BIDS
`
`Ebay
`TPC dbgen
`TPC dbgen
`TPC dbgen
`
`Yes
`Yes
`No
`No
`
`500,000
`26,913
`500,000
`1,000,000
`
`Bulkload
`File Size
`795MB
`4MB
`76MB
`54MB
`
`Table 3. Auction dataset characteristics
`
`4.1. Datasets
`
`We evaluate our text indexing and search system on two
`datasets, a 500MB TPC-H database, and a 929MB Auction
`database. The TPCH500 dataset follows the complete TPC-
`H schema in Figure 2 and is generated using the official
`TPC-H data generator (DBGEN) [4]. Schema information
`including relation names, attribute names, and join condi-
`tions are provided to our TOC Query construction applica-
`tion. We designate Customer, Orders, Part, and Supplier as
`the root relations. The system default would also include
`Region and Nation, but they are of too coarse a granularity
`to be useful to users. The four parameterized TOC Queries
`for the four root relations constitute our workload at index-
`ing time, since each UDF invocation executes one of these
`queries and concatenates the resulting tuples. We optimized
`the indexing process by submitting this four-query work-
`load to the DB2 Design Advisor, which suggested indexes
`to build on the database tables. Table 2 summarizes the
`TPCH500 dataset.
`The Auction dataset consists of a mixture of primarily
`real-world data and some synthetic data. It models a data-
`base for an online auction service and the schema is shown
`in Figure 8. Whereas the TPC-H schema has many rela-
`tively short textual fields and is fairly complex with many
`join relationships, the Auction dataset contains a few long
`textual fields and is very simple in terms of join relation-
`ships. The Items table is populated with 500,000 Ebay
`auction items crawled in 2001, provided courtesy of the
`University of Wisconsin database group. The Users ta-
`ble is populated from a subset of the Supplier table in our
`
`Figure 8. Auction dataset schema
`
`that persists all the virtual documents, and then maintain
`a text index over this view. Since our approach handles
`data updates, the only possible advantage we see to the
`view approach is exploiting existing infr

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket