`and Parallel SQL Based Execution
`
`Masaru Kitsuregawa, Takahiko Shintani,
`Takeshi Yoshizawa(cid:2), and Iko Pramudiono
`
`Institute of Industrial Science, The University of Tokyo
`7-22-1 Roppongi, Minato-ku, Tokyo 106, Japan
`{kitsure,shintani,yoshi,iko}@tkl.iis.u-tokyo.ac.jp
`
`Abstract. We performed association rule mining and sequence pattern
`mining against the access log which was accumulated at NTT Software
`Mobile Info Search portal site. Detail web log mining process and the
`rules we derived are reported in this paper. The integration of web data
`and relational database enables better management of web data. Some
`researches have even tried to implement applications such as web mining
`with SQL. Commercial RDBMSs support parallel execution of SQL. Par-
`allelism is key to improve the performance. We showed that commercial
`RDBMS can achieve substantial speed up for web mining.
`
`1 Introduction
`
`The analysis of web log to understand the characteristics of web users has been
`one of the major topics in web mining. The goal is to provide personalized
`information retrieval mechanism that match the need of each individual user.
`”One to one marketing” on the web also has similar objectives. The development
`of the web personalization technologies will certainly benefit e-commerce too.
`In this paper, we focused on the mining access log using association rule
`discovery techniques. We show some mining results from web log of Mobile Info
`Search(MIS), a location-aware search engine [18]. Usage mining of this unique
`site could give some interesting insights into the behavior of mobile device users
`which are the targets of this site.
`Here we report some of the web mining techniques based on association rule
`that can be accomplished by some modified SQL queries on relational database.
`The integration of web with database techniques has drawn attention from re-
`searchers. Some have proposed query languages for the web that is similar with
`SQL such as Squeal[9] and WebSQL[5]. They emphasize better organization of
`web data managed in relation database way. We extend this concept for real
`applications of web mining. We also address the performance problem by paral-
`leling the execution of SQL queries.
`Although the amount of log at MIS is not so large, generally at large portal
`site it tends to be very large. The log can reach several tens of GB per day. Just
`(cid:2) IBM Japan Co.,Ltd. 1-1, Nakase, Mihama-ku, Chiba-shi, Chiba 261-8522, Japan
`
`S. Bhalla (Ed.): DNIS 2000, LNCS 1966, pp. 20–32, 2000.
`c(cid:2) Springer-Verlag Berlin Heidelberg 2000
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-1
`
`
`
`Web Log Mining and Parallel SQL Based Execution
`
`21
`
`one day log is not enough for mining. If we are going to use several weeks log,
`then we have to handle more than one terabyte of data. Single PC server cannot
`process such huge amount of data with reasonable amount of time.
`On the other hand recently most major commercial database systems have in-
`cluded capabilities to support parallelization although no report available about
`how the parallelization affects the performance of complex query required by
`association rule mining. This fact motivated us to examine how efficiently SQL
`based association rule mining can be parallelized and speeded up using commer-
`cial parallel database system (IBM DB2 UDB EEE). We propose two techniques
`to enhance association rule mining query based on SETM [10]. And we have also
`compared the performance with commercial mining tool (IBM Intelligent Miner).
`Our performance evaluation shows that we can achieve comparable performance
`with commercial mining tool using only 4 nodes. Some considerable works on
`effective SQL queries to mine association rule such didn’t examine the effect of
`parallelization [15][16]. Some of authors have reported a performance evaluation
`on PC cluster as parallel platform [13][14]. Comparison with natively coded pro-
`grams is also reported. However we use currently available commercial products
`for the evaluation here.
`
`2 Web Usage Mining for Portal Site
`
`2.1 Web Access Log Mining
`
`Access log of a web site records every user requests sent to the web server. From
`the access log we can know which pages were visited by the user, what kind of
`CGI request he submitted, when was the access and it also tells to some extent
`where the user come from. Using those information, we can modify the web site to
`satisfy the need of users better by providing better site map, change layout of the
`pages and the placement of links etc. [12] has proposed the concept of adaptive
`web site that dynamically optimize the structure of web pages based on the
`users access pattern. Some data mining techniques has been applicated on web
`logs to predict future user behavior and to derive marketing intelligence[21][7][8].
`Currently many e-commerce applications also provides limited personalization
`based on access log analysis. Some pioneers such as Amazon.com have achieved
`considerable success.
`Here we will show the mining process of real web site. We have collaborated
`with NTT Software to analyze the usage of a unique search engine called Mobile
`Info Search.
`
`2.2 Mobile Info Search(MIS)
`
`Mobile Info Search (MIS) is a research project conducted by NTT Software
`Laboratory whose goal to provide location aware information from the inter-
`net by collecting, structuring, organizing, and filtering in a practicable form[17].
`MIS employs a mediator architecture. Between users and information sources,
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-2
`
`
`
`22
`
`M. Kitsuregawa et al.
`
`MIS mediates database-type resources such as online maps, internet “yellow-
`pages” etc. using Location-Oriented Meta Search and static files using Location
`Oriented Robot-based Search. Users input their location using address, nearest
`station, latitude-longitude or postal number. If the user has a Personal Handy
`Phone(PHS) or Geo Positioning System(GPS) unit, the user location is auto-
`matically obtained.
`The site is available to the public since 1997. Its URL is http://www.kokono.net.
`In average 500 searches are performed on the site daily. A snapshot of this site
`is shown in Figure 1. 1
`
`Mobile Info Search 2 Ver.2.00
`
`Location Information(2000/09/22 16:23:08)
`
`Tokyo, Chuo-ku, Ginza, 4-chome ZIP 104-0061
`(NL 35.40.4.7 EL 139.46.5.7)
`Nearest station : Ginza, Higashi-Ginza
`Kokono (nearby area) Search
`Shops Information Internet-Townpage
`Keywords [ ! type ]
`Maps
`Train Route
`Train Timetable
`Hotels
`Newspapers
`Weather Report
`TV Guide
`
`Fig. 1. Index page of Mobile Info Search
`
`MIS has two main functionalities :
`
`1. Location Oriented Meta Search
`Many local information on the web are database-type, that is the information
`is stored in backbone database. In contrast to static pages, they are accessed
`through CGI program of WWW server. MIS provides a simple interface for
`local information services which have various search interfaces. It converts
`the location information and picks the suitable wrapper for the requested
`service. Example of database-type resources provided are shown in table 1.
`2. Location-Oriented Robot-Based Search “kokono Search”
`kokono Search provides the spatial search that searches the document close to
`a location. “kokono” is a Japanese word means here. kokono Search also em-
`ploys “robot” to collects static documents from internet. While other search
`engines provide a keyword-based search, kokono Search do a location-based
`spatial search. It displays documents in the order of the distance between
`the location written in the document and the user’s location.
`1 The page is shown in Japanese at http://www.kokono.net
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-3
`
`
`
`Web Log Mining and Parallel SQL Based Execution
`
`23
`
`Table 1. Database-type resources on the Internet
`
`Location information used for the search
`Service
`longitude-latitude
`Maps
`Yellow Pages
`address ( and categories ... etc)
`Train Time Tables station
`Weather Reports address or region
`Hotel Guides
`nearest station
`
`3 Mining MIS Access Log and Its Derived Rules
`
`3.1 Preprocessing
`
`We analyzed the users’ searches from the access log recorded on the server be-
`tween January and May 1999. There are 1035532 accesses on the log, but the
`log also consists image retrieval, searches without cookie and pages that do not
`have relation with search. Those logs were removed. Finally we had 25731 search
`logs to be mined.
`
`– Access Log Format
`Each search log consists CGI parameters such as location information (ad-
`dress, station, zip), location acquisition method (f rom), resource type (sub-
`mit), the name of resource (shop web, map web, rail web, station web,
`tv web), the condition of search (keyword, shop cond). We treat those pa-
`rameters the same way as items in transaction data of retail sales. In ad-
`dition, we generate some items explaining the time of access (access week,
`access hour).
`Example of a search log is shown in Figure 2.
`– Taxonomy of Location
`Since names of places follow some kind of hierarchy, such as “city is a part of
`prefecture” or “a town is a part of a city”, we introduce taxonomy between
`them. We do this by adding items on part of CGI parameter address. For
`example, if we have an entry in CGI parameters entry [address=Yamanashi-
`ken, Koufu-shi, Oo-satomachi], we can add 2 items as ancestors : [address=
`Yamanashi-ken, Koufu-shi] at city level and [address=Yamanashi-ken] at
`prefecture level. In Japanese, “ken” means prefecture and “shi” means city.
`– Transformation to Transaction Table
`Finally we have the access log is transformed into transaction table ready
`for association rule mining. Part of transaction table that corresponds to log
`entry in Figure 2 is shown in Table 2
`
`3.2 Association Rule Mining
`
`Agrawal et. al.[1] first suggested the problem of finding association rule from
`large database. An example of association rule mining is finding “if a customer
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-4
`
`
`
`24
`
`M. Kitsuregawa et al.
`
`0000000003 - - [01/Jan/1999:00:30:46 0900] "GET /index.cgi?
`sel_st=0&NL=35.37.4.289&EL=138.33.45.315&address=Yamanashi-ken,
`Koufu-shi,Oosato-machi&station=Kokubo:Kaisumiyoshi:Minami-koufu:
`Jouei&zip=400-0053&from=address&shop_web=townpage&keyword=
`&shop_cond=blank&submit_map=Map&map_web=townpage&rail_web
`=s_tranavi&station_web=ekimae&tv_web=tvguideHTTP/1.1" 200 1389
`"http://www.kokono.net/mis2/mis2-header?date=1999/01/01.00:27:59
`&address=Yamanashi-ken,Koufu-shi,Oosato-machi&NL=35.37.4.289&EL
`=138.33.45.315&station=Kokubo:Kaisumiyoshi:Minami-koufu:Jouei&zip
`=400-0053&from=address&keyword=&shop_web=townpage&shop_cond=blank
`&map_web=townpage&station_web=&tv_web=tvguide""Mozilla/4.0
`(compatible; MSIE 4.01; Windows 98)$B!I(B"LastPoint=NL=35.37.4.289
`&EL=138.33.45.315&address=Yamanashi-ken,Koufu-shi,Oosato-machi&station
`=Kokubo:Kaisumiyoshi:Minami-koufu:Jouei&zip=400-0053; LastSelect
`=shop_web=townpage&shop_cond=blank&keyword=&map_web=townpage&rail_web=
`s_tranavi&station_web=ekimae&tv_web=tvguide;Apache=1; MIS=1" "-"
`
`Fig. 2. Example of an access log
`
`Table 2. Representation of access log in relational database
`
`Log ID User ID
`001
`003
`
`001
`001
`001
`
`001
`001
`001
`001
`
`003
`003
`003
`
`003
`003
`003
`003
`
`Relation LOG
`Item
`address=Yamanashi-ken
`,Koufu-shi,Oosato-machi
`address=Yamanashi-ken,Koufu-shi,
`address=Yamanashi-ken,
`station=Kokubo:
`Kaisumiyoshi:Minami-koufu:Jouei
`zip=400-0053
`from=address
`submit map=Map
`map web=townpage
`
`buys A and B then 90% of them buy also C” in transaction databases of large
`retail organizations. This 90% value is called confidence of the rule. Another
`important parameter is support of an itemset, such as {A,B,C}, which is de-
`fined as the percentage of the itemset contained in the entire transactions. For
`above example, confidence can also be measured as support({A,B,C}) divided
`by support({A,B}).
`We show some results in Table 3 and 4. Beside common parameters such as
`confidence and support, we also use user that indicate the percentage of users
`logs that contain the rule.
`Those rules can be used to improve the value of web site. We can identify from
`the rules some access patterns of users that access this web site. For example,
`from the first rule we know that though Akihabara is a well known place in Tokyo
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-5
`
`
`
`Web Log Mining and Parallel SQL Based Execution
`
`25
`
`for electronic appliances/parts shopping, user that searches around Akihabara
`station will probably looks for restaurant. From this unexpected result, we can
`prefetch information of restaurant around Akihabara station to reduce access
`time, we can also provide links to this kind of user to make his navigation
`easier or offer proper advertisement banner. In addition, learning users behavior
`provides hint for business chance for example the first rule tell us the shortcoming
`of restaurants in Akihabara area.
`
`Table 3. Some results of MIS log mining regarding search condition
`
`Not so many good restaurants in Akihabara ?
`[keyword=][address=Tokyo,][station=Akihabara] ⇒ [shop cond=restaurant]
`In Hokkaido, people looks for gasoline stand at night from its address
`[access hour=20][address=Hokkaido,][from=address] [shop web=townpage]
`⇒ [shop cond=gasoline]
`People from Gifu-ken quite often searches for restaurants
`[address=Gifu-ken,][shop web=townpage] ⇒ [shop cond=restaurant]
`However people from Gifu-ken search for hotels on Saturday
`[access week=Sat][address=Gifu-ken,] [shop web=townpage] ⇒ [shop cond=hotel]
`People from Gifu-ken must search for hotel around stations
`[address=Gifu-ken,][shop web=townpage] [station=Kouyama] ⇒ [shop cond=hotel]
`
`Table 4. Some results of MIS log mining regarding time and location acquisition
`method
`
`Most frequent searches for restaurants around 16:00 if they start from address on Friday
`[access week=Fri][from=address][shop cond=restaurant] ⇒ [access hour=16]
`Most frequent searches for department store stand at 20:00 if start from address.
`[from=address][shop cond=department] ⇒ [access hour=20]
`Looking for gasoline stand on Sunday ?
`[from=address][shop cond=gasoline][shop web=townpage] ⇒ [access week=Sun]
`Search for hotels often from station if at Kanagawa-ken
`[address=Kanagawa-ken,][shop cond=hotel] ⇒ [from=station]
`People at Osaka start searching convenience stores from ZIP number !
`[address=Osaka,][shop cond=conveni] ⇒ [from=zip]
`People at Hokkaido always search convenience stores from address
`[address=Hokkaido,][shop cond=conveni] [shop web=townpage] ⇒ [from=address]
`
`3.3 Sequential Rule Mining
`
`The problem of mining sequential patterns in a large database of customer trans-
`actions was also introduced by Agrawal et. al.[3]. The transactions are ordered
`by the transaction time. A sequential pattern is an ordered list (sequence) of
`itemsets such as “5% of customers who buy both A and B also buy C in the
`next transaction”.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-6
`
`
`
`26
`
`M. Kitsuregawa et al.
`
`We show some sequential patterns that might be interesting in Table 5. Some
`patterns indicate the behavior of users that might be planning to do shop-
`ping. We can derive from second pattern that significant part of users check
`the weather forecast first, then they look for the shops in the yellow-pages ser-
`vice called “Townpage” then look again for additional information in the vicinity
`with kokono Search and finally they confirm the exact location in the map.
`
`Table 5. Some results of sequential pattern mining
`
`After finding a shop, check how to go there and the weather
`[submit shop=Shop Info] → [submit rail=Search Train]
`→ [submit newspaper=Newspaper]
`⇒ [submit weather=Weather Forecast]
`Or decide the plan after checking the weather first
`[submit weather=Weather Forecast]
`→ [submit shop=Shop Info] [shop web=townpage]
`→ [submit kokono=Kokono Search] ⇒ [submit map=Map]
`Looking for shops after closing time
`[submit shop=Shop Info] [access hour=22] [access week=Fri]
`⇒ [submit map=Map] [access hour=22] [access week=Fri]
`
`4 Mining Web Log Using RDBMS
`
`The ability to perform web mining using standard SQL queries is a next challenge
`for better integration of web and RDBMS. The integration is essential since
`better management of web data has become a necessity for large sites.
`The performance issue was a major problem for data mining with RDBMS.
`We will show that commercial RDBMS on parallel platforms could handle the
`task with sufficient performance.
`
`4.1 Association Rule Mining Based on SQL
`
`A common strategy to mine association rule is:
`
`1. Find all itemsets that have transaction support above minimum support,
`usually called large itemsets.
`2. Generate the desired rules using large itemsets.
`
`Since the first step consumes most of processing time, development of mining
`algorithm has been concentrated on this step.
`In our experiments we employed three type of SQL query. First of all the
`standard SQL query using SETM algorithm [10]. It is shown in Figure 3. Sec-
`ond is the enhanced SQL query using view materialization technique. Third is
`another enhanced SQL query using subquery technique.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-7
`
`
`
`Web Log Mining and Parallel SQL Based Execution
`
`27
`
`SQL query using SETM algorithm. Transaction data is transformed into
`the first normal form (transaction ID, item). In the first pass we simply gather
`the count of each item. Items that satisfy the minimum support are inserted
`into large itemsets table C 1 that takes form (item, item count). SETM employs
`temporary tables to reuse item combination in next pass. In first pass, transac-
`tions that match large itemsets are preserved in temporary table R 1. In other
`passes for example pass k, we first generate all lexicographically ordered candi-
`date itemsets of length k into another temporary table RTMP k by self-joining
`table R (k-1) that contains k-1 length transaction data. Then we generate the
`count for those itemsets, itemsets that meet minimum support are included into
`large itemset table C k. Finally transaction data R k of length k generated by
`matching items in candidate itemset table RTMP k with items in large itemsets.
`In order to avoid excessive I/O, we disable the log during the execution.
`
`= q.id
`p.id
`p.item 1 = q.item 1
`p.item 2 = q.item 2
`
`...
`
`p.item k-2 = q.item k-2
`p.item k-1 < q.item k-1;
`
`WHERE
`AND
`AND
`
`AND
`AND
`
`INSERT INTO C k
`item 1, item 2, ..., item k,
`SELECT
`COUNT(*)
`RTMP k
`FROM
`GROUP BY item 1, item 2, ..., item k
`HAVING COUNT(*) >= :min support;
`
`INSERT INTO R k
`p.id, p.item 1, p.item 2, ...,
`SELECT
`p.item k
`RTMP k p, C k c
`p.item 1 = c.item 1
`p.item 2 = c.item 2
`
`FROM
`WHERE
`AND
`
`...
`
`AND
`
`p.item k = c.item k;
`
`DROP TABLE R k-1;
`DROP TABLE RTMP k;
`
`CREATE TABLE LOG (id int, item int);
`
`– PASS 1
`
`CREATE TABLE C 1 (item 1 int, cnt int);
`CREATE TABLE R 1 (id int, item 1 int);
`
`INSERT INTO C 1
`item AS item 1, COUNT(*)
`SELECT
`FROM
`LOG
`GROUP BY item
`HAVING COUNT(*) >= :min support;
`
`INSERT INTO R 1
`p.id, p.item AS item 1
`SELECT
`LOG p, C 1 c
`FROM
`p.item = c.item 1;
`WHERE
`
`– PASS k
`CREATE TABLE RTMP k (id int, item 1 int,
`item 2 int, ... , item k int)
`(item 1 int,
`CREATE TABLE C k
`item 2 int, ... , item k int, cnt int)
`(id int, item 1 int,
`CREATE TABLE R k
`item 2 int, ... , item k int)
`
`INSERT INTO RTMP k
`p.id, p.item 1, p.item 2, ... ,
`SELECT
`p.item k-1, q.item k-1
`R k-1 p, R k-1 q
`
`FROM
`
`Fig. 3. SQL query to mine association rule
`
`Enhanced SETM query using view materialize technique. SETM has
`to materialize its temporary tables namely R k and RTMP k. Those temporary
`tables are only required in the next pass and they are not needed for generating
`the rules. In fact, those tables can be deleted after execution of its subsequent
`pass. Based on this observation we could avoid materialization cost of those
`temporary tables by replacing the table creation with view.
`
`Enhanced SETM query using subquery technique. We expect significant
`performance improvement with utilization of view, however view still requires
`time to access the system catalog and are holding locks to the system catalog
`table during creating views so we further use subquery instead of temporary
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-8
`
`
`
`28
`
`M. Kitsuregawa et al.
`
`tables. Therefore we embed the generation of item combinations into the query
`to generate large itemsets.
`
`4.2 Parallel Execution Environment and Performance Evaluation
`
`In our experiment we employed commercial Parallel RDBMS: IBM DB2 UDB
`EEE version 6.1 on IBM UNIX Parallel Server System: IBM RS/6000 SP. 12
`nodes make this system and using shared nothing architecture. Each node has
`POWER2 77Mhz CPU, 4.4GB SCSI hard disk, 256MB RAM and connected
`by High Performance Switch HPS with 100MB/s network speed. We also used
`commercial data mining tool IBM Intelligent Miner on single node of RS/6000
`SP for performance comparison with the SQL based data mining.
`We show execution time of association rule mining with several minimum
`supports in Figure 4. The data used here is synthetic transaction data generated
`with program described in Apriori algorithm [2] to show that we can handle
`larger data with parallel RDBMS. The number of transactions here is 200000,
`average transaction length 10 and number of items 2000. Transaction data is
`partitioned uniformly by hashing algorithm corresponds to transaction ID among
`processing nodes’ local hard disks. We also show the result of commercial data
`mining program Intelligent Miner from IBM on single node for reference.
`Figure 4 shows the execution time on each degree of parallelization. On av-
`erage, we can derive that View and Subquery SQL is about 6 times faster than
`SETM SQL regardless of the number of nodes. The result is also compared with
`the execution time of Intelligent Miner on single processing node. It is true that
`Intelligent Miner on single node with transaction data stored in flat file is much
`faster than the SQL queries. However, the View and Subquery SQL are 50on
`single node if the transaction data have to be read from RDBMS. We exemplified
`that we can achieve comparable performance of Intelligent Miner on single node
`with flat file by activating only 4 nodes when we used View and Subquery SQL.
`The result gives evidence for the effectiveness of parallelization of SQL query to
`mine association rule.
`The speedup ratio is shown in Figure 5. This is also reasonably good, espe-
`cially View and Subquery SQL are not being saturated as the number of process-
`ing nodes increased. That means they can be parallelized well. The execution
`is 11 times faster with 12 nodes. In parallel environment, network potentially
`becomes botleneck which degrades the speed-up ratio. However our experiments
`suggest that association rule mining using variants of SETM is mostly CPU
`bound and network I/O is negligible.
`Here we give thorough comparison and analysis on the three variations of
`SQL query described before. The performance evaluation is done on 12 nodes.
`The mining is two passes long. It is well known that in most cases the second pass
`generates huge amount of candidate itemsets thus it is the most time consuming
`phase[4][5]. Our results are very much alike. Almost over 80belongs to PASS2 in
`all three SQL queries. Obviously View and Subquery SQL complete their first
`and second passes faster than SETM SQL query. We have recorded the execution
`traces of the three SQL in each PASS. The decomposition of execution time is
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-9
`
`
`
`Web Log Mining and Parallel SQL Based Execution
`
`29
`
`analysed as shown Figure 6 (PASS2) respectively. Comparing the elapsed time
`with the cpu time at Figure 6, we find that both are close for View SQL and
`Subquery SQL. This means these SQL’s are cpu bound, while SETM SQL is not
`cpu bound. Most of execution time of SETM query is dominated by disk write
`time for creating temporary table such as R k and RTMP k. We can also see
`that sort time is almost equal for all three SQL’s, which represents the cost of
`group by aggregation. In PASS2, SETM reuses item combinations in temporary
`table R1 on the secondary storage that is generated in PASS1. We replace it
`with view or subquery. Then data is transferred directly through memory from
`PASS1 to PASS2. Figure 6 indicates that PASS2 of those modified SQL queries
`only read data from buffer pool. Thus the disk write time of View SQL and
`Subquery SQL is almost negligible, although it is dominant for SETM SQL.
`This analysis clarifies the problem of SETM and how to cost can be reduced for
`View and Subquery SQLs, which is the key to the performance improvement.
`
`Fig. 4. Execution time on parallel database environment
`
`5 Summary
`
`The web has change some paradigms of information retrieval. The challenge
`to provide satisfying answers from web queries faces problems from the chaotic
`nature of web page generation, the phenomenal growth of information on the web
`and the complexity of web structure. The technology of web mining has showed
`promising results to improve the information retrieval quality from the web.
`More sophisticated and complex web mining techniques will be inevitable for the
`future of web. To support that we need powerful systems, yet with reasonable
`cost.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-10
`
`
`
`30
`
`M. Kitsuregawa et al.
`
`Fig. 5. Speedup ratio in parallel database environment
`
`Fig. 6. Decomposition of execution time of PASS2 for three types of SQL queries
`
`In this paper, we reported the result of mining web access log of a portal
`site for mobile users called Mobile Info Search. Two techniques are used : the
`association rule mining and sequential pattern mining. We have also examined
`the parallelization of SQL query to mine association rule on commercial RDBMS
`(IBM DB2 UDB EEE). We showed that good speedup ratio can be achieved,
`that means it is parallelized well. We also examined two variations of SETM SQL
`queries to improve performance, which reduce I/O cost by using View materialize
`or Subquery technique.
`We have compared the parallel implementation of SQL based association rule
`mining with commercial data mining tool (IBM Intelligent Miner). Through real
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-11
`
`
`
`Web Log Mining and Parallel SQL Based Execution
`
`31
`
`implementation, we have showed that our improved SETM query using View or
`Subquery can beat the performance of specialized tool with only 4 nodes while
`original SETM query would need more than 24 processing nodes to achieve
`the same performance. We don’t have to buy expensive data mining tool, since
`parallel execution of SQL comes at no extra cost. It is also easy to implement
`and portable.
`
`Acknowledgements
`
`We would like to thank people from NTT Software, in particular Mr. Katsumi
`Takahashi and Dr. Atsuhiro Goto for providing the log file of MIS and helpful
`discussions.
`
`References
`
`1. R. Agrawal, T. Imielinski, A. Swami. “Mining Association Rules between Sets of
`Items in Large Databases”. In Proc. of the ACM SIGMOD Conference on Man-
`agement of Data, 1993.
`2. R. Agrawal, R. Srikant. “Fast Algorithms for Mining Association Rules”. In Proc.
`of the VLDB Conference, 1994.
`3. R. Agrawal, R. Srikant “Mining Sequential Patterns”. ‘In Proc. of Int. Conf. on
`Data Engineering, March 1995.
`4. R. Srikant, R. Agrawal “Mining Sequential Patterns: Generalizations and perfor-
`mance improvements” ‘In Proc. of 5th Int. Conf. on Extending Database Technol-
`ogy, March 1996.
`5. G. O. Arocena, A. O. Mandelzon, G. A. Mihaila. “Applications of a Web Query
`Language” ‘In Proc. of WWW6, April 1997.
`6. S. Brin, L. Page “The Anatomy of a Large Scale Hypertextual Web Search Engine”.
`In Proc. of WWW7, May 1998.
`7. A. Buchner, M. D. Mulvenna. “Discovering internet marketing intelligence through
`online analytical Web usage mining” In SIGMOD Record (4)27, 1999.
`8. R. Cooley, B. Mobasher, J. Srivistava. “Data preparation for mining World Wide
`Web browsing patterns” In Journal of Knowledge and Information Systems (1)1,
`1999.
`9. E. Spertus, L. A. Stein. “Squel: A Structured Query Language for the Web” In
`Proc. of WWW9, May 2000.
`10. M. Houtsma, A. Swami. “Set-oriented Mining of Association Rules” In Proc. of
`International Conference on Data Engineering, March 1995.
`11. J. Kleinberg. “Authoritive sources in s hyperlinked environment”.
`ACM-SIAM Symposium in Discrete Algorithm, 1998.
`12. M. Perkowitz, O. Etzioni. “Towards Adaptive Web Sites: Conceptual Framework
`and Case Study”, In Proc. of WWW8, May 1999.
`13. I. Pramudiono, T. Shintani, T. Tamura, M. Kitsuregawa. “Parallel SQL Based
`Association Rule Mining on Large Scale PC Cluster: Performance Comparison with
`Directly Coded C Implementation”. In Proc. of Third Pacific-Asia Conference on
`Knowledge Discovery a nd Data Mining (PAKDD99), March 1999.
`
`In Proc. of
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-12
`
`
`
`32
`
`M. Kitsuregawa et al.
`
`“Mining General-
`14. I. Pramudiono, T. Shintani, T. Tamura, M. Kitsuregawa.
`ized Association Rule using Parallel RDB Engine on PC Cluster”.
`In Proc. of
`First International Conference on Data Warehousing and Knowledege Discovery
`(DAWAK99), September 1999.
`15. S. Sarawagi, S. Thomas, R. Agrawal. “Integrating Association Rule Mining with
`Relational Database Systems : Alternatives and Implications”. In Proc. of the ACM
`SIGMOD Conference on Management of Data, 1998.
`16. S. Thomas, S. Chakravarthy. “Performance Evaluation and Optimization of Join
`Queries for Association Rule Mining”. In Proc. of First International Conference
`on Data Warehousing and Knowledege Discovery (DAWAK99), September 1999.
`17. Katsumi Takahashi, Seiji Yokoji, Nobuyuki Miura “Location Oriented Integra-
`tion of Internet Information - Mobile Info Search”. In Designing the Digital City,
`Springer-Verlag, March 2000.
`18. http://www.kokono.net
`19. Takayuki Tamura, Masato Oguchi, and Masaru Kitsuregawa “Parallel Database
`Processing on a 100 Node PC Cluster: Cases for Decision Support Query Pro-
`cessing and Data Mining”. In Proc. of SC97: High Performance Networking and
`Computing(SuperComputing ’97), November, 1997.
`20. S. Thomas, S. Sarawagi “Mining Generalized Association Rules and Sequential
`Patterns Using SQL Queries” ‘In Proc. of Int. Conf. on Knowledge Discovery and
`Data Mining, March 1998.
`21. T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. “From user access patterns to
`dynamic hypertext linking” In Proc. of WWW5, May 1996.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-13