throbber
Web Log Mining
`and Parallel SQL Based Execution
`
`Masaru Kitsuregawa, Takahiko Shintani,
`Takeshi Yoshizawa(cid:2), and Iko Pramudiono
`
`Institute of Industrial Science, The University of Tokyo
`7-22-1 Roppongi, Minato-ku, Tokyo 106, Japan
`{kitsure,shintani,yoshi,iko}@tkl.iis.u-tokyo.ac.jp
`
`Abstract. We performed association rule mining and sequence pattern
`mining against the access log which was accumulated at NTT Software
`Mobile Info Search portal site. Detail web log mining process and the
`rules we derived are reported in this paper. The integration of web data
`and relational database enables better management of web data. Some
`researches have even tried to implement applications such as web mining
`with SQL. Commercial RDBMSs support parallel execution of SQL. Par-
`allelism is key to improve the performance. We showed that commercial
`RDBMS can achieve substantial speed up for web mining.
`
`1 Introduction
`
`The analysis of web log to understand the characteristics of web users has been
`one of the major topics in web mining. The goal is to provide personalized
`information retrieval mechanism that match the need of each individual user.
`”One to one marketing” on the web also has similar objectives. The development
`of the web personalization technologies will certainly benefit e-commerce too.
`In this paper, we focused on the mining access log using association rule
`discovery techniques. We show some mining results from web log of Mobile Info
`Search(MIS), a location-aware search engine [18]. Usage mining of this unique
`site could give some interesting insights into the behavior of mobile device users
`which are the targets of this site.
`Here we report some of the web mining techniques based on association rule
`that can be accomplished by some modified SQL queries on relational database.
`The integration of web with database techniques has drawn attention from re-
`searchers. Some have proposed query languages for the web that is similar with
`SQL such as Squeal[9] and WebSQL[5]. They emphasize better organization of
`web data managed in relation database way. We extend this concept for real
`applications of web mining. We also address the performance problem by paral-
`leling the execution of SQL queries.
`Although the amount of log at MIS is not so large, generally at large portal
`site it tends to be very large. The log can reach several tens of GB per day. Just
`(cid:2) IBM Japan Co.,Ltd. 1-1, Nakase, Mihama-ku, Chiba-shi, Chiba 261-8522, Japan
`
`S. Bhalla (Ed.): DNIS 2000, LNCS 1966, pp. 20–32, 2000.
`c(cid:2) Springer-Verlag Berlin Heidelberg 2000
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-1
`
`

`
`Web Log Mining and Parallel SQL Based Execution
`
`21
`
`one day log is not enough for mining. If we are going to use several weeks log,
`then we have to handle more than one terabyte of data. Single PC server cannot
`process such huge amount of data with reasonable amount of time.
`On the other hand recently most major commercial database systems have in-
`cluded capabilities to support parallelization although no report available about
`how the parallelization affects the performance of complex query required by
`association rule mining. This fact motivated us to examine how efficiently SQL
`based association rule mining can be parallelized and speeded up using commer-
`cial parallel database system (IBM DB2 UDB EEE). We propose two techniques
`to enhance association rule mining query based on SETM [10]. And we have also
`compared the performance with commercial mining tool (IBM Intelligent Miner).
`Our performance evaluation shows that we can achieve comparable performance
`with commercial mining tool using only 4 nodes. Some considerable works on
`effective SQL queries to mine association rule such didn’t examine the effect of
`parallelization [15][16]. Some of authors have reported a performance evaluation
`on PC cluster as parallel platform [13][14]. Comparison with natively coded pro-
`grams is also reported. However we use currently available commercial products
`for the evaluation here.
`
`2 Web Usage Mining for Portal Site
`
`2.1 Web Access Log Mining
`
`Access log of a web site records every user requests sent to the web server. From
`the access log we can know which pages were visited by the user, what kind of
`CGI request he submitted, when was the access and it also tells to some extent
`where the user come from. Using those information, we can modify the web site to
`satisfy the need of users better by providing better site map, change layout of the
`pages and the placement of links etc. [12] has proposed the concept of adaptive
`web site that dynamically optimize the structure of web pages based on the
`users access pattern. Some data mining techniques has been applicated on web
`logs to predict future user behavior and to derive marketing intelligence[21][7][8].
`Currently many e-commerce applications also provides limited personalization
`based on access log analysis. Some pioneers such as Amazon.com have achieved
`considerable success.
`Here we will show the mining process of real web site. We have collaborated
`with NTT Software to analyze the usage of a unique search engine called Mobile
`Info Search.
`
`2.2 Mobile Info Search(MIS)
`
`Mobile Info Search (MIS) is a research project conducted by NTT Software
`Laboratory whose goal to provide location aware information from the inter-
`net by collecting, structuring, organizing, and filtering in a practicable form[17].
`MIS employs a mediator architecture. Between users and information sources,
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-2
`
`

`
`22
`
`M. Kitsuregawa et al.
`
`MIS mediates database-type resources such as online maps, internet “yellow-
`pages” etc. using Location-Oriented Meta Search and static files using Location
`Oriented Robot-based Search. Users input their location using address, nearest
`station, latitude-longitude or postal number. If the user has a Personal Handy
`Phone(PHS) or Geo Positioning System(GPS) unit, the user location is auto-
`matically obtained.
`The site is available to the public since 1997. Its URL is http://www.kokono.net.
`In average 500 searches are performed on the site daily. A snapshot of this site
`is shown in Figure 1. 1
`
`Mobile Info Search 2 Ver.2.00
`
`Location Information(2000/09/22 16:23:08)
`
`Tokyo, Chuo-ku, Ginza, 4-chome ZIP 104-0061
`(NL 35.40.4.7 EL 139.46.5.7)
`Nearest station : Ginza, Higashi-Ginza
`Kokono (nearby area) Search
`Shops Information Internet-Townpage
`Keywords [ ! type ]
`Maps
`Train Route
`Train Timetable
`Hotels
`Newspapers
`Weather Report
`TV Guide
`
`Fig. 1. Index page of Mobile Info Search
`
`MIS has two main functionalities :
`
`1. Location Oriented Meta Search
`Many local information on the web are database-type, that is the information
`is stored in backbone database. In contrast to static pages, they are accessed
`through CGI program of WWW server. MIS provides a simple interface for
`local information services which have various search interfaces. It converts
`the location information and picks the suitable wrapper for the requested
`service. Example of database-type resources provided are shown in table 1.
`2. Location-Oriented Robot-Based Search “kokono Search”
`kokono Search provides the spatial search that searches the document close to
`a location. “kokono” is a Japanese word means here. kokono Search also em-
`ploys “robot” to collects static documents from internet. While other search
`engines provide a keyword-based search, kokono Search do a location-based
`spatial search. It displays documents in the order of the distance between
`the location written in the document and the user’s location.
`1 The page is shown in Japanese at http://www.kokono.net
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-3
`
`

`
`Web Log Mining and Parallel SQL Based Execution
`
`23
`
`Table 1. Database-type resources on the Internet
`
`Location information used for the search
`Service
`longitude-latitude
`Maps
`Yellow Pages
`address ( and categories ... etc)
`Train Time Tables station
`Weather Reports address or region
`Hotel Guides
`nearest station
`
`3 Mining MIS Access Log and Its Derived Rules
`
`3.1 Preprocessing
`
`We analyzed the users’ searches from the access log recorded on the server be-
`tween January and May 1999. There are 1035532 accesses on the log, but the
`log also consists image retrieval, searches without cookie and pages that do not
`have relation with search. Those logs were removed. Finally we had 25731 search
`logs to be mined.
`
`– Access Log Format
`Each search log consists CGI parameters such as location information (ad-
`dress, station, zip), location acquisition method (f rom), resource type (sub-
`mit), the name of resource (shop web, map web, rail web, station web,
`tv web), the condition of search (keyword, shop cond). We treat those pa-
`rameters the same way as items in transaction data of retail sales. In ad-
`dition, we generate some items explaining the time of access (access week,
`access hour).
`Example of a search log is shown in Figure 2.
`– Taxonomy of Location
`Since names of places follow some kind of hierarchy, such as “city is a part of
`prefecture” or “a town is a part of a city”, we introduce taxonomy between
`them. We do this by adding items on part of CGI parameter address. For
`example, if we have an entry in CGI parameters entry [address=Yamanashi-
`ken, Koufu-shi, Oo-satomachi], we can add 2 items as ancestors : [address=
`Yamanashi-ken, Koufu-shi] at city level and [address=Yamanashi-ken] at
`prefecture level. In Japanese, “ken” means prefecture and “shi” means city.
`– Transformation to Transaction Table
`Finally we have the access log is transformed into transaction table ready
`for association rule mining. Part of transaction table that corresponds to log
`entry in Figure 2 is shown in Table 2
`
`3.2 Association Rule Mining
`
`Agrawal et. al.[1] first suggested the problem of finding association rule from
`large database. An example of association rule mining is finding “if a customer
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-4
`
`

`
`24
`
`M. Kitsuregawa et al.
`
`0000000003 - - [01/Jan/1999:00:30:46 0900] "GET /index.cgi?
`sel_st=0&NL=35.37.4.289&EL=138.33.45.315&address=Yamanashi-ken,
`Koufu-shi,Oosato-machi&station=Kokubo:Kaisumiyoshi:Minami-koufu:
`Jouei&zip=400-0053&from=address&shop_web=townpage&keyword=
`&shop_cond=blank&submit_map=Map&map_web=townpage&rail_web
`=s_tranavi&station_web=ekimae&tv_web=tvguideHTTP/1.1" 200 1389
`"http://www.kokono.net/mis2/mis2-header?date=1999/01/01.00:27:59
`&address=Yamanashi-ken,Koufu-shi,Oosato-machi&NL=35.37.4.289&EL
`=138.33.45.315&station=Kokubo:Kaisumiyoshi:Minami-koufu:Jouei&zip
`=400-0053&from=address&keyword=&shop_web=townpage&shop_cond=blank
`&map_web=townpage&station_web=&tv_web=tvguide""Mozilla/4.0
`(compatible; MSIE 4.01; Windows 98)$B!I(B"LastPoint=NL=35.37.4.289
`&EL=138.33.45.315&address=Yamanashi-ken,Koufu-shi,Oosato-machi&station
`=Kokubo:Kaisumiyoshi:Minami-koufu:Jouei&zip=400-0053; LastSelect
`=shop_web=townpage&shop_cond=blank&keyword=&map_web=townpage&rail_web=
`s_tranavi&station_web=ekimae&tv_web=tvguide;Apache=1; MIS=1" "-"
`
`Fig. 2. Example of an access log
`
`Table 2. Representation of access log in relational database
`
`Log ID User ID
`001
`003
`
`001
`001
`001
`
`001
`001
`001
`001
`
`003
`003
`003
`
`003
`003
`003
`003
`
`Relation LOG
`Item
`address=Yamanashi-ken
`,Koufu-shi,Oosato-machi
`address=Yamanashi-ken,Koufu-shi,
`address=Yamanashi-ken,
`station=Kokubo:
`Kaisumiyoshi:Minami-koufu:Jouei
`zip=400-0053
`from=address
`submit map=Map
`map web=townpage
`
`buys A and B then 90% of them buy also C” in transaction databases of large
`retail organizations. This 90% value is called confidence of the rule. Another
`important parameter is support of an itemset, such as {A,B,C}, which is de-
`fined as the percentage of the itemset contained in the entire transactions. For
`above example, confidence can also be measured as support({A,B,C}) divided
`by support({A,B}).
`We show some results in Table 3 and 4. Beside common parameters such as
`confidence and support, we also use user that indicate the percentage of users
`logs that contain the rule.
`Those rules can be used to improve the value of web site. We can identify from
`the rules some access patterns of users that access this web site. For example,
`from the first rule we know that though Akihabara is a well known place in Tokyo
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-5
`
`

`
`Web Log Mining and Parallel SQL Based Execution
`
`25
`
`for electronic appliances/parts shopping, user that searches around Akihabara
`station will probably looks for restaurant. From this unexpected result, we can
`prefetch information of restaurant around Akihabara station to reduce access
`time, we can also provide links to this kind of user to make his navigation
`easier or offer proper advertisement banner. In addition, learning users behavior
`provides hint for business chance for example the first rule tell us the shortcoming
`of restaurants in Akihabara area.
`
`Table 3. Some results of MIS log mining regarding search condition
`
`Not so many good restaurants in Akihabara ?
`[keyword=][address=Tokyo,][station=Akihabara] ⇒ [shop cond=restaurant]
`In Hokkaido, people looks for gasoline stand at night from its address
`[access hour=20][address=Hokkaido,][from=address] [shop web=townpage]
`⇒ [shop cond=gasoline]
`People from Gifu-ken quite often searches for restaurants
`[address=Gifu-ken,][shop web=townpage] ⇒ [shop cond=restaurant]
`However people from Gifu-ken search for hotels on Saturday
`[access week=Sat][address=Gifu-ken,] [shop web=townpage] ⇒ [shop cond=hotel]
`People from Gifu-ken must search for hotel around stations
`[address=Gifu-ken,][shop web=townpage] [station=Kouyama] ⇒ [shop cond=hotel]
`
`Table 4. Some results of MIS log mining regarding time and location acquisition
`method
`
`Most frequent searches for restaurants around 16:00 if they start from address on Friday
`[access week=Fri][from=address][shop cond=restaurant] ⇒ [access hour=16]
`Most frequent searches for department store stand at 20:00 if start from address.
`[from=address][shop cond=department] ⇒ [access hour=20]
`Looking for gasoline stand on Sunday ?
`[from=address][shop cond=gasoline][shop web=townpage] ⇒ [access week=Sun]
`Search for hotels often from station if at Kanagawa-ken
`[address=Kanagawa-ken,][shop cond=hotel] ⇒ [from=station]
`People at Osaka start searching convenience stores from ZIP number !
`[address=Osaka,][shop cond=conveni] ⇒ [from=zip]
`People at Hokkaido always search convenience stores from address
`[address=Hokkaido,][shop cond=conveni] [shop web=townpage] ⇒ [from=address]
`
`3.3 Sequential Rule Mining
`
`The problem of mining sequential patterns in a large database of customer trans-
`actions was also introduced by Agrawal et. al.[3]. The transactions are ordered
`by the transaction time. A sequential pattern is an ordered list (sequence) of
`itemsets such as “5% of customers who buy both A and B also buy C in the
`next transaction”.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-6
`
`

`
`26
`
`M. Kitsuregawa et al.
`
`We show some sequential patterns that might be interesting in Table 5. Some
`patterns indicate the behavior of users that might be planning to do shop-
`ping. We can derive from second pattern that significant part of users check
`the weather forecast first, then they look for the shops in the yellow-pages ser-
`vice called “Townpage” then look again for additional information in the vicinity
`with kokono Search and finally they confirm the exact location in the map.
`
`Table 5. Some results of sequential pattern mining
`
`After finding a shop, check how to go there and the weather
`[submit shop=Shop Info] → [submit rail=Search Train]
`→ [submit newspaper=Newspaper]
`⇒ [submit weather=Weather Forecast]
`Or decide the plan after checking the weather first
`[submit weather=Weather Forecast]
`→ [submit shop=Shop Info] [shop web=townpage]
`→ [submit kokono=Kokono Search] ⇒ [submit map=Map]
`Looking for shops after closing time
`[submit shop=Shop Info] [access hour=22] [access week=Fri]
`⇒ [submit map=Map] [access hour=22] [access week=Fri]
`
`4 Mining Web Log Using RDBMS
`
`The ability to perform web mining using standard SQL queries is a next challenge
`for better integration of web and RDBMS. The integration is essential since
`better management of web data has become a necessity for large sites.
`The performance issue was a major problem for data mining with RDBMS.
`We will show that commercial RDBMS on parallel platforms could handle the
`task with sufficient performance.
`
`4.1 Association Rule Mining Based on SQL
`
`A common strategy to mine association rule is:
`
`1. Find all itemsets that have transaction support above minimum support,
`usually called large itemsets.
`2. Generate the desired rules using large itemsets.
`
`Since the first step consumes most of processing time, development of mining
`algorithm has been concentrated on this step.
`In our experiments we employed three type of SQL query. First of all the
`standard SQL query using SETM algorithm [10]. It is shown in Figure 3. Sec-
`ond is the enhanced SQL query using view materialization technique. Third is
`another enhanced SQL query using subquery technique.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-7
`
`

`
`Web Log Mining and Parallel SQL Based Execution
`
`27
`
`SQL query using SETM algorithm. Transaction data is transformed into
`the first normal form (transaction ID, item). In the first pass we simply gather
`the count of each item. Items that satisfy the minimum support are inserted
`into large itemsets table C 1 that takes form (item, item count). SETM employs
`temporary tables to reuse item combination in next pass. In first pass, transac-
`tions that match large itemsets are preserved in temporary table R 1. In other
`passes for example pass k, we first generate all lexicographically ordered candi-
`date itemsets of length k into another temporary table RTMP k by self-joining
`table R (k-1) that contains k-1 length transaction data. Then we generate the
`count for those itemsets, itemsets that meet minimum support are included into
`large itemset table C k. Finally transaction data R k of length k generated by
`matching items in candidate itemset table RTMP k with items in large itemsets.
`In order to avoid excessive I/O, we disable the log during the execution.
`
`= q.id
`p.id
`p.item 1 = q.item 1
`p.item 2 = q.item 2
`
`...
`
`p.item k-2 = q.item k-2
`p.item k-1 < q.item k-1;
`
`WHERE
`AND
`AND
`
`AND
`AND
`
`INSERT INTO C k
`item 1, item 2, ..., item k,
`SELECT
`COUNT(*)
`RTMP k
`FROM
`GROUP BY item 1, item 2, ..., item k
`HAVING COUNT(*) >= :min support;
`
`INSERT INTO R k
`p.id, p.item 1, p.item 2, ...,
`SELECT
`p.item k
`RTMP k p, C k c
`p.item 1 = c.item 1
`p.item 2 = c.item 2
`
`FROM
`WHERE
`AND
`
`...
`
`AND
`
`p.item k = c.item k;
`
`DROP TABLE R k-1;
`DROP TABLE RTMP k;
`
`CREATE TABLE LOG (id int, item int);
`
`– PASS 1
`
`CREATE TABLE C 1 (item 1 int, cnt int);
`CREATE TABLE R 1 (id int, item 1 int);
`
`INSERT INTO C 1
`item AS item 1, COUNT(*)
`SELECT
`FROM
`LOG
`GROUP BY item
`HAVING COUNT(*) >= :min support;
`
`INSERT INTO R 1
`p.id, p.item AS item 1
`SELECT
`LOG p, C 1 c
`FROM
`p.item = c.item 1;
`WHERE
`
`– PASS k
`CREATE TABLE RTMP k (id int, item 1 int,
`item 2 int, ... , item k int)
`(item 1 int,
`CREATE TABLE C k
`item 2 int, ... , item k int, cnt int)
`(id int, item 1 int,
`CREATE TABLE R k
`item 2 int, ... , item k int)
`
`INSERT INTO RTMP k
`p.id, p.item 1, p.item 2, ... ,
`SELECT
`p.item k-1, q.item k-1
`R k-1 p, R k-1 q
`
`FROM
`
`Fig. 3. SQL query to mine association rule
`
`Enhanced SETM query using view materialize technique. SETM has
`to materialize its temporary tables namely R k and RTMP k. Those temporary
`tables are only required in the next pass and they are not needed for generating
`the rules. In fact, those tables can be deleted after execution of its subsequent
`pass. Based on this observation we could avoid materialization cost of those
`temporary tables by replacing the table creation with view.
`
`Enhanced SETM query using subquery technique. We expect significant
`performance improvement with utilization of view, however view still requires
`time to access the system catalog and are holding locks to the system catalog
`table during creating views so we further use subquery instead of temporary
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-8
`
`

`
`28
`
`M. Kitsuregawa et al.
`
`tables. Therefore we embed the generation of item combinations into the query
`to generate large itemsets.
`
`4.2 Parallel Execution Environment and Performance Evaluation
`
`In our experiment we employed commercial Parallel RDBMS: IBM DB2 UDB
`EEE version 6.1 on IBM UNIX Parallel Server System: IBM RS/6000 SP. 12
`nodes make this system and using shared nothing architecture. Each node has
`POWER2 77Mhz CPU, 4.4GB SCSI hard disk, 256MB RAM and connected
`by High Performance Switch HPS with 100MB/s network speed. We also used
`commercial data mining tool IBM Intelligent Miner on single node of RS/6000
`SP for performance comparison with the SQL based data mining.
`We show execution time of association rule mining with several minimum
`supports in Figure 4. The data used here is synthetic transaction data generated
`with program described in Apriori algorithm [2] to show that we can handle
`larger data with parallel RDBMS. The number of transactions here is 200000,
`average transaction length 10 and number of items 2000. Transaction data is
`partitioned uniformly by hashing algorithm corresponds to transaction ID among
`processing nodes’ local hard disks. We also show the result of commercial data
`mining program Intelligent Miner from IBM on single node for reference.
`Figure 4 shows the execution time on each degree of parallelization. On av-
`erage, we can derive that View and Subquery SQL is about 6 times faster than
`SETM SQL regardless of the number of nodes. The result is also compared with
`the execution time of Intelligent Miner on single processing node. It is true that
`Intelligent Miner on single node with transaction data stored in flat file is much
`faster than the SQL queries. However, the View and Subquery SQL are 50on
`single node if the transaction data have to be read from RDBMS. We exemplified
`that we can achieve comparable performance of Intelligent Miner on single node
`with flat file by activating only 4 nodes when we used View and Subquery SQL.
`The result gives evidence for the effectiveness of parallelization of SQL query to
`mine association rule.
`The speedup ratio is shown in Figure 5. This is also reasonably good, espe-
`cially View and Subquery SQL are not being saturated as the number of process-
`ing nodes increased. That means they can be parallelized well. The execution
`is 11 times faster with 12 nodes. In parallel environment, network potentially
`becomes botleneck which degrades the speed-up ratio. However our experiments
`suggest that association rule mining using variants of SETM is mostly CPU
`bound and network I/O is negligible.
`Here we give thorough comparison and analysis on the three variations of
`SQL query described before. The performance evaluation is done on 12 nodes.
`The mining is two passes long. It is well known that in most cases the second pass
`generates huge amount of candidate itemsets thus it is the most time consuming
`phase[4][5]. Our results are very much alike. Almost over 80belongs to PASS2 in
`all three SQL queries. Obviously View and Subquery SQL complete their first
`and second passes faster than SETM SQL query. We have recorded the execution
`traces of the three SQL in each PASS. The decomposition of execution time is
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-9
`
`

`
`Web Log Mining and Parallel SQL Based Execution
`
`29
`
`analysed as shown Figure 6 (PASS2) respectively. Comparing the elapsed time
`with the cpu time at Figure 6, we find that both are close for View SQL and
`Subquery SQL. This means these SQL’s are cpu bound, while SETM SQL is not
`cpu bound. Most of execution time of SETM query is dominated by disk write
`time for creating temporary table such as R k and RTMP k. We can also see
`that sort time is almost equal for all three SQL’s, which represents the cost of
`group by aggregation. In PASS2, SETM reuses item combinations in temporary
`table R1 on the secondary storage that is generated in PASS1. We replace it
`with view or subquery. Then data is transferred directly through memory from
`PASS1 to PASS2. Figure 6 indicates that PASS2 of those modified SQL queries
`only read data from buffer pool. Thus the disk write time of View SQL and
`Subquery SQL is almost negligible, although it is dominant for SETM SQL.
`This analysis clarifies the problem of SETM and how to cost can be reduced for
`View and Subquery SQLs, which is the key to the performance improvement.
`
`Fig. 4. Execution time on parallel database environment
`
`5 Summary
`
`The web has change some paradigms of information retrieval. The challenge
`to provide satisfying answers from web queries faces problems from the chaotic
`nature of web page generation, the phenomenal growth of information on the web
`and the complexity of web structure. The technology of web mining has showed
`promising results to improve the information retrieval quality from the web.
`More sophisticated and complex web mining techniques will be inevitable for the
`future of web. To support that we need powerful systems, yet with reasonable
`cost.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-10
`
`

`
`30
`
`M. Kitsuregawa et al.
`
`Fig. 5. Speedup ratio in parallel database environment
`
`Fig. 6. Decomposition of execution time of PASS2 for three types of SQL queries
`
`In this paper, we reported the result of mining web access log of a portal
`site for mobile users called Mobile Info Search. Two techniques are used : the
`association rule mining and sequential pattern mining. We have also examined
`the parallelization of SQL query to mine association rule on commercial RDBMS
`(IBM DB2 UDB EEE). We showed that good speedup ratio can be achieved,
`that means it is parallelized well. We also examined two variations of SETM SQL
`queries to improve performance, which reduce I/O cost by using View materialize
`or Subquery technique.
`We have compared the parallel implementation of SQL based association rule
`mining with commercial data mining tool (IBM Intelligent Miner). Through real
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-11
`
`

`
`Web Log Mining and Parallel SQL Based Execution
`
`31
`
`implementation, we have showed that our improved SETM query using View or
`Subquery can beat the performance of specialized tool with only 4 nodes while
`original SETM query would need more than 24 processing nodes to achieve
`the same performance. We don’t have to buy expensive data mining tool, since
`parallel execution of SQL comes at no extra cost. It is also easy to implement
`and portable.
`
`Acknowledgements
`
`We would like to thank people from NTT Software, in particular Mr. Katsumi
`Takahashi and Dr. Atsuhiro Goto for providing the log file of MIS and helpful
`discussions.
`
`References
`
`1. R. Agrawal, T. Imielinski, A. Swami. “Mining Association Rules between Sets of
`Items in Large Databases”. In Proc. of the ACM SIGMOD Conference on Man-
`agement of Data, 1993.
`2. R. Agrawal, R. Srikant. “Fast Algorithms for Mining Association Rules”. In Proc.
`of the VLDB Conference, 1994.
`3. R. Agrawal, R. Srikant “Mining Sequential Patterns”. ‘In Proc. of Int. Conf. on
`Data Engineering, March 1995.
`4. R. Srikant, R. Agrawal “Mining Sequential Patterns: Generalizations and perfor-
`mance improvements” ‘In Proc. of 5th Int. Conf. on Extending Database Technol-
`ogy, March 1996.
`5. G. O. Arocena, A. O. Mandelzon, G. A. Mihaila. “Applications of a Web Query
`Language” ‘In Proc. of WWW6, April 1997.
`6. S. Brin, L. Page “The Anatomy of a Large Scale Hypertextual Web Search Engine”.
`In Proc. of WWW7, May 1998.
`7. A. Buchner, M. D. Mulvenna. “Discovering internet marketing intelligence through
`online analytical Web usage mining” In SIGMOD Record (4)27, 1999.
`8. R. Cooley, B. Mobasher, J. Srivistava. “Data preparation for mining World Wide
`Web browsing patterns” In Journal of Knowledge and Information Systems (1)1,
`1999.
`9. E. Spertus, L. A. Stein. “Squel: A Structured Query Language for the Web” In
`Proc. of WWW9, May 2000.
`10. M. Houtsma, A. Swami. “Set-oriented Mining of Association Rules” In Proc. of
`International Conference on Data Engineering, March 1995.
`11. J. Kleinberg. “Authoritive sources in s hyperlinked environment”.
`ACM-SIAM Symposium in Discrete Algorithm, 1998.
`12. M. Perkowitz, O. Etzioni. “Towards Adaptive Web Sites: Conceptual Framework
`and Case Study”, In Proc. of WWW8, May 1999.
`13. I. Pramudiono, T. Shintani, T. Tamura, M. Kitsuregawa. “Parallel SQL Based
`Association Rule Mining on Large Scale PC Cluster: Performance Comparison with
`Directly Coded C Implementation”. In Proc. of Third Pacific-Asia Conference on
`Knowledge Discovery a nd Data Mining (PAKDD99), March 1999.
`
`In Proc. of
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-12
`
`

`
`32
`
`M. Kitsuregawa et al.
`
`“Mining General-
`14. I. Pramudiono, T. Shintani, T. Tamura, M. Kitsuregawa.
`ized Association Rule using Parallel RDB Engine on PC Cluster”.
`In Proc. of
`First International Conference on Data Warehousing and Knowledege Discovery
`(DAWAK99), September 1999.
`15. S. Sarawagi, S. Thomas, R. Agrawal. “Integrating Association Rule Mining with
`Relational Database Systems : Alternatives and Implications”. In Proc. of the ACM
`SIGMOD Conference on Management of Data, 1998.
`16. S. Thomas, S. Chakravarthy. “Performance Evaluation and Optimization of Join
`Queries for Association Rule Mining”. In Proc. of First International Conference
`on Data Warehousing and Knowledege Discovery (DAWAK99), September 1999.
`17. Katsumi Takahashi, Seiji Yokoji, Nobuyuki Miura “Location Oriented Integra-
`tion of Internet Information - Mobile Info Search”. In Designing the Digital City,
`Springer-Verlag, March 2000.
`18. http://www.kokono.net
`19. Takayuki Tamura, Masato Oguchi, and Masaru Kitsuregawa “Parallel Database
`Processing on a 100 Node PC Cluster: Cases for Decision Support Query Pro-
`cessing and Data Mining”. In Proc. of SC97: High Performance Networking and
`Computing(SuperComputing ’97), November, 1997.
`20. S. Thomas, S. Sarawagi “Mining Generalized Association Rules and Sequential
`Patterns Using SQL Queries” ‘In Proc. of Int. Conf. on Knowledge Discovery and
`Data Mining, March 1998.
`21. T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. “From user access patterns to
`dynamic hypertext linking” In Proc. of WWW5, May 1996.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2010-13

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket