`A Case Study using Data Mining
`
`J. E. Moreira S. P. Midkiff M. Gupta R. Lawrence
`{jmoreira, smidkiff, mgupta, ricklawr}@us.ibm.com
`IBM T. J. Watson Research Center
`P. O. Box 218
`Yorktown Heights, NY 10598-0218
`
`Abstract
`
`This paper discusses several techniques used in developing a parallel, production quality data mining application
`in Java. We started by developing three sequential versions of a product recommendation data mining application:
`(i) a Fortran 90 version used as a performance reference, (ii) a plain Java implementation that only uses the primitive
`array structures from the language, and (iii) a baseline Java implementation that uses our Array package for Java. This
`Array package provides parallelism at the level of individual Array and BLAS operations. Using this Array package,
`we also developed two parallel Java versions of the data mining application: one that relies entirely on the implicit
`parallelism provided by the Array package, and another that is explicitly parallel at the application level. We discuss
`the design of the Array package, as well as the design of the data mining application. We compare the trade-offs
`between performance and the abstraction level the different Java versions present to the application programmer. Our
`studies show that, although a plain Java implementation performs poorly, the Java implementation with the Array
`package is quite competitive in performance with Fortran. We achieve a single processor performance of 109 Mflops,
`or 91% of Fortran performance, on a 332 MHz PowerPC 604e processor. Both the implicitly and explicitly parallel
`forms of our Java implementations also parallelize well. On an SMP with four of those PowerPC processors, the
`implicitly parallel form achieves 290 Mflops with no effort from the application programmer, while the explicitly
`parallel form achieves 340 Mflops.
`
`1 Introduction
`
`Data mining [25] is the process of discovering useful and previously unknown information in very large data bases.
`Applications of data mining span scientific (e.g., image processing) as well as commercial computing (e.g., fraud de-
`tection). Although typical data mining applications access large volumes of data from either flat files or relational data
`bases, such applications often involve a significant amount of computation above the data access layer. This compu-
`tation can be floating-point intensive in the case of a neural network application, or may involve a significant number
`of integer operations for rule-based algorithms encountered in tree classification [14] and associations algorithms [1].
`High-performance data mining applications, therefore, require both fast data access and fast computation.
`In many ways Java is an ideal language to implement data mining operations. Java supports portable parallel
`programming on many platforms, which is useful for developing applications that will be deployed across several
`organizations with different computing environments. Java is clean and object-oriented, which allows programs to be
`easily maintained. Standard classes that support domain-specific operations can be effectively implemented in Java,
`as demonstrated by its large collection of graphics and database APIs. As an additional advantage, Java offers an
`expanding base of skilled and enthusiastic adepts.
`The primary worry about Java has been its efficiency. Execution speed of computation-intensive code in Java can
`be up to an order of magnitude slower than the equivalent Fortran code [18], even when the best commercially available
`environments are used. In this paper we use a production data mining application as a case study of the performance
`and ease-of-use of Java. In particular, we show that the use of the Array package for Java can lead to performance
`figures that are comparable to what can be achieved with the best Fortran environments. Exploiting parallelism within
`the Array package can further lead to significant improvements in performance with no effort from the application
`programmer.
`We contrast five implementations of the scoring phase of a production quality data mining application. One
`implementation is written in Fortran 90, and is used as the reference for performance. The first Java implementation
`is built using plain Java (i.e., only Java and its standard libraries). The second Java implementation uses an Array
`package [19, 17] that we have developed, and which is being proposed as a standard by the Java Grande Forum [13].
`(We emphasize that the Array package is written entirely in Java and requires absolutely no modifications to the
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`1
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 1
`
`
`
`language.) The computational core of the program uses a BLAS [7] class that we developed as part of the Array
`package. This BLAS class can perform its individual operations either sequentially or in parallel, and allows programs
`written in a completely sequential style to exploit some parallelism. This strategy is similar to that employed by
`applications built upon the SMP Engineering and Scientific Subroutine Library (ESSL) [11] or parallel DB2 [10]
`subsystems. This second Java version can be considered both a serial version and an implicitly parallel version.
`The final Java implementation is explicitly parallel. It also uses the Array package, but exploits parallelism through
`multiple threads at the application level.
`These different implementations allow us to make a fair assessment of the computational speed of Java relative to
`a language that is tuned for performance. They also allow us to compare the relative benefits of developing explic-
`itly parallel code and building a “sequential” numerical application on top of a parallel subsystem that implements
`individual operations in parallel. Our Java implementation with the Array package achieves 109 Mflops, or 91% of
`Fortran performance, on a 332 MHz PowerPC 604e processor. Parallel execution on a 4-way SMP (using the same
`processors) achieves 290 Mflops for the implicitly parallel version and 340 Mflops for the explicitly parallel version.
`These correspond to speedups of 2.7 and 3.1, respectively.
`The rest of this paper is organized as follows. Section 2 gives a quick overview of the particular data mining
`application in our case study. Section 3 describes the Array package and the BLAS class that we have implemented
`in Java. Section 4 describes in some detail the different implementations of the data mining application. Section 5
`presents the performance results for these implementations, and discusses the implications of the experimental data.
`Section 6 discusses related work in the area, and Section 7 presents our conclusions.
`
`2 Data mining and product recommendation
`
`The specific data mining problem considered here involves recommending new products to customers based on pre-
`vious spending behavior. A number of recommendation systems [20] based on collaborative filtering [24] have been
`developed over the past several years, and used for recommending items such as books and music. Typically, a new
`user of such a recommendation system first rates different items on a numerical scale ranging from very positive to
`very negative. The collaborative filtering system uses this information to establish a user profile, and then finds a
`group of other users with similar profiles. Once this “peer group” is established, predicted ratings for new items are
`computed for the target user as a weighted average of the known ratings of the members of the peer group. Higher
`weights are given to ratings of peer-group members most similar to the target user.
`Our methodology differs in several respects from collaborative filtering. As input to our product recommendation
`system, we have available both detailed purchase data for the customer set, as well as a product taxonomy which
`generates assignments of each product to a spending category. A 3-level product taxonomy is used: individual prod-
`ucts (e.g., ”Brooke Bond Tea, 250g”) are assigned to one of 2000 product subgroups (e.g., ”Premium Tea Bags -
`Branded”), which in turn are mapped to one of 100 major product groups (e.g., ”Tea”). For each customer, a person-
`alized recommendation list is generated by sorting this list of eligible products according to a customer-specific score
`computed for each product. The score in defined such that higher scores indicate stronger interest in this particular
`product. In some commercial situations, there may be an incentive to preferentially recommend some products over
`others due to overstocks, supplier incentives, or increased profit margins. This information can be introduced into
`the recommendation strategy via product-dependent scaling factors, with magnitudes reflecting the relative priority of
`recommendation. Details of the scoring algorithm are provided in the following section.
`
`2.1 The scoring algorithm
`
`Let there be m products eligible for recommendation, p customers, and n spending categories. Each product i has an
`affinity vector Ai, where Aik is the affinity of product i with spending category k. These affinities can be interpreted as
`the perceived appeal of this product to customers with participation in this spending category. They are determined by
`pre-computing associations [1] at the level of spending categories. The collection of affinity vectors from all products
`forms the m × n affinity matrix A. This matrix is sparse and is organized so that products with the same nonzero
`patterns in their affinity vectors are adjacent. Therefore, matrix A can be represented by a set of groups of products, as
`shown in Figure 1. Each group i is defined by fi and li, the indices of the rows representing the first and last products
`in the group, respectively. All products with the same nonzero pattern belong to the same group. The nonzero pattern
`for the products in group i is represented by an index vector Ii, where Iik is the position of the k-th nonzero entry for
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`2
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 2
`
`
`
`that group. Let len(Ii) be the length of index vector Ii. Also, let gi = li − fi + 1 be the number of products in group
`i. Then the affinity matrix Gi for group i is the gi × len(Ii) dense matrix with only the nonzero entries from rows fi
`to li of A.
`
`f1 = 1, l1 = 2 G1 =(cid:20) × × ×
`× × × (cid:21)
`I1 =(cid:2) 1
`f2 = 3, l2 = 4 G2 =(cid:20) × × ×
`× × × (cid:21)
`I2 =(cid:2) 2
`× × × × # I3 =(cid:2) 1
`f3 = 5, l3 = 7 G3 =" × × × ×
`
`× × × ×
`
`2
`
`3
`
`5 (cid:3)
`4 (cid:3)
`
`3
`
`4
`
`6 (cid:3)
`
`
`
`× ×
`
`× ×
`
`×
`
`×
`
`× × ×
`
`× × ×
`
`×
`
`×
`
`×
`
`× ×
`
`× ×
`
`× ×
`
`×
`
`×
`
`×
`
`
`
`A =
`
`(a)
`
`(b)
`
`Figure 1: Structure (a) and representation (b) of the affinity matrix A.
`
`Each customer j has an associated spending vector Sj , where Sjk is the normalized spending of customer j in
`category k. The collection of spending vectors of all customers forms the p × n spending matrix S. This matrix is also
`sparse, but customers are not organized in any particular order. (The affinity matrix is relatively static and can be built
`once, whereas the spending matrix may change between two executions of the scoring algorithm.) We also define a
`p × n normalizing matrix N , which has the same nonzero pattern as S, but with a 1 wherever S has a nonzero:
`
`Njk =(cid:26) 1 if Sjk 6= 0
`
`0 if Sjk = 0
`
`.
`
`(1)
`
`The structure of matrices S and N are shown in Figure 2(a). The nonzero pattern of row j of matrix S is represented
`by an index vector Jj , where Jjk is the position of the k-th nonzero entry for that row (customer). The actual spending
`values are represented by a dense vector σj , of the same length as Jj , which has only the nonzero entries for row j of
`S. This representation of S is shown in Figure 2(b). Matrix N does not have to be represented explicitly.
`
`J1 = [ 1
`σ1 = [ × × × ]
`J2 = [ 1
`σ2 = [ × × × ]
`σ3 = [ × × × × ] J3 = [ 1
`J4 = [ 2
`σ4 = [ × × × ]
`
`3
`2
`2
`3
`
`5 ]
`5 ]
`3
`4 ]
`
`4 ]
`
`(b)
`
`
`
`1
`1
`
`1
`
`1
`1
`
`1
`1
`1
`
`1
`1
`
`1
`1
`1
`
`N =
`
`
`
`
`(a)
`
`×
`
`×
`
`× ×
`
`×
`
`×
`
`× × × ×
`
`× × ×
`
`S =
`
`
`Figure 2: Structure of the spending matrix S and normalizing matrix N (a), representation of matrix S (b).
`
`Let ρi be the product dependent scaling factor for product i as discussed in the preceding section. The score Φij
`of customer j against product i is computed as:
`
`Φij = ρiPn
`Pn
`
`k=1 AikSjk
`k=1 AikNjk
`
`.
`
`(2)
`
`That is, we first compute the sum of the spending of customer j on all categories, weighted by the affinity of each
`k=1 AikSjk). We then divide it by the sum of all affinities of product i with categories for
`k=1 AikNjk). The result is an affinity-weighted average of the spending habits
`of customer j, and therefore a normalized measure of how well product i matches those habits. Finally, we multiply
`the result by the priority scaling factor to capture product specific contributions in the actual Φij score.
`k=1 AikSjk in Equation (2) is the dot-product of row i of A by column j of ST . Similarly,
`k=1 AikNjk is the dot-product of of row i of A by column j of N T . Let Λp(ρ1, ρ2, . . . , ρm) denote an m × p matrix
`
`category with product i (Pn
`which customer j has any spending (Pn
`We note that Pn
`Pn
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`3
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 3
`
`
`
`Λ with Λij = ρi, 1 ≤ i ≤ m, 1 ≤ j ≤ p. Then, Equation (2) can be rewritten as a matrix operation:
`
`Φ = Λp(ρ1, ρ2, . . . , ρm) ∗ (A × ST ) ÷ (A × N T )
`
`(3)
`
`where × denotes matrix multiplication, ÷ denotes two-dimensional array (or element-by-element) division, and ∗
`denotes two-dimensional array (element-by-element) multiplication.
`Using the representations of A and S discussed previously, the scores for a customer j against all products in a
`product group i can be computed in the following way. First, two vectors ε and η of size n are initialized to 0:
`
`Then, the compressed vector σj is expanded by assigning its values to the elements of ε indexed by Jj . (This is a
`scatter operation.) Similarly, an expanded representation of the j-th row of N is stored into η:
`
`ε[1 : n] = 0,
`η[1 : n] = 0.
`
`(4)
`
`ε[Jj] = σj,
`η[Jj] = 1.
`
`Now, the scores of customer j against products fi to li can be computed by:
`
`Φ[fi : li, j] = Λ1(ρfi , ρfi+1, . . . , ρli ) ∗ (Gi × ε[Ii]) ÷ (Gi × η[Ii])
`
`(5)
`
`(6)
`
`where × now denotes matrix-vector multiplication, and ÷ and ∗ denote one-dimensional array (element-by-element)
`division and multiplication respectively. Note that we have two gather operations, one each in ε[Ii] and η[Ii]. If we
`group several customers together, forming a block from customer j1 to customer j2, the operations in Equation (6)
`become dense matrix multiplication, array division, and array multiplication:
`
`Φ[fi : li, j1 : j2] = Λj2−j1+1(ρfi , ρfi+1, . . . , ρli) ∗ (Gi × ε[Ii, j1 : j2]) ÷ (Gi × η[Ii, j1 : j2]).
`
`(7)
`
`Matrix multiplication can be implemented more efficiently than matrix-vector multiplication. This is particularly true
`on cache-based RISC microprocessors, since matrix multiplication can be organized through blocking to better exploit
`the memory hierarchy.
`
`3 The Array package
`
`Array and matrix operations are building blocks for many numerically intensive applications, including our data min-
`ing code. To address various issues in coding these applications in Java, we have developed an Array package. The
`goal of this package is to introduce in Java the functionality and performance usually associated with Fortran arrays.
`In this section we first discuss the basic features of multidimensional Arrays in the package. (We use the term Array,
`with a capital first A, to denote the multidimensional structures implemented by the Array package.) We then proceed
`to describe the BLAS class of the Array package, which implements several matrix operations, including the matrix-
`multiply we use in the data mining code. Finally, we discuss how to exploit the parallelism that is implicit in array and
`matrix operations.
`
`3.1 The basic package
`
`The Array package for Java is a class library that implements true multidimensional rectangular Arrays. Each class
`name is of the form <type>Array<rank>, where <type> is the type of the Array elements and <rank> is the rank
`(number of axes or dimensions) of the Array. Supported types include all the primitive Java types as well as complex
`numbers. Supported ranks go from 1D (one-dimensional) to 7D (seven-dimensional). We have, however, currently
`implemented only one- through three-dimensional Arrays. A k-dimensional Array A is characterized by its shape
`vector (n0, n1, . . . , nk−1), where nj is the extent of Array A along its j-th axis. The shape of an Array is defined at
`creation time and is immutable during the lifetime of the Array. In this discussion we focus on classes doubleArray1D
`and doubleArray2D (one- and two-dimensional arrays of doubles), since these are the two kinds of arrays used in our
`data mining code. Regular ranges of indices for arrays are implemented by Range objects. Irregular ranges of indices
`are implemented by Index objects. These objects have properties that facilitate bounds checking optimizations.
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`4
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 4
`
`
`
`Java does not directly support multidimensional arrays. Instead, it provides arrays of arrays. To understand the
`difference between a Java double[ ][ ] array of arrays (hereafter called Java arrays) and a doubleArray2D, consider
`the example of Figure 3. Both X and Y are of type double[ ][ ], while Z and T are of type doubleArray2D. The
`structure of Java arrays is fragmented into rows. Accessing an element (e.g., X[4][2]) requires both pointer chasing
`(to find row X[4]) and indexing (to find element X[4][2]). Java arrays are not rectangular structures, since different
`rows of X can have different lengths. Also, a row of an array can be aliased to other rows of the same or another array.
`(Rows X[1] and X[2] are aliased to each other in Figure 3(a), the same is true for rows X[4] and Y [3].) Finally, the
`shape of a Java array can be changed at run time. For example, the operation X[4] = new double[n] associates a new
`row with X[4]. This operation can even be performed by one thread while another thread is concurrently performing
`a computation with X. All these properties give great flexibility to Java arrays, but they also hinder performance.
`Aliasing disambiguation becomes extremely difficult for elements of Java arrays, and determining bounds of (pseudo)
`multidimensional Java arrays is, in general, an expensive operation.
`
`✲Z
`
`✲T
`
`Y
`❄
`
`43210
`
`✛
`
`✛
`
` ❄
`
`✄❄
`
`X
`❄
`
`✲
`
`✲
`
`✲
`
`43210
`
`(a)
`
`(b)
`
`Figure 3: Examples of different array types: (a) double[ ][ ], (b) doubleArray2D.
`
`The multidimensional Arrays in the Array package fix many of the performance problems associated with Java
`arrays. Although the storage layout is not visible from outside the classes, it can be implemented in a contiguous
`format to avoid fragmentation. Pointer chasing is not necessary to access elements, since the location of an element
`in storage can be found through index arithmetic. Rows cannot be aliased to each other and are all of the same
`length. Another important benefit of the Array package is that it allows bounds checking optimizations [15, 16] to
`be performed without privatizing the array descriptors. Because Java requires a precise exception to be thrown for
`any null-pointer dereference or any out-of-bounds array access, good bounds checking optimization is necessary for
`compilers to perform any transformation that alters data access orders. Finally, the rectangular nature of the data
`structures lends itself nicely to parallelization.
`To illustrate differences in coding styles, we show in Figure 4 an example of how a user might code a matrix
`multiply routine using (a) Java arrays and (b) multidimensional Arrays. The two versions are very similar in structure,
`but they require different optimization strategies. The fixed shapes of a, b, and c in Figure 4(b) allow the bounds
`checking optimizations to be performed without any need for privatization. The high order transformations described
`in [16] can also be applied if we can disambiguate between a, b, and c. (Note that, in Java, a, b, and c are really
`pointers to doubleArray2D objects.) Disambiguation between a, b, and c in Figure 4(b) can be done with a constant
`time test. All we need to show is that c 6= a and c 6= b. (To be precise, we have to show that c.x 6= a.x and c.x 6= b.x,
`where x is an internal field of our array classes which points to the data storage area.) Disambiguation in Figure 4(a)
`requires disambiguation for each and every row of a, b, and c. (We need to show that c[i] 6= a[j] and c[i] 6= b[j] ∀i, j
`and that c[i] 6= c[j] ∀i, j i 6= j.)
`Some of the design principles of the Array package can be understood by examining the method divide of
`class doubleArray2D. This method (shown in Figure 5) computes the array division c = a ÷ b (which would be
`coded in Java as c = a.divide(b)). Before performing any computation, we check that b has the same shape
`as a. If the shapes are not the same we exit the method with an exception. The access to the Array elements in
`the actual division operation are safe with respect to Array bounds. Note that the range of i is from 0 to m − 1
`and the range of j is from 0 to n − 1. Furthermore, a.size(0) = b.size(0) = c.size(0) = m and
`a.size(1) = b.size(1) = c.size(1) = n. This property is used to optimize bounds checking during the
`computation, according to the techniques described in [8, 9].
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`5
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 5
`
`
`
`void matmul(
`
`double[ ][ ] a,
`double[ ][ ] b,
`double[ ][ ] c,
`int m, int n, int p ) {
`for (i = 0; i < m; i++)
`for (j = 0; j < p; j++)
`for (k = 0; k < n; k++)
`c[i][j] += a[i][k] ∗ b[k][j];
`
`}
`
`void matmul(
`
`doubleArray2D a,
`doubleArray2D b,
`doubleArray2D c,
`int m, int n, int p ) {
`for (i = 0; i < m; i++)
`for (j = 0; j < p; j++)
`for (k = 0; k < n; k++)
`c.set(i, j, c.get(i, j) + a.get(i, k) ∗ b.get(k, j));
`
`}
`
`(a)
`
`(b)
`
`Figure 4: Matrix-multiply code: (a) with double[ ][ ], (b) with doubleArray2D.
`
`public doubleArray2D divide(doubleArray2D b)
`throws NonconformingArrayException {
`
`doubleArray2D a = this;
`doubleArray2D c = null;
`
`int m = a.size(0);
`int n = a.size(1);
`
`if (n != b.size(0)) throw new NonconformingArrayException();
`if (m != b.size(1)) throw new NonconformingArrayException();
`
`c = new doubleArray2D(m,n);
`
`for (int i=0; i<m; i++) {
`for (int j=0; j<n; j++) {
`c.set(i,j,a.get(i,j)/b.get(i,j));
`
`Figure 5: Implementation of Array division.
`
`}
`
`eturn c;
`
`} r
`
`}
`
`3.2 The BLAS class
`
`The Array package provides a Basic Linear Algebra Subprograms (BLAS) [7] class as a facility for writing efficient
`linear algebra codes. BLAS implements a variety of elementary vector-vector (level 1), matrix-vector (level 2), and
`matrix-matrix (level 3) operations. Note that this is a 100% Java implementation of BLAS, and not an interface to
`already existing native libraries. We chose this approach for two main reasons: (i) to demonstrate that Java can achieve
`high performance in numerical computing and (ii) to have a completely portable parallel implementation of BLAS that
`works well with the Array package.
`The most important of the many BLAS operations, from a performance perspective, is dgemm, which implements
`the generic matrix multiplication.
`
`C = αA⋆B⋆ + βC
`
`(8)
`
`where A, B, and C are matrices, α and β are scalars, A⋆ can be either A or AT and B⋆ can be either B or BT . Linear
`algebra codes are often written to maximize the use of dgemm, since this is typically the highest performing operation
`in BLAS. Matrix multiplication is also the main operation in our scoring code, as described in Section 2.1. Because
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`6
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 6
`
`
`
`of its importance, we focus this discussion on the implementation of dgemm.
`A simplified version of dgemm from the BLAS class is shown in Figure 6. First, the Arrays a and b are transposed,
`if necessary, as indicated by the two flags transa and transb. This transposition is performed in place at negligible
`cost (no data copying) and allows us to use just one version of code for the actual multiplication. Following the design
`principles for the Array package, all Arrays are tested for conformity before we start any computation. The compu-
`tational loop nest is guaranteed not to throw any exceptions and therefore can be optimized extensively. (For clarity,
`we omit the code that performs runtime alias disambiguation. If c is aliased to a or b, the result is first computed
`into a temporary Array, to guarantee proper semantics.) In the actual code for dgemm (not shown here), the loops are
`blocked and unrolled for better memory behavior, register allocation, and instruction-level parallelism [16]. This leads
`to good sequential performance, as discussed below. The blocked structure is also used for internal parallelization of
`dgemm, using the approach described in [16].
`
`public static void dgemm(int transa, int transb, double alpha,
`doubleArray2D a, doubleArray2D b,
`double beta, doubleArray2D c)
`throws NonconformingArrayException {
`
`if (transa == Transpose) a = a.permuteAxes(1,0);
`if (transb == Transpose) b = b.permuteAxes(1,0);
`
`int m = a.size(0);
`
`int n = a.size(1);
`
`int p = b.size(1);
`
`if (n != b.size(0)) throw new NonconformingArrayException();
`if (p != c.size(1)) throw new NonconformingArrayException();
`if (m != c.size(0)) throw new NonconformingArrayException();
`
`for (int i=0; i<m; i++) {
`for (int j=0; j<p; j++) {
`double s = 0;
`for (int k=0; k<n; k++) {
`s += a.get(i,k)*b.get(k,j);
`
`} c
`
`}
`
`}
`
`}
`
`.set(i,j,alpha*s+beta*c.get(i,j));
`
`Figure 6: An implementation of dgemm using the Array classes.
`
`Figure 7(a) shows the performance achieved by our Java dgemm implementation executing on an SMP machine
`with 4 × 332 MHz PowerPC 604e processors. For comparison, Figure 7(b) shows the performance for the highly
`optimized dgemm in ESSL (note the different scales). We report Mflops achieved when all three matrices are of size
`n×n, for different values of n. The circles represent results for a single threaded execution, while the crosses represent
`results for an execution with four threads. The single threaded Java dgemm execution achieves consistent performance
`(∼ 180 Mflops) throughout the entire range of problem sizes. The performance of the multithreaded Java execution
`scales rapidly with the problem size in the lower part of that range, and later stabilizes at approximately 3.5 times the
`single threaded performance. We note that for small problem sizes (n = 25 in particular), the multithreaded execution
`actually does worse in terms of performance. With such a small problem size, there really is only one block of matrix
`C to be computed and all the work is done by one thread. Since we still pay the overhead to start and terminate
`execution on all four threads, we see a performance degradation. From the results with n = 25 we can estimate the
`overhead for starting and terminating parallel execution at approximately 100 microseconds. Performance of Java
`dgemm is approximately half of ESSL dgemm. (Which, in turn, achieves approximately 50% of the peak machine
`performance of 664 Mflops per processor.) This is not surprising, since Java cannot use the fma instruction in the
`PowerPC. The net effect of this limitation is a reduction by a factor of two of the peak machine performance, from
`664 Mflops with fmas to 332 Mflops without.
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`7
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 7
`
`
`
`Performance of Java DGEMM on 4 x 332MHz PPC 604e
`
`Performance of ESSL DGEMM on 4 x 332MHz PPC 604e
`
`4 threads
`
`1 thread
`
`100
`
`800
`700
`600
`500
`400
`300
`200
`Problem size n (A, B, and C are n x n)
`
`900 1000
`
`2000
`
`1800
`
`1600
`
`1400
`
`1200
`
`1000
`
`800
`
`600
`
`400
`
`200
`
`0
`0
`
`Mflops
`
`4 threads
`
`1 thread
`
`100
`
`800
`700
`600
`500
`400
`300
`200
`Problem size n (A, B, and C are n x n)
`
`900 1000
`
`1000
`
`900
`
`800
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`0
`0
`
`Mflops
`
`(a)
`
`(b)
`
`Figure 7: Performance of dgemm in (a) the Java BLAS and (b) ESSL.
`
`3.3 Enabling implicit parallelism
`
`We now describe the master/slave mechanism used by the implicitly parallel routines. Our design principles were: (i)
`create threads only once and reuse them in different operations; and (ii) allow the master to control the assignment
`of work to helper threads. The first principle reduces the overhead of excessive thread creation and destructions. The
`second principle supports better exploitation of data locality between operations.
`Figure 8 shows the basic actions performed when an operation is executed in parallel. Solid boxes indicate mutex
`operations performed on per-thread data structures, while dashed boxes indicate mutex operation performed on a
`global shared data structure. Solid arcs indicate flow of control, and dashed arcs indicate notify events. The master
`thread initially creates the number of helper threads specified by an initialization call from the application program.
`It then begins executing the program. Each newly created helper thread marks itself as idle, and waits for work on a
`synchronization object associated with it. This allows the Java notify method to be used to restart a specific thread.
`When the master thread encounters a parallel operation, it computes T , the desired number of threads to perform
`the operation. It then notifies helper threads 1, . . . , T −1 to start work. All of the threads, including the master, perform
`their part of the computation. When each thread finishes work on the parallel operation, it signals that by decrementing
`a shared counter. If the master thread is not the last thread to finish work (as can be determined by examining if the
`shared counter is zero) it marks itself as idle and waits on its synchronization object. If a helper thread is the last thread
`to finish, it notifies the master to continue. All helper threads, when finished with the operation, mark themselves as
`idle, and wait for the next parallel operation. When all threads have finished their part of the parallel computation, the
`master thread proceeds with the serial computation.
`
`4 The data mining implementations
`
`This section discusses five different implementations of the scoring algorithm described in Section 2.1. The first
`describes a plain Java implementation of the scoring algorithm, that is an implementation that does not make use of
`the BLAS class or Array package described in Section 3. The second implementation is our baseline Java code, which
`implements the scoring algorithm using those classes. From the baseline we derive implicitly and explicitly parallel
`Java implementations. Finally, we describe a Fortran 90 version of the scoring algorithm. The Fortran 90 version
`serves as a metric to determine how well we are doing with Java in absolute terms.
`
`In the plain Java implementation of the scoring algorithm, each Gi matrix is
`The plain Java implementation:
`represented as a double[][] G and each Ii index vector is represented as an int[] I. The compressed spending
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`8
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 8
`
`
`
`Master thread
`
`Helper threads
`
`mark as idle
`wait
`
`do work
`
`signal finished
`with work
`
`if (last thread) {
`
`wait until master idle
`
`notify master
`
`}
`
`create helper threads
`. . .
`begin parallel computation {
`for each needed thread {
` wait until idle
`
`notify thread
`
`} d
`
`o work
`
`signal finished
`with work
`
`if (!last thread) {
`
`mark as idle
`wait
`
`}
`
`} c
`
`ontinue execution of the program
`
`Figure 8: Overview of the thread scheduling operations.
`
`vector σj for customer j is expanded into vectors ε and η according to Equation (5), with these vectors represented by
`a double[] E and double[] N, respectively. Scores for customer j against the products in group i are computed
`according to Equation (6) by the code shown in Figure 9. In that code, m and n are the number of rows and columns
`of G, respectively. The index fi of the first product in group i is represented by f[i] and the scaling factor ρq for
`product q is represented by rho[q]. Results are stored in a double[][] Phi structure, which represents Φ.
`
`for (int l=0; l<m; l++) {
`
`double score = 0.0;
`for (int k=0; k<n; k++) {
`score += G[l][k]*E[I[k]];
`
`} d
`
`ouble