throbber
High Performance Computing with the Array Package for Java:
`A Case Study using Data Mining
`
`J. E. Moreira S. P. Midkiff M. Gupta R. Lawrence
`{jmoreira, smidkiff, mgupta, ricklawr}@us.ibm.com
`IBM T. J. Watson Research Center
`P. O. Box 218
`Yorktown Heights, NY 10598-0218
`
`Abstract
`
`This paper discusses several techniques used in developing a parallel, production quality data mining application
`in Java. We started by developing three sequential versions of a product recommendation data mining application:
`(i) a Fortran 90 version used as a performance reference, (ii) a plain Java implementation that only uses the primitive
`array structures from the language, and (iii) a baseline Java implementation that uses our Array package for Java. This
`Array package provides parallelism at the level of individual Array and BLAS operations. Using this Array package,
`we also developed two parallel Java versions of the data mining application: one that relies entirely on the implicit
`parallelism provided by the Array package, and another that is explicitly parallel at the application level. We discuss
`the design of the Array package, as well as the design of the data mining application. We compare the trade-offs
`between performance and the abstraction level the different Java versions present to the application programmer. Our
`studies show that, although a plain Java implementation performs poorly, the Java implementation with the Array
`package is quite competitive in performance with Fortran. We achieve a single processor performance of 109 Mflops,
`or 91% of Fortran performance, on a 332 MHz PowerPC 604e processor. Both the implicitly and explicitly parallel
`forms of our Java implementations also parallelize well. On an SMP with four of those PowerPC processors, the
`implicitly parallel form achieves 290 Mflops with no effort from the application programmer, while the explicitly
`parallel form achieves 340 Mflops.
`
`1 Introduction
`
`Data mining [25] is the process of discovering useful and previously unknown information in very large data bases.
`Applications of data mining span scientific (e.g., image processing) as well as commercial computing (e.g., fraud de-
`tection). Although typical data mining applications access large volumes of data from either flat files or relational data
`bases, such applications often involve a significant amount of computation above the data access layer. This compu-
`tation can be floating-point intensive in the case of a neural network application, or may involve a significant number
`of integer operations for rule-based algorithms encountered in tree classification [14] and associations algorithms [1].
`High-performance data mining applications, therefore, require both fast data access and fast computation.
`In many ways Java is an ideal language to implement data mining operations. Java supports portable parallel
`programming on many platforms, which is useful for developing applications that will be deployed across several
`organizations with different computing environments. Java is clean and object-oriented, which allows programs to be
`easily maintained. Standard classes that support domain-specific operations can be effectively implemented in Java,
`as demonstrated by its large collection of graphics and database APIs. As an additional advantage, Java offers an
`expanding base of skilled and enthusiastic adepts.
`The primary worry about Java has been its efficiency. Execution speed of computation-intensive code in Java can
`be up to an order of magnitude slower than the equivalent Fortran code [18], even when the best commercially available
`environments are used. In this paper we use a production data mining application as a case study of the performance
`and ease-of-use of Java. In particular, we show that the use of the Array package for Java can lead to performance
`figures that are comparable to what can be achieved with the best Fortran environments. Exploiting parallelism within
`the Array package can further lead to significant improvements in performance with no effort from the application
`programmer.
`We contrast five implementations of the scoring phase of a production quality data mining application. One
`implementation is written in Fortran 90, and is used as the reference for performance. The first Java implementation
`is built using plain Java (i.e., only Java and its standard libraries). The second Java implementation uses an Array
`package [19, 17] that we have developed, and which is being proposed as a standard by the Java Grande Forum [13].
`(We emphasize that the Array package is written entirely in Java and requires absolutely no modifications to the
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`1
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 1
`
`

`

`language.) The computational core of the program uses a BLAS [7] class that we developed as part of the Array
`package. This BLAS class can perform its individual operations either sequentially or in parallel, and allows programs
`written in a completely sequential style to exploit some parallelism. This strategy is similar to that employed by
`applications built upon the SMP Engineering and Scientific Subroutine Library (ESSL) [11] or parallel DB2 [10]
`subsystems. This second Java version can be considered both a serial version and an implicitly parallel version.
`The final Java implementation is explicitly parallel. It also uses the Array package, but exploits parallelism through
`multiple threads at the application level.
`These different implementations allow us to make a fair assessment of the computational speed of Java relative to
`a language that is tuned for performance. They also allow us to compare the relative benefits of developing explic-
`itly parallel code and building a “sequential” numerical application on top of a parallel subsystem that implements
`individual operations in parallel. Our Java implementation with the Array package achieves 109 Mflops, or 91% of
`Fortran performance, on a 332 MHz PowerPC 604e processor. Parallel execution on a 4-way SMP (using the same
`processors) achieves 290 Mflops for the implicitly parallel version and 340 Mflops for the explicitly parallel version.
`These correspond to speedups of 2.7 and 3.1, respectively.
`The rest of this paper is organized as follows. Section 2 gives a quick overview of the particular data mining
`application in our case study. Section 3 describes the Array package and the BLAS class that we have implemented
`in Java. Section 4 describes in some detail the different implementations of the data mining application. Section 5
`presents the performance results for these implementations, and discusses the implications of the experimental data.
`Section 6 discusses related work in the area, and Section 7 presents our conclusions.
`
`2 Data mining and product recommendation
`
`The specific data mining problem considered here involves recommending new products to customers based on pre-
`vious spending behavior. A number of recommendation systems [20] based on collaborative filtering [24] have been
`developed over the past several years, and used for recommending items such as books and music. Typically, a new
`user of such a recommendation system first rates different items on a numerical scale ranging from very positive to
`very negative. The collaborative filtering system uses this information to establish a user profile, and then finds a
`group of other users with similar profiles. Once this “peer group” is established, predicted ratings for new items are
`computed for the target user as a weighted average of the known ratings of the members of the peer group. Higher
`weights are given to ratings of peer-group members most similar to the target user.
`Our methodology differs in several respects from collaborative filtering. As input to our product recommendation
`system, we have available both detailed purchase data for the customer set, as well as a product taxonomy which
`generates assignments of each product to a spending category. A 3-level product taxonomy is used: individual prod-
`ucts (e.g., ”Brooke Bond Tea, 250g”) are assigned to one of 2000 product subgroups (e.g., ”Premium Tea Bags -
`Branded”), which in turn are mapped to one of 100 major product groups (e.g., ”Tea”). For each customer, a person-
`alized recommendation list is generated by sorting this list of eligible products according to a customer-specific score
`computed for each product. The score in defined such that higher scores indicate stronger interest in this particular
`product. In some commercial situations, there may be an incentive to preferentially recommend some products over
`others due to overstocks, supplier incentives, or increased profit margins. This information can be introduced into
`the recommendation strategy via product-dependent scaling factors, with magnitudes reflecting the relative priority of
`recommendation. Details of the scoring algorithm are provided in the following section.
`
`2.1 The scoring algorithm
`
`Let there be m products eligible for recommendation, p customers, and n spending categories. Each product i has an
`affinity vector Ai, where Aik is the affinity of product i with spending category k. These affinities can be interpreted as
`the perceived appeal of this product to customers with participation in this spending category. They are determined by
`pre-computing associations [1] at the level of spending categories. The collection of affinity vectors from all products
`forms the m × n affinity matrix A. This matrix is sparse and is organized so that products with the same nonzero
`patterns in their affinity vectors are adjacent. Therefore, matrix A can be represented by a set of groups of products, as
`shown in Figure 1. Each group i is defined by fi and li, the indices of the rows representing the first and last products
`in the group, respectively. All products with the same nonzero pattern belong to the same group. The nonzero pattern
`for the products in group i is represented by an index vector Ii, where Iik is the position of the k-th nonzero entry for
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`2
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 2
`
`

`

`that group. Let len(Ii) be the length of index vector Ii. Also, let gi = li − fi + 1 be the number of products in group
`i. Then the affinity matrix Gi for group i is the gi × len(Ii) dense matrix with only the nonzero entries from rows fi
`to li of A.
`
`f1 = 1, l1 = 2 G1 =(cid:20) × × ×
`× × × (cid:21)
`I1 =(cid:2) 1
`f2 = 3, l2 = 4 G2 =(cid:20) × × ×
`× × × (cid:21)
`I2 =(cid:2) 2
`× × × × # I3 =(cid:2) 1
`f3 = 5, l3 = 7 G3 =" × × × ×
`
`× × × ×
`
`2
`
`3
`
`5 (cid:3)
`4 (cid:3)
`
`3
`
`4
`
`6 (cid:3)
`
`
`
`× ×
`
`× ×
`

`

`
`× × ×
`
`× × ×
`

`

`

`
`× ×
`
`× ×
`
`× ×
`

`

`

`
`
`
`A =
`
`(a)
`
`(b)
`
`Figure 1: Structure (a) and representation (b) of the affinity matrix A.
`
`Each customer j has an associated spending vector Sj , where Sjk is the normalized spending of customer j in
`category k. The collection of spending vectors of all customers forms the p × n spending matrix S. This matrix is also
`sparse, but customers are not organized in any particular order. (The affinity matrix is relatively static and can be built
`once, whereas the spending matrix may change between two executions of the scoring algorithm.) We also define a
`p × n normalizing matrix N , which has the same nonzero pattern as S, but with a 1 wherever S has a nonzero:
`
`Njk =(cid:26) 1 if Sjk 6= 0
`
`0 if Sjk = 0
`
`.
`
`(1)
`
`The structure of matrices S and N are shown in Figure 2(a). The nonzero pattern of row j of matrix S is represented
`by an index vector Jj , where Jjk is the position of the k-th nonzero entry for that row (customer). The actual spending
`values are represented by a dense vector σj , of the same length as Jj , which has only the nonzero entries for row j of
`S. This representation of S is shown in Figure 2(b). Matrix N does not have to be represented explicitly.
`
`J1 = [ 1
`σ1 = [ × × × ]
`J2 = [ 1
`σ2 = [ × × × ]
`σ3 = [ × × × × ] J3 = [ 1
`J4 = [ 2
`σ4 = [ × × × ]
`
`3
`2
`2
`3
`
`5 ]
`5 ]
`3
`4 ]
`
`4 ]
`
`(b)
`
`
`
`1
`1
`
`1
`
`1
`1
`
`1
`1
`1
`
`1
`1
`
`1
`1
`1
`
`N =
`
`
`
`
`(a)
`

`

`
`× ×
`

`

`
`× × × ×
`
`× × ×
`
`S =
`
`
`Figure 2: Structure of the spending matrix S and normalizing matrix N (a), representation of matrix S (b).
`
`Let ρi be the product dependent scaling factor for product i as discussed in the preceding section. The score Φij
`of customer j against product i is computed as:
`
`Φij = ρiPn
`Pn
`
`k=1 AikSjk
`k=1 AikNjk
`
`.
`
`(2)
`
`That is, we first compute the sum of the spending of customer j on all categories, weighted by the affinity of each
`k=1 AikSjk). We then divide it by the sum of all affinities of product i with categories for
`k=1 AikNjk). The result is an affinity-weighted average of the spending habits
`of customer j, and therefore a normalized measure of how well product i matches those habits. Finally, we multiply
`the result by the priority scaling factor to capture product specific contributions in the actual Φij score.
`k=1 AikSjk in Equation (2) is the dot-product of row i of A by column j of ST . Similarly,
`k=1 AikNjk is the dot-product of of row i of A by column j of N T . Let Λp(ρ1, ρ2, . . . , ρm) denote an m × p matrix
`
`category with product i (Pn
`which customer j has any spending (Pn
`We note that Pn
`Pn
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`3
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 3
`
`

`

`Λ with Λij = ρi, 1 ≤ i ≤ m, 1 ≤ j ≤ p. Then, Equation (2) can be rewritten as a matrix operation:
`
`Φ = Λp(ρ1, ρ2, . . . , ρm) ∗ (A × ST ) ÷ (A × N T )
`
`(3)
`
`where × denotes matrix multiplication, ÷ denotes two-dimensional array (or element-by-element) division, and ∗
`denotes two-dimensional array (element-by-element) multiplication.
`Using the representations of A and S discussed previously, the scores for a customer j against all products in a
`product group i can be computed in the following way. First, two vectors ε and η of size n are initialized to 0:
`
`Then, the compressed vector σj is expanded by assigning its values to the elements of ε indexed by Jj . (This is a
`scatter operation.) Similarly, an expanded representation of the j-th row of N is stored into η:
`
`ε[1 : n] = 0,
`η[1 : n] = 0.
`
`(4)
`
`ε[Jj] = σj,
`η[Jj] = 1.
`
`Now, the scores of customer j against products fi to li can be computed by:
`
`Φ[fi : li, j] = Λ1(ρfi , ρfi+1, . . . , ρli ) ∗ (Gi × ε[Ii]) ÷ (Gi × η[Ii])
`
`(5)
`
`(6)
`
`where × now denotes matrix-vector multiplication, and ÷ and ∗ denote one-dimensional array (element-by-element)
`division and multiplication respectively. Note that we have two gather operations, one each in ε[Ii] and η[Ii]. If we
`group several customers together, forming a block from customer j1 to customer j2, the operations in Equation (6)
`become dense matrix multiplication, array division, and array multiplication:
`
`Φ[fi : li, j1 : j2] = Λj2−j1+1(ρfi , ρfi+1, . . . , ρli) ∗ (Gi × ε[Ii, j1 : j2]) ÷ (Gi × η[Ii, j1 : j2]).
`
`(7)
`
`Matrix multiplication can be implemented more efficiently than matrix-vector multiplication. This is particularly true
`on cache-based RISC microprocessors, since matrix multiplication can be organized through blocking to better exploit
`the memory hierarchy.
`
`3 The Array package
`
`Array and matrix operations are building blocks for many numerically intensive applications, including our data min-
`ing code. To address various issues in coding these applications in Java, we have developed an Array package. The
`goal of this package is to introduce in Java the functionality and performance usually associated with Fortran arrays.
`In this section we first discuss the basic features of multidimensional Arrays in the package. (We use the term Array,
`with a capital first A, to denote the multidimensional structures implemented by the Array package.) We then proceed
`to describe the BLAS class of the Array package, which implements several matrix operations, including the matrix-
`multiply we use in the data mining code. Finally, we discuss how to exploit the parallelism that is implicit in array and
`matrix operations.
`
`3.1 The basic package
`
`The Array package for Java is a class library that implements true multidimensional rectangular Arrays. Each class
`name is of the form <type>Array<rank>, where <type> is the type of the Array elements and <rank> is the rank
`(number of axes or dimensions) of the Array. Supported types include all the primitive Java types as well as complex
`numbers. Supported ranks go from 1D (one-dimensional) to 7D (seven-dimensional). We have, however, currently
`implemented only one- through three-dimensional Arrays. A k-dimensional Array A is characterized by its shape
`vector (n0, n1, . . . , nk−1), where nj is the extent of Array A along its j-th axis. The shape of an Array is defined at
`creation time and is immutable during the lifetime of the Array. In this discussion we focus on classes doubleArray1D
`and doubleArray2D (one- and two-dimensional arrays of doubles), since these are the two kinds of arrays used in our
`data mining code. Regular ranges of indices for arrays are implemented by Range objects. Irregular ranges of indices
`are implemented by Index objects. These objects have properties that facilitate bounds checking optimizations.
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`4
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 4
`
`

`

`Java does not directly support multidimensional arrays. Instead, it provides arrays of arrays. To understand the
`difference between a Java double[ ][ ] array of arrays (hereafter called Java arrays) and a doubleArray2D, consider
`the example of Figure 3. Both X and Y are of type double[ ][ ], while Z and T are of type doubleArray2D. The
`structure of Java arrays is fragmented into rows. Accessing an element (e.g., X[4][2]) requires both pointer chasing
`(to find row X[4]) and indexing (to find element X[4][2]). Java arrays are not rectangular structures, since different
`rows of X can have different lengths. Also, a row of an array can be aliased to other rows of the same or another array.
`(Rows X[1] and X[2] are aliased to each other in Figure 3(a), the same is true for rows X[4] and Y [3].) Finally, the
`shape of a Java array can be changed at run time. For example, the operation X[4] = new double[n] associates a new
`row with X[4]. This operation can even be performed by one thread while another thread is concurrently performing
`a computation with X. All these properties give great flexibility to Java arrays, but they also hinder performance.
`Aliasing disambiguation becomes extremely difficult for elements of Java arrays, and determining bounds of (pseudo)
`multidimensional Java arrays is, in general, an expensive operation.
`
`✲Z
`
`✲T
`
`Y
`❄
`
`43210
`
`✛
`
`✛
`
`❄
`
`✄❄
`
`X
`❄
`
`✲
`
`✲
`
`✲
`
`43210
`
`(a)
`
`(b)
`
`Figure 3: Examples of different array types: (a) double[ ][ ], (b) doubleArray2D.
`
`The multidimensional Arrays in the Array package fix many of the performance problems associated with Java
`arrays. Although the storage layout is not visible from outside the classes, it can be implemented in a contiguous
`format to avoid fragmentation. Pointer chasing is not necessary to access elements, since the location of an element
`in storage can be found through index arithmetic. Rows cannot be aliased to each other and are all of the same
`length. Another important benefit of the Array package is that it allows bounds checking optimizations [15, 16] to
`be performed without privatizing the array descriptors. Because Java requires a precise exception to be thrown for
`any null-pointer dereference or any out-of-bounds array access, good bounds checking optimization is necessary for
`compilers to perform any transformation that alters data access orders. Finally, the rectangular nature of the data
`structures lends itself nicely to parallelization.
`To illustrate differences in coding styles, we show in Figure 4 an example of how a user might code a matrix
`multiply routine using (a) Java arrays and (b) multidimensional Arrays. The two versions are very similar in structure,
`but they require different optimization strategies. The fixed shapes of a, b, and c in Figure 4(b) allow the bounds
`checking optimizations to be performed without any need for privatization. The high order transformations described
`in [16] can also be applied if we can disambiguate between a, b, and c. (Note that, in Java, a, b, and c are really
`pointers to doubleArray2D objects.) Disambiguation between a, b, and c in Figure 4(b) can be done with a constant
`time test. All we need to show is that c 6= a and c 6= b. (To be precise, we have to show that c.x 6= a.x and c.x 6= b.x,
`where x is an internal field of our array classes which points to the data storage area.) Disambiguation in Figure 4(a)
`requires disambiguation for each and every row of a, b, and c. (We need to show that c[i] 6= a[j] and c[i] 6= b[j] ∀i, j
`and that c[i] 6= c[j] ∀i, j i 6= j.)
`Some of the design principles of the Array package can be understood by examining the method divide of
`class doubleArray2D. This method (shown in Figure 5) computes the array division c = a ÷ b (which would be
`coded in Java as c = a.divide(b)). Before performing any computation, we check that b has the same shape
`as a. If the shapes are not the same we exit the method with an exception. The access to the Array elements in
`the actual division operation are safe with respect to Array bounds. Note that the range of i is from 0 to m − 1
`and the range of j is from 0 to n − 1. Furthermore, a.size(0) = b.size(0) = c.size(0) = m and
`a.size(1) = b.size(1) = c.size(1) = n. This property is used to optimize bounds checking during the
`computation, according to the techniques described in [8, 9].
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`5
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 5
`
`

`

`void matmul(
`
`double[ ][ ] a,
`double[ ][ ] b,
`double[ ][ ] c,
`int m, int n, int p ) {
`for (i = 0; i < m; i++)
`for (j = 0; j < p; j++)
`for (k = 0; k < n; k++)
`c[i][j] += a[i][k] ∗ b[k][j];
`
`}
`
`void matmul(
`
`doubleArray2D a,
`doubleArray2D b,
`doubleArray2D c,
`int m, int n, int p ) {
`for (i = 0; i < m; i++)
`for (j = 0; j < p; j++)
`for (k = 0; k < n; k++)
`c.set(i, j, c.get(i, j) + a.get(i, k) ∗ b.get(k, j));
`
`}
`
`(a)
`
`(b)
`
`Figure 4: Matrix-multiply code: (a) with double[ ][ ], (b) with doubleArray2D.
`
`public doubleArray2D divide(doubleArray2D b)
`throws NonconformingArrayException {
`
`doubleArray2D a = this;
`doubleArray2D c = null;
`
`int m = a.size(0);
`int n = a.size(1);
`
`if (n != b.size(0)) throw new NonconformingArrayException();
`if (m != b.size(1)) throw new NonconformingArrayException();
`
`c = new doubleArray2D(m,n);
`
`for (int i=0; i<m; i++) {
`for (int j=0; j<n; j++) {
`c.set(i,j,a.get(i,j)/b.get(i,j));
`
`Figure 5: Implementation of Array division.
`
`}
`
`eturn c;
`
`} r
`
`}
`
`3.2 The BLAS class
`
`The Array package provides a Basic Linear Algebra Subprograms (BLAS) [7] class as a facility for writing efficient
`linear algebra codes. BLAS implements a variety of elementary vector-vector (level 1), matrix-vector (level 2), and
`matrix-matrix (level 3) operations. Note that this is a 100% Java implementation of BLAS, and not an interface to
`already existing native libraries. We chose this approach for two main reasons: (i) to demonstrate that Java can achieve
`high performance in numerical computing and (ii) to have a completely portable parallel implementation of BLAS that
`works well with the Array package.
`The most important of the many BLAS operations, from a performance perspective, is dgemm, which implements
`the generic matrix multiplication.
`
`C = αA⋆B⋆ + βC
`
`(8)
`
`where A, B, and C are matrices, α and β are scalars, A⋆ can be either A or AT and B⋆ can be either B or BT . Linear
`algebra codes are often written to maximize the use of dgemm, since this is typically the highest performing operation
`in BLAS. Matrix multiplication is also the main operation in our scoring code, as described in Section 2.1. Because
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`6
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 6
`
`

`

`of its importance, we focus this discussion on the implementation of dgemm.
`A simplified version of dgemm from the BLAS class is shown in Figure 6. First, the Arrays a and b are transposed,
`if necessary, as indicated by the two flags transa and transb. This transposition is performed in place at negligible
`cost (no data copying) and allows us to use just one version of code for the actual multiplication. Following the design
`principles for the Array package, all Arrays are tested for conformity before we start any computation. The compu-
`tational loop nest is guaranteed not to throw any exceptions and therefore can be optimized extensively. (For clarity,
`we omit the code that performs runtime alias disambiguation. If c is aliased to a or b, the result is first computed
`into a temporary Array, to guarantee proper semantics.) In the actual code for dgemm (not shown here), the loops are
`blocked and unrolled for better memory behavior, register allocation, and instruction-level parallelism [16]. This leads
`to good sequential performance, as discussed below. The blocked structure is also used for internal parallelization of
`dgemm, using the approach described in [16].
`
`public static void dgemm(int transa, int transb, double alpha,
`doubleArray2D a, doubleArray2D b,
`double beta, doubleArray2D c)
`throws NonconformingArrayException {
`
`if (transa == Transpose) a = a.permuteAxes(1,0);
`if (transb == Transpose) b = b.permuteAxes(1,0);
`
`int m = a.size(0);
`
`int n = a.size(1);
`
`int p = b.size(1);
`
`if (n != b.size(0)) throw new NonconformingArrayException();
`if (p != c.size(1)) throw new NonconformingArrayException();
`if (m != c.size(0)) throw new NonconformingArrayException();
`
`for (int i=0; i<m; i++) {
`for (int j=0; j<p; j++) {
`double s = 0;
`for (int k=0; k<n; k++) {
`s += a.get(i,k)*b.get(k,j);
`
`} c
`
`}
`
`}
`
`}
`
`.set(i,j,alpha*s+beta*c.get(i,j));
`
`Figure 6: An implementation of dgemm using the Array classes.
`
`Figure 7(a) shows the performance achieved by our Java dgemm implementation executing on an SMP machine
`with 4 × 332 MHz PowerPC 604e processors. For comparison, Figure 7(b) shows the performance for the highly
`optimized dgemm in ESSL (note the different scales). We report Mflops achieved when all three matrices are of size
`n×n, for different values of n. The circles represent results for a single threaded execution, while the crosses represent
`results for an execution with four threads. The single threaded Java dgemm execution achieves consistent performance
`(∼ 180 Mflops) throughout the entire range of problem sizes. The performance of the multithreaded Java execution
`scales rapidly with the problem size in the lower part of that range, and later stabilizes at approximately 3.5 times the
`single threaded performance. We note that for small problem sizes (n = 25 in particular), the multithreaded execution
`actually does worse in terms of performance. With such a small problem size, there really is only one block of matrix
`C to be computed and all the work is done by one thread. Since we still pay the overhead to start and terminate
`execution on all four threads, we see a performance degradation. From the results with n = 25 we can estimate the
`overhead for starting and terminating parallel execution at approximately 100 microseconds. Performance of Java
`dgemm is approximately half of ESSL dgemm. (Which, in turn, achieves approximately 50% of the peak machine
`performance of 664 Mflops per processor.) This is not surprising, since Java cannot use the fma instruction in the
`PowerPC. The net effect of this limitation is a reduction by a factor of two of the peak machine performance, from
`664 Mflops with fmas to 332 Mflops without.
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`7
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 7
`
`

`

`Performance of Java DGEMM on 4 x 332MHz PPC 604e
`
`Performance of ESSL DGEMM on 4 x 332MHz PPC 604e
`
`4 threads
`
`1 thread
`
`100
`
`800
`700
`600
`500
`400
`300
`200
`Problem size n (A, B, and C are n x n)
`
`900 1000
`
`2000
`
`1800
`
`1600
`
`1400
`
`1200
`
`1000
`
`800
`
`600
`
`400
`
`200
`
`0
`0
`
`Mflops
`
`4 threads
`
`1 thread
`
`100
`
`800
`700
`600
`500
`400
`300
`200
`Problem size n (A, B, and C are n x n)
`
`900 1000
`
`1000
`
`900
`
`800
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`0
`0
`
`Mflops
`
`(a)
`
`(b)
`
`Figure 7: Performance of dgemm in (a) the Java BLAS and (b) ESSL.
`
`3.3 Enabling implicit parallelism
`
`We now describe the master/slave mechanism used by the implicitly parallel routines. Our design principles were: (i)
`create threads only once and reuse them in different operations; and (ii) allow the master to control the assignment
`of work to helper threads. The first principle reduces the overhead of excessive thread creation and destructions. The
`second principle supports better exploitation of data locality between operations.
`Figure 8 shows the basic actions performed when an operation is executed in parallel. Solid boxes indicate mutex
`operations performed on per-thread data structures, while dashed boxes indicate mutex operation performed on a
`global shared data structure. Solid arcs indicate flow of control, and dashed arcs indicate notify events. The master
`thread initially creates the number of helper threads specified by an initialization call from the application program.
`It then begins executing the program. Each newly created helper thread marks itself as idle, and waits for work on a
`synchronization object associated with it. This allows the Java notify method to be used to restart a specific thread.
`When the master thread encounters a parallel operation, it computes T , the desired number of threads to perform
`the operation. It then notifies helper threads 1, . . . , T −1 to start work. All of the threads, including the master, perform
`their part of the computation. When each thread finishes work on the parallel operation, it signals that by decrementing
`a shared counter. If the master thread is not the last thread to finish work (as can be determined by examining if the
`shared counter is zero) it marks itself as idle and waits on its synchronization object. If a helper thread is the last thread
`to finish, it notifies the master to continue. All helper threads, when finished with the operation, mark themselves as
`idle, and wait for the next parallel operation. When all threads have finished their part of the parallel computation, the
`master thread proceeds with the serial computation.
`
`4 The data mining implementations
`
`This section discusses five different implementations of the scoring algorithm described in Section 2.1. The first
`describes a plain Java implementation of the scoring algorithm, that is an implementation that does not make use of
`the BLAS class or Array package described in Section 3. The second implementation is our baseline Java code, which
`implements the scoring algorithm using those classes. From the baseline we derive implicitly and explicitly parallel
`Java implementations. Finally, we describe a Fortran 90 version of the scoring algorithm. The Fortran 90 version
`serves as a metric to determine how well we are doing with Java in absolute terms.
`
`In the plain Java implementation of the scoring algorithm, each Gi matrix is
`The plain Java implementation:
`represented as a double[][] G and each Ii index vector is represented as an int[] I. The compressed spending
`
`Proceedings of the ACM/IEEE SC99 Conference (SC’99)
`1-58113-091-0/99 $ 17.00 © 1999 IEEE
`
`8
`
`Petitioner Microsoft Corporation - Ex. 1024, p. 8
`
`

`

`Master thread
`
`Helper threads
`
`mark as idle
`wait
`
`do work
`
`signal finished
`with work
`
`if (last thread) {
`
`wait until master idle
`
`notify master
`
`}
`
`create helper threads
`. . .
`begin parallel computation {
`for each needed thread {
` wait until idle
`
`notify thread
`
`} d
`
`o work
`
`signal finished
`with work
`
`if (!last thread) {
`
`mark as idle
`wait
`
`}
`
`} c
`
`ontinue execution of the program
`
`Figure 8: Overview of the thread scheduling operations.
`
`vector σj for customer j is expanded into vectors ε and η according to Equation (5), with these vectors represented by
`a double[] E and double[] N, respectively. Scores for customer j against the products in group i are computed
`according to Equation (6) by the code shown in Figure 9. In that code, m and n are the number of rows and columns
`of G, respectively. The index fi of the first product in group i is represented by f[i] and the scaling factor ρq for
`product q is represented by rho[q]. Results are stored in a double[][] Phi structure, which represents Φ.
`
`for (int l=0; l<m; l++) {
`
`double score = 0.0;
`for (int k=0; k<n; k++) {
`score += G[l][k]*E[I[k]];
`
`} d
`
`ouble

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket