throbber
Homayoun
`
`Reference 14
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 1
`
`

`

`The Impact of Performance Asymmetry in Emerging Multicore Architectures
`
`Saisanthosh Balakrishnan†
`†Computer Sciences Department
`University of Wisconsin-Madison
`sai@cs.wisc.edu
`
`Konrad Lai‡
`Mike Upton§
`Ravi Rajwar‡
`‡Microarchitecture Research Lab
`§Digital Enterprise Group
`Intel Corporation
`{ravi.rajwar, mike.upton, konrad.lai}@intel.com
`
`Abstract
`Performance asymmetry in multicore architectures
`arises when individual cores have different performance.
`Building such multicore processors is desirable because
`many simple cores together provide high parallel per-
`formance while a few complex cores ensure high serial
`performance. However, application developers typically
`assume computational cores provide equal performance,
`and performance asymmetry breaks this assumption.
`This paper is concerned with the behavior of commer-
`cial applications running on performance asymmetric
`systems. We present the first study investigating the im-
`pact of performance asymmetry on a wide range of com-
`mercial applications using a hardware prototype. We
`quantify the impact of asymmetry on an application’s
`performance variance when run multiple times, and the
`impact on the application’s scalability.
`Performance asymmetry adversely affects behavior of
`many workloads. We study ways to eliminate these effects.
`In addition to asymmetry-aware operating system kernels,
`the application often itself needs to be aware of perform-
`ance asymmetry for stable and scalable performance.
`1. Introduction
`A multicore processor provides increased total compu-
`tational capability on a single chip without requiring a
`complex microarchitecture. As a result, simple multicore
`processors have better performance per watt and area
`characteristics than complex single-core processors. The
`diminishing returns on serial performance with increas-
`ingly complex cores make multicore organizations par-
`ticularly attractive.
`A performance-asymmetric multicore organization,
`where individual cores have different compute capabili-
`ties, is attractive because a few high-performance complex
`cores can provide good serial performance, and many
`low-performance simple cores can provide high parallel
`performance. Simple cores provide efficient use of tran-
`sistors for computation in addition to meeting power and
`thermal budgets. Complex cores, while inefficient, pro-
`vide computational power for single threads that require
`it. Researchers have proposed numerous such asymmetric
`processor organizations for power and performance effi-
`
`formance efficiency, and have investigated the behavior of
`multi-programmed single threaded applications on them
`[5, 8, 9, 11, 17].
`Performance asymmetry in multicore systems breaks a
`long-standing assumption made by multi-threaded appli-
`cation developers. These developers typically assume all
`computational cores provide equal performance when
`they write their parallel algorithms and applications.
`However, no study has investigated the impact, if any, of
`computational asymmetry on the behavior of these multi-
`threaded applications. For example, does computational
`asymmetry result in unpredictable performance character-
`istics of a commercial server, which must meet certain
`performance guarantees? Does the asymmetry expose an
`application’s scalability problem,
`that otherwise would
`not have manifested? Ensuring that applications run as
`expected on a new architecture is crucial for the architec-
`ture’s adoption. Answering questions regarding applica-
`tion behavior predictability and scalability are therefore
`important for understanding the implications of asymmet-
`ric architectures on software that runs on them. This paper
`tries to answer these questions.
`We present the first study investigating the impact of
`performance asymmetry on the behavior of numerous
`multithreaded applications. We use a hardware prototype
`of a multiprocessor system to perform our study. We ap-
`proximate performance asymmetry by varying the indi-
`vidual processor frequencies in a multiprocessor. Varying
`frequency is an effective way to create the property of
`performance asymmetry on real hardware.
`We focus on two questions:
`
`1. Does performance asymmetry in a multiprocessor
`system have a negative impact on an application’s
`performance characteristics? Can we predict per-
`formance of an application on an asymmetric sys-
`tem? Does the application scale as expected or does
`it experience scalability bottlenecks that were oth-
`erwise not present on a symmetric system?
`
`2. For applications that do suffer due to performance
`asymmetry, what methods can help alleviate the
`problem? Is exposing the asymmetry to the operat-
`ing system sufficient or, do the applications also
`need to consider asymmetry at an algorithmic level?
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 2
`
`

`

`We establish a baseline performance behavior by study-
`ing the applications on a performance-symmetric system
`and ensuring their performance is predictable on such
`systems, and then vary individual frequencies (at ¼ frac-
`tion increments) to study the impact of performance
`asymmetry. We determine whether the application pro-
`vides predictable performance by running the same appli-
`cation multiple times on the same asymmetric setup. We
`also determine whether the application performance scales
`predictably in proportion to the total compute power in
`the system (even if the cores are performance asymmet-
`ric).
`We then investigate ways to eliminate anomalous be-
`haviors if any. Does re-structuring the operating systems
`scheduler help? Do we also need to modify the applica-
`tion?
`Our benchmarks include commercial managed runtime
`servers (SPECjbb and SPECjAppServer), database serv-
`ers (TPC-H), web servers (Zeus and Apache), scientific
`applications with closely coupled synchronization (SPEC
`OMP), a media application (H.264), and an application
`development tool (PMAKE).
`Using the above workloads, the paper quantitatively
`makes the following four key points:
`
`1. Performance asymmetry in systems adversely affects
`the predictability of a number of commercial work-
`loads, and makes them less scalable. This effect in-
`creases with increasing concurrency.
`
`2. An asymmetry-aware operating system helps elimi-
`nate unpredictability in some applications. In others,
`the application also needs to be asymmetry-aware.
`
`3. An asymmetric multiprocessor gives higher per-
`formance than a multiprocessor in which all cores
`are slow because the fast core is effective for serial
`portions of the threaded program.
`
`4. Mechanisms for exchanging asymmetry information
`between hardware and software, and application de-
`sign methods tolerant of asymmetry need investiga-
`tion.
`
`Section 2 discusses the experimental methodology and
`Section 3 presents workload description and analysis.
`Section 4 summarizes the results, Section 5 presents re-
`lated work, and Section 6 concludes.
`2. Experimental methodology
`Our experimental platform comprises a 4-way 2.8 GHz
`Intel® Xeon™ multiprocessor
`(Shasta series). Our
`benchmark under test runs on this platform. We disable
`Hyper-Threading in all processors by using the BIOS. The
`system has 2-MB of unified Level-3 cache. We use
`Windows Server 2003 and Linux operating systems.
`Intel Xeon processors allow software to change the ac-
`tive duty cycle of processors for thermal management [6].
`
`Duty cycle is the time-period during which the clock sig-
`nal drives the processor chip. A stop clock mechanism
`disables the processor clock, during which time every-
`thing on the processor stops. This does not affect any
`modules outside the processor, e.g., coherence network or
`memory (DRAM) is unaffected, and only the processor
`appears to slow down. We vary duty cycle in multiple
`steps: 12.5%, 25%, 37.5%, 50%, 63.5%, 75%, and 87.5%.
`
`We developed a device driver to control asymmetry in
`Windows Server 2003. The driver, running in privileged
`mode, controls the duty cycle by changing the clock
`modulation register. Linux kernel 2.4 requires a new
`module to read and write the clock modulation register.
`Linux kernel 2.6 provides a thermal monitoring infrastruc-
`ture and does not require the user to write to the clock
`modulation register directly. A process-affinity API se-
`lects specific processors.
`Using the above duty cycle modulation to emulate per-
`formance asymmetry is effective for this paper’s study
`because the unpredictability of performance and limita-
`tions in scalability arise due to differences in the computa-
`tion power of the individual cores and not due to commu-
`nication latencies. The results therefore should hold when
`the communication network and latencies change.
`3. Results and analysis
`All applications were set up according to the rules pro-
`vided by the respective organizations (SPEC and TPC).
`An industrial performance evaluation group ensured strict
`compliance. Many of these setups were multi-tier, and
`only the application, whose behavior we were studying,
`ran on the system described in Section 2.
`We focus on two key predictability metrics:
`
`1. Is the application’s performance stable?
`
`2. Is the application’s performance scalable?
`An nf-ms/scale label means n fast cores and m slow
`cores running at 1/scale the speed of the fast cores. The
`total compute power of this system is (n + m/scale)).
`Symmetric configurations are 4f-0s, 0f-4s/4, and 0f-4s/8,
`and asymmetric configurations are 3f-1s/4, 3f-1s/8, 2f-
`2s/4, 2f-2s/8, 1f-3s/4, and 1f-3s/8. Performance asymme-
`try was validated using runtimes of computationally inten-
`sive micro benchmarks.
`3.1 SPECjbb
`SPECjbb2000 [19] is an online-transaction processing
`business Java application emulating a three-tier system. A
`thread represents an individual terminal, and maps to a
`specific warehouse. Increasing the number of warehouses
`increases concurrency in the workload. A probability dis-
`tribution determines queries to the system. The throughput
`of business operations per second is the primary perform-
`ance metric. The backend database is memory resident
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 3
`
`

`

`BEA JRockit JVM, Gen. concurrent GC on 4f-0s (2 runs)
`
`BEA JRockit JVM, Gen. concurrent GC on 2f-2s/8 (4 runs)
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Thousands
`
`Transactionspersecond
`
`BEA JRockit JVM, Parallel GC on 2f-2s/8 (3 runs)
`
`Sun HotSpot JVM, Gen. concurrent GC on 2f-2s/8 (3 runs)
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Thousands
`
`Transactionspersecond
`
`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
`Warehouses
`(a)
`
`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
`Warehouses
`(b)
`
`Figure 1. SPECjbb performance predictability.
`
`(25MB) and has 20 warehouses. The workload focuses on
`the middle tier running the Java application server.
`Our application server operating system is Linux 2.6,
`and we study two virtual machines for the application
`server: BEA Weblogic JRockit (8.1) and Sun Hotspot
`(1.4.2). The virtual machine flags were set for the highest
`machine optimization.
`Since garbage collection (GC) is integral to managed
`runtime systems, we include their effect in this study. We
`use two garbage collectors: a parallel and a concurrent
`generational. A parallel collector interrupts all application
`threads prior to performing collection, and is well suited
`for high-throughput long-running workloads. The genera-
`tional concurrent collector runs concurrently with the ap-
`plication, reclaiming objects. This collector is well suited
`for applications requiring minimal pause times and those
`that are unaffected by the collectors interference.
`3.1.1 Analysis
`Figure 1(a) shows SPECjbb throughput (increasing
`warehouses increases concurrency) with two different
`virtual machines (BEA JRockit with parallel GC and Sun
`HotSpot with a generational concurrent GC) running on a
`
`2f-2s/8 asymmetric configuration. Multiple runs are
`shown for each configuration to determine predictability.
`The absolute performance variance for the HotSpot con-
`figuration is higher. Minor instability exists with JRockit.
`Figure 1(b) shows SPECjbb throughput (with increasing
`warehouses) with BEA JRockit and a generational con-
`current GC (instead of the parallel GC). The new collector
`has a significant negative impact on application behavior.
`Instability in asymmetric configurations increases signifi-
`cantly across multiple runs, and increases as concurrency
`in the system increases. This increased instability is due to
`the concurrently running garbage collector interfering
`with the main application.
`Figure 2(a) shows scalability and predictability (when
`run multiple times) as computational power is varied. For
`symmetric configurations (4f-0s, 0f-4s/4, and 0f-4s/8),
`performance decreases predictably and linearly with com-
`putational power. The workload is inherently stable and
`predictably scalable on a symmetric system. While asym-
`metric configurations (3f-1s/4, 3f-1s/8, 2f-2s/4, 2f-2s/8,
`1f-3s/4, and 1f-3s/8) scale, they show significant variabil-
`ity (as shown by the error bars in the figure).
`We investigated various potential sources of instability:
`scheduling,
`locking and synchronization, and cache
`
`BEA JRockit JVM, Gen. concurrent GC on 2f-2s/8 (3 runs)
`
`Sun HotSpot JVM, Gen. concurrent GC on 2f-2s/8 (3 runs)
`
`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
`Warehouses
`
`35
`
`30
`
`25
`
`20
`
`15
`
`10
`
`05
`
`Thousands
`
`Transactionspersecond
`
`BEA JRockit JVM (4 runs)
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`2f-
`1f-
`2f-
`2s/8
`3s/4
`2s/4
`Configurations
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`50
`45
`40
`35
`30
`25
`20
`15
`10
`
`05
`
`Thousands
`
`Averagethroughput
`
`(a)
`(b)
`Figure 2. SPECjbb. (a) Scalability & predictability. Error bars indicate variability in throughput for multiple runs.
`(b) Predictability with an asymmetry-aware kernel scheduler.
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 4
`
`

`

`Avg
`
` 90%
`
`Max
`
`      
`
`
`
`
`
`
            
`
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`
`40
`
`35
`
`30
`
`25
`
`20
`
`15
`
`10
`
`05
`
`ManufacturingResponsetime(secs)
`
`Injection rate = 320
`
`Manufacturing
`NewOrder
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Hundreds
`
`Throughputpersecond
`
`4f-0s 3f-1s/4 3f-1s/8 2f-2s/4 2f-2s/8 1f-3s/4 1f-3s/8 0f-4s/4 0f-4s/8
`4f-0s 3f-1s/4 3f-1s/8 2f-2s/4 2f-2s/8 1f-3s/4 1f-3s/8 0f-4s/4 0f-4s/8
`Configurations
`Configurations
`(a)
`(b)
`Figure 3. SPECjAppServer. (a) Performance scalability. (b) Performance predictability measured as response time.
`
`thrashing. We identified the operating system scheduler as
`the primary source of instability. Instability is often due to
`work imbalance among various threads, and a computa-
`tionally important thread’s schedule—whether it runs on a
`slow processor or a fast processor—will vary performance
`significantly. A kernel scheduler typically aims to keep all
`processors occupied approximately with the same load.
`To better balance computational power, we imple-
`mented a new kernel scheduler. We made the scheduler
`aware of the underlying hardware’s performance asymme-
`try.
`In the new algorithm, the kernel scheduler ensures
`faster cores never go idle before slower cores. A process
`is explicitly migrated from a slow core to an idle fast core,
`if one is available. Figure 2(b) shows the throughput using
`the new kernel scheduler. As we can see, the new sched-
`uler eliminates the application instability (compare this
`with Figure 1 which showed significant instability for the
`same configuration). For SPECjbb, exposing performance
`asymmetry to the operating system for smart scheduling is
`effective in fixing instability.
`3.1.2 Discussion
`Garbage collection has significant impact on throughput
`in an asymmetric system. The current study used single-
`thread concurrent garbage collection and a parallel multi-
`threaded non-concurrent collector. Future garbage collec-
`tor designs should take into account underlying perform-
`ance asymmetry, to ensure stable application behavior.
`3.2 SPECjAppServer
`SPECjAppServer2002 [19] is a complex client/server
`business J2EE™ application to measure the scalability
`and performance of J2EE servers and EJB containers. The
`setup consists of three machines: a front-end driver, a
`middle-tier jAppServer, and a backend database server.
`The study focuses on the jAppServer, and its interaction
`with asymmetry.
`SPECjAppServer models four business domains, each
`with its own database and applications. These domains
`
`interact as needed. We focus on two of these domains:
`manufacturing and customer. The customer domain fo-
`cuses on order processing, and the manufacturing domain
`handles production scheduling.
`A driver generates requests for orders at a specific in-
`jection rate to the jAppServer using a pre-defined transac-
`tion mix. A complex sequence follows involving all do-
`mains, and the request must be serviced within a response
`time requirement.
`If
`the jAppServer cannot
`respond
`within a fixed time, the driver is informed, and the injec-
`tion rate of requests is scaled down. This feedback loop is
`an integral part of the workload. SPEC rules require
`specified and actual injection rates to be identical for con-
`formance. Our baseline setup where all cores have equal
`performance satisfies this requirement. Introducing asym-
`metry makes some runs non-conforming, but correct. Such
`runs are acceptable for this paper since they provide
`intuition on how asymmetry affects application behavior.
`The front-end driver runs on a 4-way 2.8GHz Intel
`Xeon multiprocessor with 4GB memory, and the back-end
`database server is Microsoft SQL with Windows Server
`2003 running on a 4-way Pentium III multiprocessor.
`These machines are powerful enough to stress test the
`jAppServer, and are not bottlenecks. A gigabit network
`connects all machines.
`The jAppServer uses BEA Weblogic (8.1) application
`server and a BEA JRockit (8.1) virtual machine. Run pa-
`rameters conforming to SPEC are used. We assume two
`ordering and two manufacturing agents, and the runs in-
`clude a 600 seconds ramp-up, a steady state of 1800 sec-
`onds, and a 300 seconds ramp-down.
`3.2.1 Analysis
`Figure 3(a) shows SPECjAppServer throughput for
`transactions in the manufacturing and customer (NewOr-
`der) domains. Average throughput for 4f-0s, 3f-1s/4, and
`3f-1s/8 is mostly constant, and sees a linear reduction for
`the remaining configurations. The workload is predictably
`scalable on symmetric systems. The throughput specified
`for the configurations 2f-2s/8, 1f-3s/4, and 1f-3s/8 actu-
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 5
`
`

`

`Runtimes for 13 runs
`
`20
`18
`16
`14
`12
`10
`
`Runtime(seconds)
`
`Parallelization Degree = 4
`Optimization Degree = 7
`
`Run 1
`Run 2
`Run 3
`Run 4
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`250
`
`200
`
`150
`
`100
`
`50
`
`0
`
`Runtime(seconds)
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`Parallelization Degree = 4
`Optimization Degree = 7
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`02468
`
`2f-
`2f-
`2f-
`2f-
`1f-
`1f-
`2s/4
`2s/8
`2s/8
`2s/4
`3s/4
`3s/4
`Configurations
`Configurations
`(a)
`(b)
`Figure 4. TPC-H. (a) Runtime for power run (all queries). (b) Runtime for one query (Query 3).
`
`ally reflect a slower injection rate because the slow system
`cannot sustain a high rate and the feedback provided to
`the driver slows it down. The configurations 3f-1s/4 and
`3f-1s/8 can sustain the specified injection rates and thus
`provides the same throughput.
`Throughput shown in Figure 3(a) demonstrates scalabil-
`ity, but does not directly demonstrate stability. We inves-
`tigated numerous secondary metrics and found no instabil-
`ity under
`symmetric and asymmetric configurations.
`Figure 3(b) shows one such metric, the manufacturing
`domain response time (for three different injection rates).
`Three bars plot average, maximum, and 90%ile response
`times. The response times are not constant, mainly be-
`cause of complex interactions in the system, but they scale
`well. The 90%ile response is closer to the average, thus
`indicating a significant number of transactions take an
`average time to complete. Further, the difference between
`the 90%ile and the average response times scales and is
`stable across various configurations. These bars would
`have reflected any significant instability introduced due to
`asymmetry.
`3.2.2 Discussion
`SPECjAppServer adapts to dynamic performance vari-
`ability by automatically scaling back and performing load
`balancing. This allows for stability and prevents the sys-
`tem from overloading. In contrast, SPECjbb discussed
`earlier suffered from significant instability due to per-
`formance asymmetry. This suggests that application de-
`sign is an important consideration when dealing with per-
`formance asymmetry. Exposing asymmetry to software
`developers and application optimizers can ensure applica-
`tion stability.
`3.3 TPC-H
`TPC-H [20] is a decision support benchmark consisting
`of 22 non-trivial queries. Each query has varying com-
`plexity, and performs concurrent data updates on a com-
`mon database. The database server used is IBM DB2 (8.2)
`running on Linux 2.4. We use a memory-resident setup (1
`
`setup (1 GB with a scale factor of 1) to isolate the impact
`of processor execution on the database server.
`The database server uses the degree of intra-query par-
`allelization parameter to parallelize a query into sub-
`queries and execute them in parallel. It also uses the de-
`gree of optimization parameter to optimize the query plan
`and its execution. Parallelization and optimization of
`TPC-H queries significantly improves their performance.
`We focus on TPC-H query execution time during a
`power run. The power run measures the raw query execu-
`tion time with a single active user. We discard the first
`few power runs and warm up the buffer spaces.
`3.3.1 Analysis
`We first run the benchmark with the highest optimiza-
`tion degree (seven) and a parallelization degree of four.
`Figure 4(a) shows the runtime for the power run (where
`all queries are run in series to completion). Multiple such
`runs are shown. The symmetric configurations of 4f-0s,
`0f-4s/4, and 0f-4s/8 show stability across multiple runs—
`these points are closely clustered. TPC-H also shows good
`scalability, when compute power is varied. However, the
`asymmetric configurations (3f-1s/4, 3f-1s/8, 2f-2s/4, 2f-
`2s/8, 1f-3s/4, and 1f-3s/8) display significant variability
`across multiple runs.
`To understand behavior of individual queries on per-
`formance asymmetric systems, we looked at various indi-
`vidual queries. Figure 4(b) shows multiple runtimes for
`one specific query—query number 3. Similar to the power
`runs, symmetric configurations are stable but asymmetric
`configurations show significant instability.
`We explicitly turned off
`intra-query parallelization
`(graph not shown). The query showed two distinct run-
`times over multiple runs—one where the runtime corre-
`sponds to the fastest processor, and another where the
`runtime corresponds to the slowest processor. Thus, the
`scheduling decisions have an impact on the application
`stability and this information needs to be available to the
`DB2 server for helping it make scheduling decisions.
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 6
`
`

`

`Parallelization Degree = 4
`Optimization Degree = 2
`
`Run 1
`Run 2
`Run 3
`Run 4
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`Runtime(seconds)
`
`Parallelization Degree = 8
`Optimization Degree = 7
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`2f-
`2f-
`1f-
`2f-
`2f-
`1f-
`2s/8
`2s/4
`3s/4
`2s/8
`2s/4
`3s/4
`Configurations
`Configurations
`(a)
`(b)
`Figure 5. TPC-H power run. (a) High parallelization degree. (b) Low optimization degree.
`
`0
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`Run 1
`Run 2
`Run 3
`Run 4
`
`250
`
`200
`
`150
`
`100
`
`Runtime(seconds)
`
`50
`
`0
`
`4f-
`0s
`
`We expected that increasing the degree of paralleliza-
`tion might reduce asymmetry effects by forcing more load
`on certain processors. Figure 5(a) shows the impact of
`increasing the degree of parallelization to eight. Contrary
`to our expectation, the runtime for various configurations
`show even more variance, at times twice the variance of
`the configuration with the parallelization set to four.
`Since the parallelized queries execute on an asymmetric
`processor setup, the scheduling decisions to run certain
`parallel sub-queries on slow or fast processors over multi-
`ple runs affects the stability of the application. Our modi-
`fied kernel scheduler did not fix the instability. The DB2
`server controls the scheduling of query execution on
`server processes, which are bound by the server to various
`processors, thus making our kernel fix ineffective.
`We approximated changes to the program itself by con-
`trolling the query optimization degree—the higher the
`degree, the more aggressive the query plan. The results we
`have discussed so far have used the highest optimization
`degree. Figure 5(b) shows the impact of reducing the op-
`timization level significantly. Expectedly,
`the runtimes
`have also slowed down for various configurations as com-
`pared to the highest optimization level. However, the in-
`stability has significantly decreased for the asymmetric
`configurations, at times nearly a factor of 10 lesser than
`with the high query optimizations.
`3.3.2 Discussion
`The query optimization experiment strongly suggests
`that the application itself, and not the operating system
`scheduler, contributes to the instability. This suggests
`exposing hardware performance asymmetry to the appli-
`cation. Query plan generators already take into account
`latency of memory accesses and disk access when com-
`puting cost and the target plan. Incorporating performance
`asymmetry and compute power of available processors
`will ensure stable performance and execution behavior for
`these queries.
`
`3.4 Apache and Zeus web servers
`We evaluate two commonly used webservers: an open-
`source webserver, Apache (2.0.40), and a commercial
`webserver, Zeus (4.3).
`Apache maintains several idle processes waiting for in-
`coming requests. A single control process launches child
`processes, and these processes wait for incoming requests.
`Optimally selecting the number of such pre-forked proc-
`esses and the maximum number of such processes allowed
`prevents system thrashing. A process handles a pre-
`defined number of requests, and then terminates and recy-
`cles. The control process also terminates excessively idle
`processes.
`Zeus utilizes a small, fixed number of single-threaded
`I/O multiplexing processes, and these processes handle
`tens of thousands of simultaneous connections.
`We use ApacheBench to drive these webservers. We
`calculate processing performance of a single static file.
`This allows us to focus on multi-threaded webserver be-
`havior in the presence of performance asymmetry, and
`removes dependency of results on the web-caching infra-
`structure that is otherwise necessary for complex setups.
`We emulate two modes: (a) heavy load and full utilization
`processing 60 requests concurrently with up to 1,000,000
`requests in total and (b) light load with 10 concurrent re-
`quests with up to total 100,000 requests.
`3.4.1 Analysis
`Figure 6(a) shows system throughput for Apache under
`heavy and light load, and Figure 7 shows the same for
`Zeus. We plot six runs for each configuration to determine
`stability. Performance-symmetric configurations (4f-0s,
`0f-4s/4, and 0f-4s/8) are scalable and show stability for
`both light and heavy load; the throughput of these runs
`cluster together for both Apache and Zeus.
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 7
`
`

`

`Asymmetry-aware Kernel (6 runs)
`Fine-grained Threads (6 runs)
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`1f-
`2f-
`2f-
`2s/8
`3s/4
`2s/4
`Configurations
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Hundreds
`
`Requestspersecond
`
`Heavy load throughput
`
`Run 1
`Run 2
`Run 3
`Run 4
`Run 5
`Run 6
`Series7
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`1f-
`2f-
`2f-
`2s/4
`2s/8
`3s/4
`Configurations
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`01234567
`
`Thousands
`
`Requestspersecond
`
`(b)
`(a)
`Figure 6. Apache throughput. (a) Light load. (b) With two techniques to reduce asymmetry impact.
`
`However, as can be seen by the vertical spread of the
`data points in Figure 6(a) for a given asymmetric configu-
`ration, Apache performance under light load on asymmet-
`ric setups is significantly unstable. The process scheduling
`decisions and the server design contribute to the instabil-
`ity. Apache forks processes in advance to handle incom-
`ing web serving requests. Since these processes are visible
`to the kernel, the kernel scheduler decisions affect their
`scheduling. Sometimes the kernel scheduler places proc-
`esses on slower cores even though a faster core is avail-
`able because it is agnostic to the relative speed of the
`processors.
`We found Apache performance under heavy load to be
`stable with no variance over multiple runs. The through-
`put with varying configurations is stable and scalable and
`a function of the underlying computational power of the
`system, because in a throughput benchmark under heavy
`load, each processor is always busy. Instability with per-
`formance-asymmetry typically arises
`in throughput-
`oriented applications when some processors are idle. In
`such situations, threads may randomly schedule on fast or
`slow processors.
`
`Unlike the Apache webserver, Zeus displays significant
`variance and instability for both heavily loaded and lightly
`loaded systems. However, Zeus provides a significantly
`higher throughput than Apache does, up to a factor of 2.5.
`Due to significant
`instability, we cannot determine
`whether Zeus is predictably scalable. Since Zeus is a
`commercial product and we do not have access to its
`source code, we cannot isolate the reasons for instability.
`We replaced the Linux kernel scheduler with our modi-
`fied scheduler described earlier in Section 3.1.1. Our new
`scheduler is aware of the performance asymmetry of the
`underlying system and makes scheduling decisions ac-
`cordingly. Figure 6(b) shows the result for Apache with
`light load. As can be seen, the kernel fix solves the insta-
`bility problem and the runs are now repeatable.
`We ran Zeus with our modified Linux kernel scheduler.
`The scheduler did not have any effect on the significant
`instability, suggesting that Zeus runs its own threading
`scheduler. This again demonstrates that simply exposing
`the asymmetry to the operating system is not sufficient.
`The webserver also needs to be aware of the asymmetry.
`
`Requestspersecond
`
`Run 1
`Run 2
`Run 3
`Run 4
`Run 5
`Run 6
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`14
`
`12
`
`10
`
`Thousands
`
`02468
`
`Requestspersecond
`
`1f-
`2f-
`2f-
`1f-
`2f-
`2f-
`2s/8
`2s/4
`3s/4
`2s/8
`2s/4
`3s/4
`Configurations
`Configurations
`(b)
`(a)
`Figure 7. Zeus throughput. (a) Light load. (b) Heavy load.
`
`Run 1
`Run 2
`Run 3
`Run 4
`Run 5
`Run 6
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`14
`
`12
`
`10
`
`Thousands
`
`02468
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 8
`
`

`

`4f-0s
`2f-2s/8
`2f-2s/8
`0f-4s/4
`0f-4s/8
`
`4p/8-9515
`4p/4-6974
`2p/8-3430
`4p/1-3050
`
`wupwis swim mgrid
`
`equak apsi
`applu galgel
`Benchmarks
`(b)
`
`fma3d
`
`art
`
`ammp
`
`1500
`
`1250
`
`1000
`
`750
`
`500
`
`250
`
`0
`
`Runtime(seconds)
`
`4f-0s
`2f-2s/8
`2f-2s/8
`0f-4s/4
`0f-4s/8
`
`4p/8-2960
`2p/8-2750
`4p/4-1940
`
`wupwis swim mgrid
`
`equak apsi
`applu galgel
`Benchmarks
`(a)
`
`fma3d
`
`art
`
`ammp
`
`1000
`
`900
`
`800
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`0
`
`Runtime(seconds)
`
`Figure 8. SPEC OMP runtimes. (a) Unmodified source. (b) Source modified to use parallelization directives.
`
`3.4.2 Discussion
`The degree of instability varies as asymmetry is varied.
`Higher instability exists for the 3f-1s/8 and 3f-1s/4 con-
`figurations over the 2f-2s/8 and 2f-2s/4 configurations.
`Introducing slight asymmetry (for e.g., a system with
`mostly fast processors but one slow processor) seems to
`introduce more instability than a system in which the
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket