`
`Reference 14
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 1
`
`
`
`The Impact of Performance Asymmetry in Emerging Multicore Architectures
`
`Saisanthosh Balakrishnan†
`†Computer Sciences Department
`University of Wisconsin-Madison
`sai@cs.wisc.edu
`
`Konrad Lai‡
`Mike Upton§
`Ravi Rajwar‡
`‡Microarchitecture Research Lab
`§Digital Enterprise Group
`Intel Corporation
`{ravi.rajwar, mike.upton, konrad.lai}@intel.com
`
`Abstract
`Performance asymmetry in multicore architectures
`arises when individual cores have different performance.
`Building such multicore processors is desirable because
`many simple cores together provide high parallel per-
`formance while a few complex cores ensure high serial
`performance. However, application developers typically
`assume computational cores provide equal performance,
`and performance asymmetry breaks this assumption.
`This paper is concerned with the behavior of commer-
`cial applications running on performance asymmetric
`systems. We present the first study investigating the im-
`pact of performance asymmetry on a wide range of com-
`mercial applications using a hardware prototype. We
`quantify the impact of asymmetry on an application’s
`performance variance when run multiple times, and the
`impact on the application’s scalability.
`Performance asymmetry adversely affects behavior of
`many workloads. We study ways to eliminate these effects.
`In addition to asymmetry-aware operating system kernels,
`the application often itself needs to be aware of perform-
`ance asymmetry for stable and scalable performance.
`1. Introduction
`A multicore processor provides increased total compu-
`tational capability on a single chip without requiring a
`complex microarchitecture. As a result, simple multicore
`processors have better performance per watt and area
`characteristics than complex single-core processors. The
`diminishing returns on serial performance with increas-
`ingly complex cores make multicore organizations par-
`ticularly attractive.
`A performance-asymmetric multicore organization,
`where individual cores have different compute capabili-
`ties, is attractive because a few high-performance complex
`cores can provide good serial performance, and many
`low-performance simple cores can provide high parallel
`performance. Simple cores provide efficient use of tran-
`sistors for computation in addition to meeting power and
`thermal budgets. Complex cores, while inefficient, pro-
`vide computational power for single threads that require
`it. Researchers have proposed numerous such asymmetric
`processor organizations for power and performance effi-
`
`formance efficiency, and have investigated the behavior of
`multi-programmed single threaded applications on them
`[5, 8, 9, 11, 17].
`Performance asymmetry in multicore systems breaks a
`long-standing assumption made by multi-threaded appli-
`cation developers. These developers typically assume all
`computational cores provide equal performance when
`they write their parallel algorithms and applications.
`However, no study has investigated the impact, if any, of
`computational asymmetry on the behavior of these multi-
`threaded applications. For example, does computational
`asymmetry result in unpredictable performance character-
`istics of a commercial server, which must meet certain
`performance guarantees? Does the asymmetry expose an
`application’s scalability problem,
`that otherwise would
`not have manifested? Ensuring that applications run as
`expected on a new architecture is crucial for the architec-
`ture’s adoption. Answering questions regarding applica-
`tion behavior predictability and scalability are therefore
`important for understanding the implications of asymmet-
`ric architectures on software that runs on them. This paper
`tries to answer these questions.
`We present the first study investigating the impact of
`performance asymmetry on the behavior of numerous
`multithreaded applications. We use a hardware prototype
`of a multiprocessor system to perform our study. We ap-
`proximate performance asymmetry by varying the indi-
`vidual processor frequencies in a multiprocessor. Varying
`frequency is an effective way to create the property of
`performance asymmetry on real hardware.
`We focus on two questions:
`
`1. Does performance asymmetry in a multiprocessor
`system have a negative impact on an application’s
`performance characteristics? Can we predict per-
`formance of an application on an asymmetric sys-
`tem? Does the application scale as expected or does
`it experience scalability bottlenecks that were oth-
`erwise not present on a symmetric system?
`
`2. For applications that do suffer due to performance
`asymmetry, what methods can help alleviate the
`problem? Is exposing the asymmetry to the operat-
`ing system sufficient or, do the applications also
`need to consider asymmetry at an algorithmic level?
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 2
`
`
`
`We establish a baseline performance behavior by study-
`ing the applications on a performance-symmetric system
`and ensuring their performance is predictable on such
`systems, and then vary individual frequencies (at ¼ frac-
`tion increments) to study the impact of performance
`asymmetry. We determine whether the application pro-
`vides predictable performance by running the same appli-
`cation multiple times on the same asymmetric setup. We
`also determine whether the application performance scales
`predictably in proportion to the total compute power in
`the system (even if the cores are performance asymmet-
`ric).
`We then investigate ways to eliminate anomalous be-
`haviors if any. Does re-structuring the operating systems
`scheduler help? Do we also need to modify the applica-
`tion?
`Our benchmarks include commercial managed runtime
`servers (SPECjbb and SPECjAppServer), database serv-
`ers (TPC-H), web servers (Zeus and Apache), scientific
`applications with closely coupled synchronization (SPEC
`OMP), a media application (H.264), and an application
`development tool (PMAKE).
`Using the above workloads, the paper quantitatively
`makes the following four key points:
`
`1. Performance asymmetry in systems adversely affects
`the predictability of a number of commercial work-
`loads, and makes them less scalable. This effect in-
`creases with increasing concurrency.
`
`2. An asymmetry-aware operating system helps elimi-
`nate unpredictability in some applications. In others,
`the application also needs to be asymmetry-aware.
`
`3. An asymmetric multiprocessor gives higher per-
`formance than a multiprocessor in which all cores
`are slow because the fast core is effective for serial
`portions of the threaded program.
`
`4. Mechanisms for exchanging asymmetry information
`between hardware and software, and application de-
`sign methods tolerant of asymmetry need investiga-
`tion.
`
`Section 2 discusses the experimental methodology and
`Section 3 presents workload description and analysis.
`Section 4 summarizes the results, Section 5 presents re-
`lated work, and Section 6 concludes.
`2. Experimental methodology
`Our experimental platform comprises a 4-way 2.8 GHz
`Intel® Xeon™ multiprocessor
`(Shasta series). Our
`benchmark under test runs on this platform. We disable
`Hyper-Threading in all processors by using the BIOS. The
`system has 2-MB of unified Level-3 cache. We use
`Windows Server 2003 and Linux operating systems.
`Intel Xeon processors allow software to change the ac-
`tive duty cycle of processors for thermal management [6].
`
`Duty cycle is the time-period during which the clock sig-
`nal drives the processor chip. A stop clock mechanism
`disables the processor clock, during which time every-
`thing on the processor stops. This does not affect any
`modules outside the processor, e.g., coherence network or
`memory (DRAM) is unaffected, and only the processor
`appears to slow down. We vary duty cycle in multiple
`steps: 12.5%, 25%, 37.5%, 50%, 63.5%, 75%, and 87.5%.
`
`We developed a device driver to control asymmetry in
`Windows Server 2003. The driver, running in privileged
`mode, controls the duty cycle by changing the clock
`modulation register. Linux kernel 2.4 requires a new
`module to read and write the clock modulation register.
`Linux kernel 2.6 provides a thermal monitoring infrastruc-
`ture and does not require the user to write to the clock
`modulation register directly. A process-affinity API se-
`lects specific processors.
`Using the above duty cycle modulation to emulate per-
`formance asymmetry is effective for this paper’s study
`because the unpredictability of performance and limita-
`tions in scalability arise due to differences in the computa-
`tion power of the individual cores and not due to commu-
`nication latencies. The results therefore should hold when
`the communication network and latencies change.
`3. Results and analysis
`All applications were set up according to the rules pro-
`vided by the respective organizations (SPEC and TPC).
`An industrial performance evaluation group ensured strict
`compliance. Many of these setups were multi-tier, and
`only the application, whose behavior we were studying,
`ran on the system described in Section 2.
`We focus on two key predictability metrics:
`
`1. Is the application’s performance stable?
`
`2. Is the application’s performance scalable?
`An nf-ms/scale label means n fast cores and m slow
`cores running at 1/scale the speed of the fast cores. The
`total compute power of this system is (n + m/scale)).
`Symmetric configurations are 4f-0s, 0f-4s/4, and 0f-4s/8,
`and asymmetric configurations are 3f-1s/4, 3f-1s/8, 2f-
`2s/4, 2f-2s/8, 1f-3s/4, and 1f-3s/8. Performance asymme-
`try was validated using runtimes of computationally inten-
`sive micro benchmarks.
`3.1 SPECjbb
`SPECjbb2000 [19] is an online-transaction processing
`business Java application emulating a three-tier system. A
`thread represents an individual terminal, and maps to a
`specific warehouse. Increasing the number of warehouses
`increases concurrency in the workload. A probability dis-
`tribution determines queries to the system. The throughput
`of business operations per second is the primary perform-
`ance metric. The backend database is memory resident
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 3
`
`
`
`BEA JRockit JVM, Gen. concurrent GC on 4f-0s (2 runs)
`
`BEA JRockit JVM, Gen. concurrent GC on 2f-2s/8 (4 runs)
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Thousands
`
`Transactionspersecond
`
`BEA JRockit JVM, Parallel GC on 2f-2s/8 (3 runs)
`
`Sun HotSpot JVM, Gen. concurrent GC on 2f-2s/8 (3 runs)
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Thousands
`
`Transactionspersecond
`
`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
`Warehouses
`(a)
`
`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
`Warehouses
`(b)
`
`Figure 1. SPECjbb performance predictability.
`
`(25MB) and has 20 warehouses. The workload focuses on
`the middle tier running the Java application server.
`Our application server operating system is Linux 2.6,
`and we study two virtual machines for the application
`server: BEA Weblogic JRockit (8.1) and Sun Hotspot
`(1.4.2). The virtual machine flags were set for the highest
`machine optimization.
`Since garbage collection (GC) is integral to managed
`runtime systems, we include their effect in this study. We
`use two garbage collectors: a parallel and a concurrent
`generational. A parallel collector interrupts all application
`threads prior to performing collection, and is well suited
`for high-throughput long-running workloads. The genera-
`tional concurrent collector runs concurrently with the ap-
`plication, reclaiming objects. This collector is well suited
`for applications requiring minimal pause times and those
`that are unaffected by the collectors interference.
`3.1.1 Analysis
`Figure 1(a) shows SPECjbb throughput (increasing
`warehouses increases concurrency) with two different
`virtual machines (BEA JRockit with parallel GC and Sun
`HotSpot with a generational concurrent GC) running on a
`
`2f-2s/8 asymmetric configuration. Multiple runs are
`shown for each configuration to determine predictability.
`The absolute performance variance for the HotSpot con-
`figuration is higher. Minor instability exists with JRockit.
`Figure 1(b) shows SPECjbb throughput (with increasing
`warehouses) with BEA JRockit and a generational con-
`current GC (instead of the parallel GC). The new collector
`has a significant negative impact on application behavior.
`Instability in asymmetric configurations increases signifi-
`cantly across multiple runs, and increases as concurrency
`in the system increases. This increased instability is due to
`the concurrently running garbage collector interfering
`with the main application.
`Figure 2(a) shows scalability and predictability (when
`run multiple times) as computational power is varied. For
`symmetric configurations (4f-0s, 0f-4s/4, and 0f-4s/8),
`performance decreases predictably and linearly with com-
`putational power. The workload is inherently stable and
`predictably scalable on a symmetric system. While asym-
`metric configurations (3f-1s/4, 3f-1s/8, 2f-2s/4, 2f-2s/8,
`1f-3s/4, and 1f-3s/8) scale, they show significant variabil-
`ity (as shown by the error bars in the figure).
`We investigated various potential sources of instability:
`scheduling,
`locking and synchronization, and cache
`
`BEA JRockit JVM, Gen. concurrent GC on 2f-2s/8 (3 runs)
`
`Sun HotSpot JVM, Gen. concurrent GC on 2f-2s/8 (3 runs)
`
`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
`Warehouses
`
`35
`
`30
`
`25
`
`20
`
`15
`
`10
`
`05
`
`Thousands
`
`Transactionspersecond
`
`BEA JRockit JVM (4 runs)
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`2f-
`1f-
`2f-
`2s/8
`3s/4
`2s/4
`Configurations
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`50
`45
`40
`35
`30
`25
`20
`15
`10
`
`05
`
`Thousands
`
`Averagethroughput
`
`(a)
`(b)
`Figure 2. SPECjbb. (a) Scalability & predictability. Error bars indicate variability in throughput for multiple runs.
`(b) Predictability with an asymmetry-aware kernel scheduler.
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 4
`
`
`
`Avg
`
` 90%
`
`Max
`
`
`
`
`
`
`
`
`
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`320
`290
`250
`
`40
`
`35
`
`30
`
`25
`
`20
`
`15
`
`10
`
`05
`
`ManufacturingResponsetime(secs)
`
`Injection rate = 320
`
`Manufacturing
`NewOrder
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Hundreds
`
`Throughputpersecond
`
`4f-0s 3f-1s/4 3f-1s/8 2f-2s/4 2f-2s/8 1f-3s/4 1f-3s/8 0f-4s/4 0f-4s/8
`4f-0s 3f-1s/4 3f-1s/8 2f-2s/4 2f-2s/8 1f-3s/4 1f-3s/8 0f-4s/4 0f-4s/8
`Configurations
`Configurations
`(a)
`(b)
`Figure 3. SPECjAppServer. (a) Performance scalability. (b) Performance predictability measured as response time.
`
`thrashing. We identified the operating system scheduler as
`the primary source of instability. Instability is often due to
`work imbalance among various threads, and a computa-
`tionally important thread’s schedule—whether it runs on a
`slow processor or a fast processor—will vary performance
`significantly. A kernel scheduler typically aims to keep all
`processors occupied approximately with the same load.
`To better balance computational power, we imple-
`mented a new kernel scheduler. We made the scheduler
`aware of the underlying hardware’s performance asymme-
`try.
`In the new algorithm, the kernel scheduler ensures
`faster cores never go idle before slower cores. A process
`is explicitly migrated from a slow core to an idle fast core,
`if one is available. Figure 2(b) shows the throughput using
`the new kernel scheduler. As we can see, the new sched-
`uler eliminates the application instability (compare this
`with Figure 1 which showed significant instability for the
`same configuration). For SPECjbb, exposing performance
`asymmetry to the operating system for smart scheduling is
`effective in fixing instability.
`3.1.2 Discussion
`Garbage collection has significant impact on throughput
`in an asymmetric system. The current study used single-
`thread concurrent garbage collection and a parallel multi-
`threaded non-concurrent collector. Future garbage collec-
`tor designs should take into account underlying perform-
`ance asymmetry, to ensure stable application behavior.
`3.2 SPECjAppServer
`SPECjAppServer2002 [19] is a complex client/server
`business J2EE™ application to measure the scalability
`and performance of J2EE servers and EJB containers. The
`setup consists of three machines: a front-end driver, a
`middle-tier jAppServer, and a backend database server.
`The study focuses on the jAppServer, and its interaction
`with asymmetry.
`SPECjAppServer models four business domains, each
`with its own database and applications. These domains
`
`interact as needed. We focus on two of these domains:
`manufacturing and customer. The customer domain fo-
`cuses on order processing, and the manufacturing domain
`handles production scheduling.
`A driver generates requests for orders at a specific in-
`jection rate to the jAppServer using a pre-defined transac-
`tion mix. A complex sequence follows involving all do-
`mains, and the request must be serviced within a response
`time requirement.
`If
`the jAppServer cannot
`respond
`within a fixed time, the driver is informed, and the injec-
`tion rate of requests is scaled down. This feedback loop is
`an integral part of the workload. SPEC rules require
`specified and actual injection rates to be identical for con-
`formance. Our baseline setup where all cores have equal
`performance satisfies this requirement. Introducing asym-
`metry makes some runs non-conforming, but correct. Such
`runs are acceptable for this paper since they provide
`intuition on how asymmetry affects application behavior.
`The front-end driver runs on a 4-way 2.8GHz Intel
`Xeon multiprocessor with 4GB memory, and the back-end
`database server is Microsoft SQL with Windows Server
`2003 running on a 4-way Pentium III multiprocessor.
`These machines are powerful enough to stress test the
`jAppServer, and are not bottlenecks. A gigabit network
`connects all machines.
`The jAppServer uses BEA Weblogic (8.1) application
`server and a BEA JRockit (8.1) virtual machine. Run pa-
`rameters conforming to SPEC are used. We assume two
`ordering and two manufacturing agents, and the runs in-
`clude a 600 seconds ramp-up, a steady state of 1800 sec-
`onds, and a 300 seconds ramp-down.
`3.2.1 Analysis
`Figure 3(a) shows SPECjAppServer throughput for
`transactions in the manufacturing and customer (NewOr-
`der) domains. Average throughput for 4f-0s, 3f-1s/4, and
`3f-1s/8 is mostly constant, and sees a linear reduction for
`the remaining configurations. The workload is predictably
`scalable on symmetric systems. The throughput specified
`for the configurations 2f-2s/8, 1f-3s/4, and 1f-3s/8 actu-
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 5
`
`
`
`Runtimes for 13 runs
`
`20
`18
`16
`14
`12
`10
`
`Runtime(seconds)
`
`Parallelization Degree = 4
`Optimization Degree = 7
`
`Run 1
`Run 2
`Run 3
`Run 4
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`250
`
`200
`
`150
`
`100
`
`50
`
`0
`
`Runtime(seconds)
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`Parallelization Degree = 4
`Optimization Degree = 7
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`02468
`
`2f-
`2f-
`2f-
`2f-
`1f-
`1f-
`2s/4
`2s/8
`2s/8
`2s/4
`3s/4
`3s/4
`Configurations
`Configurations
`(a)
`(b)
`Figure 4. TPC-H. (a) Runtime for power run (all queries). (b) Runtime for one query (Query 3).
`
`ally reflect a slower injection rate because the slow system
`cannot sustain a high rate and the feedback provided to
`the driver slows it down. The configurations 3f-1s/4 and
`3f-1s/8 can sustain the specified injection rates and thus
`provides the same throughput.
`Throughput shown in Figure 3(a) demonstrates scalabil-
`ity, but does not directly demonstrate stability. We inves-
`tigated numerous secondary metrics and found no instabil-
`ity under
`symmetric and asymmetric configurations.
`Figure 3(b) shows one such metric, the manufacturing
`domain response time (for three different injection rates).
`Three bars plot average, maximum, and 90%ile response
`times. The response times are not constant, mainly be-
`cause of complex interactions in the system, but they scale
`well. The 90%ile response is closer to the average, thus
`indicating a significant number of transactions take an
`average time to complete. Further, the difference between
`the 90%ile and the average response times scales and is
`stable across various configurations. These bars would
`have reflected any significant instability introduced due to
`asymmetry.
`3.2.2 Discussion
`SPECjAppServer adapts to dynamic performance vari-
`ability by automatically scaling back and performing load
`balancing. This allows for stability and prevents the sys-
`tem from overloading. In contrast, SPECjbb discussed
`earlier suffered from significant instability due to per-
`formance asymmetry. This suggests that application de-
`sign is an important consideration when dealing with per-
`formance asymmetry. Exposing asymmetry to software
`developers and application optimizers can ensure applica-
`tion stability.
`3.3 TPC-H
`TPC-H [20] is a decision support benchmark consisting
`of 22 non-trivial queries. Each query has varying com-
`plexity, and performs concurrent data updates on a com-
`mon database. The database server used is IBM DB2 (8.2)
`running on Linux 2.4. We use a memory-resident setup (1
`
`setup (1 GB with a scale factor of 1) to isolate the impact
`of processor execution on the database server.
`The database server uses the degree of intra-query par-
`allelization parameter to parallelize a query into sub-
`queries and execute them in parallel. It also uses the de-
`gree of optimization parameter to optimize the query plan
`and its execution. Parallelization and optimization of
`TPC-H queries significantly improves their performance.
`We focus on TPC-H query execution time during a
`power run. The power run measures the raw query execu-
`tion time with a single active user. We discard the first
`few power runs and warm up the buffer spaces.
`3.3.1 Analysis
`We first run the benchmark with the highest optimiza-
`tion degree (seven) and a parallelization degree of four.
`Figure 4(a) shows the runtime for the power run (where
`all queries are run in series to completion). Multiple such
`runs are shown. The symmetric configurations of 4f-0s,
`0f-4s/4, and 0f-4s/8 show stability across multiple runs—
`these points are closely clustered. TPC-H also shows good
`scalability, when compute power is varied. However, the
`asymmetric configurations (3f-1s/4, 3f-1s/8, 2f-2s/4, 2f-
`2s/8, 1f-3s/4, and 1f-3s/8) display significant variability
`across multiple runs.
`To understand behavior of individual queries on per-
`formance asymmetric systems, we looked at various indi-
`vidual queries. Figure 4(b) shows multiple runtimes for
`one specific query—query number 3. Similar to the power
`runs, symmetric configurations are stable but asymmetric
`configurations show significant instability.
`We explicitly turned off
`intra-query parallelization
`(graph not shown). The query showed two distinct run-
`times over multiple runs—one where the runtime corre-
`sponds to the fastest processor, and another where the
`runtime corresponds to the slowest processor. Thus, the
`scheduling decisions have an impact on the application
`stability and this information needs to be available to the
`DB2 server for helping it make scheduling decisions.
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 6
`
`
`
`Parallelization Degree = 4
`Optimization Degree = 2
`
`Run 1
`Run 2
`Run 3
`Run 4
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`Runtime(seconds)
`
`Parallelization Degree = 8
`Optimization Degree = 7
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`2f-
`2f-
`1f-
`2f-
`2f-
`1f-
`2s/8
`2s/4
`3s/4
`2s/8
`2s/4
`3s/4
`Configurations
`Configurations
`(a)
`(b)
`Figure 5. TPC-H power run. (a) High parallelization degree. (b) Low optimization degree.
`
`0
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`Run 1
`Run 2
`Run 3
`Run 4
`
`250
`
`200
`
`150
`
`100
`
`Runtime(seconds)
`
`50
`
`0
`
`4f-
`0s
`
`We expected that increasing the degree of paralleliza-
`tion might reduce asymmetry effects by forcing more load
`on certain processors. Figure 5(a) shows the impact of
`increasing the degree of parallelization to eight. Contrary
`to our expectation, the runtime for various configurations
`show even more variance, at times twice the variance of
`the configuration with the parallelization set to four.
`Since the parallelized queries execute on an asymmetric
`processor setup, the scheduling decisions to run certain
`parallel sub-queries on slow or fast processors over multi-
`ple runs affects the stability of the application. Our modi-
`fied kernel scheduler did not fix the instability. The DB2
`server controls the scheduling of query execution on
`server processes, which are bound by the server to various
`processors, thus making our kernel fix ineffective.
`We approximated changes to the program itself by con-
`trolling the query optimization degree—the higher the
`degree, the more aggressive the query plan. The results we
`have discussed so far have used the highest optimization
`degree. Figure 5(b) shows the impact of reducing the op-
`timization level significantly. Expectedly,
`the runtimes
`have also slowed down for various configurations as com-
`pared to the highest optimization level. However, the in-
`stability has significantly decreased for the asymmetric
`configurations, at times nearly a factor of 10 lesser than
`with the high query optimizations.
`3.3.2 Discussion
`The query optimization experiment strongly suggests
`that the application itself, and not the operating system
`scheduler, contributes to the instability. This suggests
`exposing hardware performance asymmetry to the appli-
`cation. Query plan generators already take into account
`latency of memory accesses and disk access when com-
`puting cost and the target plan. Incorporating performance
`asymmetry and compute power of available processors
`will ensure stable performance and execution behavior for
`these queries.
`
`3.4 Apache and Zeus web servers
`We evaluate two commonly used webservers: an open-
`source webserver, Apache (2.0.40), and a commercial
`webserver, Zeus (4.3).
`Apache maintains several idle processes waiting for in-
`coming requests. A single control process launches child
`processes, and these processes wait for incoming requests.
`Optimally selecting the number of such pre-forked proc-
`esses and the maximum number of such processes allowed
`prevents system thrashing. A process handles a pre-
`defined number of requests, and then terminates and recy-
`cles. The control process also terminates excessively idle
`processes.
`Zeus utilizes a small, fixed number of single-threaded
`I/O multiplexing processes, and these processes handle
`tens of thousands of simultaneous connections.
`We use ApacheBench to drive these webservers. We
`calculate processing performance of a single static file.
`This allows us to focus on multi-threaded webserver be-
`havior in the presence of performance asymmetry, and
`removes dependency of results on the web-caching infra-
`structure that is otherwise necessary for complex setups.
`We emulate two modes: (a) heavy load and full utilization
`processing 60 requests concurrently with up to 1,000,000
`requests in total and (b) light load with 10 concurrent re-
`quests with up to total 100,000 requests.
`3.4.1 Analysis
`Figure 6(a) shows system throughput for Apache under
`heavy and light load, and Figure 7 shows the same for
`Zeus. We plot six runs for each configuration to determine
`stability. Performance-symmetric configurations (4f-0s,
`0f-4s/4, and 0f-4s/8) are scalable and show stability for
`both light and heavy load; the throughput of these runs
`cluster together for both Apache and Zeus.
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 7
`
`
`
`Asymmetry-aware Kernel (6 runs)
`Fine-grained Threads (6 runs)
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`1f-
`2f-
`2f-
`2s/8
`3s/4
`2s/4
`Configurations
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Hundreds
`
`Requestspersecond
`
`Heavy load throughput
`
`Run 1
`Run 2
`Run 3
`Run 4
`Run 5
`Run 6
`Series7
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`1f-
`2f-
`2f-
`2s/4
`2s/8
`3s/4
`Configurations
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`01234567
`
`Thousands
`
`Requestspersecond
`
`(b)
`(a)
`Figure 6. Apache throughput. (a) Light load. (b) With two techniques to reduce asymmetry impact.
`
`However, as can be seen by the vertical spread of the
`data points in Figure 6(a) for a given asymmetric configu-
`ration, Apache performance under light load on asymmet-
`ric setups is significantly unstable. The process scheduling
`decisions and the server design contribute to the instabil-
`ity. Apache forks processes in advance to handle incom-
`ing web serving requests. Since these processes are visible
`to the kernel, the kernel scheduler decisions affect their
`scheduling. Sometimes the kernel scheduler places proc-
`esses on slower cores even though a faster core is avail-
`able because it is agnostic to the relative speed of the
`processors.
`We found Apache performance under heavy load to be
`stable with no variance over multiple runs. The through-
`put with varying configurations is stable and scalable and
`a function of the underlying computational power of the
`system, because in a throughput benchmark under heavy
`load, each processor is always busy. Instability with per-
`formance-asymmetry typically arises
`in throughput-
`oriented applications when some processors are idle. In
`such situations, threads may randomly schedule on fast or
`slow processors.
`
`Unlike the Apache webserver, Zeus displays significant
`variance and instability for both heavily loaded and lightly
`loaded systems. However, Zeus provides a significantly
`higher throughput than Apache does, up to a factor of 2.5.
`Due to significant
`instability, we cannot determine
`whether Zeus is predictably scalable. Since Zeus is a
`commercial product and we do not have access to its
`source code, we cannot isolate the reasons for instability.
`We replaced the Linux kernel scheduler with our modi-
`fied scheduler described earlier in Section 3.1.1. Our new
`scheduler is aware of the performance asymmetry of the
`underlying system and makes scheduling decisions ac-
`cordingly. Figure 6(b) shows the result for Apache with
`light load. As can be seen, the kernel fix solves the insta-
`bility problem and the runs are now repeatable.
`We ran Zeus with our modified Linux kernel scheduler.
`The scheduler did not have any effect on the significant
`instability, suggesting that Zeus runs its own threading
`scheduler. This again demonstrates that simply exposing
`the asymmetry to the operating system is not sufficient.
`The webserver also needs to be aware of the asymmetry.
`
`Requestspersecond
`
`Run 1
`Run 2
`Run 3
`Run 4
`Run 5
`Run 6
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`14
`
`12
`
`10
`
`Thousands
`
`02468
`
`Requestspersecond
`
`1f-
`2f-
`2f-
`1f-
`2f-
`2f-
`2s/8
`2s/4
`3s/4
`2s/8
`2s/4
`3s/4
`Configurations
`Configurations
`(b)
`(a)
`Figure 7. Zeus throughput. (a) Light load. (b) Heavy load.
`
`Run 1
`Run 2
`Run 3
`Run 4
`Run 5
`Run 6
`
`1f-
`3s/8
`
`0f-
`4s/4
`
`0f-
`4s/8
`
`4f-
`0s
`
`3f-
`1s/4
`
`3f-
`1s/8
`
`14
`
`12
`
`10
`
`Thousands
`
`02468
`
`Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05)
`1063-6897/05 $20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2126, p. 8
`
`
`
`4f-0s
`2f-2s/8
`2f-2s/8
`0f-4s/4
`0f-4s/8
`
`4p/8-9515
`4p/4-6974
`2p/8-3430
`4p/1-3050
`
`wupwis swim mgrid
`
`equak apsi
`applu galgel
`Benchmarks
`(b)
`
`fma3d
`
`art
`
`ammp
`
`1500
`
`1250
`
`1000
`
`750
`
`500
`
`250
`
`0
`
`Runtime(seconds)
`
`4f-0s
`2f-2s/8
`2f-2s/8
`0f-4s/4
`0f-4s/8
`
`4p/8-2960
`2p/8-2750
`4p/4-1940
`
`wupwis swim mgrid
`
`equak apsi
`applu galgel
`Benchmarks
`(a)
`
`fma3d
`
`art
`
`ammp
`
`1000
`
`900
`
`800
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`0
`
`Runtime(seconds)
`
`Figure 8. SPEC OMP runtimes. (a) Unmodified source. (b) Source modified to use parallelization directives.
`
`3.4.2 Discussion
`The degree of instability varies as asymmetry is varied.
`Higher instability exists for the 3f-1s/8 and 3f-1s/4 con-
`figurations over the 2f-2s/8 and 2f-2s/4 configurations.
`Introducing slight asymmetry (for e.g., a system with
`mostly fast processors but one slow processor) seems to
`introduce more instability than a system in which the
`