`
`Reference 12
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2124, p. 1
`
`
`
`2017 IEEE 35th International Conference on Computer Design
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`!
`
`
`
`
`"
` #
`
`$
`%&#
`'(()*+
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
` !
`
`
`
`
`
`
`
`
` #
` #
`
`#
`
`
`#
`
`
`
`
`
`
`
` &
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
` %
`
`
`
`
`
`
`
`
`
`
`
`
` #
`
`
` "
`
`
`
`
`
`
`
`
`
`
`
`
`
` + ,
`
`
`
`
`
`#
`
` #
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`,+ ,-./!#-,/
`
`
`
`
`
`
`
` 1
`
`
`
`
`
` 0
`
`
`1
`2% 1
`0
`
`
`
`+
`
`
`
`
`
`
`2
`
`
`
`
`
`
`456+1
`
`
`
`
` 1
` 1
`
`
`
`
` 4(789:6+
`
`
`
` %
`
`
`
`45555:9;6+
`
`
`
`
`1
`
`
` 2
`
`
`
`,< 0,4=6. >2+ ,-- 4?6 -
`
`( 4:6
`
`
` 2
` 1 1
`1
`
` +
`
`
`
`
`
`
` 2
`2 1
`
`
`
`
`
` +
` 2
`
`
`
`+-
`
`
` $ 45:6 -$% 45?6
`
`45556+ , 4555:6
`
`
`1
`2
`
`
` 2
`
`
`
`
`+ -
` 456
`
` (! 2
`
`
`2
` 1
`
`
`
`
` 2 2 0
`
`
`
` @+ -
`
`
`
` 1
`
`
` 2
` 2
`
`
`
`
`
` + A
`
`
`
`
`
`
`
`
`
`
`
`
` 0
`
`2+
` %
`
`
`4595(5:6+1
`
`
`
`
`
`
`
`
`
`
`
`
`2
`
`22
`
`
`
`
`
`
`
`+$
`
`
` 11
`
`
`
`
`2
`
`
`
`
`
`
`-
`
`
`
`
`
`
`1
`
` + -
`
`
`
`
`
`
` 1
`
`
` B
`@ 2
`
`
`+
`A
`
` 2 2
`
`
`
`
`
`2
`
`
`
` 1 @
`
`+ ,
`
`
`
`
`
`
`
`
` 5C
`
`
` 9C
`
`
` 1
`
`2
`
`
` % (C
`
`
`2
`
`
`
` >
`
`
`4;895996
`
`1063-6404/17 $31.00 © 2017 IEEE
`DOI 10.1109/ICCD.2017.28
`
`129
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2124, p. 2
`
`
`
`Prior Work
`/ 0 1
`
`
`0 Thread Count: [5, 9, 21, 22]
` &'$$$($'$"##%
`Core Type: [4,8,11,17,18,19,22]
`
`
`0 Thread Count & Core Type: [9,22]
`
`) #*%
`Voltage/Frequency & Core Type: [2,3]
`+
`0 Application Behavior: [2, 3,4, 5, 9, 11, ...]
`This Work
` 0 1
`
`
`0 Thread Count & Core Type &
`
`)+
`. O Voltage/Frequency & Application Behavior
`
`.
`
`Multithreaded applications are comprised of number of
`
`
`
` 2
`
`parallel regions which are separated by serial regions. In this
`
`
` 1
`
` 2
`
`+ ,
`work, we refer to these parallel regions as Region of Interest
`1
`0 1
`
`
`
` . ,
`
`(ROI).
`In a homogenous multi-core architecture and for
`E./,C+ ,
`
`
`
`
`conventional scheduling, all of these regions are processed on
`
`
`
`
`the same core type, same voltage and fiequency, and number
`
`
`@ 2
`
`of threads,
`though not all
`regions may have the same
`
`
`
`preferences. Some of these ROIs may benefit fiom different
`
`
` + ./, 2
`
`
`configurations than the others to obtain the maximized energy-
`
`
`2%3
`
`efficiency. As mentioned earlier,
`the main challenge for
` +
`
`
`
`scheduling is to effectively tune system, architecture and
`
`
`
`application level parameters in CCA for the entire application
`
`
`
`
`
`as well as
`the intermediate parallel
`regions
`to achieve
` 1
`
`
`
`maximum energy-efficiency. In this work, similar to [1,2], we
`%
`
`0
`45961
`are assuming that big cores (composed) are constructed by
`
` 2
` E C
`
` 2
`composing multiple little (base) cores.
` E2C
`+
`Fig. 2 illustrates an overview of optimal configurations for
`$+9
`
` 1
`
`
`application selected fiom SPLASH-2 multithreaded
`an
`
`
`
`
`benchmark suite with three parallel regions. In each ROI the
`2
`0 1
`
`
`+ , ./,
`best possible configuration for core type, operating fiequency
`2 2
`
`
`
`
`@
`and thread counts that results in maximized energy-efficiency
`
`
` %3
`
`is specified. As can be seen, R011 needs to be run on the base
` + 2./,52
`2
`core with 2.4GHz fiequency and 8 threads, whereas for R012
`
`19+7"3
`@ ?
`1
`
`./,9
`in order to achieve the best energy-efficiency, we need to
`
`
` 2
`
`compose two small cores to make a big core. Moreover, the
` 1
` 0 2
`+
`
`
`optimal operating frequency increases to 2.8GHz. The ability
`
`
`@
` 9+?"3+ - 2
`to accurately predict the optimal application, system and even
`
`
`
`microarchitecture parameters and adapt them to achieve the
`
`
`
`
`
`
`maximum energy-efficiency within different parallel regions of
`%
`
`
`
`
`an application is the main motivation for this work. We
`
` 1
`0+ A
`develop several machine leaming-based approaches which are
`
`
`
` 1
`
`able to predict the optimal setting of tuning parameters and
`2
`
`
`
`change them to suit the specific requirement of each parallel
`
`@
`
`
`region within the multithreaded application.
`
`1
` +
`Fig. 3 depicts our three-stage approach for predicting
`$+ (
`
`
`
`
`
`the right core type and application configuration when running
`
`
`
`1
`
`a multithreaded application on composite cores architecture.
`
`
`
`
`+
`Our machine leaming-based approach begins fiom extracting
`/
`
`
` 2
` %
`
`microarchitectural data (referred as feature extraction), fiom
`
`
`
` E
`
`
`
` %
` C
`
`different parallel regions of application to characterize the
`
`
`
`
`
`3
`multithreaded workload. These data (or features) include the
`
` 1
`0+ - E
`
`C
`hardware performance counter data, which are representative
`
`1
`
`
`
` 1
`
`
`
`of application behavior at run-time. Next, a machine leaming-
` 2
`
`
`
`based predictor (that is built ofi-line) takes in these features
`2
`
` E 2
`
`and predicts the best configuration settings for a given parallel
`
` 2
`
`
`
`region. For this purpose, we have implemented five well-
`
`+ $
`
` 1 1
`known machine learning algorithms and compare them in
`01
`
`
`
`terms of accuracy, power and area overhead to find to the most
`
`
` 1
`
`
`
`effective learning model which yields in optimized energy-
`
` 1 3
`
`efficiency. Finally, we configure the processor and schedule
` + $ 1
`
`
`
`the application to run on the predicted configuration. In this
`
`
`
`+ ,
`work, the metric that we use to characterize energy-efficiency
`1
`0
` 1
`
`3
`
`is the Energy Delay Product (EDP) which aims to balance
`
` !
` E!C 1 2
`performance and power consumption. We construct and
`
`
` 1
` + A
`
`
`
`#
`$
`*
`0 i 0
`*
`-
`0
`Extraction
`Feature
`Predictor
`Scheduling
`—>I—>
`
`LEI—b
`/!
`Multithreaded
`+
`Processor Carrjl'g.
`
`
`
`Feature Values
`Applications
`
`
`
`
`(Core Type, Freq, #Threads)
`
`
`Fig. 3. An overview ofour approach for predicting the optimal configuration
`$+(+
` 1
`
`
`
`
`
`and scheduling the multithreaded application
`
`
`
`
`
`
`
`
`—’-—>
`
`
`
`
`
`
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2124, p. 3
`
`voltage/fiequency [2,3], or core type [4,8,11,17,18,19,22] and
` B
`@ 49(6
`
` 47?555:5?58996
`they have ignored the interplay among all these parameters.
`
`
`
`
`+
`This study indicates that these parameters individually, while
`-
`
` 1
`important, do not make a truly optimum configuration to
`
` 0
`
`
`achieve the best energy-efficiency on a CCA. The best
` 2
`
`configuration for a multithreaded application can be effectively
`
`
`
` 2
`found, only when these parameters are jointly optimized. Fig. 1
`1
`
`
`D3+$+5
`illustrates the tuning parameters influencing in scheduling
`
`
`
`
`decision in a CCA. Also, recent prior works as well as the
` +
`
`
` 1
`0 1
`contribution of our work is shown in this figure.
`
`2
`1
`01
`+
`In this paper, through methodical investigation of power
`,
`
` 1
`
`and performance, and comprehensive system and micro-
`
`
`
`
`
`architectural level analysis, we first characterize multithreaded
`
`
` 1
`
`
`3
`
`applications on CCA to understand the power and performance
`
`1
`
`
`
`trade-ofis offered by various configuration parameters and to
`
`
` 2
`
`
`
`
`find how the interplay of these parameters affects the energy-
`1
`
`
`
`
`efficiency. Our study is focused on a CCA where many little
` + /
` 1
`
`cores (base) can be configured into few big cores (composed)
`
`E2C 2
`12
`E C
`and vice versa. The experimental results support that there is
`
`+ - %
`
`
`
`
`no unique solution for the best configuration for different
` @
` 2
`
`
`
`applications. Given the
`dispersed pattern of optimum
` + "
`
`
`
`configuration, we develop various machine learning models to
`
`1
`
`
`predict the energy-efficiency of parallel regions, and guide
`
`
`
`
`
`scheduling and fine-tuning parameters to maximize the energy-
`
`
`%3
`
`efficiency. As behavior of applications changes at run-time, we
` +2
`
`
`applied our prediction and tuning method at a fine-grained
`
`
`
`
`level of individual parallel region within an application.
`
`
`1 +
`We use five well-known machine learning models to build
`A 1
`2
`predictors based on the knowledge extracted fiom an extensive
`
`
`201%
`
`%
`set of hardware performance data which are
`a good
`
`1
`
`
` 1
`
`representative of application behavior for each parallel region
`
`
` 2
`
`
`
`
`within the training phase. The models are then used at run-time
`1
`+-
`
`
`to predict the optimal processor configuration for each parallel
`
`
`
`
`
`
`
`region of a multithreaded workload to maximize the energy-
`
`
` 1
`0 %3
`
`efficiency. We analyze the proposed machine leaming-based
` + A 3
`
`
`models in terms of their prediction accuracy, power overhead
`
`
`
`
` 1
`
`
`and implementation cost to understand their cost effectiveness.
`
`
` +
`11. MOTIVATION AND OVERVIEW OF OUR APPROACH
`,,+ /-,&-,/!/&.&,A/$/#../
`
`Optimal Configurations
`l"""""""""""""""""""""""""""""""""""""""""""""""""u
`_ _ _
`_ — _ 1
`— _ _ _
`'
`l Base Core
`I
`l Composed Core I
`l Composed Core I
`g
`E
`|
`24611:
`|
`28st
`I
`|
`24GHz
`I
`:
`:
`l
`8threads
`I
`l_ 8threads
`l
`l
`8threads
`l
`i
`I
`:
`— 7 z- —
`— F —
`— 7/ ,_ —
`:
`
` ’“w4—) 4—} 4—)
`m l/
`
`ROI1
`ROI2
`ROB
`
`.u
`
`Time
`
`Fig. 2. Optimal configurations in different parallel regions ofan application
`$+9+/
`
`
`
`
`
`130
`130
`
`
`
`
`
`
`Energy
`
`Efficiency
`
`
`CE,” me
`
`l
`'3’ '
`“l
`
`
`
`
`
`
`
`
`Application
`+
`
`Behavior (l/O,
`,- 2
`
`
`CPU, ...)
`/3...
`
`Fig.1. Tuning parameters influencing energy-efficiency in composite cores
`$+5+-
`
`
`
`
`architecture Prior work on scheduling using these tuning parameters
`
`
`+
`
`1
`0
`
`+
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2124, p. 3
`
`
`
`Big (Composed)
`
`TABLL‘ l. ARCHITECTURAL SPECIFICATION
`
` .,--#. ,$,-,/
`- ,+
`
`
`Mieroareh. Parameter
`Base
`Composed
`+
`1
`
`
`
`
`2
`lssue-Commit width
`4
`
`,
`9
`7
`lNT instruction queue
`16 entries
`32 entries
`
`,-
` @
`5=
`
`(9
`
`FP instruction queue
`16 entries
`32 entries
`
`Reorder buffer
`32 entries
`64 entries
`$
` @
`5=
`
`(9
`
`
`.
`
`2
`
`(9
`
`=7
`
`Branch penalty
`7cyc
`14cyc
`
`
`:
`57
`iLl-dLl Cache
`16KB/4—way/2cyc
`32KB/4—way/2cyc
`
` 5
`5=M B7
`(9M B7
`L2 Cache
`4MB/8-way/32cyc
`4MB/8-way/32cyc
` 9
`7 B?
`7 B?
`
`
`
`
`
`
`
`4#
`
`Little (Base)
`4
`
`4$56
`
`4$56
`
`4$56
`
`4$56
`
`
`
`
`
`
`
`
`
`, 7
`
`4$52
`
`4$52
`
`4$52
`
`4$52
`
`Fig. 4. Conceptual structure of a four-core composite architecture
`$+7+
`
`
`
`
`
`
`
`compare five predictors using the machine learning algorithms
`
`
`
`
`
`
`described in the section V to guide the scheduling in a CCA.
`
`2 & +
`111. EXPERIMENTAL SETUP AND METHODOLOGY
`,,,+ F., - -#! -/!/ /"G
`This section provides the details of our experimental setup.
` -
`
`%
`+
`We use Sniper [13] version 6.1, a parallel, high speed and
`A
` 45(6
` =+5
`
`cycle-accurate x86 simulator for multi-core systems. McPAT
`
` %?=
`
`
` + -
`[15] is integrated with Sniper and is used to obtain power
`45;6
` 1
` 2 1
`
`consumption results. We study SPLASH-2 [14] and PARSEC
`
`+A
`[26] multithreaded benchmark suites
`for simulation. For
`49=6
` 2
`0
` + $
`
`architectural
`simulation, we modeled a
`heterogeneous
`
`
` 1
`
`composite cores architecture based on the recently proposed
`
`
`
` 2
`
`
`work in [17, 18]. Our study is focusing on a CCA where two
`1
`045:5?6+/
` 1
`1
`little cores
`(base) can be configured into one big core
`
` E2C 2
` 2
`
`(composed) and vice versa. Fig. 4 provides a conceptual
`E C
`+ $+ 7
`
`overview of a four-core CCA. In this paper, we investigate two
`
` 1
`
`+,
`1 1
`baseline heterogeneous CCAs which consist of multiple base
`2
` 1 2
`and composed cores: 1) 8base/4comp, and 2) 4base/2comp. It
`
`H 5C?2B7 9C 72B9 + ,
`is also important to note that for benchmark simulation we
`
`
` 2
`0 1
`applied the binding (one-thread—per—core) model with #threads
`2E
`
`
`C1I
`
`= #cores to maximize the performance of multithreaded
`JJ I
` %3
`
`
`
`applications [4, 5].
` 47;6+
`Table I shows the microarchitectural configuration of base
` -2,1
`
`
`
`2
`(little) and composed (big) core of CCA in our experiment. We
`EC E2C
`
`%
`+A
`collect performance counters data on each architecture for
`
`
`
`
`
`
`
`characterization and drive
`the
`scheduling and mapping
`
`
`3
`
`
`algorithms. We use these data to extract and evaluate the actual
`
`+A%
`
`behavior of applications (I/O, CPU or memory intensive) for
`2
` E,B/ #
`
` C
`
`predicting energy-efficiency and assisting scheduling decision.
`
`
`
`IV. CHARACTERIZATION RESULTS
`,&+ .-.,K-,/.# -
`In this section, we evaluate the application performance and
`, 1
`
`
`energy-efficiency sensitivity to the tuning parameters of
`
`
`
`
`operating fiequency, number of running threads, and the choice
`
`
`@ 2
`
`
`
`of microarchitectures (base vs. composed) in heterogeneous
`
`
`
` E2 + C
`
`composite cores architecture. Note that
`the entire set of
`
`
`
`+
`
`benchmark analysis results is quite extensive. Therefore, due to
`2
`0
`@% +-
`
`
`space limitation we only present the results for a limited
` 1
`
`
`
`number of representative benchmarks shown in Fig. 5. This
`2
`
`
` 2
`0 1 $+ ;+ -
`figure depicts the overall performance in terms of execution
`
`
`
`
`
` %
`time (represented as a bar graph) and EDP results (represented
`E
`
`2
`
`C!
`E
`
`
`as a line graph) for four different multithreaded benchmarks
`
`C
`
`
`
` 2
`0
`across different core types, fiequencies and number of threads.
`
`
`
`
`@ 2
`
`+
`
`
`A. Frequency Sensitivity
`For this analysis, all benchmarks were simulated using a
`$
` 2
`0 1
`
`baseline composed core running with only a single thread. The
`2
`
`1
`+-
`operating fiequency is swept fi'om 1.6 GHz to 2.8GHz with a
`
`
`@ 1
`5+="39+?"31
`step of 400MHz and the voltage is changed between 0.7, 0.8,
` 7LL 3 21L+: L+?
`0.9, and 1V, respectively. As can be seen in Fig. 5, some
`L+8 5&
` + 2 $+ ;
`benchmarks are very sensitive to changing the fiequency. For
`2
`0
`
`
`@ + $
`
` !!
`"
`
`@
`instance, in finm and cholesky reducing the fiequency almost
`linearly reduces the overall performance. Overall, as expected,
`
`
`
`
`
` +/
`%
`as
`the
`fiequency increases,
`the performance
`increases
`
`
`@
`
`
`
`
`
`
`
`accordingly. The EDP results show that a higher fiequency
`
`+ - !
` 1
`
`@
`
`results in higher EDP.
`Increasing the number of threads
`
`
` !+ ,
` 2
`
`
`interestingly reduces the sensitivity to fi'equency. In other
`
`
`
`@ + ,
`
`words, increasing the number of running threads increases the
`1
`
`2
`
`
`
`
`performance gain due to parallelization. Consequently,
`the
`
`
`
`3+ @
`overall performance as the number of threads increases is more
`
`
`
` 2
`
`
`
`
`influenced by the speedup gain as a results of parallelism rather
` 2
`
`
`
`
`than operating at higher fi'equency. Moreover, the results show
`
`
`
`@ +
`
`
`1
`that the base core is more sensitive to fiequency scaling than
` 2
`
`
`@
`the composed cores. This is also an interesting observation as
`
`+-
`2
`
`the composed core has a large pipeline, allowing it to tolerate
`
`
`1
`
`performance cost due to alterations in access latency to cache
`
`
`
`
`subsystem as a result of fi'equency scaling. Note that changing
`2
`
`@ +
`clock fiequency changes the number of cycles it takes for the
` 0
`@ 2
` 0
`
`processor to communicate with the cache.
`
`
` 1 +
`#
`
`B. Core Type Sensitivity
`In this section,
`the results are reported for a baseline
`
`,
`
`
`
`
` 2
`configuration with a core running a single thread at the highest
`
`1
`
`
`
`fiequency of 2.8 GHz and operating voltage of IV. The
`
`@ 9+? "3
` 5&+ -
`changing parameter is the core type, which varies between a
`
`
`
` 1
` 21
`base (little) core and a composed (big) core architecture. Core
`2EC
` E2C
`
`
`+
`
`type demonstrates constant behavior with regards to EDP. As
`
` 2
` 1
`
` !+
`shown in Fig. 5, there is a clear gap between the big composed
`1$+;
`
`212
`and little base cores (in Threadl and F28), with composed
` 2
` E -
`5 $9+?C 1
`core having lower EDP.
`In these cases,
`the performance
`
` 1
` !+ ,
`
`
`benefits of the composed core outweigh the energy savings of
`2
`1
`
`the base core.
`2
`+
`
`
`C. Thread Count Sensitivity
`Finally, each benchmark is simulated with varying the
`$ 2
`0 1
`
`numbers of threads.
`In this
`step, each simulation was
`2
`
`+ , 1
`performed at the same fi'equency of 2.8 GHz and operating
`
`
`
`@ 9+? "3
`
`voltage of 1V, when changing the number of threads fiom 1 to
` 5&1 2
`
`
`5
`8. As shown in Fig. 5, increasing the thread counts leads to
`?+ 1 $+ ;
`
`
`better performance at the cost of higher power. Moreover, there
`2
`
`
`
`1
`+
`
`
`
`is a large gap between the EDP values of base and composed
`
`21! 2
`core in lower number of threads. In particular, we observe that
`
`1
`2
`
`+,
`
`12
`
`by increasing the application thread counts, the corresponding
`2
`
`
`
`
`gap between different core types diminishes and makes the
` 21
`
` 0
`base core competitive to the composed core in terms of EDP.
`2
`
`
`!+
`$ %
`
`
`D. Joint Analysis (Core Type, Frequency, Thread Count)
`To understand the
`interplay among various
`tuning
`-
`
`
`
`
`
`
`parameters and find the optimum configuration for maximizing
`
`
`
`
`%3
`the energy-efiiciency, in this section all permutations of the
`
`
`
`parameters were simulated. We test four voltage/fiequency
`
`
` 1
` + A
` B
`@
`settings on two core types and execute each multithreaded
` 1
` %