`
`Reference 8
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 1
`
`
`
`Asymmetric Multi-Processor Architecture for
`Reconfigurable System-on-Chip and Operating
`System Abstractions
`
`Xin Xie, John Williams, Neil Bergmann
`School of Information Technology and Electrical Engineering,
`The University of Queensland
`Brisbane, Australia
`{xxie; jwilliams; n.bergmann }@itee.uq.edu.au
`
`Abstral1- We propose an
`asymmetric multi-processor
`reconfigurable SoC arch.itccture comprised of a master CPU
`running embedded Linux and loosely-coupled slave CP Us
`executing dedicated software processes. The slave processes are
`mapped into the host OS as ghost p rocesses, and are able to
`communicate with each other and the master via standard
`operating
`system
`communication
`abstractions. Custom
`hardware accelerators can be also added to the slave or master
`CPUs. We describe an architectural case study of an MP3
`decoding application of 12 different single and multi-CPU
`configurations, with and without custom hardware. Analysis of
`system perto rmance under master CPU load (computation and
`10), and a Time-Arca cost model reveals the counter-intuitive
`result that multiple CPUs and appropriate software partitioning
`can lead to more efficient and load-resilient architecture than a
`single CPU with custom hardware offload capabilities, at a lower
`design cost.
`
`I.
`INTRODUCTION
`Modem embedded systems are increasingly required to
`meet the competing requirements of real-time or near real-time
`tight
`time-to-market and
`perfonnance, while satisfying
`interoperability
`requirements. Real-time performance
`is
`typically offered by dedicated custom hardware acceleration or
`microprocessor running specific firmware or microkemel, but
`the rapid development and interoperability requirements are
`more readily provided by commonly available operating
`systems (OS) such as embedded Linux.
`
`One approach to meet these requirements is to virtualize
`t11e OS by running it as a low priority process on top of a real(cid:173)
`time kernel. This approach is complicated, error-prone and
`requires an ongoing maintenance and porting effort. Another is
`to use custom hardware to accelerate critical sections of an
`application. Custom hardware design increases non-recurring
`costs and software/hardware co-desif,'11 complexity.
`
`Instead of multiplexing multiple software environments
`onto a single CPU or using custom hardware, we propose an
`asymmetric, reconfigurable System-on-Chip (SoC) multi(cid:173)
`processor architecture, implemented on commodity FPGA
`resources. The proposed approach employs multiple CPUs,
`dual-port on-chip memory and FIFO type communication
`links. This architecture is expected to achieve t11e low memory
`latency environment for parallel processes execution, in a
`generic platform suitable for a wide variety of applications.
`
`The major challenge of implementing asymmetric
`multiprocessor architecture (ASMP) suitable for SoC is the
`programming model - how can t11e hardware resources be
`t11e developer in a productive way without
`exposed to
`compromising perfonnance? Our approach is to represent
`processes running on slave CPUs as processes in the central
`(master) operating system. Communication between ghost
`processes and oilier software processes on t11e master CPU is
`achieved transparently by extending t11e standard Unix/Linux
`process FIFO/pipe
`Inter-Process Communication
`(IPC)
`concept.
`
`The main contribution of this research is to provide a
`sof1ware framework based on existing OS suitable for FPGA(cid:173)
`based multiprocessor architecture. This paper presents a
`detailed description of the operating system integration model,
`as well as a comprehensive architectural case study. The
`reconfigurability of FPGAs is not used in a run-time sense, but
`rat11er as a flexible implementation fabric allowing t11e system
`architectu re to be readily mapped to the application at hand.
`
`The paper is organised as follows: a review of existing
`multi-processor SoC related research and
`techniques
`is
`presented in Section II, Section III describes t11e hardware
`subsystems and t11eir interconnection, while Section IV
`introduces
`the OS
`integration model. Section V
`is a
`comprehensive case study of the architecture applied to an
`MP3 decoding application which is accompanied by analysis
`and discussions. Lastly, Section VI presents our conclusions
`and future research plan.
`
`II.
`
`B ACKGROUND
`
`A. Acceleration techniques/or FPGA based SoC
`One technique to improve perfonnance on FPGA based
`SoC is to use customized processor cores, e.g. adding a
`reactive Instruction Set Architecture (ISA) and direct signal
`lines to existing CPUs [I]. However tltis technique requires
`modifications to t11e compiler to recognize t11e new features. It
`is also rare that a sufficiently large portion of the task can be
`mapped into a limited custom opcode set with fixed instruction
`format, to escape t11e implications of Amdahl' s law.
`A more common approach in t11e FPGA community is to
`design a custom hardware core to replace t11e critical section of
`a specific application, wltile t11e CPU carries oilier less-critical
`
`1-4244-1472-5/07/$25.00 © 2007 IEEE
`
`FPT 2007
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 2
`
`
`
`sections. However, this approach requires significant efforts in
`hardware/software co-design. At the same time, there is
`generally no kernel level OS support for the custom hardware,
`which can result in portability and programmability problems
`on tlie platfonns other llian llie one initially implemented.
`
`llie software perspective, a real-time OS or
`From
`microkernel can be used on an FPGA based SoC, which offers
`guarantees on metrics such as interrupt latency. Operating
`systems such as Linux do not offer such hard real-time
`perfonnance natively, but
`instead
`require
`significant
`modifications or extensions. Examples of this approach
`include RTLinux [2] and RTAl [3]. Porting such microkernel
`OS attracts non-trivial software development cost which can
`be unsuitable and unworthy especially in tl1e case of hardware
`platfonn changes in the future.
`
`B. Multiprocessors on FPGAs
`The most common multiple processor implementations are
`based on a Symmetric Multi-Processor (SMP) architecture as
`widely used in workstation and server environments for High
`Performance Computation (HPC). Similar to tlie conventional
`SMP system, embedded SMP systems require OS integration
`[4] and hardware cache management (5]; both can be costly
`for an embedded system. Coherent caches are particularly
`expensive to implement in conunodity FPGA architectures. In
`real tenns llie performance improvement in SMP is Likely to be
`marginal due to OS associated overhead and the growing gap
`between processor speeds and memory latency [ 6, 7].
`SMP is a one-size-fits-all approach, appropriate for fixed
`silicon when a designer must hedge llieir bets against all
`possible future uses of llie architecture. FPGAs on llie oilier
`hand achieve llieir potential only when llie system architecture
`is created to match the application. ASMP systems can
`dedicate the required computing resources to computing
`intensive tasks, and distributed memories for each CPU avoid
`memory and bus bandwidth scalability issues. Existing (single
`CPU) operating systems can be used on ASMP system with
`less OS kernel changes (avoiding issues in IRQ/load
`balancing, cache coherence etc.)
`
`A related asymmetric system was proposed with a master
`CPU and up-lo eight co-processors connected with FIFOs [8].
`The OS
`level
`integration of the hardware FIFO was
`inspirational for this research. However llie co-processor is an
`8-bit microcontroller programmed in assembly language, and
`its performance was such lliat llie architecture was better
`suited to real time control applications than high speed data
`processing.
`C. MicroBlaze and Linux
`The hardware is based on primarily vendor-provided IPs
`including the Xilinx MicroBlaze soft-core CPU, hardware
`FIFO and on-chip dual-port memory primitives. MicroBlaze is
`a 32 bit RlSC type soft-core CPU from Xilinx and takes on tlie
`role as boili master and slave CPUs in tlie proposed
`architecture. The CPUs can be customized in different areas,
`including bus interfaces, cache size and hardware multiplier
`support [9]. At llie same time, it has lliree different bus
`interfaces: Local Memory Bus (LMB) for llie fast on-chip
`memory accessing, On-chip Peripheral Bus (OPB) for the
`
`various shared on-chip peripheral, and the Fast Simplex Bus
`(FSL) for the direct connection to the peripherals or processors
`
`uCLinux (1 0] is used as the OS on llie master CPU. uClinux
`refers to a configuration of the regular Linux kernel that
`supports CPU architectures
`lacking hardware memory
`management support (such as tlie MicroBlaze). Many existing
`application are available from Linux that is readily available
`for the development enviromnent. Running conventional OS
`on FPGA based SoC is a realistic approach that can utilize
`existing software melliodology while retaining the advantage
`of the custom computing platform.
`
`III. HARDWARE ARCHITECTURE FOR FPGA BASED ASMP
`This section contains an overview of the hardware
`architecture, while more complete details can be found in our
`earlier work [ 11]. The architecture is designed witl1 generality,
`ease of use and perfonnance as its primary objectives. Figure
`l shows an overview.
`
`A. Master CPU subsystem
`The Master CPU subsystem (left-most region of Fig. I) is a
`fairly typical Linux-capable MicroBlaze system. Off-chip
`memory is required due to tlie memory footprint of the Linux
`kernel and tlie file system, which implies tlie use of CPU
`caches to achieve reasonable perfonnance.
`
`The master CPU communicates with slave CPUs through
`tlie intended
`the specific FIFO connections to support
`applications. In addition, the master CPU can also control the
`running state of tlie slave CPUs and reprogram tlie slave CPUs
`at nm time as described in Section 111.
`
`B. Slave CPU subsystem
`The primary motivation of having multiple slave CPUs is
`to benefit from parallel process execution, and to provide a
`dedicated environment for those computi ng intensive process.
`Typical SoC systems running applications inside llie OS have
`tlie bottleneck of operating system overheads and non(cid:173)
`detenninistic memory caches. By giving slave CPUs their own
`low-latency local memory busses and running an exclusive
`single process, we can avoid llie cost of an operating system
`for iliese tasks. The basic structure of the slave CPU
`subsystems is illustrated in tl1e lower-right region of Fig. l.
`
`We utilize MicroBlaze's Harvard architecture and attach
`separate dual-port memories lo the instruction and data
`memory interfaces. The oilier port of each dual-port memory
`is connected to the global bus which can be accessed from
`master MicroBlaze. By keeping llie main data transport on tlie
`application-specific FIFO network, and only infrequent code
`updates on the shared global bus, bus traffic is sign.ificantly
`reduced.
`
`By executing from fast on-chip memory, slave CPUs do
`not require instmction or data caches. ln contrast to llie 7-10
`cycles required for ex1ernal memory accesses, slave CPUs
`have fast (l-2 cycle) and predictable memory accesses, ideal
`for real-time tasks.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 3
`
`
`
`Custom Hardw ares
`
`•
`
`• • •
`
`Standalone Slave CPUs
`
`Legencf
`:
`:
`: <1---C> Fast Simple.-.: Link (FSL)
`........... Data Local Memory Bus ( LMB)
`:
`I <=¢, On~hip Peripherals Bus (OPB) ~ Instruction Local Memory Bus (LMB):
`I
`.
`I
`L - - - Off Chip Connection ----------------------------------------- 1
`
`Fig. I. FPGA base ASMP architecture
`
`A total of eight slave CPUs can be connected to master
`CPU directly, a limitation imposed by the number of FSL
`ports on the MicroBlaze architecture. However, if not all slave
`CPUs require a direct data connection to tl1e master, then tlie
`number of CPUs is limited primarily by the available logic
`resources. Table I shows tl1e resource usage for a single slave
`CPU with 8KB each of on-chip instruction and data memory
`(Xilinx Virtex4 LX25 device). This on-chip memory is
`potentially the strongest limiting factor, however, the amount
`of memory allocated to each CPU can be carefulJy tuned to
`match tlle execution process tl1at is will host.
`
`The software environment of the slave CPUs is kept
`deliberately simple, witl1 no support for multitlireading or
`making master OS system calls. The current intention is to
`offload primarily computational tasks, with high speed data IO
`requirements.
`
`C.
`
`Integration of slave CPUs lo master CPU
`An application-specific architecture is created by using tl1e
`connecting these CPUs with hardware FIFOs in a network
`topology that matches the data flow requirements of the
`application ( upper right, Fig. I). The Xilinx FSL bus is ideal
`for tllis purpose. To send data to anotl1er CPU, it is written out
`on tl1e appropriate FSL port. FSL is also used to attach any
`
`TABLE I. SLAVE MICRO BLAZE FPGA RESOURCES USAGE
`
`Selected Device: Virtex-4 LX25
`Number of Slices:
`1332 out of 10752
`Number of Slice Flip Flops:
`724 out 21504
`Number of 4 input LUTs:
`1822 out 21504
`Number of BRAMs•:
`8 out of72
`*used as 8kB+8kB 1/D memorv for aoolication
`
`12,1<.
`3%
`9%
`20%
`
`hardware accelerator tmits to CPUs tJ1at require tJ1em.
`
`To reprogram a slave CPU, the master can write directly
`into its code and data memories. However, for predictable
`systems tJ1e ability to halt and reset tJ1ese slaves is required. A
`controller (not shown in Fig. I) was created for that purpose,
`which can manage up to 32 slave CPUs. Reprogramming a
`slave CPU is achieved by the following steps: ( I) master CPU
`transmit a halt signal to the slave CPU, (2) t.ransfer the slave
`CPU prof,'Taln code into the corresponding slave CPU's data
`and instruction memory from the external storage device, and
`(3) clear the halt signal of the slave CPU.
`
`IV. OS INTEGRATION FOR FPGA BASED ASMP
`The software abstraction of communications channel in the
`proposed FPGA based ASMP is critical to
`tJ1e system
`progranunability and performance. The approach adopted is to
`extend the Unix/Linux software FIFO implementation to the
`ASMP conununication channel.
`
`A. Utilities/or Slave CPU control
`Section III. C described tJ1e metl1ods used to reprogr'<lln and
`
`TABLE II. UTILITY SOFTIVARE
`
`Pause shwe C PU
`# slave control - oause s]ave l
`Sta11 slave C PU
`# slave control - start slave I
`Read back the sh1\'C CPU memory and Sm'C it to an output
`memory imaee
`# slave control - read slave l > slavel image
`\\'rite back the sla,•e C PU me mory from an input me.mo1-y imaee
`# slave_control - write slave I < slavel _ image
`Create the sla, •e CPUs memory imaee from the execution fiJe
`# slave_control - create slave l < slave l.exe > slavel _image
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 4
`
`
`
`User
`Space
`
`Kernel
`Space
`
`Hardwar
`Space
`
`INTR handler INTR handler
`("")
`(RO)
`
`--p
`,, .. , '
`
`Q""
`.......... ..
`
`fa input
`
`IRO
`Signll
`
`FSl
`Input 2
`
`J
`I
`l Q ::;:~---,.. Unu~~=rnel~~!are ~ FSL Oaraj
`
`Legend
`
`Fig. 2. lnterrupt based FSL driver for Linux
`
`Master CPU
`
`Master CPU
`,, .. --.... ,
`t, P2g)
`'
`'
`
`................
`
`. . ... .... .... .... ~.":.~!::'.~ ..
`
`control the Slave CPUs. These methods were implemented in
`a utility program ('slave_control') to perfonn the various
`actions. Table II describes the usage of th.is program.
`
`B. FSL FIFO driver for Linux
`FSL is the hardware channel for the master-to-slave and
`slave-to-slave c01mnunication. On the slave side, the blocking
`versions of the native MicroBlaze FSL instructions can be
`used, since there is only a single process of execution.
`However on tl1e master side, such blocking read/write
`operation can cause the system to freeze or stutter by causing
`tJ1e entire OS, including intem1pts, to be stalled tmtil tJ1e FSL
`transaction is complete. Operating system design principles
`forbid direct access to hardware by application software,
`motivating the use of a device driver to mediate this access.
`Finally, a device driver can implement buffering which
`improves overall tliroughput by preventing applications from
`blocking when tJ1e relatively small hardware FIFO channels
`are full.
`
`We kept tlie general design and structme of the original
`FSL FIFO driver (8], and ell.1ended it to operate in an interrupt(cid:173)
`driven ratl1er tl1an polled mode. This gives much better
`perfonnance when tl1e master CPU is under heavy load and
`reduces overall kernel overhead.
`
`Fig. 2 illustrates the simplified FSL driver internals
`including tJ1e read/write interface, kernel buffer and tlie
`intermpt handler. It only illustrates a single charmel pair - in
`reality all 8 FSL charmel pairs are managed by tJ1e same
`driver. With these changes the FSL driver is capable of
`sending/receiving multiple channels of FSL data with
`coherence and high perfonnance under different user-kernel
`space system loads. The driver presents a standard Li.nm,
`character device interface, supporting standard system calls
`such as openO, readO, writeO and closeo.
`C. Ghost process and pipe redirection
`A co1mnon multiprocessing metl10dology in Unix/Linux is
`to split an application into multi ple independent processes,
`connected by software based FIFO channels, called pipes.
`
`Legend
`
`~
`(
`-t-- Linux data (user& kernel space) i
`Jc:::::J Linux user space
`1c:::::J Linux kernel -.pace
`----t,, FSL .data
`.. ______________ .... _______________ ,
`l~
`SingleCPUstartdalone Q 1No<king Process
`!L ___ : Hardware FSL c.onnection () Ghost Process
`
`1
`1
`}
`
`Fig. 3. Ghost process pipe redirection
`
`Similarly, the slave processors can be connected to the master
`processor's process tlirough an agent process inside tJ1e Linllx
`user space. This metJ1odology can provide a process model
`that supports additional slave processors inside the same
`operating system, including support for co1mnonly used Linux
`inter-process communication metJ1odologies. We call tltis
`process model "ghost process', and tJ1e details are described as
`follows.
`
`the following examples use
`For clarify and brevity,
`conventional c01mnand line shell script syntax and constructs
`for process and pipe creation. However,
`tJ1e system
`progra1mn.ing calls such as forkO, execO and mkfifoO could
`also be used in a controlling (master CPU) application to
`achieve the same effect. Consider tliree processes Pl, P2 and
`P3 connected by the pipes, executed by the following
`command line:
`#Pl < i nput I P2 I P3 > output
`
`ln the event that the P2 process workload must be migrated
`into the slave CPU, a ghost process called P28 is created inside
`tJ1e user-space representing the agent for tJ1e slave CPU
`execution. The command line execution of the Ghost Process
`is similar to the previous command, where the difference is
`P2g, the ghost process, replaces the original P2:
`#Pl< input I P29 I P3 > output
`Using tltis approach, P28 will not do any real computation
`work, instead just transfer the data to and from the slave
`CPU's Pl llirough the FSL device driver. The slave CPU will
`execute tJ1e actual process P2, and P 1 and P3 are still entirely
`if,'11oran.t of the fact tliat P2 is now being executed on a
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 5
`
`
`
`TABLE Ill. CUSTOM HARDWARE VSSOfTWARE
`
`(cycles)
`SW
`HW
`Speedup
`
`IMI>CT36
`1507
`1119
`134.67%
`
`I>CT32
`2132
`723
`294.88%
`
`different CPU. From a software design perspective the value
`of this abstraction is obvious, however, this approach has a
`large perfonnance penalty caused by unnecessary buffer
`coping and user-space to kernel-space switching as shown in
`Fig. 3(a).
`
`ln response, we developed a pipe redirection mechanism
`inside the kernel (as a Linux system call) for the ghost process
`data trnnsfer without action or intervention of the user space
`application. The method is inspired by the Linux sendfile0 and
`splice0 system calls, which allow zero-copy data tn msfer
`between file descriptors. Fig. 3(b) iUustrntes t11e concept. The
`ghost process P28 in Fig. 3(b) is required to initiate such
`redirection mechanism at t11e starting time; after which t11e P28
`can eit11er be kept alive for monitoring and controlling tlie
`slave CPUs; or it can be tenninated once the redirection is
`initiated.
`
`One other feature not detailed here is t11e passing of an
`End-of-file (EOF) token tlrrough t11e pipelines. This allows tJ1e
`termination of data streams to be detected by master and slave
`processes. WitJ1 tJ1e ghost process scheme described above, the
`other connected processes inside the OS user space are
`unaware of where the computation task is executed. Further, a
`developer can first prototype the whole application as multiple
`processes running on tJ1e master OS, and only later the critical
`process can be migrnted
`into
`the slave CPUs without
`communication overhead penalty.
`
`V.
`C ASE STUDY At'-ID RESULTS
`Using the previously described hardware and software
`architecture, a complete application is tested with different
`
`~.P'3 1npu1$tr'ff,m
`
`rogram xeeu n
`Time Profile
`53.6%
`
`- - -► 0.0%
`
`--- ► 0.3%
`
`--- ► 10.0%
`
`--- ► 0.8%
`
`- - - ► 0.0%
`-- - ► 8.7%
`--- -► 20.3%
`........ ► 3.5%
`
`46.3%
`
`- - -♦ 27.5%
`······ ····· ······•
`
`•5.7%
`
`Fig. 4. MP3 decoding fimctions and program profiling
`
`master-slave CPUs and custom hardware configurations. The
`results demonstrated FPGA based ASMP architectures not
`only improve the application execution but also offer the
`perfonnance invariance against system load.
`
`A . MP3 decoding mapping to FPGA basedASlvfP
`We chose MP3 audio decoding as a case study due to its
`moderate computational complexity and wide availability of
`sofiware implementations.
`libmad, was selected in
`the
`implementation. One charncteristic of libmad is its use of
`highly optimised, architecture-independent integer operations
`to perform the DSP required for MP3 decoding. Using such
`highly optimised software as a starting point makes for a fa irer
`comparison against custom hardware acceleration, tJ1an using
`for example a nai"ve, floating point implementation.
`
`The original monolitJJ.ic implementation was analysed with
`conventional software profiling tools to identify critical
`sections. Fig. 4 shows the progrnm internal fimctions and
`profile. The time domain and frequency domain functions each
`represent approximately half of the total 98. 9% execution
`time, and t11e rest 1.1 % time is consumed by disk I/O
`functions. The separntion of tJ1e frequency domain and tJ1e
`time domain functions inside the Libmad offers a naturnl
`pipeline style implementation.
`
`From the above analysis, we divided the original Libmad
`into four separated processes: PO for the reading of the input
`MP3 file, Pl for frequency domain functions, P2 for t11e time
`domain functions and P3 for tJ1e writing of output PCM files.
`Pl and P2 can be executed on eitJ1er the master CPU or in the
`slave CPUs (wit11 ghost processes Pl 8 and P28 respectively). lf
`Pl and P2 are bot11 mapped onto slave CPUs, we have t11e
`options of cOimecting t11em directly or using master CPU to
`join them together. Even with only four processes there is a
`large nwnber of possible mappings.
`
`We further choose two functions imdct360 and tl1e dct320 ,
`which represented total 19.2% execution time as candidates
`for custom hardware acceleration. The custom hardware for
`imdct360 and the dct320 can be attached to the Pl and P2
`processes eitl1er tlrrough tl1e master CPU or tl1e slave CPUs,
`depending on where tl1at process is to be mapped. Other
`functions like Huffman decoding and anti-alias filtering might
`also be suitable for the custom hardware implementation,
`however Ill.is was not investigated further.
`
`Table Ill shows the perfonnance gain from using custom
`hardware over Libmad' s corresponding software subroutine.
`used
`optimized
`and
`Libmad
`imdct360
`dct320
`implementations due to the lossy audio decoding algoritlun;
`while the custom hardware used here did not use t11e optimized
`algoritlun .
`
`Using the mapping methods described above, we have a
`total of 12 configurations, comprised of 6 major variants, and
`each of those eit11er witl1 or wit11out custom hardware. Table
`IV shows the six confi1,'llrntions wit11 custom hardware, and t11e
`labels given to describe each. These configurntions includes:
`(a) ORG HWwll.ich is original monolithic progrnm (ORG) but
`wit11 direct access to tl1e custom HW on the master CPU; (b)
`tl1e original progrnm
`into four
`PIPE HW wll.ich splits
`processes connected by Linux pipes; (c) PJPE_SLAVE/_ ffl,V
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 6
`
`
`
`JO ~-~--~--~---~T_-_-_-_-_-_T_-_-_-_-_-_r_-_---~~
`
`20
`
`••••••••••
`
`,0
`
`Fig. 5. Performance of lvlP3 decoding of different configurations
`
`which migrates Pl only into the first slave processor; (d)
`PIPE_SI.AVE2_HW which nligr<1tes P2 into the second slave
`processor; and in (e) and (f), both configurations have Pl and
`P2 running in the slave processors but using different Pl to P2
`data routing configurations. We also implemented all the
`above 6 configurations wit110ul custom hardware (t11eir labels
`have no "_HW' appendix).
`
`Using the ghost process and t11e utility software introduced
`in previous section, the above configurations described in
`Table IV can be reconfigured and executed during runtime.
`For example,
`to
`test
`the configuration
`label.led as
`PIPE_SLAVEI_PIPE_SI.AVE, t11e following commands can
`be
`used:
`
`# slave_control -write slavel < Pl9,_image
`
`# slave_control -write slave2 < P2g_image
`#PO < input . MP3 I Pl9 I P29 I P3 > output .pcm
`B. Baseline Performance
`Each configuration was tested with a fixed rate 128kbs, 20
`second duration stereo MP3 file. Ot11er bitrates were also
`tested but showed no meaningful deviation from t11e selected
`MP3 file.
`
`Fig. 5 shows the runtime results of aU 12 configurations.
`The horizontal line al 20 seconds is t11e minimum perfonnance
`which results in flawless playback. Three configurations do
`not satisfy Ill.is constraint: ORG and PIPE both are aU single
`CPU-only with no custom hardware. ORG is executed as a
`single process and the Pf PE as 4 processes connected by
`pipes. The difference between these two is largely the OS
`system overhead associated wit11 multi-tasking and Ille OS
`FIFO buffers.
`
`The ot11er configurations all meet the requirement wit11
`tllal
`different margins.
`lllis demonstrates
`t11e ASMP
`arcllitecture and custom hardware are both useful for
`improving perfonnance. Best performance was achieved by
`combi1ling t11e two slave CPUs and t11e custom hardware,
`yielding a performance increase of 348% against the original
`program.
`
`Note that t11e greatest relative perfonnance improvement
`comes from the move to the three-CPU ASMP architecture
`(software only), rather than the addition of any specific piece
`of custom hardware. Using t11e custom hardware on Ille single
`
`T ABLE JV. C ONFIGURATIONS FOR ASMP MP3 DECODING USING CUSTOM HARDWARE
`
`(b)PIPE HW
`
`(c) PIPE SLAVE/ HW
`
`d PIPE SLAVE2 HW
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 7
`
`
`
`... ,--,---,---,---"";:'.:::::::::~===::c:==~
`I ~ ~~=~~. I
`,,.
`
`-
`
`-
`
`-
`
`"
`"
`T
`" !
`,.
`i
`"
`"
`
`!
`
`I
`
`y
`T
`
`1. PIP£
`
`2 °"°
`
`3: PIPE_HW
`• ORG_ HW
`5: PIP'E_SLAVE1
`5: PIPE_SLAI/E2
`7 PIPE_SLAVEI_HW
`8: PIPE,_SLAVE2._tfW
`9- PIP£_SLAVEI_SlAVE2
`10' PIPE_SLAVE1_PIPE_SLAI/E2
`11·PIPE_Sl.AVE1_ SV.~HW
`12: PIPE:_SLAVE1_PIPE_SLAv£2_HW
`
`T
`'
`
`lo
`J
`
`~
`~
`
`o'---~--~--~--~--~--~---'
`
`,000
`
`,000
`" " "
`FPGALUTaUHCI
`
`Fig. 6. Normalised Time-Area Products
`
`Fig. 7. Time vs Area for all architectures
`
`CPU only increases tl1e system perfonnance by 54%, while
`two additional slave CPUs witl1 no custom hardware increases
`the perfonnance by a factor of 262%. Furtl1er improvement by
`adding custom hardware is marginal.
`It is certainly possible to achieve further improvement by
`swapping more functions into hardware, or using more
`efficient hardware implementations. Even assmning we could
`achieve 100 times improvement in all critical sections of tl1e
`MP3 decoding with hardware, tl1is will improve tl1e execution
`time of the Huffman decoding, alias reduction, IMDCT, and
`the filter bank from total 60% to 0.6% of execution time. At
`tl1e same time tl1e total speedup would still be just 250% - less
`tl1an what was achieved by using two slave CPUs at
`significantly less development cost.
`
`C. Performance vs. resource usage
`The area-time product is a long-established metric for
`assessing
`the
`relative merit of competing
`system
`In modem FPGA architectures tli.is is
`implementations.
`somewhat complicated by tl1e presence of discrete blocks or
`hard macros such as on-cli.ip BRAMs and the DSP48 multiply(cid:173)
`accumulate m1its in Xilinx FPGAs. For simplicity, we have
`counted only ilie Look Up Table (LUT) usage reported by tl1e
`synthesis
`tools. Logic consmned by
`common SoC
`infrastructure (buses, memory controllers, I/O devices etc) is
`counted for all systems.
`For tl1e 12 confi!,'llrntions, we calculated a normalised area(cid:173)
`time product (witl1 tl1e ORG software implementation as
`100%), witl1 the results shown in Fig. 6. The relative
`the
`tliree-CPU ASMP arcli.itectures
`is
`efficiency of
`sig1i.ificantly better tl1an any other, and once again tl1e presence
`of custom hardware brings only marginal furtl1er benefit. Fig.
`7 plots the raw time vs. area data.
`Note fina lly that the DSP48 FPGA logic blocks, heavily
`used by both tl1e dct32 and imdct hardware accelerators, are
`not cmmted in tli.is analysis. Modelling a conservative
`"equivalent LUT count" of one DSP48
`to 100 LUTs
`completely erodes any time-area benefit of using custom
`hardware in Ill.is scenario.
`
`D. Performance under system load
`Increasingly embedded systems are called upon to support
`a number of ancillary
`functions beyond
`their prime
`
`application. For Ill.is reason, tl1eir performance under varying
`loads is an important metric. We furtl1er distinguish between
`two broad classes - CPU load and I/O load. CPU load arises
`from tasks competing for access to the ALU, while UO load
`arises from requests to eA1emal devices. Real applications will
`of course be some mi).1ure of both.
`
`is
`is a benchmark wli.ich
`The program dhrystone
`computationally intensive but perfonns no 1/0. Fig. 8 shows
`tl1e performance of different arcli.itectural configurations
`against t11e load induced by running multiple dhrystone
`instances on the master CPU. Only those configurations using
`both slave CPUs (wiili or witl10ut custom hardware) show
`resilience against increased load. Using custom hardware does
`improve the perfonnance consistently, but it does not offer
`immtui.ity against the system load as long as there are still
`significant code portions executed on tl1e master CPU.
`
`To simulate l/O load we implemented a program tl1at
`continually reads and writes random data over a network file
`system. Tli.is benclunark is almost entirely 1/0 dominant,
`causing very high UO loads and intensive interrupt activity.
`Fig. 9 shows tl1e performance of tl1e arcli.itectures tmder
`multiple instances of tli.is load. One notable feature is the
`saturation effect beyond a certain load - at this point the
`system is fully loaded. Once again tl1e confi!,'lITTltion using
`both slave CPUs has no real performance drop against tl1e l/0
`load, wli.ile all others configurations increase the decoding
`time by up to a factor of 200%.
`
`To avoid system load affects in a single CPU with only
`custom hardware would require the entire MP3 decoding
`function be moved to custom hardware. However, Ill.is may not
`be practical or cost effective. Using multiple slave CPUs and
`the custom hardware together offers a realistic and effective
`solution.
`
`VI. CONCLUSIONS AND FUTURE RESEARCH
`This research does not describe any novel algoritluns or
`achieve extraordinary speedup by using FPGA teclmology for
`a fixed application. Instead tl1e focus is on tl1e application
`implementation and tl1e performance gain at a system level.
`Amdalll 's Law is still applicable even witl1 tl1e emergence of
`new technology. For example, using custom hardware on
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 8
`
`
`
`= ,.,
`"'
`"'
`..
`"'
`I "" .,
`.,
`"
`
`_,_
`_,_
`- ·-
`-·-
`=·-= ·-
`= ·-
`
`90
`
`.,
`
`70
`
`"'
`j 50
`"
`
`30
`
`,,
`
`ONfSRN/
`-
`INFSRN/
`-
`2NFSRN/
`-
`-3NFSRNJ
`c=l4NFSRN/
`c::==::::::J SNFSRN/
`c:=:::=:J 6NfSRNi
`
`Fig. 8 Performance under CPU load
`
`Fig. 9 Perfo