throbber
Homayoun
`
`Reference 8
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 1
`
`

`

`Asymmetric Multi-Processor Architecture for
`Reconfigurable System-on-Chip and Operating
`System Abstractions
`
`Xin Xie, John Williams, Neil Bergmann
`School of Information Technology and Electrical Engineering,
`The University of Queensland
`Brisbane, Australia
`{xxie; jwilliams; n.bergmann }@itee.uq.edu.au
`
`Abstral1- We propose an
`asymmetric multi-processor
`reconfigurable SoC arch.itccture comprised of a master CPU
`running embedded Linux and loosely-coupled slave CP Us
`executing dedicated software processes. The slave processes are
`mapped into the host OS as ghost p rocesses, and are able to
`communicate with each other and the master via standard
`operating
`system
`communication
`abstractions. Custom
`hardware accelerators can be also added to the slave or master
`CPUs. We describe an architectural case study of an MP3
`decoding application of 12 different single and multi-CPU
`configurations, with and without custom hardware. Analysis of
`system perto rmance under master CPU load (computation and
`10), and a Time-Arca cost model reveals the counter-intuitive
`result that multiple CPUs and appropriate software partitioning
`can lead to more efficient and load-resilient architecture than a
`single CPU with custom hardware offload capabilities, at a lower
`design cost.
`
`I.
`INTRODUCTION
`Modem embedded systems are increasingly required to
`meet the competing requirements of real-time or near real-time
`tight
`time-to-market and
`perfonnance, while satisfying
`interoperability
`requirements. Real-time performance
`is
`typically offered by dedicated custom hardware acceleration or
`microprocessor running specific firmware or microkemel, but
`the rapid development and interoperability requirements are
`more readily provided by commonly available operating
`systems (OS) such as embedded Linux.
`
`One approach to meet these requirements is to virtualize
`t11e OS by running it as a low priority process on top of a real(cid:173)
`time kernel. This approach is complicated, error-prone and
`requires an ongoing maintenance and porting effort. Another is
`to use custom hardware to accelerate critical sections of an
`application. Custom hardware design increases non-recurring
`costs and software/hardware co-desif,'11 complexity.
`
`Instead of multiplexing multiple software environments
`onto a single CPU or using custom hardware, we propose an
`asymmetric, reconfigurable System-on-Chip (SoC) multi(cid:173)
`processor architecture, implemented on commodity FPGA
`resources. The proposed approach employs multiple CPUs,
`dual-port on-chip memory and FIFO type communication
`links. This architecture is expected to achieve t11e low memory
`latency environment for parallel processes execution, in a
`generic platform suitable for a wide variety of applications.
`
`The major challenge of implementing asymmetric
`multiprocessor architecture (ASMP) suitable for SoC is the
`programming model - how can t11e hardware resources be
`t11e developer in a productive way without
`exposed to
`compromising perfonnance? Our approach is to represent
`processes running on slave CPUs as processes in the central
`(master) operating system. Communication between ghost
`processes and oilier software processes on t11e master CPU is
`achieved transparently by extending t11e standard Unix/Linux
`process FIFO/pipe
`Inter-Process Communication
`(IPC)
`concept.
`
`The main contribution of this research is to provide a
`sof1ware framework based on existing OS suitable for FPGA(cid:173)
`based multiprocessor architecture. This paper presents a
`detailed description of the operating system integration model,
`as well as a comprehensive architectural case study. The
`reconfigurability of FPGAs is not used in a run-time sense, but
`rat11er as a flexible implementation fabric allowing t11e system
`architectu re to be readily mapped to the application at hand.
`
`The paper is organised as follows: a review of existing
`multi-processor SoC related research and
`techniques
`is
`presented in Section II, Section III describes t11e hardware
`subsystems and t11eir interconnection, while Section IV
`introduces
`the OS
`integration model. Section V
`is a
`comprehensive case study of the architecture applied to an
`MP3 decoding application which is accompanied by analysis
`and discussions. Lastly, Section VI presents our conclusions
`and future research plan.
`
`II.
`
`B ACKGROUND
`
`A. Acceleration techniques/or FPGA based SoC
`One technique to improve perfonnance on FPGA based
`SoC is to use customized processor cores, e.g. adding a
`reactive Instruction Set Architecture (ISA) and direct signal
`lines to existing CPUs [I]. However tltis technique requires
`modifications to t11e compiler to recognize t11e new features. It
`is also rare that a sufficiently large portion of the task can be
`mapped into a limited custom opcode set with fixed instruction
`format, to escape t11e implications of Amdahl' s law.
`A more common approach in t11e FPGA community is to
`design a custom hardware core to replace t11e critical section of
`a specific application, wltile t11e CPU carries oilier less-critical
`
`1-4244-1472-5/07/$25.00 © 2007 IEEE
`
`FPT 2007
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 2
`
`

`

`sections. However, this approach requires significant efforts in
`hardware/software co-design. At the same time, there is
`generally no kernel level OS support for the custom hardware,
`which can result in portability and programmability problems
`on tlie platfonns other llian llie one initially implemented.
`
`llie software perspective, a real-time OS or
`From
`microkernel can be used on an FPGA based SoC, which offers
`guarantees on metrics such as interrupt latency. Operating
`systems such as Linux do not offer such hard real-time
`perfonnance natively, but
`instead
`require
`significant
`modifications or extensions. Examples of this approach
`include RTLinux [2] and RTAl [3]. Porting such microkernel
`OS attracts non-trivial software development cost which can
`be unsuitable and unworthy especially in tl1e case of hardware
`platfonn changes in the future.
`
`B. Multiprocessors on FPGAs
`The most common multiple processor implementations are
`based on a Symmetric Multi-Processor (SMP) architecture as
`widely used in workstation and server environments for High
`Performance Computation (HPC). Similar to tlie conventional
`SMP system, embedded SMP systems require OS integration
`[4] and hardware cache management (5]; both can be costly
`for an embedded system. Coherent caches are particularly
`expensive to implement in conunodity FPGA architectures. In
`real tenns llie performance improvement in SMP is Likely to be
`marginal due to OS associated overhead and the growing gap
`between processor speeds and memory latency [ 6, 7].
`SMP is a one-size-fits-all approach, appropriate for fixed
`silicon when a designer must hedge llieir bets against all
`possible future uses of llie architecture. FPGAs on llie oilier
`hand achieve llieir potential only when llie system architecture
`is created to match the application. ASMP systems can
`dedicate the required computing resources to computing
`intensive tasks, and distributed memories for each CPU avoid
`memory and bus bandwidth scalability issues. Existing (single
`CPU) operating systems can be used on ASMP system with
`less OS kernel changes (avoiding issues in IRQ/load
`balancing, cache coherence etc.)
`
`A related asymmetric system was proposed with a master
`CPU and up-lo eight co-processors connected with FIFOs [8].
`The OS
`level
`integration of the hardware FIFO was
`inspirational for this research. However llie co-processor is an
`8-bit microcontroller programmed in assembly language, and
`its performance was such lliat llie architecture was better
`suited to real time control applications than high speed data
`processing.
`C. MicroBlaze and Linux
`The hardware is based on primarily vendor-provided IPs
`including the Xilinx MicroBlaze soft-core CPU, hardware
`FIFO and on-chip dual-port memory primitives. MicroBlaze is
`a 32 bit RlSC type soft-core CPU from Xilinx and takes on tlie
`role as boili master and slave CPUs in tlie proposed
`architecture. The CPUs can be customized in different areas,
`including bus interfaces, cache size and hardware multiplier
`support [9]. At llie same time, it has lliree different bus
`interfaces: Local Memory Bus (LMB) for llie fast on-chip
`memory accessing, On-chip Peripheral Bus (OPB) for the
`
`various shared on-chip peripheral, and the Fast Simplex Bus
`(FSL) for the direct connection to the peripherals or processors
`
`uCLinux (1 0] is used as the OS on llie master CPU. uClinux
`refers to a configuration of the regular Linux kernel that
`supports CPU architectures
`lacking hardware memory
`management support (such as tlie MicroBlaze). Many existing
`application are available from Linux that is readily available
`for the development enviromnent. Running conventional OS
`on FPGA based SoC is a realistic approach that can utilize
`existing software melliodology while retaining the advantage
`of the custom computing platform.
`
`III. HARDWARE ARCHITECTURE FOR FPGA BASED ASMP
`This section contains an overview of the hardware
`architecture, while more complete details can be found in our
`earlier work [ 11]. The architecture is designed witl1 generality,
`ease of use and perfonnance as its primary objectives. Figure
`l shows an overview.
`
`A. Master CPU subsystem
`The Master CPU subsystem (left-most region of Fig. I) is a
`fairly typical Linux-capable MicroBlaze system. Off-chip
`memory is required due to tlie memory footprint of the Linux
`kernel and tlie file system, which implies tlie use of CPU
`caches to achieve reasonable perfonnance.
`
`The master CPU communicates with slave CPUs through
`tlie intended
`the specific FIFO connections to support
`applications. In addition, the master CPU can also control the
`running state of tlie slave CPUs and reprogram tlie slave CPUs
`at nm time as described in Section 111.
`
`B. Slave CPU subsystem
`The primary motivation of having multiple slave CPUs is
`to benefit from parallel process execution, and to provide a
`dedicated environment for those computi ng intensive process.
`Typical SoC systems running applications inside llie OS have
`tlie bottleneck of operating system overheads and non(cid:173)
`detenninistic memory caches. By giving slave CPUs their own
`low-latency local memory busses and running an exclusive
`single process, we can avoid llie cost of an operating system
`for iliese tasks. The basic structure of the slave CPU
`subsystems is illustrated in tl1e lower-right region of Fig. l.
`
`We utilize MicroBlaze's Harvard architecture and attach
`separate dual-port memories lo the instruction and data
`memory interfaces. The oilier port of each dual-port memory
`is connected to the global bus which can be accessed from
`master MicroBlaze. By keeping llie main data transport on tlie
`application-specific FIFO network, and only infrequent code
`updates on the shared global bus, bus traffic is sign.ificantly
`reduced.
`
`By executing from fast on-chip memory, slave CPUs do
`not require instmction or data caches. ln contrast to llie 7-10
`cycles required for ex1ernal memory accesses, slave CPUs
`have fast (l-2 cycle) and predictable memory accesses, ideal
`for real-time tasks.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 3
`
`

`

`Custom Hardw ares
`
`•
`
`• • •
`
`Standalone Slave CPUs
`
`Legencf
`:
`:
`: <1---C> Fast Simple.-.: Link (FSL)
`........... Data Local Memory Bus ( LMB)
`:
`I <=¢, On~hip Peripherals Bus (OPB) ~ Instruction Local Memory Bus (LMB):
`I
`.
`I
`L - - - Off Chip Connection ----------------------------------------- 1
`
`Fig. I. FPGA base ASMP architecture
`
`A total of eight slave CPUs can be connected to master
`CPU directly, a limitation imposed by the number of FSL
`ports on the MicroBlaze architecture. However, if not all slave
`CPUs require a direct data connection to tl1e master, then tlie
`number of CPUs is limited primarily by the available logic
`resources. Table I shows tl1e resource usage for a single slave
`CPU with 8KB each of on-chip instruction and data memory
`(Xilinx Virtex4 LX25 device). This on-chip memory is
`potentially the strongest limiting factor, however, the amount
`of memory allocated to each CPU can be carefulJy tuned to
`match tlle execution process tl1at is will host.
`
`The software environment of the slave CPUs is kept
`deliberately simple, witl1 no support for multitlireading or
`making master OS system calls. The current intention is to
`offload primarily computational tasks, with high speed data IO
`requirements.
`
`C.
`
`Integration of slave CPUs lo master CPU
`An application-specific architecture is created by using tl1e
`connecting these CPUs with hardware FIFOs in a network
`topology that matches the data flow requirements of the
`application ( upper right, Fig. I). The Xilinx FSL bus is ideal
`for tllis purpose. To send data to anotl1er CPU, it is written out
`on tl1e appropriate FSL port. FSL is also used to attach any
`
`TABLE I. SLAVE MICRO BLAZE FPGA RESOURCES USAGE
`
`Selected Device: Virtex-4 LX25
`Number of Slices:
`1332 out of 10752
`Number of Slice Flip Flops:
`724 out 21504
`Number of 4 input LUTs:
`1822 out 21504
`Number of BRAMs•:
`8 out of72
`*used as 8kB+8kB 1/D memorv for aoolication
`
`12,1<.
`3%
`9%
`20%
`
`hardware accelerator tmits to CPUs tJ1at require tJ1em.
`
`To reprogram a slave CPU, the master can write directly
`into its code and data memories. However, for predictable
`systems tJ1e ability to halt and reset tJ1ese slaves is required. A
`controller (not shown in Fig. I) was created for that purpose,
`which can manage up to 32 slave CPUs. Reprogramming a
`slave CPU is achieved by the following steps: ( I) master CPU
`transmit a halt signal to the slave CPU, (2) t.ransfer the slave
`CPU prof,'Taln code into the corresponding slave CPU's data
`and instruction memory from the external storage device, and
`(3) clear the halt signal of the slave CPU.
`
`IV. OS INTEGRATION FOR FPGA BASED ASMP
`The software abstraction of communications channel in the
`proposed FPGA based ASMP is critical to
`tJ1e system
`progranunability and performance. The approach adopted is to
`extend the Unix/Linux software FIFO implementation to the
`ASMP conununication channel.
`
`A. Utilities/or Slave CPU control
`Section III. C described tJ1e metl1ods used to reprogr'<lln and
`
`TABLE II. UTILITY SOFTIVARE
`
`Pause shwe C PU
`# slave control - oause s]ave l
`Sta11 slave C PU
`# slave control - start slave I
`Read back the sh1\'C CPU memory and Sm'C it to an output
`memory imaee
`# slave control - read slave l > slavel image
`\\'rite back the sla,•e C PU me mory from an input me.mo1-y imaee
`# slave_control - write slave I < slavel _ image
`Create the sla, •e CPUs memory imaee from the execution fiJe
`# slave_control - create slave l < slave l.exe > slavel _image
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 4
`
`

`

`User
`Space
`
`Kernel
`Space
`
`Hardwar
`Space
`
`INTR handler INTR handler
`("")
`(RO)
`
`--p
`,, .. , '
`
`Q""
`.......... ..
`
`fa input
`
`IRO
`Signll
`
`FSl
`Input 2
`
`J
`I
`l Q ::;:~---,.. Unu~~=rnel~~!are ~ FSL Oaraj
`
`Legend
`
`Fig. 2. lnterrupt based FSL driver for Linux
`
`Master CPU
`
`Master CPU
`,, .. --.... ,
`t, P2g)
`'
`'
`
`................
`
`. . ... .... .... .... ~.":.~!::'.~ ..
`
`control the Slave CPUs. These methods were implemented in
`a utility program ('slave_control') to perfonn the various
`actions. Table II describes the usage of th.is program.
`
`B. FSL FIFO driver for Linux
`FSL is the hardware channel for the master-to-slave and
`slave-to-slave c01mnunication. On the slave side, the blocking
`versions of the native MicroBlaze FSL instructions can be
`used, since there is only a single process of execution.
`However on tl1e master side, such blocking read/write
`operation can cause the system to freeze or stutter by causing
`tJ1e entire OS, including intem1pts, to be stalled tmtil tJ1e FSL
`transaction is complete. Operating system design principles
`forbid direct access to hardware by application software,
`motivating the use of a device driver to mediate this access.
`Finally, a device driver can implement buffering which
`improves overall tliroughput by preventing applications from
`blocking when tJ1e relatively small hardware FIFO channels
`are full.
`
`We kept tlie general design and structme of the original
`FSL FIFO driver (8], and ell.1ended it to operate in an interrupt(cid:173)
`driven ratl1er tl1an polled mode. This gives much better
`perfonnance when tl1e master CPU is under heavy load and
`reduces overall kernel overhead.
`
`Fig. 2 illustrates the simplified FSL driver internals
`including tJ1e read/write interface, kernel buffer and tlie
`intermpt handler. It only illustrates a single charmel pair - in
`reality all 8 FSL charmel pairs are managed by tJ1e same
`driver. With these changes the FSL driver is capable of
`sending/receiving multiple channels of FSL data with
`coherence and high perfonnance under different user-kernel
`space system loads. The driver presents a standard Li.nm,
`character device interface, supporting standard system calls
`such as openO, readO, writeO and closeo.
`C. Ghost process and pipe redirection
`A co1mnon multiprocessing metl10dology in Unix/Linux is
`to split an application into multi ple independent processes,
`connected by software based FIFO channels, called pipes.
`
`Legend
`
`~
`(
`-t-- Linux data (user& kernel space) i
`Jc:::::J Linux user space
`1c:::::J Linux kernel -.pace
`----t,, FSL .data
`.. ______________ .... _______________ ,
`l~
`SingleCPUstartdalone Q 1No<king Process
`!L ___ : Hardware FSL c.onnection () Ghost Process
`
`1
`1
`}
`
`Fig. 3. Ghost process pipe redirection
`
`Similarly, the slave processors can be connected to the master
`processor's process tlirough an agent process inside tJ1e Linllx
`user space. This metJ1odology can provide a process model
`that supports additional slave processors inside the same
`operating system, including support for co1mnonly used Linux
`inter-process communication metJ1odologies. We call tltis
`process model "ghost process', and tJ1e details are described as
`follows.
`
`the following examples use
`For clarify and brevity,
`conventional c01mnand line shell script syntax and constructs
`for process and pipe creation. However,
`tJ1e system
`progra1mn.ing calls such as forkO, execO and mkfifoO could
`also be used in a controlling (master CPU) application to
`achieve the same effect. Consider tliree processes Pl, P2 and
`P3 connected by the pipes, executed by the following
`command line:
`#Pl < i nput I P2 I P3 > output
`
`ln the event that the P2 process workload must be migrated
`into the slave CPU, a ghost process called P28 is created inside
`tJ1e user-space representing the agent for tJ1e slave CPU
`execution. The command line execution of the Ghost Process
`is similar to the previous command, where the difference is
`P2g, the ghost process, replaces the original P2:
`#Pl< input I P29 I P3 > output
`Using tltis approach, P28 will not do any real computation
`work, instead just transfer the data to and from the slave
`CPU's Pl llirough the FSL device driver. The slave CPU will
`execute tJ1e actual process P2, and P 1 and P3 are still entirely
`if,'11oran.t of the fact tliat P2 is now being executed on a
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 5
`
`

`

`TABLE Ill. CUSTOM HARDWARE VSSOfTWARE
`
`(cycles)
`SW
`HW
`Speedup
`
`IMI>CT36
`1507
`1119
`134.67%
`
`I>CT32
`2132
`723
`294.88%
`
`different CPU. From a software design perspective the value
`of this abstraction is obvious, however, this approach has a
`large perfonnance penalty caused by unnecessary buffer
`coping and user-space to kernel-space switching as shown in
`Fig. 3(a).
`
`ln response, we developed a pipe redirection mechanism
`inside the kernel (as a Linux system call) for the ghost process
`data trnnsfer without action or intervention of the user space
`application. The method is inspired by the Linux sendfile0 and
`splice0 system calls, which allow zero-copy data tn msfer
`between file descriptors. Fig. 3(b) iUustrntes t11e concept. The
`ghost process P28 in Fig. 3(b) is required to initiate such
`redirection mechanism at t11e starting time; after which t11e P28
`can eit11er be kept alive for monitoring and controlling tlie
`slave CPUs; or it can be tenninated once the redirection is
`initiated.
`
`One other feature not detailed here is t11e passing of an
`End-of-file (EOF) token tlrrough t11e pipelines. This allows tJ1e
`termination of data streams to be detected by master and slave
`processes. WitJ1 tJ1e ghost process scheme described above, the
`other connected processes inside the OS user space are
`unaware of where the computation task is executed. Further, a
`developer can first prototype the whole application as multiple
`processes running on tJ1e master OS, and only later the critical
`process can be migrnted
`into
`the slave CPUs without
`communication overhead penalty.
`
`V.
`C ASE STUDY At'-ID RESULTS
`Using the previously described hardware and software
`architecture, a complete application is tested with different
`
`~.P'3 1npu1$tr'ff,m
`
`rogram xeeu n
`Time Profile
`53.6%
`
`- - -► 0.0%
`
`--- ► 0.3%
`
`--- ► 10.0%
`
`--- ► 0.8%
`
`- - - ► 0.0%
`-- - ► 8.7%
`--- -► 20.3%
`........ ► 3.5%
`
`46.3%
`
`- - -♦ 27.5%
`······ ····· ······•
`
`•5.7%
`
`Fig. 4. MP3 decoding fimctions and program profiling
`
`master-slave CPUs and custom hardware configurations. The
`results demonstrated FPGA based ASMP architectures not
`only improve the application execution but also offer the
`perfonnance invariance against system load.
`
`A . MP3 decoding mapping to FPGA basedASlvfP
`We chose MP3 audio decoding as a case study due to its
`moderate computational complexity and wide availability of
`sofiware implementations.
`libmad, was selected in
`the
`implementation. One charncteristic of libmad is its use of
`highly optimised, architecture-independent integer operations
`to perform the DSP required for MP3 decoding. Using such
`highly optimised software as a starting point makes for a fa irer
`comparison against custom hardware acceleration, tJ1an using
`for example a nai"ve, floating point implementation.
`
`The original monolitJJ.ic implementation was analysed with
`conventional software profiling tools to identify critical
`sections. Fig. 4 shows the progrnm internal fimctions and
`profile. The time domain and frequency domain functions each
`represent approximately half of the total 98. 9% execution
`time, and t11e rest 1.1 % time is consumed by disk I/O
`functions. The separntion of tJ1e frequency domain and tJ1e
`time domain functions inside the Libmad offers a naturnl
`pipeline style implementation.
`
`From the above analysis, we divided the original Libmad
`into four separated processes: PO for the reading of the input
`MP3 file, Pl for frequency domain functions, P2 for t11e time
`domain functions and P3 for tJ1e writing of output PCM files.
`Pl and P2 can be executed on eitJ1er the master CPU or in the
`slave CPUs (wit11 ghost processes Pl 8 and P28 respectively). lf
`Pl and P2 are bot11 mapped onto slave CPUs, we have t11e
`options of cOimecting t11em directly or using master CPU to
`join them together. Even with only four processes there is a
`large nwnber of possible mappings.
`
`We further choose two functions imdct360 and tl1e dct320 ,
`which represented total 19.2% execution time as candidates
`for custom hardware acceleration. The custom hardware for
`imdct360 and the dct320 can be attached to the Pl and P2
`processes eitl1er tlrrough tl1e master CPU or tl1e slave CPUs,
`depending on where tl1at process is to be mapped. Other
`functions like Huffman decoding and anti-alias filtering might
`also be suitable for the custom hardware implementation,
`however Ill.is was not investigated further.
`
`Table Ill shows the perfonnance gain from using custom
`hardware over Libmad' s corresponding software subroutine.
`used
`optimized
`and
`Libmad
`imdct360
`dct320
`implementations due to the lossy audio decoding algoritlun;
`while the custom hardware used here did not use t11e optimized
`algoritlun .
`
`Using the mapping methods described above, we have a
`total of 12 configurations, comprised of 6 major variants, and
`each of those eit11er witl1 or wit11out custom hardware. Table
`IV shows the six confi1,'llrntions wit11 custom hardware, and t11e
`labels given to describe each. These configurntions includes:
`(a) ORG HWwll.ich is original monolithic progrnm (ORG) but
`wit11 direct access to tl1e custom HW on the master CPU; (b)
`tl1e original progrnm
`into four
`PIPE HW wll.ich splits
`processes connected by Linux pipes; (c) PJPE_SLAVE/_ ffl,V
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 6
`
`

`

`JO ~-~--~--~---~T_-_-_-_-_-_T_-_-_-_-_-_r_-_---~~
`
`20
`
`••••••••••
`
`,0
`
`Fig. 5. Performance of lvlP3 decoding of different configurations
`
`which migrates Pl only into the first slave processor; (d)
`PIPE_SI.AVE2_HW which nligr<1tes P2 into the second slave
`processor; and in (e) and (f), both configurations have Pl and
`P2 running in the slave processors but using different Pl to P2
`data routing configurations. We also implemented all the
`above 6 configurations wit110ul custom hardware (t11eir labels
`have no "_HW' appendix).
`
`Using the ghost process and t11e utility software introduced
`in previous section, the above configurations described in
`Table IV can be reconfigured and executed during runtime.
`For example,
`to
`test
`the configuration
`label.led as
`PIPE_SLAVEI_PIPE_SI.AVE, t11e following commands can
`be
`used:
`
`# slave_control -write slavel < Pl9,_image
`
`# slave_control -write slave2 < P2g_image
`#PO < input . MP3 I Pl9 I P29 I P3 > output .pcm
`B. Baseline Performance
`Each configuration was tested with a fixed rate 128kbs, 20
`second duration stereo MP3 file. Ot11er bitrates were also
`tested but showed no meaningful deviation from t11e selected
`MP3 file.
`
`Fig. 5 shows the runtime results of aU 12 configurations.
`The horizontal line al 20 seconds is t11e minimum perfonnance
`which results in flawless playback. Three configurations do
`not satisfy Ill.is constraint: ORG and PIPE both are aU single
`CPU-only with no custom hardware. ORG is executed as a
`single process and the Pf PE as 4 processes connected by
`pipes. The difference between these two is largely the OS
`system overhead associated wit11 multi-tasking and Ille OS
`FIFO buffers.
`
`The ot11er configurations all meet the requirement wit11
`tllal
`different margins.
`lllis demonstrates
`t11e ASMP
`arcllitecture and custom hardware are both useful for
`improving perfonnance. Best performance was achieved by
`combi1ling t11e two slave CPUs and t11e custom hardware,
`yielding a performance increase of 348% against the original
`program.
`
`Note that t11e greatest relative perfonnance improvement
`comes from the move to the three-CPU ASMP architecture
`(software only), rather than the addition of any specific piece
`of custom hardware. Using t11e custom hardware on Ille single
`
`T ABLE JV. C ONFIGURATIONS FOR ASMP MP3 DECODING USING CUSTOM HARDWARE
`
`(b)PIPE HW
`
`(c) PIPE SLAVE/ HW
`
`d PIPE SLAVE2 HW
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 7
`
`

`

`... ,--,---,---,---"";:'.:::::::::~===::c:==~
`I ~ ~~=~~. I
`,,.
`
`-
`
`-
`
`-
`
`"
`"
`T
`" !
`,.
`i
`"
`"
`
`!
`
`I
`
`y
`T
`
`1. PIP£
`
`2 °"°
`
`3: PIPE_HW
`• ORG_ HW
`5: PIP'E_SLAVE1
`5: PIPE_SLAI/E2
`7 PIPE_SLAVEI_HW
`8: PIPE,_SLAVE2._tfW
`9- PIP£_SLAVEI_SlAVE2
`10' PIPE_SLAVE1_PIPE_SLAI/E2
`11·PIPE_Sl.AVE1_ SV.~HW
`12: PIPE:_SLAVE1_PIPE_SLAv£2_HW
`
`T
`'
`
`lo
`J
`
`~
`~
`
`o'---~--~--~--~--~--~---'
`
`,000
`
`,000
`" " "
`FPGALUTaUHCI
`
`Fig. 6. Normalised Time-Area Products
`
`Fig. 7. Time vs Area for all architectures
`
`CPU only increases tl1e system perfonnance by 54%, while
`two additional slave CPUs witl1 no custom hardware increases
`the perfonnance by a factor of 262%. Furtl1er improvement by
`adding custom hardware is marginal.
`It is certainly possible to achieve further improvement by
`swapping more functions into hardware, or using more
`efficient hardware implementations. Even assmning we could
`achieve 100 times improvement in all critical sections of tl1e
`MP3 decoding with hardware, tl1is will improve tl1e execution
`time of the Huffman decoding, alias reduction, IMDCT, and
`the filter bank from total 60% to 0.6% of execution time. At
`tl1e same time tl1e total speedup would still be just 250% - less
`tl1an what was achieved by using two slave CPUs at
`significantly less development cost.
`
`C. Performance vs. resource usage
`The area-time product is a long-established metric for
`assessing
`the
`relative merit of competing
`system
`In modem FPGA architectures tli.is is
`implementations.
`somewhat complicated by tl1e presence of discrete blocks or
`hard macros such as on-cli.ip BRAMs and the DSP48 multiply(cid:173)
`accumulate m1its in Xilinx FPGAs. For simplicity, we have
`counted only ilie Look Up Table (LUT) usage reported by tl1e
`synthesis
`tools. Logic consmned by
`common SoC
`infrastructure (buses, memory controllers, I/O devices etc) is
`counted for all systems.
`For tl1e 12 confi!,'llrntions, we calculated a normalised area(cid:173)
`time product (witl1 tl1e ORG software implementation as
`100%), witl1 the results shown in Fig. 6. The relative
`the
`tliree-CPU ASMP arcli.itectures
`is
`efficiency of
`sig1i.ificantly better tl1an any other, and once again tl1e presence
`of custom hardware brings only marginal furtl1er benefit. Fig.
`7 plots the raw time vs. area data.
`Note fina lly that the DSP48 FPGA logic blocks, heavily
`used by both tl1e dct32 and imdct hardware accelerators, are
`not cmmted in tli.is analysis. Modelling a conservative
`"equivalent LUT count" of one DSP48
`to 100 LUTs
`completely erodes any time-area benefit of using custom
`hardware in Ill.is scenario.
`
`D. Performance under system load
`Increasingly embedded systems are called upon to support
`a number of ancillary
`functions beyond
`their prime
`
`application. For Ill.is reason, tl1eir performance under varying
`loads is an important metric. We furtl1er distinguish between
`two broad classes - CPU load and I/O load. CPU load arises
`from tasks competing for access to the ALU, while UO load
`arises from requests to eA1emal devices. Real applications will
`of course be some mi).1ure of both.
`
`is
`is a benchmark wli.ich
`The program dhrystone
`computationally intensive but perfonns no 1/0. Fig. 8 shows
`tl1e performance of different arcli.itectural configurations
`against t11e load induced by running multiple dhrystone
`instances on the master CPU. Only those configurations using
`both slave CPUs (wiili or witl10ut custom hardware) show
`resilience against increased load. Using custom hardware does
`improve the perfonnance consistently, but it does not offer
`immtui.ity against the system load as long as there are still
`significant code portions executed on tl1e master CPU.
`
`To simulate l/O load we implemented a program tl1at
`continually reads and writes random data over a network file
`system. Tli.is benclunark is almost entirely 1/0 dominant,
`causing very high UO loads and intensive interrupt activity.
`Fig. 9 shows tl1e performance of tl1e arcli.itectures tmder
`multiple instances of tli.is load. One notable feature is the
`saturation effect beyond a certain load - at this point the
`system is fully loaded. Once again tl1e confi!,'lITTltion using
`both slave CPUs has no real performance drop against tl1e l/0
`load, wli.ile all others configurations increase the decoding
`time by up to a factor of 200%.
`
`To avoid system load affects in a single CPU with only
`custom hardware would require the entire MP3 decoding
`function be moved to custom hardware. However, Ill.is may not
`be practical or cost effective. Using multiple slave CPUs and
`the custom hardware together offers a realistic and effective
`solution.
`
`VI. CONCLUSIONS AND FUTURE RESEARCH
`This research does not describe any novel algoritluns or
`achieve extraordinary speedup by using FPGA teclmology for
`a fixed application. Instead tl1e focus is on tl1e application
`implementation and tl1e performance gain at a system level.
`Amdalll 's Law is still applicable even witl1 tl1e emergence of
`new technology. For example, using custom hardware on
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2120, p. 8
`
`

`

`= ,.,
`"'
`"'
`..
`"'
`I "" .,
`.,
`"
`
`_,_
`_,_
`- ·-
`-·-
`=·-= ·-
`= ·-
`
`90
`
`.,
`
`70
`
`"'
`j 50
`"
`
`30
`
`,,
`
`ONfSRN/
`-
`INFSRN/
`-
`2NFSRN/
`-
`-3NFSRNJ
`c=l4NFSRN/
`c::==::::::J SNFSRN/
`c:=:::=:J 6NfSRNi
`
`Fig. 8 Performance under CPU load
`
`Fig. 9 Perfo

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket