throbber
Streaming consistency: a model for efficient MPSoC design
`
`J.W. van den Brand and M. Bekooij
`NXP Research
`contact: jan.willem.v.d.brand@nxp.com
`
`Abstract
`
`Multiprocessor systems-on-chip (MPSoC) with dis-
`tributed shared memory and caches are flexible when it
`comes to inter-processor communication but require an ef-
`ficient memory consistency and cache coherency solution.
`In this paper we present a novel consistency model,
`streaming consistency, for the streaming domain in which
`tasks communicate through circular buffers. The model al-
`lows more reordering than release consistency and, among
`other optimizations, enables an efficient software cache co-
`herency solution and posted writes.
`We also present a software cache coherency implemen-
`tation and discuss a software circular buffer administration
`that does not need an atomic read-modify-write instruction.
`A small experiment demonstrates the potential perfor-
`mance increase of posted writes in MPSoCs with high com-
`munication latencies.
`
`Keywords: streaming, memory consistency, cache co-
`herency, MPSoC, NoC
`Track, topic area: T1 Systems-on-a-chip/in-a-package,
`multi-processors
`
`1
`
`Introduction
`In this paper, we consider heterogeneous Multiprocessor
`Systems-on-Chip (MPSoCs) with distributed shared mem-
`ory (DSM) and caches. In DSM, a single shared address
`space is distributed over multiple physical memories. Pro-
`cessors communicate through shared memory.
`An architecture with a number of processors P with
`caches $ and with multiple physical memories is shown in
`Figure 1. The processors communicate through an intercon-
`nect which can be a bus or a Network-on-Chip (NoC). Ex-
`amples of real architectures with a similar structure are the
`TI OMAP platform [5], Philips Semiconductors’ Viper [8]
`and Silicon Hive architectures [3].
`Multiprocessor systems in which processors communi-
`cate through shared memory via caches require a cache co-
`herency solution and a memory consistency model. Cache
`coherency assures that processors observe up-to-date data
`
`Figure 1. Multiprocessor system with background
`memory.
`
`in the cache. We are not aware of efficient hardware cache
`coherency solutions for MPSoCs with NoCs. Therefore, we
`use a software cache coherency solution for such systems.
`A memory consistency model defines the order in which
`memory operations from one processor appears to other
`processors. It affects both performance and programming
`model. Many consistency models have been proposed
`for high performance computers. Sequential consistency
`(SC) [15] is a model that allows no reordering of memory
`operations. This is natural from a programmers perspec-
`tive. Relaxing the ordering constraints enables pipelining of
`shared memory accesses, which can significantly improve
`performance, especially for high communication latencies.
`The release consistency model (RC) [10] relaxes many
`of the SC ordering constraints. RC requires a programmer
`to use acquire and release synchronization sections. It al-
`lows many hardware optimizations compared to SC [1].
`In this paper, we propose a new consistency model,
`streaming consistency (StrC), which is targeted at
`the
`streaming domain. Examples of streaming applications
`are MPEG4 [7], Digital Radio Mondiale [13] and face
`recognition [14]. StrC allows more reordering and thus
`more pipelining than RC. Furthermore, it enables optimiza-
`tions such as efficient software cache coherency and posted
`writes. These optimizations are desirable for MPSoCs, es-
`pecially when a NoC interconnect is used.
`
`P
`
`$
`
`P
`
`$
`
`memory
`
`interconnect
`
`mem
`
`mem
`
`mem
`
`P
`
`P
`
`P
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`1
`
`APPLE 1011
`
`

`
`The key contribution of this paper is the introduction of
`a new consistency model, StrC, which allows more reorder-
`ing than RC and which enables optimizations.
`This paper is organized as follows. In Section 2 we dis-
`cuss related work. Then, in Section 3 we give a brief intro-
`duction to cache coherency and memory consistency. Ex-
`isting consistency models and hardware solutions are dis-
`cussed and we show why these solutions are not well suited
`for MPSoCs. In Section 4 we present the StrC model that
`fits such systems. In Section 5 we present software solu-
`tions for cache coherency and circular buffer administra-
`tion. Section 6 shows the potential performance increase of
`StrC by means of an experiment. Section 7 concludes.
`
`2 Related work
`Memory consistency for MPSoCs is discussed in [17].
`Release consistency (RC) is chosen as consistency model
`but it is not made clear why this model fits their purposes.
`Our consistency model allows more reordering than RC and
`enables several optimizations.
`In [18], a heterogeneous architecture is presented to
`which applications modeled as Kahn Process Networks
`(KPN) are mapped. Buffers are mapped to background
`memory. The work allows caches to be placed in so called
`shells and argues, as we do, that a more efficient cache co-
`herency mechanism is possible due to explicit communica-
`tion. The used caches are dedicated to streaming data. We
`use a generic cache with minor adjustments that is used for
`streaming data as well as program data and instructions.
`Experimental results in [11] show, in the context of high
`performance computing, that the performance gain of us-
`ing relaxed models can be significant (10-40%) compared
`to strict models in systems with networks.
`In [6], the coupling between memory abstraction and in-
`terconnect is discussed. Cache coherency for NoC based
`MPSoCs is identified as a challenging issue. Snooping and
`directory based solutions are mentioned as unlikely candi-
`dates. We also discard these hardware solutions (see Sec-
`tion 3.4). We use a software cache coherency solution.
`Experiments in [1] show that software cache coherency
`solutions perform comparable to hardware solutions for a
`broad class of programs. It is shown that for well-structured
`programs the software based approaches out-perform hard-
`ware based approaches.
`
`3 Cache coherency and memory consistency
`In this section we give a short introduction to cache co-
`herency and memory consistency. We give examples of po-
`tential coherency and consistency problems and we discuss
`existing consistency models. Finally we describe widely
`used hardware solutions for cache coherency and memory
`consistency and we explain why we think that they are not
`well suited for MPSoCs with NoCs.
`
`3.1 Cache coherency
`Processors in a cached shared memory multiprocessor
`environment can observe different data for the same mem-
`ory address. This occurs when writes from one processor
`are not propagated to caches of other processors. For in-
`stance, in case of a write-back cache, write data goes by
`default to the cache and not to background memory. With-
`out a cache coherency policy, this data is only visible for the
`processor that performed the write.
`Caches that have a write through policy propagate all
`writes to shared memory. That does not mean that no cache
`coherency policy is required. On a read, the cache con-
`troller marks the fetched line associated to the read address
`as valid. The write of another processor to the same ad-
`dress is not visible to the first processor without a cache
`coherency policy.
`We illustrate cache coherency with the following exam-
`ple. Consider the communicating tasks in Figure 2 which
`are mapped on two cached processors, P1 and P2. Note
`that the print instruction implies a read operation. First
`the task of P1 writes to A. The value of A is propagated
`to shared memory if the cache of P1 has a write through
`policy. Next, on the read action of P2, its cache fetches
`the value of A from shared memory marking the associated
`cache line as valid and the task prints the result A = 1. Then
`P1 writes another value to A which is again propagated to
`shared memory. However, on the next read, the cache of P2
`will return the old value of A, A = 1, because its line was
`marked valid.
`A similar problem occurs if the cache of P1 has a write
`back policy. The value of A would then not be propagated
`to shared memory in the first place and the cache of P2 will
`read a unknown value from memory.
`
`Figure 2. Cache coherency problem.
`
`In a cache coherent system, data of all memory ad-
`dresses that involve interprocessor communication must be
`observed with the same value for all involved processors. A
`cache coherency policy is necessary to assure this.
`
`3.2 Memory consistency
`Every multiprocessor system in which processors com-
`municates through a shared memory should have a mem-
`ory consistency model that defines how ordering of mem-
`ory accesses are handled. Shared memory accesses can be
`conflicting and non-conflicting. Accesses to a shared ad-
`dress are said to be conflicting if they come from different
`
`P1
`A = 1;
`
`A = 2;
`
`P2
`
`print A;
`
`print A;
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`2
`
`

`
`processors and at least one of them is a write. The behav-
`ior of a program on a conflicting access depends on the or-
`der in which the accesses are observed between processors.
`In order to write a program that exhibits correct functional
`behavior, a programmer must know what the ordering be-
`havior of the system is. Reordering of memory operations
`is only allowed if local program order of the processor is
`maintained (e.g a write followed by a read from the same
`address can never be reordered).
`The memory consistency model of a shared memory sys-
`tem specifies the order in which memory operations will
`appear to execute to the programmer. Consider for exam-
`ple the tasks in Figure 3, taken from [4], where the ordering
`of operations influences functional behavior. The intention
`of this program is clear; processor P1 communicates vari-
`able A to processor P2 through shared memory and uses a
`flag (shared variable) for synchronization. The underlying
`assumption is that the write of A becomes visible to P2 be-
`fore the write to flag. If this assumption fails, the program
`will not fulfill the programmer’s expectation.
`The consistency model determines the programmer’s
`view on shared memory and therefore has a major influence
`on the programming model.
`
`The PC model relaxes the ordering rules defined by the
`SC model. In this model, writes issued by a single proces-
`sor are observed at another processor in the same order as
`they are issued. However, writes issued by different pro-
`cessors do not appear in the same order to all processors
`and therefore it does not satisfy write atomicity. It reflects
`the reality of complex interconnects where the latencies be-
`tween nodes differ for different processor pairs. Because
`writes in the PC model are guaranteed to appear in order,
`the example in Figure 3 will work correctly. However, pro-
`grams written with the SC model in mind are not guaranteed
`to work in a processor consistent system. The example in
`Figure 4 taken from [4] shows a program that can fail under
`PC but works under SC due to write atomicity. P2 reads A
`which is written by P1 and then writes B which in turn is
`used by P3 as synchronization flag. Without write atomicity
`There is no guarantee that the latest version of A, A = 1,
`is visible to P3 before B = 1 is visible to P3. The wrong
`value is printed even if all processors receive write data in
`the order issued by the processors.
`PC also allows reads that follow a write to a different
`address on the same processor to be reordered with respect
`to each other. Reads can be issued while a write is still in
`transfer.
`
`Figure 3. Event synchronization through a flag.
`
`3.3 Existing consistency models
`There are many different memory consistency models
`proposed in literature (see for instance [2, 4]). These consis-
`tency models target off-chip multiprocessor systems such as
`high performance computers with limited resource and en-
`ergy consumption constraints.
`Here we discuss sequential consistency [15] (SC), pro-
`cessor consistency [1] (PC) and release consistency [10]
`(RC). The models differ in the amount of reordering that
`is allowed. Reordering can enable pipelining of shared
`memory accesses. A model is relaxed compared to another
`model if it allows more reordering and thus enables more
`pipelining of shared memory accesses. More relaxed mod-
`els are introduced to allow more pipelining.
`SC requires program order to be maintained among all
`operations. Every task appears to issue complete mem-
`ory operations one at a time and atomically in program or-
`der [4]. Writes issued by different processors appear in the
`same order to all processors (write atomicity). This is very
`natural from a programmer’s point of view. For instance, the
`example from Figure 3 executes perfectly in a SC system.
`Unfortunately, this very intuitive model poses restrictions
`on hardware and compiler optimizations [2].
`
`Figure 4. Importance of write atomicity for SC.
`
`RC, introduced in [10], is an even more relaxed model.
`For this model, shared memory accesses are categorized and
`different ordering requirements can apply to different cate-
`gories. In RC terminology, a conflicting access is competing
`if there is a chance that a read and write occur simultane-
`ously; otherwise it’s a non-competing access. A conflicting
`access is made non-competing by using synchronization.
`There are two types of synchronization accesses; acquire
`and release.
`A system is RC if the following conditions hold (taken
`from [10]):
`(a) Before a non-competing load or store access is al-
`lowed to perform with respect to another processor, all pre-
`vious acquire accesses must be performed and,
`(b) before a release access is allowed to perform with
`respect to any other processor, all previous non-competing
`load and store accesses must be performed, and
`(c) competing accesses (e.g.
`acquire and release ac-
`cesses) are processor consistent with respect to one another.
`The conditions give ordering requirements between non-
`competing and competing accesses and between competing
`accesses. There are no ordering requirements between non-
`
`P1
`A = 1;
`flag = 1;
`
`P2
`while(flag == 0);
`print A;
`
`P1
`A = 1;
`
`P2
`while(A == 0);
`B = 1;
`
`P3
`
`while(B == 0);
`print A;
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`3
`
`

`
`competing accesses.
`A program written with PC in mind is not guaranteed
`to work on a system with RC. The example of Figure 3
`which executed perfectly under PC will have consistency
`problems because the shared memory accesses are not cat-
`egorized and no synchronization is added to resolve unde-
`sired competing accesses.
`The diagram in Figure 5 taken from [12] shows the or-
`dering requirements for shared memory accesses to differ-
`ent addresses from a processor in a multiprocessor system
`for different consistency models. For each model, a pro-
`gram is shown with shared memory accesses. The arrows
`denote ordering constraints. SC allows no reordering, PC
`allows reordering of writes followed by a read. RC allows
`reordering of all shared memory accesses to different ad-
`dresses outside the synchronization section and inside syn-
`chronization sections.
`
`Figure 5. Comparison of three different consis-
`tency models.
`
`RC offers extensive pipelining possibilities. In Section 4
`we present a consistency model targeted at streaming ap-
`plications that is more relaxed than RC to allow even more
`pipelining.
`
`3.4 Hardware solutions
`Shared memory architectures that constantly monitor a
`bus for transactions (snooping bus) have an advantage when
`it concerns cache coherency and memory consistency solu-
`tions. The bus provides a global view on all memory op-
`erations. Caches can observe all memory transactions by
`snooping the bus and take appropriate action when a trans-
`action takes place that concerns data in the cache. Also,
`ordering of reads and writes according to the selected con-
`sistency model can be assured by stalling bus transactions.
`
`However, buses pose limitations on bandwidth. In prac-
`tice no more than 16 processors are connected to a single
`shared bus. Moreover, we consider MPSoCs with NoC
`interconnects. Such interconnects do not provide global
`observability of memory transactions. To the best of our
`knowledge, there are no snooping solutions for NoCs and it
`seems unlikely that efficient solutions will be found.
`Directory based cache coherency approaches are better
`suited for network interconnects. In a directory based sys-
`tem, processors and caches assure that they do not violate
`cache coherency and memory consistency by issuing re-
`quests and notifications to a storage place (directory) that
`maintains state of all relevant transactions. For cache co-
`herency, the directory is notified of all relevant changes of
`cache state by processors and is therefore capable of deter-
`mining which cache contains the requested data.
`The request and notification communication consume a
`significant amount of bandwidth, especially when a strict
`consistency model is used such as SC. The directory mem-
`ory and control makes the hardware significantly more ex-
`pensive than bus snooping hardware. Finally, the communi-
`cation from cache to directory and back and then from cache
`to cache or from memory to cache introduces more latency.
`This latency results in additional processor stall cycles.
`Therefore we conclude that both bus snooping and
`directory-based approaches are not well suited for MPSoC
`embedded systems with NoC interconnects. In the next sec-
`tion we present a consistency model that enables an efficient
`software solution.
`4 Streaming consistency
`The previous section discussed existing consistency
`models which were designed for high performance com-
`puters. This section presents a novel consistency model,
`streaming consistency (StrC), targeted at the streaming do-
`main. It allows more pipelining than RC and enables op-
`timizations such as an efficient software cache coherency
`solution which fits MPSoCs with NoCs. First we give a def-
`inition of StrC. Then, in Section 4.2 optimizations enabled
`by StrC are presented.
`
`4.1 Streaming consistency model
`StrC targets systems that run streaming applications. In-
`ter processor communication in such systems is performed
`by sharing units of data through circular buffers that are lo-
`cated in shared memory. These circular buffers can have
`multiple producers and consumers.
`As for RC, StrC only has ordering constraints with re-
`spect to acquire and release calls. However, StrC associates
`these synchronization variables to circular buffers. A sys-
`tem is streaming consistent if the following conditions hold:
`(a) before an access to a circular buffer b is allowed to
`be performed with respect to any other processor, the asso-
`ciated acquire access, acquire(b), must be performed, and
`
`Sequential
`consistency
`
`Processor
`consistency
`
`Release
`consistency
`
`= A
`
`B =
`
`= A
`
`B =
`
`= A
`
`B =
`
`acquire (S);
`
`acquire (S);
`
`acquire (S);
`
`C =
`
`= D
`
`C =
`
`= D
`
`C =
`
`= D
`
`release (S);
`
`release (S);
`
`release (S);
`
`E =
`
`F =
`
`E =
`
`F =
`
`E =
`
`F =
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`4
`
`

`
`(b) before a release(b) access is allowed to perform
`with respect to any other processor, the access to the cir-
`cular buffer b to which the release is associated must be
`performed, and
`(c) acquire and release accesses for a circular buffer are
`processor consistent with respect to each other, and
`(d) circular buffers are only allowed to be accessed
`within synchronization sections.
`StrC allows reordering of synchronization sections that
`are associated to different buffers, i.e. such synchronization
`sections are allowed to overtake each other. This is differ-
`ent from RC where synchronization sections can overlap but
`can never overtake each other. RC conditions allow overlap
`as long as all accesses are performed before the following
`release and after the preceding acquire. Figure 6(a), taken
`from [10], shows the overlap possibilities for RC. Synchro-
`nization sections can never overtake preceding synchroniza-
`tion sections.
`
`Figure 7. Implicit synchronization in RC.
`
`lowed by a computation with the obtained data followed by
`a write of the new value to buffer2 over and over again. The
`read from buffer1 for the second iteration can start imme-
`diately after the first read. The read from buffer1 can be
`pipelined with the computation and the write to buffer2.
`On a RC system, the acquire of buffer1 for the second
`iteration has to wait for the acquire of buffer2 of the first
`iteration. Also, buffer1 can not be released before buffer2 is
`released. This limits the pipelining possibilities.
`
`(a) RC
`
`(b) StrC
`
`(a) Code
`
`Figure 6. Overlap possibilities for SC, RC and StrC.
`
`The example in Figure 7 illustrates why these ordering
`constraints are crucial for RC. The example shows two pro-
`cesses, P1 and P2. P1 writes a value to a outside a synchro-
`nization section and writes a value to b inside a synchroniza-
`tion section. The RC conditions guarantee that the write to
`a is visible to other processors after the release. P2 uses a
`and b. Availability of a after the release is guaranteed by
`the RC conditions. This kind of implicit synchronization is
`not allowed in StrC as every write or read to shared mem-
`ory must take place inside a synchronization section and is
`associated to that section. Therefore, StrC has no ordering
`constraints between synchronization sections that are asso-
`ciated to different buffers as shown in Figure 6(b).
`An example of an application that can exploit the over-
`take possibility of StrC is shown in Figure 8. It shows a
`streaming application where data is read from buffer1 fol-
`
`(b) Overlap
`
`Figure 8. StrC overlap example.
`
`StrC relaxes RC. In StrC, the ordering constraints only
`have to be obeyed with respect to a certain circular buffer.
`A program written for StrC will run properly on a RC sys-
`tem because a program written for a more relaxed model
`executes properly on a system with a stricter model.
`Concerning the programming model, StrC requires pro-
`grammers to explicitly program shared memory communi-
`cation through circular buffers in synchronization sections.
`Streaming applications expose this explicit communication.
`StrC does therefore not complicate programming.
`Besides more pipelining possibilities, StrC also enables
`optimizations which are presented in the next section.
`
`acquire
`
`memory
`access
`
`release
`
`memory
`access
`
`acquire
`
`memory
`access
`
`release
`
`acquire(S1)
`
`acquire(S2)
`
`shared mem
`access
`
`shared mem
`access
`
`release(S1)
`
`release(S2)
`
`P2
`
`P1
`a = 1
`acquire(S1)
`b = 2
`release(S1)
`
`acquire(S1)
`c = b
`release(S1)
`d = a + c
`
`while (true)
`{
`
`acquire(buffer1); //data available?
`a=read(buffer1);
`release(buffer1); //release space
`b=computation(a);
`acquire(buffer2); //space available?
`write(buffer2,b);
`release(buffer2); //release data
`
`}
`
`computation:
`
`comp.
`
`comp.
`
`comp.
`
`communication:
`
`r(buf1)
`
`r(buf1)
`
`r(buf1)
`
`w(buf2)
`
`w(buf2)
`
`w(buf2)
`
`time
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`5
`
`

`
`4.2 Optimization possibilities
`StrC enables several optimizations. These optimiza-
`tions require StrC and have additional requirements.
`In
`this section we discuss three optimizations: efficient soft-
`ware cache coherency, deterministic functional behavior
`and posting of writes.
`
`4.2.1 Efficient software cache coherency
`
`In Section 3.4 we discussed the limitations of hardware
`cache coherency solutions. StrC enables efficient software
`cache coherency. Software cache coherency adds cache line
`flush calls after writes to shared memory and cache line
`invalidate calls before reads from shared memory. These
`calls can be added to acquires and releases. Invalidates of
`all shared addresses from which is read after an acquire
`must be executed before the reads that follow the acquire.
`Flushes of all shared addresses to which is written prior to
`a release must be executed before the following release.
`RC allows programmers to program shared memory
`communication everywhere in a program. Every memory
`access is potentially a shared memory access. Shared mem-
`ory accesses are not directly linked to acquire and release
`calls which makes it hard to determine which cache lines
`should be invalidated on acquire and flushed on release.
`In StrC, shared data is communicated in synchronization
`sections through circular buffers that are linked to these sec-
`tions. Therefore, flush and invalidate calls can easily be
`added to acquire and release calls.
`Caches communicate at the granularity of cache lines
`which typically consist of multiple words. Multiple proces-
`sors can access different words in the same cache line. This
`is known as false sharing. Flushing of a line by one pro-
`cessor affects data in shared memory for another processor.
`False sharing can be eliminated by not allowing multiple
`processors to access different words of the same cache line.
`For StrC, false sharing can be prevented by only allowing
`a single circular buffer per cache line. If data is scattered
`in shared memory, this may result in partially used cache
`lines. To prevent this, a circular buffer should be located in
`a consecutive address range. For RC, it is difficult to prevent
`false sharing because shared accesses are not explicit in the
`program.
`In Section 5 we discuss a software cache coherency so-
`lution.
`
`4.2.2 Deterministic functional behavior
`
`The functional behavior of a task often depends on the re-
`ceived data. We assume that tasks that run on different pro-
`cessors are scheduled independently. The circular buffer in
`Figure 9(a) has multiple producing and multiple consum-
`ing tasks. All consuming tasks access the same data be-
`fore space is released. What data is put in the buffer by the
`
`producers in what order depends on the execution order of
`the producing tasks and on the communication latency (i.e.
`data from a task that is scheduled earlier than another task
`can arrive later at the buffer then data from the second task
`due to communication latency). This limits the extend to
`which one can reason about the functional behavior of an
`application at design time.
`
`(a) Multiple producers
`
`(b) Single producer
`
`Figure 9. Number of producers and consumers per
`circular buffer.
`
`Reasoning about functional behavior at design time is
`enabled for the StrC model by only allowing a single pro-
`ducer per circular buffer as depicted in Figure 9(b). For RC,
`it is hard to determine whether a shared address has multiple
`producers because shared accesses are not explicit.
`
`4.2.3 Posting of writes
`
`A posted write is a write that allows operations that follow
`the write to be executed without confirmation of the write.
`The SC model does not support posting of writes because
`reordering of memory transactions is not allowed. A write
`has to be completed before the following operation is al-
`lowed to execute.
`For RC, all writes that precede a release call have to be
`finished before the release call is allowed to finish because
`writes are not allowed to be reordered with respect to a fol-
`lowing release. In order delivery at shared memory of writes
`to shared data and the write to the synchronization variable
`(i.e. FIFO behavior) is sufficient. However, RC says noth-
`ing about where the different addresses that is being written
`to are physically located. In a DSM system, it is likely that
`these addresses are distributed over different physical mem-
`ories. If this is the case, in order delivery of writes to shared
`data and the synchronization variable is hard to guarantee.
`In this case RC requires writes to be confirmed and does not
`support posted writes.
`For StrC, a synchronization section is associated to a cir-
`cular buffer.
`If the circular buffer is located in the same
`memory as its synchronization variable, in order delivery is
`feasible. StrC then supports posted writes.
`Furthermore, a write followed by a read from the same
`address requires a system to assure that the write has per-
`formed before the read in order to continue with the read.
`RC allows a combination of reads and writes to the same
`
`circular
`buffer
`
`pc1
`
`pcm
`
`c1
`
`cn
`
`circular
`buffer
`
`pc
`
`c1
`
`cn
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`6
`
`

`
`shared address in- and outside synchronization sections.
`Therefore, a mechanism is required to assure that correct
`data is returned on a read following a write.
`StrC allows a combination of reads and writes from and
`to the same circular buffer in a synchronization section.
`Special hardware support (e.g. write buffers) is not nec-
`essary if we do not allow a combination of reads and writes
`from/to the same circular buffer inside a synchronization
`section.
`
`5 Software solution
`In this section we present a software cache coherency
`solution. This approach is enabled by StrC, presented in
`the previous section. As opposed to hardware bus snooping
`and directory based solutions described in Section 3.4, this
`solution is well suited for NoC based MPSoCs.
`We also discuss a solution for implementing the circular
`buffer administration in software.
`
`5.1 Cache coherency
`In StrC, the interprocessor communication is more ex-
`plicit than in RC. It is known what address range is commu-
`nicated and the moments before and after communication
`of shared data are clearly marked by acquire and release.
`This allows us to add invalidate and flush instructions
`when buffer data is read or written. On an acquire, be-
`fore data is read, all cache lines that contain old data for
`the involved shared addresses are marked as invalid. This
`ensures that the latest data is observed. The software cache
`coherency approach flushes cache lines that contain data of
`writes to shared memory before the release call to ensure
`that the shared memory has the most recent copy.
`The approach requires caches to have a means to inval-
`idate and flush cache lines by address. Such functionality
`is present in many caches, ARM11 and TriMedia are exam-
`ples of cached processors that support this.
`
`5.2 Circular buffer administration
`We use a software circular buffer administration to as-
`sure that no data is read if no data is available and that no
`data is written if no space is available. For this purpose, we
`use the C-HEAP protocol [9, 16]. In this protocol a buffer
`administration is maintained for number of reads and writes
`from and to the buffer (read counter and write counter).
`No hardware support is needed and no atomic read-modify-
`write operation is required. The read and write pointers are
`increased after data is read or written. Wrap around of a
`pointer occurs if a pointer would fall outside the memory
`range of the buffer. The amount of data or space in a buffer
`can be deduced from the values of the read and write pointer
`and the size of the buffer.
`Figure 10 shows the steps that are performed on a buffer
`write and read for software cache coherency (see previous
`section) and circular buffer administration. A write starts by
`
`polling the buffer administration to check if there is space
`available in the buffer (1).
`If this is the case the data is
`written (2). The data is flushed from the cache after it is
`written (3). Then, the buffer administration is updated (4).
`A read to a buffer starts by checking if there is data in the
`buffer (5). Next the cache lines that are associated with the
`data are invalidated (6). Then the data is read from shared
`memory (7) and the buffer administration is updated (8).
`
`Figure 10. Software cache coherency and memory
`consistency.
`
`Performing cache coherency instructions adds cycles.
`However, software approaches remove hardware cost by
`transferring the cost of detecting coherency problems (pro-
`tocol overhead) from hardware to software.
`
`6 Experimental results
`In this section we show the potential performance in-
`crease of using the StrC model by means of an experi-
`ment. We focus on the posted write optimization from Sec-
`tion 4.2.3. The experiment shows the negative effect of NoC
`latency for systems that do not allow posting of writes. SC
`does not support posting of writes and RC does not support
`it by default.
`Our test application consists of a producer and a con-
`sumer task. Each task runs on its own ARM7 processor
`with local memory. The processor runs at 100 MHz. The
`produced token (data) is put in a buffer that is located in the
`local memory of the consumer processor. We measure the
`execution time with and without posted write. In the system
`without posted write, each write has to be acknowledged.
`Figure 11 shows a bar-plot of the execution times of the
`producer with and without posted write for different NoC
`latencies and for two diff

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket