`
`J.W. van den Brand and M. Bekooij
`NXP Research
`contact: jan.willem.v.d.brand@nxp.com
`
`Abstract
`
`Multiprocessor systems-on-chip (MPSoC) with dis-
`tributed shared memory and caches are flexible when it
`comes to inter-processor communication but require an ef-
`ficient memory consistency and cache coherency solution.
`In this paper we present a novel consistency model,
`streaming consistency, for the streaming domain in which
`tasks communicate through circular buffers. The model al-
`lows more reordering than release consistency and, among
`other optimizations, enables an efficient software cache co-
`herency solution and posted writes.
`We also present a software cache coherency implemen-
`tation and discuss a software circular buffer administration
`that does not need an atomic read-modify-write instruction.
`A small experiment demonstrates the potential perfor-
`mance increase of posted writes in MPSoCs with high com-
`munication latencies.
`
`Keywords: streaming, memory consistency, cache co-
`herency, MPSoC, NoC
`Track, topic area: T1 Systems-on-a-chip/in-a-package,
`multi-processors
`
`1
`
`Introduction
`In this paper, we consider heterogeneous Multiprocessor
`Systems-on-Chip (MPSoCs) with distributed shared mem-
`ory (DSM) and caches. In DSM, a single shared address
`space is distributed over multiple physical memories. Pro-
`cessors communicate through shared memory.
`An architecture with a number of processors P with
`caches $ and with multiple physical memories is shown in
`Figure 1. The processors communicate through an intercon-
`nect which can be a bus or a Network-on-Chip (NoC). Ex-
`amples of real architectures with a similar structure are the
`TI OMAP platform [5], Philips Semiconductors’ Viper [8]
`and Silicon Hive architectures [3].
`Multiprocessor systems in which processors communi-
`cate through shared memory via caches require a cache co-
`herency solution and a memory consistency model. Cache
`coherency assures that processors observe up-to-date data
`
`Figure 1. Multiprocessor system with background
`memory.
`
`in the cache. We are not aware of efficient hardware cache
`coherency solutions for MPSoCs with NoCs. Therefore, we
`use a software cache coherency solution for such systems.
`A memory consistency model defines the order in which
`memory operations from one processor appears to other
`processors. It affects both performance and programming
`model. Many consistency models have been proposed
`for high performance computers. Sequential consistency
`(SC) [15] is a model that allows no reordering of memory
`operations. This is natural from a programmers perspec-
`tive. Relaxing the ordering constraints enables pipelining of
`shared memory accesses, which can significantly improve
`performance, especially for high communication latencies.
`The release consistency model (RC) [10] relaxes many
`of the SC ordering constraints. RC requires a programmer
`to use acquire and release synchronization sections. It al-
`lows many hardware optimizations compared to SC [1].
`In this paper, we propose a new consistency model,
`streaming consistency (StrC), which is targeted at
`the
`streaming domain. Examples of streaming applications
`are MPEG4 [7], Digital Radio Mondiale [13] and face
`recognition [14]. StrC allows more reordering and thus
`more pipelining than RC. Furthermore, it enables optimiza-
`tions such as efficient software cache coherency and posted
`writes. These optimizations are desirable for MPSoCs, es-
`pecially when a NoC interconnect is used.
`
`P
`
`$
`
`P
`
`$
`
`memory
`
`interconnect
`
`mem
`
`mem
`
`mem
`
`P
`
`P
`
`P
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`1
`
`APPLE 1011
`
`
`
`The key contribution of this paper is the introduction of
`a new consistency model, StrC, which allows more reorder-
`ing than RC and which enables optimizations.
`This paper is organized as follows. In Section 2 we dis-
`cuss related work. Then, in Section 3 we give a brief intro-
`duction to cache coherency and memory consistency. Ex-
`isting consistency models and hardware solutions are dis-
`cussed and we show why these solutions are not well suited
`for MPSoCs. In Section 4 we present the StrC model that
`fits such systems. In Section 5 we present software solu-
`tions for cache coherency and circular buffer administra-
`tion. Section 6 shows the potential performance increase of
`StrC by means of an experiment. Section 7 concludes.
`
`2 Related work
`Memory consistency for MPSoCs is discussed in [17].
`Release consistency (RC) is chosen as consistency model
`but it is not made clear why this model fits their purposes.
`Our consistency model allows more reordering than RC and
`enables several optimizations.
`In [18], a heterogeneous architecture is presented to
`which applications modeled as Kahn Process Networks
`(KPN) are mapped. Buffers are mapped to background
`memory. The work allows caches to be placed in so called
`shells and argues, as we do, that a more efficient cache co-
`herency mechanism is possible due to explicit communica-
`tion. The used caches are dedicated to streaming data. We
`use a generic cache with minor adjustments that is used for
`streaming data as well as program data and instructions.
`Experimental results in [11] show, in the context of high
`performance computing, that the performance gain of us-
`ing relaxed models can be significant (10-40%) compared
`to strict models in systems with networks.
`In [6], the coupling between memory abstraction and in-
`terconnect is discussed. Cache coherency for NoC based
`MPSoCs is identified as a challenging issue. Snooping and
`directory based solutions are mentioned as unlikely candi-
`dates. We also discard these hardware solutions (see Sec-
`tion 3.4). We use a software cache coherency solution.
`Experiments in [1] show that software cache coherency
`solutions perform comparable to hardware solutions for a
`broad class of programs. It is shown that for well-structured
`programs the software based approaches out-perform hard-
`ware based approaches.
`
`3 Cache coherency and memory consistency
`In this section we give a short introduction to cache co-
`herency and memory consistency. We give examples of po-
`tential coherency and consistency problems and we discuss
`existing consistency models. Finally we describe widely
`used hardware solutions for cache coherency and memory
`consistency and we explain why we think that they are not
`well suited for MPSoCs with NoCs.
`
`3.1 Cache coherency
`Processors in a cached shared memory multiprocessor
`environment can observe different data for the same mem-
`ory address. This occurs when writes from one processor
`are not propagated to caches of other processors. For in-
`stance, in case of a write-back cache, write data goes by
`default to the cache and not to background memory. With-
`out a cache coherency policy, this data is only visible for the
`processor that performed the write.
`Caches that have a write through policy propagate all
`writes to shared memory. That does not mean that no cache
`coherency policy is required. On a read, the cache con-
`troller marks the fetched line associated to the read address
`as valid. The write of another processor to the same ad-
`dress is not visible to the first processor without a cache
`coherency policy.
`We illustrate cache coherency with the following exam-
`ple. Consider the communicating tasks in Figure 2 which
`are mapped on two cached processors, P1 and P2. Note
`that the print instruction implies a read operation. First
`the task of P1 writes to A. The value of A is propagated
`to shared memory if the cache of P1 has a write through
`policy. Next, on the read action of P2, its cache fetches
`the value of A from shared memory marking the associated
`cache line as valid and the task prints the result A = 1. Then
`P1 writes another value to A which is again propagated to
`shared memory. However, on the next read, the cache of P2
`will return the old value of A, A = 1, because its line was
`marked valid.
`A similar problem occurs if the cache of P1 has a write
`back policy. The value of A would then not be propagated
`to shared memory in the first place and the cache of P2 will
`read a unknown value from memory.
`
`Figure 2. Cache coherency problem.
`
`In a cache coherent system, data of all memory ad-
`dresses that involve interprocessor communication must be
`observed with the same value for all involved processors. A
`cache coherency policy is necessary to assure this.
`
`3.2 Memory consistency
`Every multiprocessor system in which processors com-
`municates through a shared memory should have a mem-
`ory consistency model that defines how ordering of mem-
`ory accesses are handled. Shared memory accesses can be
`conflicting and non-conflicting. Accesses to a shared ad-
`dress are said to be conflicting if they come from different
`
`P1
`A = 1;
`
`A = 2;
`
`P2
`
`print A;
`
`print A;
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`2
`
`
`
`processors and at least one of them is a write. The behav-
`ior of a program on a conflicting access depends on the or-
`der in which the accesses are observed between processors.
`In order to write a program that exhibits correct functional
`behavior, a programmer must know what the ordering be-
`havior of the system is. Reordering of memory operations
`is only allowed if local program order of the processor is
`maintained (e.g a write followed by a read from the same
`address can never be reordered).
`The memory consistency model of a shared memory sys-
`tem specifies the order in which memory operations will
`appear to execute to the programmer. Consider for exam-
`ple the tasks in Figure 3, taken from [4], where the ordering
`of operations influences functional behavior. The intention
`of this program is clear; processor P1 communicates vari-
`able A to processor P2 through shared memory and uses a
`flag (shared variable) for synchronization. The underlying
`assumption is that the write of A becomes visible to P2 be-
`fore the write to flag. If this assumption fails, the program
`will not fulfill the programmer’s expectation.
`The consistency model determines the programmer’s
`view on shared memory and therefore has a major influence
`on the programming model.
`
`The PC model relaxes the ordering rules defined by the
`SC model. In this model, writes issued by a single proces-
`sor are observed at another processor in the same order as
`they are issued. However, writes issued by different pro-
`cessors do not appear in the same order to all processors
`and therefore it does not satisfy write atomicity. It reflects
`the reality of complex interconnects where the latencies be-
`tween nodes differ for different processor pairs. Because
`writes in the PC model are guaranteed to appear in order,
`the example in Figure 3 will work correctly. However, pro-
`grams written with the SC model in mind are not guaranteed
`to work in a processor consistent system. The example in
`Figure 4 taken from [4] shows a program that can fail under
`PC but works under SC due to write atomicity. P2 reads A
`which is written by P1 and then writes B which in turn is
`used by P3 as synchronization flag. Without write atomicity
`There is no guarantee that the latest version of A, A = 1,
`is visible to P3 before B = 1 is visible to P3. The wrong
`value is printed even if all processors receive write data in
`the order issued by the processors.
`PC also allows reads that follow a write to a different
`address on the same processor to be reordered with respect
`to each other. Reads can be issued while a write is still in
`transfer.
`
`Figure 3. Event synchronization through a flag.
`
`3.3 Existing consistency models
`There are many different memory consistency models
`proposed in literature (see for instance [2, 4]). These consis-
`tency models target off-chip multiprocessor systems such as
`high performance computers with limited resource and en-
`ergy consumption constraints.
`Here we discuss sequential consistency [15] (SC), pro-
`cessor consistency [1] (PC) and release consistency [10]
`(RC). The models differ in the amount of reordering that
`is allowed. Reordering can enable pipelining of shared
`memory accesses. A model is relaxed compared to another
`model if it allows more reordering and thus enables more
`pipelining of shared memory accesses. More relaxed mod-
`els are introduced to allow more pipelining.
`SC requires program order to be maintained among all
`operations. Every task appears to issue complete mem-
`ory operations one at a time and atomically in program or-
`der [4]. Writes issued by different processors appear in the
`same order to all processors (write atomicity). This is very
`natural from a programmer’s point of view. For instance, the
`example from Figure 3 executes perfectly in a SC system.
`Unfortunately, this very intuitive model poses restrictions
`on hardware and compiler optimizations [2].
`
`Figure 4. Importance of write atomicity for SC.
`
`RC, introduced in [10], is an even more relaxed model.
`For this model, shared memory accesses are categorized and
`different ordering requirements can apply to different cate-
`gories. In RC terminology, a conflicting access is competing
`if there is a chance that a read and write occur simultane-
`ously; otherwise it’s a non-competing access. A conflicting
`access is made non-competing by using synchronization.
`There are two types of synchronization accesses; acquire
`and release.
`A system is RC if the following conditions hold (taken
`from [10]):
`(a) Before a non-competing load or store access is al-
`lowed to perform with respect to another processor, all pre-
`vious acquire accesses must be performed and,
`(b) before a release access is allowed to perform with
`respect to any other processor, all previous non-competing
`load and store accesses must be performed, and
`(c) competing accesses (e.g.
`acquire and release ac-
`cesses) are processor consistent with respect to one another.
`The conditions give ordering requirements between non-
`competing and competing accesses and between competing
`accesses. There are no ordering requirements between non-
`
`P1
`A = 1;
`flag = 1;
`
`P2
`while(flag == 0);
`print A;
`
`P1
`A = 1;
`
`P2
`while(A == 0);
`B = 1;
`
`P3
`
`while(B == 0);
`print A;
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`3
`
`
`
`competing accesses.
`A program written with PC in mind is not guaranteed
`to work on a system with RC. The example of Figure 3
`which executed perfectly under PC will have consistency
`problems because the shared memory accesses are not cat-
`egorized and no synchronization is added to resolve unde-
`sired competing accesses.
`The diagram in Figure 5 taken from [12] shows the or-
`dering requirements for shared memory accesses to differ-
`ent addresses from a processor in a multiprocessor system
`for different consistency models. For each model, a pro-
`gram is shown with shared memory accesses. The arrows
`denote ordering constraints. SC allows no reordering, PC
`allows reordering of writes followed by a read. RC allows
`reordering of all shared memory accesses to different ad-
`dresses outside the synchronization section and inside syn-
`chronization sections.
`
`Figure 5. Comparison of three different consis-
`tency models.
`
`RC offers extensive pipelining possibilities. In Section 4
`we present a consistency model targeted at streaming ap-
`plications that is more relaxed than RC to allow even more
`pipelining.
`
`3.4 Hardware solutions
`Shared memory architectures that constantly monitor a
`bus for transactions (snooping bus) have an advantage when
`it concerns cache coherency and memory consistency solu-
`tions. The bus provides a global view on all memory op-
`erations. Caches can observe all memory transactions by
`snooping the bus and take appropriate action when a trans-
`action takes place that concerns data in the cache. Also,
`ordering of reads and writes according to the selected con-
`sistency model can be assured by stalling bus transactions.
`
`However, buses pose limitations on bandwidth. In prac-
`tice no more than 16 processors are connected to a single
`shared bus. Moreover, we consider MPSoCs with NoC
`interconnects. Such interconnects do not provide global
`observability of memory transactions. To the best of our
`knowledge, there are no snooping solutions for NoCs and it
`seems unlikely that efficient solutions will be found.
`Directory based cache coherency approaches are better
`suited for network interconnects. In a directory based sys-
`tem, processors and caches assure that they do not violate
`cache coherency and memory consistency by issuing re-
`quests and notifications to a storage place (directory) that
`maintains state of all relevant transactions. For cache co-
`herency, the directory is notified of all relevant changes of
`cache state by processors and is therefore capable of deter-
`mining which cache contains the requested data.
`The request and notification communication consume a
`significant amount of bandwidth, especially when a strict
`consistency model is used such as SC. The directory mem-
`ory and control makes the hardware significantly more ex-
`pensive than bus snooping hardware. Finally, the communi-
`cation from cache to directory and back and then from cache
`to cache or from memory to cache introduces more latency.
`This latency results in additional processor stall cycles.
`Therefore we conclude that both bus snooping and
`directory-based approaches are not well suited for MPSoC
`embedded systems with NoC interconnects. In the next sec-
`tion we present a consistency model that enables an efficient
`software solution.
`4 Streaming consistency
`The previous section discussed existing consistency
`models which were designed for high performance com-
`puters. This section presents a novel consistency model,
`streaming consistency (StrC), targeted at the streaming do-
`main. It allows more pipelining than RC and enables op-
`timizations such as an efficient software cache coherency
`solution which fits MPSoCs with NoCs. First we give a def-
`inition of StrC. Then, in Section 4.2 optimizations enabled
`by StrC are presented.
`
`4.1 Streaming consistency model
`StrC targets systems that run streaming applications. In-
`ter processor communication in such systems is performed
`by sharing units of data through circular buffers that are lo-
`cated in shared memory. These circular buffers can have
`multiple producers and consumers.
`As for RC, StrC only has ordering constraints with re-
`spect to acquire and release calls. However, StrC associates
`these synchronization variables to circular buffers. A sys-
`tem is streaming consistent if the following conditions hold:
`(a) before an access to a circular buffer b is allowed to
`be performed with respect to any other processor, the asso-
`ciated acquire access, acquire(b), must be performed, and
`
`Sequential
`consistency
`
`Processor
`consistency
`
`Release
`consistency
`
`= A
`
`B =
`
`= A
`
`B =
`
`= A
`
`B =
`
`acquire (S);
`
`acquire (S);
`
`acquire (S);
`
`C =
`
`= D
`
`C =
`
`= D
`
`C =
`
`= D
`
`release (S);
`
`release (S);
`
`release (S);
`
`E =
`
`F =
`
`E =
`
`F =
`
`E =
`
`F =
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`4
`
`
`
`(b) before a release(b) access is allowed to perform
`with respect to any other processor, the access to the cir-
`cular buffer b to which the release is associated must be
`performed, and
`(c) acquire and release accesses for a circular buffer are
`processor consistent with respect to each other, and
`(d) circular buffers are only allowed to be accessed
`within synchronization sections.
`StrC allows reordering of synchronization sections that
`are associated to different buffers, i.e. such synchronization
`sections are allowed to overtake each other. This is differ-
`ent from RC where synchronization sections can overlap but
`can never overtake each other. RC conditions allow overlap
`as long as all accesses are performed before the following
`release and after the preceding acquire. Figure 6(a), taken
`from [10], shows the overlap possibilities for RC. Synchro-
`nization sections can never overtake preceding synchroniza-
`tion sections.
`
`Figure 7. Implicit synchronization in RC.
`
`lowed by a computation with the obtained data followed by
`a write of the new value to buffer2 over and over again. The
`read from buffer1 for the second iteration can start imme-
`diately after the first read. The read from buffer1 can be
`pipelined with the computation and the write to buffer2.
`On a RC system, the acquire of buffer1 for the second
`iteration has to wait for the acquire of buffer2 of the first
`iteration. Also, buffer1 can not be released before buffer2 is
`released. This limits the pipelining possibilities.
`
`(a) RC
`
`(b) StrC
`
`(a) Code
`
`Figure 6. Overlap possibilities for SC, RC and StrC.
`
`The example in Figure 7 illustrates why these ordering
`constraints are crucial for RC. The example shows two pro-
`cesses, P1 and P2. P1 writes a value to a outside a synchro-
`nization section and writes a value to b inside a synchroniza-
`tion section. The RC conditions guarantee that the write to
`a is visible to other processors after the release. P2 uses a
`and b. Availability of a after the release is guaranteed by
`the RC conditions. This kind of implicit synchronization is
`not allowed in StrC as every write or read to shared mem-
`ory must take place inside a synchronization section and is
`associated to that section. Therefore, StrC has no ordering
`constraints between synchronization sections that are asso-
`ciated to different buffers as shown in Figure 6(b).
`An example of an application that can exploit the over-
`take possibility of StrC is shown in Figure 8. It shows a
`streaming application where data is read from buffer1 fol-
`
`(b) Overlap
`
`Figure 8. StrC overlap example.
`
`StrC relaxes RC. In StrC, the ordering constraints only
`have to be obeyed with respect to a certain circular buffer.
`A program written for StrC will run properly on a RC sys-
`tem because a program written for a more relaxed model
`executes properly on a system with a stricter model.
`Concerning the programming model, StrC requires pro-
`grammers to explicitly program shared memory communi-
`cation through circular buffers in synchronization sections.
`Streaming applications expose this explicit communication.
`StrC does therefore not complicate programming.
`Besides more pipelining possibilities, StrC also enables
`optimizations which are presented in the next section.
`
`acquire
`
`memory
`access
`
`release
`
`memory
`access
`
`acquire
`
`memory
`access
`
`release
`
`acquire(S1)
`
`acquire(S2)
`
`shared mem
`access
`
`shared mem
`access
`
`release(S1)
`
`release(S2)
`
`P2
`
`P1
`a = 1
`acquire(S1)
`b = 2
`release(S1)
`
`acquire(S1)
`c = b
`release(S1)
`d = a + c
`
`while (true)
`{
`
`acquire(buffer1); //data available?
`a=read(buffer1);
`release(buffer1); //release space
`b=computation(a);
`acquire(buffer2); //space available?
`write(buffer2,b);
`release(buffer2); //release data
`
`}
`
`computation:
`
`comp.
`
`comp.
`
`comp.
`
`communication:
`
`r(buf1)
`
`r(buf1)
`
`r(buf1)
`
`w(buf2)
`
`w(buf2)
`
`w(buf2)
`
`time
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`5
`
`
`
`4.2 Optimization possibilities
`StrC enables several optimizations. These optimiza-
`tions require StrC and have additional requirements.
`In
`this section we discuss three optimizations: efficient soft-
`ware cache coherency, deterministic functional behavior
`and posting of writes.
`
`4.2.1 Efficient software cache coherency
`
`In Section 3.4 we discussed the limitations of hardware
`cache coherency solutions. StrC enables efficient software
`cache coherency. Software cache coherency adds cache line
`flush calls after writes to shared memory and cache line
`invalidate calls before reads from shared memory. These
`calls can be added to acquires and releases. Invalidates of
`all shared addresses from which is read after an acquire
`must be executed before the reads that follow the acquire.
`Flushes of all shared addresses to which is written prior to
`a release must be executed before the following release.
`RC allows programmers to program shared memory
`communication everywhere in a program. Every memory
`access is potentially a shared memory access. Shared mem-
`ory accesses are not directly linked to acquire and release
`calls which makes it hard to determine which cache lines
`should be invalidated on acquire and flushed on release.
`In StrC, shared data is communicated in synchronization
`sections through circular buffers that are linked to these sec-
`tions. Therefore, flush and invalidate calls can easily be
`added to acquire and release calls.
`Caches communicate at the granularity of cache lines
`which typically consist of multiple words. Multiple proces-
`sors can access different words in the same cache line. This
`is known as false sharing. Flushing of a line by one pro-
`cessor affects data in shared memory for another processor.
`False sharing can be eliminated by not allowing multiple
`processors to access different words of the same cache line.
`For StrC, false sharing can be prevented by only allowing
`a single circular buffer per cache line. If data is scattered
`in shared memory, this may result in partially used cache
`lines. To prevent this, a circular buffer should be located in
`a consecutive address range. For RC, it is difficult to prevent
`false sharing because shared accesses are not explicit in the
`program.
`In Section 5 we discuss a software cache coherency so-
`lution.
`
`4.2.2 Deterministic functional behavior
`
`The functional behavior of a task often depends on the re-
`ceived data. We assume that tasks that run on different pro-
`cessors are scheduled independently. The circular buffer in
`Figure 9(a) has multiple producing and multiple consum-
`ing tasks. All consuming tasks access the same data be-
`fore space is released. What data is put in the buffer by the
`
`producers in what order depends on the execution order of
`the producing tasks and on the communication latency (i.e.
`data from a task that is scheduled earlier than another task
`can arrive later at the buffer then data from the second task
`due to communication latency). This limits the extend to
`which one can reason about the functional behavior of an
`application at design time.
`
`(a) Multiple producers
`
`(b) Single producer
`
`Figure 9. Number of producers and consumers per
`circular buffer.
`
`Reasoning about functional behavior at design time is
`enabled for the StrC model by only allowing a single pro-
`ducer per circular buffer as depicted in Figure 9(b). For RC,
`it is hard to determine whether a shared address has multiple
`producers because shared accesses are not explicit.
`
`4.2.3 Posting of writes
`
`A posted write is a write that allows operations that follow
`the write to be executed without confirmation of the write.
`The SC model does not support posting of writes because
`reordering of memory transactions is not allowed. A write
`has to be completed before the following operation is al-
`lowed to execute.
`For RC, all writes that precede a release call have to be
`finished before the release call is allowed to finish because
`writes are not allowed to be reordered with respect to a fol-
`lowing release. In order delivery at shared memory of writes
`to shared data and the write to the synchronization variable
`(i.e. FIFO behavior) is sufficient. However, RC says noth-
`ing about where the different addresses that is being written
`to are physically located. In a DSM system, it is likely that
`these addresses are distributed over different physical mem-
`ories. If this is the case, in order delivery of writes to shared
`data and the synchronization variable is hard to guarantee.
`In this case RC requires writes to be confirmed and does not
`support posted writes.
`For StrC, a synchronization section is associated to a cir-
`cular buffer.
`If the circular buffer is located in the same
`memory as its synchronization variable, in order delivery is
`feasible. StrC then supports posted writes.
`Furthermore, a write followed by a read from the same
`address requires a system to assure that the write has per-
`formed before the read in order to continue with the read.
`RC allows a combination of reads and writes to the same
`
`circular
`buffer
`
`pc1
`
`pcm
`
`c1
`
`cn
`
`circular
`buffer
`
`pc
`
`c1
`
`cn
`
`10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
`0-7695-2978-X/07 $25.00 © 2007
`
`6
`
`
`
`shared address in- and outside synchronization sections.
`Therefore, a mechanism is required to assure that correct
`data is returned on a read following a write.
`StrC allows a combination of reads and writes from and
`to the same circular buffer in a synchronization section.
`Special hardware support (e.g. write buffers) is not nec-
`essary if we do not allow a combination of reads and writes
`from/to the same circular buffer inside a synchronization
`section.
`
`5 Software solution
`In this section we present a software cache coherency
`solution. This approach is enabled by StrC, presented in
`the previous section. As opposed to hardware bus snooping
`and directory based solutions described in Section 3.4, this
`solution is well suited for NoC based MPSoCs.
`We also discuss a solution for implementing the circular
`buffer administration in software.
`
`5.1 Cache coherency
`In StrC, the interprocessor communication is more ex-
`plicit than in RC. It is known what address range is commu-
`nicated and the moments before and after communication
`of shared data are clearly marked by acquire and release.
`This allows us to add invalidate and flush instructions
`when buffer data is read or written. On an acquire, be-
`fore data is read, all cache lines that contain old data for
`the involved shared addresses are marked as invalid. This
`ensures that the latest data is observed. The software cache
`coherency approach flushes cache lines that contain data of
`writes to shared memory before the release call to ensure
`that the shared memory has the most recent copy.
`The approach requires caches to have a means to inval-
`idate and flush cache lines by address. Such functionality
`is present in many caches, ARM11 and TriMedia are exam-
`ples of cached processors that support this.
`
`5.2 Circular buffer administration
`We use a software circular buffer administration to as-
`sure that no data is read if no data is available and that no
`data is written if no space is available. For this purpose, we
`use the C-HEAP protocol [9, 16]. In this protocol a buffer
`administration is maintained for number of reads and writes
`from and to the buffer (read counter and write counter).
`No hardware support is needed and no atomic read-modify-
`write operation is required. The read and write pointers are
`increased after data is read or written. Wrap around of a
`pointer occurs if a pointer would fall outside the memory
`range of the buffer. The amount of data or space in a buffer
`can be deduced from the values of the read and write pointer
`and the size of the buffer.
`Figure 10 shows the steps that are performed on a buffer
`write and read for software cache coherency (see previous
`section) and circular buffer administration. A write starts by
`
`polling the buffer administration to check if there is space
`available in the buffer (1).
`If this is the case the data is
`written (2). The data is flushed from the cache after it is
`written (3). Then, the buffer administration is updated (4).
`A read to a buffer starts by checking if there is data in the
`buffer (5). Next the cache lines that are associated with the
`data are invalidated (6). Then the data is read from shared
`memory (7) and the buffer administration is updated (8).
`
`Figure 10. Software cache coherency and memory
`consistency.
`
`Performing cache coherency instructions adds cycles.
`However, software approaches remove hardware cost by
`transferring the cost of detecting coherency problems (pro-
`tocol overhead) from hardware to software.
`
`6 Experimental results
`In this section we show the potential performance in-
`crease of using the StrC model by means of an experi-
`ment. We focus on the posted write optimization from Sec-
`tion 4.2.3. The experiment shows the negative effect of NoC
`latency for systems that do not allow posting of writes. SC
`does not support posting of writes and RC does not support
`it by default.
`Our test application consists of a producer and a con-
`sumer task. Each task runs on its own ARM7 processor
`with local memory. The processor runs at 100 MHz. The
`produced token (data) is put in a buffer that is located in the
`local memory of the consumer processor. We measure the
`execution time with and without posted write. In the system
`without posted write, each write has to be acknowledged.
`Figure 11 shows a bar-plot of the execution times of the
`producer with and without posted write for different NoC
`latencies and for two diff