`Applications
`
`Simon Segars, Manager CPU Development, ARM Ltd.
`
`Abstract
`
`Portable applications such as mobile phones,
`and PDAs
`are
`continually
`growing
`in
`pagers,
`sophistication. This places an increasing burden on the
`embedded microprocessor to provide high performance
`while retaining low power consumption and small die
`size.
`
`been
`The ARM7TDMI microprocessor has
`highly successful in these application areas. However, as
`products grow in complexity more processing power is
`required while the expectation on battery life also
`increases. This has lead to the introduction of the ARM9
`family, a range ofhigh performance low power embedded
`microprocessors targeted at next generation embedded
`applications.
`This paper focuses on the implementation of 2
`members of the ARM9 family,
`the ARM9TDMI integer
`core and the ARM940T cached processor. These offer
`performance in excess of 150 MIPS while retaining low
`power consumption. The evolution from the ARM7 to the
`ARM9 microarchitecture is described and the trade offs
`between low power consumption and high performance
`discussed.
`
`Introduction
`
`low power
`ARM designs high performance.
`microprocessors targeted at embedded applications. To
`date most of the ARM design wins have been with the
`ARM7TDMI[1,2] processor. This product incorporates
`the Thumbinstruction set [3], providing industry leading
`code density and typically achieves around 60MHz and
`only 1.5 mW/MHz power consumption on a 0.35um
`process. Coupled with a small die size and integral debug
`features,
`this product
`is
`ideal
`for many medium-
`performance embeddedapplications.
`in many
`ARM7TDMI has been successful
`portable applications. Examples include GSM mobile
`phones such as the Panasonic G650 [4]. ARM7 based
`cores have also been integrated with cache memories and
`
`peripherals in ASSPs such as the ARM7100 [5] as used in
`the PSION 5 PDA [6]. The ARM7 family owes its
`success to the combination of low powcr, low cost and
`high performance.
`However, as applications become more complex
`and integrate more and more functionality, the processor
`is required to provide more and more performance. A
`classic example of such an application is the so-called
`“Smart Phone’. This is a cellular phone and PDArolled
`into one.
`Initial
`smart phones have used multiple
`processors in order to meet the performance needs - one to
`run the PDA, another to run the cellular protocol stack
`and a DSPto process the data traffic.
`Applications such as this epitomize Moore’s law
`and have lead ARM to develop the ARM9 family of
`microprocessors
`[7].
`These devices build on the
`architecture of the ARM7 family and provide higher
`levels of performance. ARM9 processors are specifically
`targeted to meet the needs of the next generation of highly
`integrated portable applications while at the same time
`keeping power consumption and die size to a minimum.
`While the ARM9 family rises to meet
`this
`challenge, the ARM7 family will
`live on servicing the
`needs of low-end applications.
`
`The ARM9TDMIembeddedcore
`
`The first member of the ARM9 family is the
`ARMO9YTDMIinteger core. The goal of this product was
`to produce a high performance Thumb compatible
`processor, to providing a performance upgrade path from
`the ARM7TDMI. The processor wasspecified to be used
`either stand alone, or within a cached processor such as
`the ARM940T.
`
`Higher performance has been achieved by
`increasing the depth of the pipeline from 3 stages as in the
`ARM7TDMIto 5 stages. This allows the device to be
`clocked at
`a higher
`rate
`than the ARM7TDMI.
`Forwarding paths have also been introduced to the
`pipeline in order to reduce the number ofinterlock cases
`and hence reduce the average number of clocks per
`instruction, CPI.
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`4
`
`SAMSUNG 1038
`
`1
`
`SAMSUNG 1038
`
`
`
`Load and store operations account for around
`25% of all instructions in the ARM instruction flow (a
`more detailed breakdown of ARM instruction distribution
`is shown in Table 1). In ARM7TDMIa basic load takes 3
`cycles and a store takes 2 cycles and these contribute
`significantly to the average overall CPI.
`It is therefore
`important
`to optimise for these instructions in high
`performance processors. This has been achieved in the
`ARM9TDMI processor core by adopting a Harvard
`memory architecture (ARM7TDMIused a Von Neumman
`architecture for simplicity) thereby allowing instruction
`fetches to occur in parallel with data accesses.
`The ARM9TDMIpipeline consists of the stages
`fetch, decode, execute, memory access and write-back.
`The main operations performed in cach stage
`are
`described in Figure 1 which compares the pipelines of the
`ARM7TDMIand ARM9TDML.
`
`The extra stages allow the work performedin the
`somewhat congested execute state of the ARM7TDMIto
`be spread more evenly,
`thercby permitting a higher
`maximum operating frequency. ARM7TDMIperforms a
`Thumb to ARM instruction format conversion during the
`first phase of the decode cycle.
`In the 5 stage pipeline of
`ARM9TDMI, ARM and Thumbinstructions are decoded
`in parallel.
`ARM9TDMI performs register address
`decode in the first half of the decode cycle and register
`reads in the second half. This meansthat there is no spare
`phase in which to perform a Thumb to ARM instruction
`conversion as there was in the ARM7TDMI. Thisleads to
`two parallel decode units, one which is only active when
`the processor is in ARM state and the other active only
`when the processor is in Thumbstate, in order to save
`power.
`
`would be required if the data was read during decode
`along with the otherregisters.
`The shifter and ALU perform the same functions
`as those found in ARM7TDMI. The main difference in
`
`the arithmetic and logic units are
`the ALU is that
`separated so that during an instruction only the required
`functional unit is activated.
`It has been found that the
`
`ALUcontributes significantly to the power consumption
`of the ARM7TDMI.
`In that device the simple nature of
`the ALU meansthat both an arithmetic and a logic result
`are calculated each cycle and the required result is then
`selected. This is inefficient from a power consumption
`perspective and so the two units have been partitioned in
`the ARM9TDMIdesign.
`The forwarding paths in the ARM9TDMIallow
`back to back data processing instructions to execute in the
`pipeline withoutstall cycles. Load data, which becomes
`available at
`the end of the memory cycle,
`is also
`forwarded into the pipeline.
`If the data from the load is
`required in the very next cycle then there is a one cycle
`interlock, since the data is not returned until the end of the
`memory cycle, and the load instruction occupies 2 execute
`cycles in the datapath. However,
`if the data is not
`required until the next but one instruction then the data is
`forwarded, there is no interlock and the instruction has
`only occupied one execute cycle in the datapath. These
`two cases are depicted in Figure3 (a) and (b).
`There is a third case where a load data specifies a
`rotation or sign extension as it is fetched and here the
`forwarding paths cannot be used. The loaded data must
`be passed through the Byte Rotator block and written back
`to the register bank before being used by subsequent
`instructions. Therefore,
`instructions such as these can
`cause up to two cycles of interlock depending on when the
`data is required.
`The ARMIYTDMI microarchitecture described
`
`The datapath of the ARM9TDMIis shown in
`Figure 2. The register bank has 3 read ports and two write
`ports. The A and B read ports feed the execution units in
`above results in an average CPI of 1.5. A breakdown of
`the datapath. The C port is used exclusively for reading
`the numberof cycles for each instruction class is shown in
`store data. Store data is read during the execute stage of
`Table 1 along with that of ARM7TDMIfor comparison.
`the pipeline. This reduces the numberof forwarding paths
`The new microarchitecture results in a 21% increase in
`and also removes the need for holding latches which
`ARM7TDMIPipeline Operation
`Execute
`Fetch
`Decode.
`Main Decode.
`Convert Thumb
`Register Read Shifter
`to ARM
`Register Address
`Instruction Fetch
`ALU
`Decode
`Writeback
`
`
`
`
`
`ARM9TDMIPipeline Operation
`Execute
`Fetch
`Decode
`Memory Writeback
`
`ARM Decode
`Register
`ALUResult
`Read
`
`ALU
`Memory Data access
`and/or
`Load data
`Register
`Writeback
`Read
`
`
`
`
`
`
`
`
`
`Instruction Fetch=Le ----- 4 Shifter
`
`Figure | : ARM7TDMI and ARM9YTDMIPipelines
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`2
`
`2
`
`
`
`
`
`mux
`
`~
`
`
`
`
`
`Word
`Repl
`
`
`
`
`
`l<
`DDSean
`
`
`
`DINC
`
`
`
`
`
`
`/
`
`r
`!
`
`
`
`
`Figure 2 : The ARM9TDMIDatapath
`
`
`F
`D
`E
`M
`Ww
`\pat
`
`F
`Interlock
`D
`E
`M
`Ww
`
`ADDR2, R1, R1
`
`
`
`(a) Single Cycle Interlock
`
`
`
`
`
`
`
`
`
`
`
`Byte Rot
`
`
`/ Sign Ex.
`> DNL]
`
`Byte/ a - I
`
`
`
`
`
`
`Cd
`
`
`
`
`eal
`
`
`TINC
`
`
`
`
`
`
`
`
`
`SHIFTER
`BL.J
`Inn
`
`
`REGBANK
`|
`|
`DAScan
`
`
`
`
`
`ae
`+PC
`NS DALI
`
`
`
`
`
`
`Atel
`!
`
`
`
`
`
`WLI
`
`
`SNF
`[~
`b>
`PSR
`
`Vectors
`
`
`
`
`
`
`
`
`
`
`
`
`f
`
`LDR RI, [RO]
`
`LDRRI, [RO]
`
`
`
`F
`D
`E
`M
`Ww
`\
`
`ADDR2, R3, R4
`
`F
`
`D
`
`E
`
`\
`\
`
`M
`
`WwW
`
`ADD RS, R6, R1
`
`
`F
`D
`E
`M
`WwW
`
`(b) No Interlock
`Figure 3 : ARM9TDMILoad Behaviour
`
`instruction throughput relative to ARM7TDMI and 1.1
`MIPS/MHz, compared to 0.9 MIPS/MHz.
`The re-pipelining allows the clock rate to be
`increased significantly compared to ARM7TDMI. On the
`same process, ARM9TDMI maybe clocked at twice the
`rate of ARM7TDMI. The increase in complexity required
`to achieve the increase in performance requires around
`50% more transistors and the area has increased by almost
`90% This area increase is accounted for by an increase in
`the number of routing channels in the datapath (to permit
`the additional forwarding paths) and a relative increase in
`the standard cell control logic to custom datapath ratio. A
`comparative summary between the two processor cores is
`shown in Table 2.
`
`The ARM940T cached processor
`
`The ARM9TDMI processor core has been
`integrated with caches, a write buffer and a protection unit
`in the ARM940T processor.
`This system has many
`advantages for the system designer. Firstly, it allows the
`processor to operate at
`its maximum frequency since
`memory accessesare to the local, high performance cache.
`Secondly, since main memory is accessed infrequently,
`system poweris reduced. Also, the main memory system
`may now beused for other tasks, such as DMA,while the
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`3
`
`3
`
`
`
`Store multiple registersPp2T POPgs
`
`|DataprocessingwithPC|38fT
`
`
`
`|Branch/BranchwithlinkJTA
`|Loadregister|A
`|Storeregister|
`Load multiple registers ee
`
`Table 1 : ARM9TDMI vs ARM7TDMICPIAnalysis
`
`processoris executing from its caches. A block diagram of
`the ARM940T design is shown below in Figure 4.
`The Harvard caches in the ARM940T are both
`4KB in size in the first implementation. The caches are
`constructed in a modular manner using | KB cacheblocks.
`Through the use of these blocks the size of the cache can
`easily be varied with minimal impact on the rest of the
`design. The cache blocks are built using a CAM-RAM
`structure comprising 64 lines each with 4 words of data.
`Each cache block has 64 way associativity and the CAMs
`are designed to compare a maximum of 27 address bits. A
`cache read involves3 basic steps.
`
`processoris stalled by the cache control logic and an
`external access occurs to fetch the required data.
`
`Although the high associativity helps little with cache
`hit rates, the design has a number of advantages. These
`include
`low power,
`short
`cycle
`time
`and simple
`modularity allowing the ARM940Tto be extended to have
`larger caches by utilising more cache segments.
`For
`example,
`if the caches were 8KB in size,
`then an
`additional bit must be used for the segment decode and
`oneless bit passed into the CAM for address look-up.
`In
`this case, bits 6:4 would be used for segment decode and
`
`poARTTDMT_|ARMOTDMI|
`
`Typical Max Clock rate (0.35um)|60|120 Power (mW/MHz @ 3.0V)
`
`MIPS/MHz
`
`eeee
`
`Table 2 : ARM9TDMI vs ARM7TDMI Comparison Sumary
`
`the 32 bit address from the processor is
`e Firstly,
`decoded to determine which of the 4 segments the
`addressed data might be in.
`In the 4KB ARM940T
`design, bits 5:4 of the address are used for the segment
`decode.
`
`e Secondly, the upper 26 bits of the address are then
`passed into the CAM where they are compared with
`the CAM contents.
`
`e Finally, if there is a CAM match, then a data access in
`the cache RAM occurs. Each RAM linc is 4 words
`
`long andbits 3:2 of the address are used to select the
`desired word.
`If the cache lookup fails then the
`
`bits 31:7 of the address would be passed into the CAMs.
`The unused column of the CAM would simply betied off.
`In fact, the CAMsare designed to cope with cache sizes
`down to 2KB and in the initial design the least significant
`CAM column istied off.
`In order to increase performance, the ARM940T
`contains a write buffer. Without this,
`then any store
`operation to main memory would have to stall
`the
`processoruntil the memory bus wasfree. The write buffer
`allows the processor to be isolated from the traffic on the
`system memory bus and sostall cycles are minimised.
`The write buffer allows storage for up to 8 words of data
`and 4 address values.
`
`As a further power and bus activity saving
`feature, the ARM940T data cache may be operated in a
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`4
`
`4
`
`
`
`Instruction Data
`
`Control
`
`t E
`
`xternal Coprocessor
`Interface
`
`
`
`
` I-Cache
`
`
`Protection
` D-Cache
`Unit / CP1S
`
`TABB1:0]
`
`
`
`Instruction Cache
`
`
`
`ARMSTDMI
`Processor Core
`(Integral EmbeddedICE)
`
`
`
` Data Cache
`
`
`
`
`
`
`BA(31:0]
`
`
`
`
`
`
`
`TAP
`Controller
`
`STAGInterface
`
`AMBAInterface
`
`——$=> Write
`Buffer
`
`|
`
`Beontrol
`
`|
`
` BD[31:0]
`
`Figure 4 : The ARM940T Processor
`
`In this mode of operation, whenever a
`write-back mode.
`store occurs which hits in the cache, the cache is updated
`but the write is not passed to the external memory system.
`At
`this time the cache and main memory have lost
`coherency and the cache line containing the incoherent
`data is said to be ‘dirty’. Subsequently,if as a result of a
`later cache miss the dirty line is selected to be overwritten
`with new data, the dirty data must be written back to main
`memory. Whenthis occurs, the processoris stalled while
`the dirty data is copied from the cache into the write
`buffer. The linefill of the new data is then performed and
`written into the cache. At that point processor execution
`is resumed. The data in the write buffer is written back to
`memory when the system bus is free and before any
`further read operations, ensuring memory coherency.
`The benefit of a write-back cache is that many
`store operations may occur to the cache before they are
`copied to the main memory. Therefore the total number
`of main memory accesses and hence system power,
`is
`reduced. This is especially useful for data regions where
`program variables are to be stored. There are cases when
`main memory and the cache must be kept coherentatall
`times, for a video frame buffer. Consequently ARM940T
`also supports a write-through mode of operation whereall
`cache updates are written through to memory as they
`happen.
`
`ARM940T is targeted at a class of embedded
`applications referred to as closed applications. Closed
`applications are where all the software the processor will
`execute is present in the system whenit is shipped by the
`OEM.Since the software can be considered reliable and
`
`safe, the memory protection provided by the processor
`may be minimised. Consequently, ARM940T does not
`support virtual memory and does not contain an MMU or
`TLB.
`Instead, a simple Protection Unit, PU, is provided.
`The PU is programmed via accesses to the system
`coprocessor, CP 15.
`The protection unit allows the system designer to
`partition memory into 16 regions, 8 on the instruction side
`and another 8 unique regions on the data side. Each
`region is specified by a base address pointer and a size
`ficld. The size can be anything, in powers of 2, from 4GB
`to 4KB. The address of the start of the region must be
`multiple of the region’s size. Each region has a number of
`properties associated with it specifying how the cache and
`write buffer behave in that region, eg. cacheable, non-
`cacheable, write-through or write-back, and also what type
`of access, eg. supervisor only, can occur within the region.
`The regions
`are
`labeled 0-7 and may be
`programmed such that they overlap.
`If a memory access
`occurs which corresponds to 2 or more regions then the
`attributes for the highest numbered region are used (ie.
`region 7 has the highest priority and region 0 the lowest).
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`5
`
`5
`
`
`
`the
`that
`is
`The advantages of overlapping regions
`flexibility with which the regions may be used is
`increased, since silicon area restrictions only permit 8 for
`eachside.
`
`By way of an example, consider a system with
`16KB of RAM wherethere is 4KB ofsupervisor code and
`12KB of user code. Without overlapping regions, 3
`protection regions would have to be specified, a 8KB and
`a 4KB region for the user code, and another 4KB region
`for the supervisor code. With the overlapping facility
`only 2 regions have to be used, one 16KB region
`programmed for user access and one overlapping 4KB
`region, with a higher region number, for the supervisor
`code.
`
`The ARM940T design wastaped out at the same
`time as ARM9TDM1andfirst silicon has been evaluated.
`With 2, 4KB caches, the device contains 800K transistors
`
`
`
`and measures a_0,35um_13.0mm on process.
`Measurements of the silicon show power consumption of
`400mW while operating with a core clock rate of 12OMHz
`and a memory bus clock of 16MHz.
`
`Conclusions
`
`The ARM9TDMIand ARM940Thave met their
`
`design goals of providing high performance Thumb
`compatible processors with small die size and low power
`consumption. These products and their derivatives will
`serve the need of next generation applications while
`ARM7TDMIcontinues to serve the needs of the low end.
`ARM9TDMIand ARM940T devices have been licensed
`to a number of ARM’s semiconductor partners and silicon
`has been produced.
`Several products utilising these
`devices are expected to be announced later this year.
`
`References
`
`[1] : S Segars, K Clarke, and L Goudge, “Embedded Control
`Problems, Thumb and the ARM7TDMI”, IEEE Micro, Oct
`1995, p.22-30
`[2] : S Segars, “ARM7TDMIPower Consumption”, IEEE
`Micro, July/Aug 1997, p.12-19
`[3]: D V Jaggar, “Advanced Risc Machines Architecture
`Reference Manual”, Prentice Hall, London, 1996, ISBN 0 13
`736299 4
`[4] : http://www.arm.com/Markets/ARMapps/Panasonic/
`[5] : G Budd, and G Milne, “ARM7100 - A High-Integration,
`Low-Power Microcontroller for PDA Applications”,
`Proceedings of COMPCON °96, p182-187
`[6] : http://www.psion.cony/series5/index.html
`[7] : | Devereux, “ARM9 Family”, Proceedings of
`Microprocessor Forum, 1997
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`6
`
`6
`
`