throbber
The ARM9Family - High Performance Microprocessors for Embedded
`Applications
`
`Simon Segars, Manager CPU Development, ARM Ltd.
`
`Abstract
`
`Portable applications such as mobile phones,
`and PDAs
`are
`continually
`growing
`in
`pagers,
`sophistication. This places an increasing burden on the
`embedded microprocessor to provide high performance
`while retaining low power consumption and small die
`size.
`
`been
`The ARM7TDMI microprocessor has
`highly successful in these application areas. However, as
`products grow in complexity more processing power is
`required while the expectation on battery life also
`increases. This has lead to the introduction of the ARM9
`family, a range ofhigh performance low power embedded
`microprocessors targeted at next generation embedded
`applications.
`This paper focuses on the implementation of 2
`members of the ARM9 family,
`the ARM9TDMI integer
`core and the ARM940T cached processor. These offer
`performance in excess of 150 MIPS while retaining low
`power consumption. The evolution from the ARM7 to the
`ARM9 microarchitecture is described and the trade offs
`between low power consumption and high performance
`discussed.
`
`Introduction
`
`low power
`ARM designs high performance.
`microprocessors targeted at embedded applications. To
`date most of the ARM design wins have been with the
`ARM7TDMI[1,2] processor. This product incorporates
`the Thumbinstruction set [3], providing industry leading
`code density and typically achieves around 60MHz and
`only 1.5 mW/MHz power consumption on a 0.35um
`process. Coupled with a small die size and integral debug
`features,
`this product
`is
`ideal
`for many medium-
`performance embeddedapplications.
`in many
`ARM7TDMI has been successful
`portable applications. Examples include GSM mobile
`phones such as the Panasonic G650 [4]. ARM7 based
`cores have also been integrated with cache memories and
`
`peripherals in ASSPs such as the ARM7100 [5] as used in
`the PSION 5 PDA [6]. The ARM7 family owes its
`success to the combination of low powcr, low cost and
`high performance.
`However, as applications become more complex
`and integrate more and more functionality, the processor
`is required to provide more and more performance. A
`classic example of such an application is the so-called
`“Smart Phone’. This is a cellular phone and PDArolled
`into one.
`Initial
`smart phones have used multiple
`processors in order to meet the performance needs - one to
`run the PDA, another to run the cellular protocol stack
`and a DSPto process the data traffic.
`Applications such as this epitomize Moore’s law
`and have lead ARM to develop the ARM9 family of
`microprocessors
`[7].
`These devices build on the
`architecture of the ARM7 family and provide higher
`levels of performance. ARM9 processors are specifically
`targeted to meet the needs of the next generation of highly
`integrated portable applications while at the same time
`keeping power consumption and die size to a minimum.
`While the ARM9 family rises to meet
`this
`challenge, the ARM7 family will
`live on servicing the
`needs of low-end applications.
`
`The ARM9TDMIembeddedcore
`
`The first member of the ARM9 family is the
`ARMO9YTDMIinteger core. The goal of this product was
`to produce a high performance Thumb compatible
`processor, to providing a performance upgrade path from
`the ARM7TDMI. The processor wasspecified to be used
`either stand alone, or within a cached processor such as
`the ARM940T.
`
`Higher performance has been achieved by
`increasing the depth of the pipeline from 3 stages as in the
`ARM7TDMIto 5 stages. This allows the device to be
`clocked at
`a higher
`rate
`than the ARM7TDMI.
`Forwarding paths have also been introduced to the
`pipeline in order to reduce the number ofinterlock cases
`and hence reduce the average number of clocks per
`instruction, CPI.
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`4
`
`SAMSUNG 1038
`
`1
`
`SAMSUNG 1038
`
`

`

`Load and store operations account for around
`25% of all instructions in the ARM instruction flow (a
`more detailed breakdown of ARM instruction distribution
`is shown in Table 1). In ARM7TDMIa basic load takes 3
`cycles and a store takes 2 cycles and these contribute
`significantly to the average overall CPI.
`It is therefore
`important
`to optimise for these instructions in high
`performance processors. This has been achieved in the
`ARM9TDMI processor core by adopting a Harvard
`memory architecture (ARM7TDMIused a Von Neumman
`architecture for simplicity) thereby allowing instruction
`fetches to occur in parallel with data accesses.
`The ARM9TDMIpipeline consists of the stages
`fetch, decode, execute, memory access and write-back.
`The main operations performed in cach stage
`are
`described in Figure 1 which compares the pipelines of the
`ARM7TDMIand ARM9TDML.
`
`The extra stages allow the work performedin the
`somewhat congested execute state of the ARM7TDMIto
`be spread more evenly,
`thercby permitting a higher
`maximum operating frequency. ARM7TDMIperforms a
`Thumb to ARM instruction format conversion during the
`first phase of the decode cycle.
`In the 5 stage pipeline of
`ARM9TDMI, ARM and Thumbinstructions are decoded
`in parallel.
`ARM9TDMI performs register address
`decode in the first half of the decode cycle and register
`reads in the second half. This meansthat there is no spare
`phase in which to perform a Thumb to ARM instruction
`conversion as there was in the ARM7TDMI. Thisleads to
`two parallel decode units, one which is only active when
`the processor is in ARM state and the other active only
`when the processor is in Thumbstate, in order to save
`power.
`
`would be required if the data was read during decode
`along with the otherregisters.
`The shifter and ALU perform the same functions
`as those found in ARM7TDMI. The main difference in
`
`the arithmetic and logic units are
`the ALU is that
`separated so that during an instruction only the required
`functional unit is activated.
`It has been found that the
`
`ALUcontributes significantly to the power consumption
`of the ARM7TDMI.
`In that device the simple nature of
`the ALU meansthat both an arithmetic and a logic result
`are calculated each cycle and the required result is then
`selected. This is inefficient from a power consumption
`perspective and so the two units have been partitioned in
`the ARM9TDMIdesign.
`The forwarding paths in the ARM9TDMIallow
`back to back data processing instructions to execute in the
`pipeline withoutstall cycles. Load data, which becomes
`available at
`the end of the memory cycle,
`is also
`forwarded into the pipeline.
`If the data from the load is
`required in the very next cycle then there is a one cycle
`interlock, since the data is not returned until the end of the
`memory cycle, and the load instruction occupies 2 execute
`cycles in the datapath. However,
`if the data is not
`required until the next but one instruction then the data is
`forwarded, there is no interlock and the instruction has
`only occupied one execute cycle in the datapath. These
`two cases are depicted in Figure3 (a) and (b).
`There is a third case where a load data specifies a
`rotation or sign extension as it is fetched and here the
`forwarding paths cannot be used. The loaded data must
`be passed through the Byte Rotator block and written back
`to the register bank before being used by subsequent
`instructions. Therefore,
`instructions such as these can
`cause up to two cycles of interlock depending on when the
`data is required.
`The ARMIYTDMI microarchitecture described
`
`The datapath of the ARM9TDMIis shown in
`Figure 2. The register bank has 3 read ports and two write
`ports. The A and B read ports feed the execution units in
`above results in an average CPI of 1.5. A breakdown of
`the datapath. The C port is used exclusively for reading
`the numberof cycles for each instruction class is shown in
`store data. Store data is read during the execute stage of
`Table 1 along with that of ARM7TDMIfor comparison.
`the pipeline. This reduces the numberof forwarding paths
`The new microarchitecture results in a 21% increase in
`and also removes the need for holding latches which
`ARM7TDMIPipeline Operation
`Execute
`Fetch
`Decode.
`Main Decode.
`Convert Thumb
`Register Read Shifter
`to ARM
`Register Address
`Instruction Fetch
`ALU
`Decode
`Writeback
`
`
`
`
`
`ARM9TDMIPipeline Operation
`Execute
`Fetch
`Decode
`Memory Writeback
`
`ARM Decode
`Register
`ALUResult
`Read
`
`ALU
`Memory Data access
`and/or
`Load data
`Register
`Writeback
`Read
`
`
`
`
`
`
`
`
`
`Instruction Fetch=Le ----- 4 Shifter
`
`Figure | : ARM7TDMI and ARM9YTDMIPipelines
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`2
`
`2
`
`

`

`
`
`mux
`
`~
`
`
`
`
`
`Word
`Repl
`
`
`
`
`
`l<
`DDSean
`
`
`
`DINC
`
`
`
`
`
`
`/
`
`r
`!
`
`
`
`
`Figure 2 : The ARM9TDMIDatapath
`
`
`F
`D
`E
`M
`Ww
`\pat
`
`F
`Interlock
`D
`E
`M
`Ww
`
`ADDR2, R1, R1
`
`
`
`(a) Single Cycle Interlock
`
`
`
`
`
`
`
`
`
`
`
`Byte Rot
`
`
`/ Sign Ex.
`> DNL]
`
`Byte/ a - I
`
`
`
`
`
`
`Cd
`
`
`
`
`eal
`
`
`TINC
`
`
`
`
`
`
`
`
`
`SHIFTER
`BL.J
`Inn
`
`
`REGBANK
`|
`|
`DAScan
`
`
`
`
`
`ae
`+PC
`NS DALI
`
`
`
`
`
`
`Atel
`!
`
`
`
`
`
`WLI
`
`
`SNF
`[~
`b>
`PSR
`
`Vectors
`
`
`
`
`
`
`
`
`
`
`
`
`f
`
`LDR RI, [RO]
`
`LDRRI, [RO]
`
`
`
`F
`D
`E
`M
`Ww
`\
`
`ADDR2, R3, R4
`
`F
`
`D
`
`E
`
`\
`\
`
`M
`
`WwW
`
`ADD RS, R6, R1
`
`
`F
`D
`E
`M
`WwW
`
`(b) No Interlock
`Figure 3 : ARM9TDMILoad Behaviour
`
`instruction throughput relative to ARM7TDMI and 1.1
`MIPS/MHz, compared to 0.9 MIPS/MHz.
`The re-pipelining allows the clock rate to be
`increased significantly compared to ARM7TDMI. On the
`same process, ARM9TDMI maybe clocked at twice the
`rate of ARM7TDMI. The increase in complexity required
`to achieve the increase in performance requires around
`50% more transistors and the area has increased by almost
`90% This area increase is accounted for by an increase in
`the number of routing channels in the datapath (to permit
`the additional forwarding paths) and a relative increase in
`the standard cell control logic to custom datapath ratio. A
`comparative summary between the two processor cores is
`shown in Table 2.
`
`The ARM940T cached processor
`
`The ARM9TDMI processor core has been
`integrated with caches, a write buffer and a protection unit
`in the ARM940T processor.
`This system has many
`advantages for the system designer. Firstly, it allows the
`processor to operate at
`its maximum frequency since
`memory accessesare to the local, high performance cache.
`Secondly, since main memory is accessed infrequently,
`system poweris reduced. Also, the main memory system
`may now beused for other tasks, such as DMA,while the
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`3
`
`3
`
`

`

`Store multiple registersPp2T POPgs
`
`|DataprocessingwithPC|38fT
`
`
`
`|Branch/BranchwithlinkJTA
`|Loadregister|A
`|Storeregister|
`Load multiple registers ee
`
`Table 1 : ARM9TDMI vs ARM7TDMICPIAnalysis
`
`processoris executing from its caches. A block diagram of
`the ARM940T design is shown below in Figure 4.
`The Harvard caches in the ARM940T are both
`4KB in size in the first implementation. The caches are
`constructed in a modular manner using | KB cacheblocks.
`Through the use of these blocks the size of the cache can
`easily be varied with minimal impact on the rest of the
`design. The cache blocks are built using a CAM-RAM
`structure comprising 64 lines each with 4 words of data.
`Each cache block has 64 way associativity and the CAMs
`are designed to compare a maximum of 27 address bits. A
`cache read involves3 basic steps.
`
`processoris stalled by the cache control logic and an
`external access occurs to fetch the required data.
`
`Although the high associativity helps little with cache
`hit rates, the design has a number of advantages. These
`include
`low power,
`short
`cycle
`time
`and simple
`modularity allowing the ARM940Tto be extended to have
`larger caches by utilising more cache segments.
`For
`example,
`if the caches were 8KB in size,
`then an
`additional bit must be used for the segment decode and
`oneless bit passed into the CAM for address look-up.
`In
`this case, bits 6:4 would be used for segment decode and
`
`poARTTDMT_|ARMOTDMI|
`
`Typical Max Clock rate (0.35um)|60|120 Power (mW/MHz @ 3.0V)
`
`MIPS/MHz
`
`eeee
`
`Table 2 : ARM9TDMI vs ARM7TDMI Comparison Sumary
`
`the 32 bit address from the processor is
`e Firstly,
`decoded to determine which of the 4 segments the
`addressed data might be in.
`In the 4KB ARM940T
`design, bits 5:4 of the address are used for the segment
`decode.
`
`e Secondly, the upper 26 bits of the address are then
`passed into the CAM where they are compared with
`the CAM contents.
`
`e Finally, if there is a CAM match, then a data access in
`the cache RAM occurs. Each RAM linc is 4 words
`
`long andbits 3:2 of the address are used to select the
`desired word.
`If the cache lookup fails then the
`
`bits 31:7 of the address would be passed into the CAMs.
`The unused column of the CAM would simply betied off.
`In fact, the CAMsare designed to cope with cache sizes
`down to 2KB and in the initial design the least significant
`CAM column istied off.
`In order to increase performance, the ARM940T
`contains a write buffer. Without this,
`then any store
`operation to main memory would have to stall
`the
`processoruntil the memory bus wasfree. The write buffer
`allows the processor to be isolated from the traffic on the
`system memory bus and sostall cycles are minimised.
`The write buffer allows storage for up to 8 words of data
`and 4 address values.
`
`As a further power and bus activity saving
`feature, the ARM940T data cache may be operated in a
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`4
`
`4
`
`

`

`Instruction Data
`
`Control
`
`t E
`
`xternal Coprocessor
`Interface
`
`
`
`
` I-Cache
`
`
`Protection
` D-Cache
`Unit / CP1S
`
`TABB1:0]
`
`
`
`Instruction Cache
`
`
`
`ARMSTDMI
`Processor Core
`(Integral EmbeddedICE)
`
`
`
` Data Cache
`
`
`
`
`
`
`BA(31:0]
`
`
`
`
`
`
`
`TAP
`Controller
`
`STAGInterface
`
`AMBAInterface
`
`——$=> Write
`Buffer
`
`|
`
`Beontrol
`
`|
`
` BD[31:0]
`
`Figure 4 : The ARM940T Processor
`
`In this mode of operation, whenever a
`write-back mode.
`store occurs which hits in the cache, the cache is updated
`but the write is not passed to the external memory system.
`At
`this time the cache and main memory have lost
`coherency and the cache line containing the incoherent
`data is said to be ‘dirty’. Subsequently,if as a result of a
`later cache miss the dirty line is selected to be overwritten
`with new data, the dirty data must be written back to main
`memory. Whenthis occurs, the processoris stalled while
`the dirty data is copied from the cache into the write
`buffer. The linefill of the new data is then performed and
`written into the cache. At that point processor execution
`is resumed. The data in the write buffer is written back to
`memory when the system bus is free and before any
`further read operations, ensuring memory coherency.
`The benefit of a write-back cache is that many
`store operations may occur to the cache before they are
`copied to the main memory. Therefore the total number
`of main memory accesses and hence system power,
`is
`reduced. This is especially useful for data regions where
`program variables are to be stored. There are cases when
`main memory and the cache must be kept coherentatall
`times, for a video frame buffer. Consequently ARM940T
`also supports a write-through mode of operation whereall
`cache updates are written through to memory as they
`happen.
`
`ARM940T is targeted at a class of embedded
`applications referred to as closed applications. Closed
`applications are where all the software the processor will
`execute is present in the system whenit is shipped by the
`OEM.Since the software can be considered reliable and
`
`safe, the memory protection provided by the processor
`may be minimised. Consequently, ARM940T does not
`support virtual memory and does not contain an MMU or
`TLB.
`Instead, a simple Protection Unit, PU, is provided.
`The PU is programmed via accesses to the system
`coprocessor, CP 15.
`The protection unit allows the system designer to
`partition memory into 16 regions, 8 on the instruction side
`and another 8 unique regions on the data side. Each
`region is specified by a base address pointer and a size
`ficld. The size can be anything, in powers of 2, from 4GB
`to 4KB. The address of the start of the region must be
`multiple of the region’s size. Each region has a number of
`properties associated with it specifying how the cache and
`write buffer behave in that region, eg. cacheable, non-
`cacheable, write-through or write-back, and also what type
`of access, eg. supervisor only, can occur within the region.
`The regions
`are
`labeled 0-7 and may be
`programmed such that they overlap.
`If a memory access
`occurs which corresponds to 2 or more regions then the
`attributes for the highest numbered region are used (ie.
`region 7 has the highest priority and region 0 the lowest).
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`5
`
`5
`
`

`

`the
`that
`is
`The advantages of overlapping regions
`flexibility with which the regions may be used is
`increased, since silicon area restrictions only permit 8 for
`eachside.
`
`By way of an example, consider a system with
`16KB of RAM wherethere is 4KB ofsupervisor code and
`12KB of user code. Without overlapping regions, 3
`protection regions would have to be specified, a 8KB and
`a 4KB region for the user code, and another 4KB region
`for the supervisor code. With the overlapping facility
`only 2 regions have to be used, one 16KB region
`programmed for user access and one overlapping 4KB
`region, with a higher region number, for the supervisor
`code.
`
`The ARM940T design wastaped out at the same
`time as ARM9TDM1andfirst silicon has been evaluated.
`With 2, 4KB caches, the device contains 800K transistors
`
`
`
`and measures a_0,35um_13.0mm on process.
`Measurements of the silicon show power consumption of
`400mW while operating with a core clock rate of 12OMHz
`and a memory bus clock of 16MHz.
`
`Conclusions
`
`The ARM9TDMIand ARM940Thave met their
`
`design goals of providing high performance Thumb
`compatible processors with small die size and low power
`consumption. These products and their derivatives will
`serve the need of next generation applications while
`ARM7TDMIcontinues to serve the needs of the low end.
`ARM9TDMIand ARM940T devices have been licensed
`to a number of ARM’s semiconductor partners and silicon
`has been produced.
`Several products utilising these
`devices are expected to be announced later this year.
`
`References
`
`[1] : S Segars, K Clarke, and L Goudge, “Embedded Control
`Problems, Thumb and the ARM7TDMI”, IEEE Micro, Oct
`1995, p.22-30
`[2] : S Segars, “ARM7TDMIPower Consumption”, IEEE
`Micro, July/Aug 1997, p.12-19
`[3]: D V Jaggar, “Advanced Risc Machines Architecture
`Reference Manual”, Prentice Hall, London, 1996, ISBN 0 13
`736299 4
`[4] : http://www.arm.com/Markets/ARMapps/Panasonic/
`[5] : G Budd, and G Milne, “ARM7100 - A High-Integration,
`Low-Power Microcontroller for PDA Applications”,
`Proceedings of COMPCON °96, p182-187
`[6] : http://www.psion.cony/series5/index.html
`[7] : | Devereux, “ARM9 Family”, Proceedings of
`Microprocessor Forum, 1997
`
`Authorizedlicensed uselimited to: Fish & Richardson PC. Downloaded on May 10,2022 at 19:42:09 UTC from IEEE Xplore. Restrictions apply.
`
`6
`
`6
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket