`Garde
`
`USOO5922076A
`Patent Number:
`11
`(45) Date of Patent:
`
`5,922,076
`Jul. 13, 1999
`
`54) CLOCKING SCHEME FOR DIGITAL SIGNAL Attorney, Agent, or Firm Wolf, Greenfield & Sacks, P.C.
`PROCESSOR SYSTEM
`57
`ABSTRACT
`
`75 Inventor: Douglas Garde, Dover, Mass.
`73 Assignee: Analog Devices, Inc., Norwood, Mass.
`
`21 Appl. No.: 08/931,665
`
`Sep. 16, 1997
`22 Filed:
`(51) Int. Cl. .................................................. G06F 1/04
`52 U.S. Cl. .............................................................. 713/600
`58 Field of Search ..................................... 395/553,555,
`395/556, 559; 713/500, 501, 600; 709/400,
`248
`
`56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5,611,075 3/1997 Garde ...................................... 395/480
`5,619.720 4/1997 Garde et al.
`- - - 395/800
`5,685,005 11/1997 Garde et al. ............................ 395/800
`Primary Examiner Thomas M. Heckler
`
`A digital Signal processing System includes a cluster of
`processors and a host. A host can access each of the
`processors through an external bus System that interconnects
`the host with each of the processors. An external port of each
`of the processors operates at one of a local clock frequency
`and host clock frequency, the local clock frequency and host
`clock frequency being asynchronous with one another. The
`host operates at the host clock frequency. Upon a host access
`of one of the processors, the clock frequency of operation of
`the external parallel port of each processor automatically is
`controlled to operate at the host clock frequency. In an
`embodiment, each processor also includes a core processor
`that operates at a core clock frequency that is a multiple of
`the local clock frequency, asynchronous with the host clock
`frequency. Thus, the Speed of operation of the core processor
`and that of the external parallel port can be optimized
`independently.
`
`12 Claims, 6 Drawing Sheets
`
`
`
`LCLK
`
`HCK
`
`IPR2023-00037
`Apple EX1028 Page 1
`
`
`
`U.S. Patent
`
`Jul. 13, 1999
`
`Sheet 1 of 6
`
`5,922,076
`
`FIG. 1
`
`LCLK
`
`HCLK
`
`LCLK
`
`HCLK
`
`ANALOG SWITCH
`
`LCLK
`
`HCLK
`
`
`
`10
`
`PA
`
`BUFFERS
`
`HCLK'
`
`HCLK's TO EACH DESTINATION
`A2
`
`BUFFERS
`
`LCLK'
`
`LCLKS TO EACH DESTINATION
`
`LCLK
`
`HCLK
`
`LCLK
`
`HCLK
`
`FIG. 2
`
`LCLK
`
`HCLK
`
`MUX
`
`MCLK
`
`IPR2023-00037
`Apple EX1028 Page 2
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jul. 13, 1999
`
`Sheet 2 of 6
`
`5,922,076
`5,922,076
`
`
`
`WNSHDv(vy)Sib0dWYHOOHdnwx&SYNCgl
`NOLLVOINAWNOO<——ySHAS
`
`
`9cstdtsy0078
`cso}|Keceek9ee
`
`ayO078ToHINOD&wei’
`
` yuddY=LVe13|4erWN]iuoglacd=SSeS
`Tae]2g0s
`HINOeliHOn2oe|wud
`92701
`=~|NYSNYGNagtAHONNAHOWSIAMOWSN
`
`
`
`OchXNS)uOefXYr926XWg2£XYy9X0018NOTL¥LNdwOO
`
`HOdv=VIVO}HOOY§=¥lvd)
`24H1HOvKTWOL
`OetXNSFYOOerXSFYO
`
`yWeWOWeitOWe
`waonanoas|2eX2e
`
`
`AMOOTNOTLVLNdWO)vy
`
`sua4and
`
`Ve
`
`E“9I4
`
`
`
`
`
`
`
`IPR2023-00037
`Apple EX1028 Page 3
`
`IPR2023-00037
`Apple EX1028 Page 3
`
`
`
`
`
`
`
`U.S. Patent
`
`Jul. 13, 1999
`
`Sheet 3 of 6
`
`5,922,076
`
`FIG. 4
`
`
`
`LCLK
`
`HCLK
`
`26
`
`PERIPH.
`
`LCLK
`
`FREG
`MULT
`
`128
`
`130-CCLK
`
`CORE
`
`32
`
`IPR2023-00037
`Apple EX1028 Page 4
`
`
`
`U.S. Patent
`
`Jul. 13, 1999
`
`Sheet 4 of 6
`
`5,922,076
`
`OUTPUT AND AERS X
`
`64
`
`40
`
`COPY OF
`DISTRIBUTION TREE
`
`T
`
`
`
`
`
`A4 2
`
`
`
`
`
`
`
`
`
`PHASE a CONTROL
`
`COPY OF
`DELAY T2 .50
`
`INVERTER CHAIN
`
`
`
`UPDATEN
`
`TRIEN
`
`
`
`
`
`OUTPUT
`PAD
`
`60
`
`2
`
`DELAY LOCKED LOOP
`
`
`
`36
`
`IPR2023-00037
`Apple EX1028 Page 5
`
`
`
`U.S. Patent
`
`Jul. 13, 1999
`
`Sheet 5 of 6
`
`5,922,076
`
`FIG. 6
`
`28
`
`IDFIFO
`
`OAFIFO
`
`DIRECT WRITE
`ADR DEST
`
`OBUF
`DIRECT READ
`(SLAVE)
`
`EXT ADR
`
`58
`
`INTERNAL
`DATA
`BUSES
`
`MO
`
`
`
`
`
`
`
`
`
`
`
`
`
`IPR2023-00037
`Apple EX1028 Page 6
`
`
`
`U.S. Patent
`
`Jul. 13, 1999
`
`Sheet 6 of 6
`
`5,922,076
`
`
`
`
`
`
`
`
`
`
`
`
`
`00€
`
`HE000B0 B1 IHM
`
`
`(JE?INDEHONAS-38) Wºº
`HEIN[100
`
`302903
`
`IPR2023-00037
`Apple EX1028 Page 7
`
`
`
`1
`CLOCKING SCHEME FOR DIGITAL SIGNAL
`PROCESSOR SYSTEM
`
`FIELD OF THE INVENTION
`The present invention relates to digital Signal processors,
`and more Specifically, to a digital Signal processor System
`and method having a unique asynchronous clocking Scheme.
`BACKGROUND OF THE INVENTION
`A digital signal processor (DSP) is a special purpose
`computer that is designed to optimize performance for
`digital Signal processing applications. Such as, for examples,
`fast Fourier transforms, digital filtering, image processing
`and Speech recognition. Digital Signal processing applica
`tions typically are characterized by real time operation, high
`interrupt rates and intensive numeric computations. In
`addition, digital Signal processing applications tend to be
`intensive in memory access operations and to require the
`input and output of large quantities of data. Thus, designs of
`digital Signal processors may be quite different from those of
`general purpose computers.
`A typical digital Signal processor includes at least one
`memory for Storing digital Signal processing operations
`instructions as well as operands used in the digital Signal
`processing operations, and a core processor, connected to the
`memory, for carrying out Such operations. A digital Signal
`processor also typically includes a peripheral input/output
`(I/O) device enabling communication with, and the transfer
`of data to/from, other processors and/or external devices.
`The core processor includes Some type of computation unit
`for performing the digital signal processing operations (i.e.,
`computations) on the operands based on the instructions.
`Many different computational Schemes as well as data
`Storage and transferring Schemes have been developed for
`optimizing Speed, accuracy, size and performance of digital
`Signal processors.
`A digital Signal processor commonly operates based upon
`receipt of a Single input clock. From this Single input clock
`are derived a core processor clock, on which the core
`processor operates, and an I/O clock, on which the I/O
`device operates. It is not uncommon for the input clock and
`the I/O clock to be maintained at the same frequency.
`The core processor clock may be a multiple of this input
`clock Such that the core processor operates at a different
`(typically greater) clock frequency than that of the I/O
`device. The speed of the I/O device is limited by the speed
`of the external Signals upon which they operate. The Speed
`of Such external Signals may be limited by physical con
`Straints and capacitances and inductances of external devices
`and buses. The core processor is not So limited. Therefore,
`it is preferable to have the core processor operate at a
`different, and more optimal clock frequency.
`Some digital Signal processors allow the user to Select a
`ratio (e.g., X2, X2.5, X3, X3.5, X4...) by which the input
`clock will be multiplied to produce the core processor clock.
`This enables the user to Select, within a limited range, a core
`processor frequency that is best for the particular processor.
`AS the geometries of processors shrink, internal Speed
`paths improve, enabling faster operation. For a particular
`processor, therefore, there is an optimal Speed at which the
`processor can operate. A limitation in currently available
`processors is that the core processor frequency is limited by
`the input clock and the user-Selectable core clock ratioS
`available.
`In a digital signal processing System, a cluster (i.e., four,
`Six or eight) of processors may be interconnected by an
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`5,922,076
`
`2
`external bus System. A host computer, connected to each of
`the processors in the System through the bus System, may
`acceSS any of the processors. The host computer operates at
`a host clock frequency that may be unrelated
`(asynchronously related) to the input clock frequency (I/O
`clock frequency) of each of the processors in the cluster.
`When the host wishes to acceSS any of the processors,
`either the host clock and the processor I/O clock must be
`Synchronized, or asynchronous acceSS must be enabled.
`Synchronization would require Some type of external Syn
`chronizing interface between the host and each processor in
`the cluster. Alternatively, the provision of asynchronous
`acceSS would require an additional, asynchronous processor
`I/O interface. To date, each of the approaches aimed at
`enabling an asynchronously operating host to acceSS a
`processor requires complex and expensive circuitry. In
`addition, each of Such approaches may be difficult for a user
`to implement and use.
`It is a general object of the present invention to provide an
`improved processor clocking Scheme.
`SUMMARY OF THE INVENTION
`One embodiment of the invention is directed to a digital
`Signal processor. The digital signal processor receives a local
`clock and a System clock, wherein the local clock frequency
`and the System clock frequency may be asynchronous with
`one another. A core processor operates at a core clock
`frequency is a multiple of the local clock frequency. An
`external parallel port, coupled to the core processor, is
`operable at the System clock frequency or at the local clock
`frequency.
`In an embodiment of the invention, the digital Signal
`processor further includes a resynchronization circuit,
`coupled between the external parallel port and the core
`processor, that receives an input command Signal and latches
`in the command Signal when valid.
`Another embodiment of the invention is directed to a
`digital Signal processing System. The System includes a
`plurality of processors, each connected to another by an
`external bus System through an external port. A host, con
`nected to each of the plurality of processors through the
`external bus System, operates at a host clock frequency. The
`host can access each processor through the external bus
`System. The external port of each of the processorS operates
`either at a local clock frequency or at the host clock
`frequency, or at a multiple of either the local clock frequency
`or host clock frequency. Upon a host access, the clock
`frequency of the external port of each processor automati
`cally is controlled to operate at the host clock frequency.
`In one embodiment, the System further includes an exter
`nal memory unit, connected to the host and to at least one of
`the processors through the external bus System. The memory
`also operates either at the local clock frequency or at the host
`clock frequency. Upon a host access of either one of the
`processors or of the memory unit, the clock frequency of the
`memory unit also automatically is controlled to operate at
`the host clock frequency.
`In an embodiment, the clock frequency of operation of the
`external port of each processor is user-controlled.
`In an embodiment, each processor includes a Switch that
`receives a local clock and a host clock and Selects one for
`operation of the external parallel port. In one embodiment,
`the Switch includes a multiplexer.
`In an embodiment, the clock frequency of the memory
`unit is controlled by a master processor to which it is
`connected.
`
`IPR2023-00037
`Apple EX1028 Page 8
`
`
`
`3
`In an embodiment of the System, each processor of the
`System includes a core processor that operates at a multiple
`of the local clock frequency, wherein the local clock fre
`quency may be asynchronous with the host clock frequency.
`In this embodiment, each processor further includes a resyn
`chronization circuit, coupled between the core processor and
`the external port, that latches in a received command Signal
`when valid.
`A further embodiment of the invention is directed to a
`method of digital Signal processing. The method includes:
`connecting a host to a plurality of digital signal processors
`through a bus System; operating an external port of each
`processor at a local clock frequency, a host clock frequency,
`or a multiple of either the local clock frequency or host clock
`frequency; and automatically Switching operation of the
`external port of each processor to the host clock frequency
`upon an access by the host of one of the processors.
`In an embodiment, the method further includes the step of
`operating a core processor of each digital signal processor at
`a multiple of the local clock frequency, which may be
`asynchronous with the System clock frequency.
`The features and advantages of the present invention will
`be more readily understood and apparent from the following
`detailed description of the invention, which should be read
`in conjunction with the accompanying drawings and from
`the claims which are appended to the end of the detailed
`description.
`BRIEF DESCRIPTION OF THE DRAWING
`For a better understanding of the present invention, ref
`erence is made to the accompanying drawings, which are
`incorporated herein by reference.
`FIG. 1 is a block diagram of a System including a cluster
`of processors according to one embodiment of the invention.
`FIG. 2 is a block diagram of an alternate embodiment of
`the system shown in FIG. 1.
`FIG. 3 is a block diagram of the internal components of
`an exemplary processor that may be used with the present
`invention.
`FIG. 4 is a part functional, part Structural block diagram
`of certain processor components and the different clock
`Signals on which the components operate.
`FIG. 5 is a block diagram of an exemplary delay calibra
`tion circuit that may be used with a processor of the
`invention.
`FIG. 6 is a block diagram of an exemplary external port
`block that may be employed within a processor of the
`invention.
`FIG. 7 is a part functional, part Structural block diagram
`of a resynchronization circuit that may be employed within
`a processor of the invention.
`DETAILED DESCRIPTION
`One embodiment of the present invention is directed to a
`cluster of digital Signal processors interconnected by a bus
`System, and a host that can access any of the processors
`through the bus System. A periphery of each of the
`processors, connected to the bus System, operates at one of
`a local clock frequency and a host clock frequency. The host
`operates at the host clock frequency and, when the host
`accesses one of the processors, the clock frequency of
`operation of the periphery of each of the processors auto
`matically is Switched to the host clock frequency.
`Another embodiment of the present invention is directed
`to a digital Signal processor having a core processor that may
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`5,922,076
`
`4
`operate asynchronously with the periphery of the processor.
`In particular, the periphery of the processor, Such as an
`external parallel port, may operate at either a local clock
`frequency or a host clock frequency, wherein a user may
`Select between the two. A core processor of the digital
`processor operates at a multiple of the local clock frequency.
`The local clock frequency and the host clock frequency may
`be independently generated and may be asynchronous with
`one another.
`FIG. 1 is a block diagram showing an exemplary embodi
`ment of the present invention including a cluster of digital
`signal processors P1-P4. The system shown also includes a
`host 100 and a memory 102. The host 100, memory 102, and
`processors P1-P4 are interconnected by a bus system 104.
`The host may include an external computer that communi
`cates with each of processors P1-P4 and external memory
`102. External memory 102 may be any suitable external
`memory that operates with Such a digital Signal processing
`System Such as Synchronous Dynamic Random Access
`Memory (SDRAM). Data may be written to or read from
`each of the processors, as well as to/from the memory.
`Preferably, the external bus operates as a pipelined bus. In
`other words, the data may arrive one, two or three cycles
`after an address is issued, corresponding to a pipeline delay
`of one, two or three cycles respectively. Addresses may be
`issued on every cycle. Preferably, all signals are Sampled on
`the clock signal rising edge and must meet a Set-up time and
`a hold-time requirement.
`During operation, host 100 may acceSS any one of pro
`cessors P1-P4 or memory 102 through bus 104. Host 100
`operates on a host clock HCLK at a host clock frequency.
`Each processor P1-P4 receives the host clock HCLK and a
`local clock LCLK. In one embodiment, as explained in
`greater detail below, the host clock HCLK and local clock
`LCLK are independently generated and may be asynchro
`nous with one another.
`A periphery of each processor, that portion of the
`processor, Such as an external parallel port, which couples
`the internal components of the processor to the external bus
`system 104, may operate at either the local clock LCLK
`frequency or the host clock HCLK frequency. In one
`embodiment, as explained below, this operation is user
`Selectable. Similarly, the memory may operate at either the
`local clock LCLK frequency or the host clock HCLK
`frequency.
`In this embodiment, a buffer 110, having multiple series
`terminated outputs, provides the host clock HCLK Signal to
`each destination, which, in this embodiment, includes host
`100, each processor P1-P4, and memory 102. Similarly,
`buffer 112, also having multiple Series-terminated outputs,
`provides local clock LCLK Signal to each destination,
`which, in this embodiment, includes each processor P1-P4
`and memory 102. Each clock signal is provided on a
`separate trace, output from the buffer. The buffers ensure that
`the same clock signal timing is provided to each designation.
`During operation, a periphery of each processor P1-P4
`and memory 102 may be operating at the local clock LCLK
`frequency. When host 100 is to access one of processors
`P1-P4 or memory 102, the clock frequency of operation of
`the periphery of each processor P1-P4 automatically is
`Switched from that of the local clock LCLK to that of the
`host clock HCLK. At the same time, the clock frequency of
`operation of the memory also is Switched automatically from
`that of the local clock LCLK to that of the host clock HCLK.
`In one embodiment, the Switching occurs when a Host
`Bus Request (HBR) or Host Bus Grant (HBG) control signal
`
`IPR2023-00037
`Apple EX1028 Page 9
`
`
`
`5,922,076
`
`15
`
`25
`
`S
`is asserted by the host. Such control Signal may be provided
`to each processor causing an internal Switch (not shown) in
`each processor to Switch the clock frequency from the local
`clock LCLK to the host clock HCLK. The Switch internal to
`each processor may include a multiplexer, or the like. Glitch
`Suppression is required for any clock signal Switch to the
`processor. For example, glitch Suppression can be attained
`by waiting for one clock to go low, and holding the clock
`output until the other clock goes low, and then driving the
`output with the first clock at that point.
`In one embodiment, an external analog switch 108 selects
`one of host clock HCLK or local clock LCLK to clock the
`memory. A master processor P3 provides a control Signal
`along line 106, at the appropriate time, causing analog
`Switch 108 to select the host clock HCLK signal and
`provides such signal to memory 102. Switch 108 preferably
`is a low-resistance analog Switch, Such that the Switching
`delay is maintained to be less than 0.2 nanoSeconds. For
`example, the Switch may be made from a low-resistance
`Field Effect Transistor. For external Switch 108, the Switch
`ing from the local clock LCLK to the host clock HCLK does
`not have to be glitch-free because no memory access is
`occurring during the Switch over.
`In an alternate embodiment of the system shown in FIG.
`1, Switch 108 of FIG. 1 is replaced by an internal multiplexer
`124, shown in FIG. 2. Such a system includes four proces
`sors P1-P4, host 100, and memory 102 (see FIG. 1). Like the
`system of FIG. 1, the host operates at a host clock HCLK
`frequency and a periphery (I/O port) of each of the proces
`sors P1-P4 operates at a periphery clock PCLK frequency
`which may be equal to either the host clock HCLK fre
`quency or at the local clock LCLK frequency. Memory 102
`operates at a memory clock MCLK frequency which also
`may be equal to either the host clock HCLK frequency or at
`the local clock LCLK frequency. As in the embodiment of
`FIG. 1, upon a host access (of memory or a processor),
`periphery clock PCLK and memory clock MCLK automati
`cally are Switched to host clock HCLK. The Switching may
`be performed internally of each processor by multiplexer
`124. Multiplexer 124 is controlled to Switch automatically to
`the host clock HCLK upon a hostbus acceSS or grant. The
`output of multiplexer 124 includes periphery clock PCLK
`Signal and memory clock MCLK Signal. One master pro
`cessor P1-P4 may be selected to provide memory clock
`MCLK signal along bus 116 to memory 102.
`Each processor shown in the systems of FIGS. 1 and 2
`may be implemented having the components shown in FIG.
`3. As shown, the principle components of DSP 10 are
`computation blockS 12 and 14, a memory 16, a control block
`24, link port buffers 26, an external port 28, a DRAM
`50
`controller 30, an instruction alignment buffer (IAB) 32 and
`a primary instruction decoder 34. Computation blockS 12
`and 14, instruction alignment buffer 32, primary instruction
`decoder 34 and control block 24 constitute a core processor
`which performs the main computation and data processing
`functions of DSP 10. External port 28 controls external
`communications via an external address buS 58 and an
`external data bus 68. External port 28 may constitute the
`periphery of DSP 10. Link port buffers 26 control external
`communication via communication ports 36. DSP 10 is
`preferably configured as a Single monolithic integrated cir
`cuit.
`Memory 16 includes three independent, large capacity
`memory banks 40, 42 and 44. In an embodiment, each of
`memory banks 40, 42 and 44 has a capacity of 64K words
`of 32 bits each. Each of the memory banks 40, 42 and 44
`may have a 128-bit data bus. Up to four consecutive aligned
`
`6
`data words of 32 bits each can be transferred to or from each
`memory bank in a Single clock cycle.
`The elements of DSP 10 are interconnected by buses for
`efficient, high Speed operation. Each of the buses includes
`multiple lines for parallel transfer of binary information. A
`first address bus 50 (MAO) interconnects memory bank 40
`(M0) and control block 24. A second address bus 52 (MA1)
`interconnects memory bank 42 (M1) and control block 24.
`A third address bus 54 (MA2) interconnects memory bank
`44 (M2) and control block 24. Each of the address buses 50,
`52 and 54 may be 16-bits wide. An external address bus 56
`(MAE) interconnects external port 28 and control block 24.
`External address bus 56 is connected through external port
`28 to external address bus 58. Each of the external address
`buses 56 and 58 may be 32 bits wide. A first data bus 60
`(MD0) interconnects memory bank 40, computation blocks
`12 and 14, control block 24, link port buffers 26, IAB 32 and
`external port 28. A second data bus 62 (MD1) interconnects
`memory bank 42, computation blockS 12 and 14, control
`block 24, link port buffers 26, IAB 32 and external port 28.
`A third data bus 64 (MD2) interconnects memory bank 44,
`computation blocks 12 and 14, control block 24, link port
`buffers 26, IAB 32 and external port 28. The data buses 60,
`62 and 64 are connected through external port 28 to external
`data bus 68. Each of the data buses 60, 62 and 64 may be 128
`bits wide, and external data bus 68 may be 64 bits wide.
`The first address bus 50 and the first data bus 60 comprise
`a bus for transfer of data to and from memory bank 40. The
`Second address buS 52 and the Second data buS 62 comprise
`a Second bus for transfer of data to and from memory bank
`42. The third address bus 54 and the third data bus 64
`comprise a third bus for transfer of data to and from memory
`bank 44. Since each of memory banks 40, 42 and 44 has a
`Separate bus, memory bankS 40, 42 and 44 may be accessed
`Simultaneously. AS used herein, “data' refers to binary
`words, which may represent either instructions or operands
`that are associated with the operation of DSP 10. In a typical
`operating mode, program instructions are Stored in one of
`the memory banks, and operands are Stored in the other two
`memory banks. Thus, at least one instruction and two
`operands can be provided to computation blockS 12 and 14
`in a single clock cycle. AS described below, each of memory
`bankS 40, 42, and 44 is configured to permit reading and
`Writing of multiple data words in a Single clock cycle. The
`Simultaneous transfer of multiple data words from each
`memory bank in a Single clock cycle is accomplished
`without requiring an instruction cache or a data cache.
`The control block 24 includes a program Sequencer 70, a
`first integer ALU 72 (JALU), a second integer ALU 74 (K
`ALU), a first DMA address generator 76 (DMAGA) and a
`second DMA address generator 78 (DMAG B). Integer
`ALU's 72 and 74, at different times, execute integer ALU
`instructions and perform data address generation. During
`execution of a program, program Sequencer 70 Supplies a
`Sequence of instruction addresses on one of address buses
`50, 52, 54 and 56, depending on the memory location of the
`instruction Sequence. Typically, one of memory bankS 40, 42
`or 44 is used for Storage of the instruction Sequence. Each of
`integer ALU's 72 and 74 supplies a data address on one of
`address buses 50, 52, 54 and 56, depending on the location
`of the operand required by the instruction. ASSume, for
`example, that an instruction Sequence is Stored in memory
`bank 40 and that the required operands are Stored in memory
`banks 42 and 44. In this case, the program Sequencer
`supplies instruction addresses on address bus 50 and the
`accessed instructions are Supplied to the instruction align
`ment buffer 32, as described below. Integer ALU's 72 and 74
`
`35
`
`40
`
`45
`
`55
`
`60
`
`65
`
`IPR2023-00037
`Apple EX1028 Page 10
`
`
`
`7
`may, for example, output addresses of operands on address
`buses 52 and 54, respectively. In response to the addresses
`generated by integer ALU's 72 and 74, memory banks 42
`and 44 Supply operands on data buses 62 and 64,
`respectively, to either or both of computation blockS 12 and
`14. Memory banks 40, 42 and 44 are interchangeable with
`respect to Storage of instructions and operands.
`Program sequencer 70 and the integer ALU's 72 and 74
`may access an external memory (not shown) via external
`port 28. The desired external memory address is placed on
`address bus 56. The external address is coupled through
`external port 28 to external address bus 58. The external
`memory Supplies the requested data word or data words on
`external data bus 68. The external data is supplied via
`external port 28 and one of the data buses 60, 62 and 64 to
`15
`one or both of computation blocks 12 and 14. The DRAM
`controller 30 controls the external memory.
`As indicated above, each of the memory banks 40, 42 and
`44 may have a capacity of 64k words of 32 bits each. Each
`memory bank may be connected to a data bus that is 128 bits
`wide. In an alternative embodiment, each data bus may be 64
`bits wide, and 64bits are transferred on each of clock phase
`1 and clock phase 2, thus providing an effective bus width
`of 128 bits. Multiple data words can be accessed in each
`memory bank in a single clock cycle. Specifically, data can
`be accessed as Single, dual or quad words of 32 bits each.
`Dual and quad accesses require the data to be aligned in
`memory. Typical applications for quad data accesses are the
`fast Fourier transform (FFT) and complex FIR filters. Quad
`accesses also assist double precision operations. Preferably,
`instructions are accessed as quad words. However, as dis
`cussed below, instructions are not required to be aligned in
`memory.
`Using quad word transfers, four instructions and eight
`operands, each of 32 bits, can be Supplied to computation
`blocks 12 and 14 in a single clock cycle. The number of data
`words transferred and the computation block or blocks to
`which the data words are transferred are selected by control
`bits in the instruction. The Single, dual, or quad data words
`can be transferred to computation block 12, to computation
`block 14, or to both. Dual and quad data word accesses
`improve the performance of DSP 10 in many applications by
`allowing Several operands to be transferred to the compu
`tation blocks 12 and 14 in a single clock cycle. The ability
`to access multiple instructions in each clock cycle allows
`multiple operations to be executed in each cycle, thereby
`improving performance. If operands can be Supplied faster
`than they are needed by the computation blocks 12 and 14,
`then there are memory cycles left over that can be used by
`DMA address generators 76 and 78 to provide new data to
`the memory banks 40, 42 and 44 during those unused cycles,
`without Stealing cycles from the core processor. Finally, the
`ability to acceSS multiple data words makes it possible to
`utilize two or more computation blocks and to keep them
`Supplied with operands. The ability to access Single or dual
`data words reduces power consumption in comparison with
`a configuration where only quad data words are accessed.
`In processor 10 shown in FIG. 3, external port 28 may
`comprise a periphery of the processor and would operate at
`a periphery clock PCLK. The remaining components of DSP
`10, in one embodiment of the invention, would operate at a
`core clock CCLK, which is a multiple of local clock LCLK,
`as described below.
`FIG. 4 is a part Structural, part functional block diagram
`of Some components of processor P1 and the clock signals
`on which they operate. The processor P1 shown includes a
`
`65
`
`45
`
`50
`
`55
`
`60
`
`5,922,076
`
`25
`
`35
`
`40
`
`8
`core processor 132, operating at a core clock CCLK
`frequency, and a periphery 126, operating at either a local
`clock LCLK frequency or a host clock HCLK frequency, or
`a multiple of either LCLK or HCLK. Periphery 126 may
`consist of external port 28 that communicates with external
`data bus 68 and external address bus 58, shown in FIG. 3.
`Processor 132 receives both the local clock LCLK signal
`and the host clock HCLK signal as inputs. Not shown in
`FIG. 4 is a delay calibration circuit through which each input
`clock signal is run to account for propagation delays, as
`described in greater detail hereinafter with reference to FIG.
`5. Both are provided to Switch 124 which selects one as the
`periphery clock PCLK to periphery 126, as described above
`with reference to FIGS. 1 and 2.
`The local clock LCLK signal also is provided to a
`frequency multiplier 128. Frequency multiplier 128 multi
`plies the local clock LCLK Signal by a ratio Selected by the
`user and outputs the product, which is the core clock signal
`CCLK, on line 130 to core processor 132. Frequency
`multiplier may, for example, include the ratios, X2, X2.5,
`X3, X3.5, X4, one of which is selected by a user to produce
`the core clock CCLK.
`This embodiment of the invention enables the frequency
`of operation of the core processor 132 to be optimized
`independently of the frequency of operation of the periphery
`126. The frequency of operation of the periphery 126 may be
`limited by the external bus should such periphery consist of
`the external parallel port. Such a limitation would not,
`however, affect the Speed of the core processor. The inven
`tion also enables the frequency of operation of the periphery
`to be optimized independently of the Speed of operation of
`the core.
`AS stated, the host clock HCLK and local clock LCLKare
`generated independently and may be asynchronous with one
`another. For example, host clock HCLK may be 66 MHz and
`local clock LCLK may be 100 MHz. When periphery 126
`operates on the local clock LCLK, it appears to operate
`Synchronously with core processor 132. AS described above,
`with reference to FIGS. 1 and 2, the Switch to operating on
`host clock HCLK occurs automatically upon an access
`request by the host. Because the core clock CCLK is related
`to the local clock LCLK and because the local clock LCLK
`may be asynchronously related to the host clock HCLK,
`periphery 126 may appear to operate asynchronously with
`core processor 132 (when operating at the host clock HCLK
`frequency). To account for Such operation, an asynchronous
`interface (not shown in FIG. 4) exists between periphery 126
`and core processor 132, and will be described in greater
`detail below.
`Given the high Speeds at which the bus operates, skew in
`the periphery clock PCLK and the memory clock MCLK
`should be minimized. In addition, skew in the core clock
`CCLK should be removed in the frequency multiplexer. In
`one embodiment of the present