`
`United States Patent
`[19J
`Garde
`
`[11]Patent Number:
`5,922,076
`[45]Date of Patent:
`
`Jul. 13, 1999
`
`US005922076A
`
`
`[54]CLOCKING SCHEME FOR DIGITAL SIGNAL
`PROCESSOR SYSTEM
`
`
`
`
`
`Attorney, Agent, or Firm-Wolf, Greenfield & Sacks, P.C.
`
`[57]
`
`ABSTRACT
`
`
`
`
`Douglas Garde, Dover, Mass.[75]Inventor:
`
`
`
`
`[21]Appl. No.: 08/931,665
`
`[22]Filed:Sep. 16, 1997
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`A digital signal processing system includes a cluster of
`
`
`
`
`
`
`
`processors and a host. A host can access each of the
`Analog Devices, Inc., Norwood, Mass.[73]Assignee:
`
`
`
`
`processors through an external bus system that interconnects
`
`
`the host with each of the processors. An external port of each
`
`
`of the processors operates at one of a local clock frequency
`
`
`and host clock frequency, the local clock frequency and host
`
`
`
`clock frequency being asynchronous with one another. The
`[51]Int. Cl.6
`G06F 1/04
`
`
`
`host operates at the host clock frequency. Upon a host access
`
`
`[52]U.S. Cl. .............................................................. 713/600
`
`
`
`of one of the processors, the clock frequency of operation of
`
`
`..................................... 395/553, 555,
`[58]Field of Search
`
`
`
`
`the external parallel port of each processor automatically is
`
`
`395/556, 559; 713/500, 501, 600; 709/400,
`
`
`
`controlled to operate at the host clock frequency. In an
`248
`
`
`
`embodiment, each processor also includes a core processor
`
`
`
`that operates at a core clock frequency that is a multiple of
`
`
`the local clock frequency, asynchronous with the host clock
`
`
`frequency. Thus, the speed of operation of the core processor
`
`
`
`5,611,075 3/1997 Garde ...................................... 395/480
`
`
`
`and that of the external parallel port can be optimized
`
`
`
`5,619,720 4/1997 Garde et al. ............................ 395/800
`independently.
`
`
`
`5,685,005 11/1997 Garde et al. ............................ 395/800
`
`[56]
`
`
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`
`
`
`
`Primary Examiner-Thomas M. Heckler
`
`
`
`
`
`12 Claims, 6 Drawing Sheets
`
`LCLK
`
`HCLK
`
`120
`
`118
`
`Pi
`PROC.
`
`MUX
`
`124
`
`PCLK
`126
`
`PERIPH.
`
`LCLK
`
`128
`FREQ.
`MULT.
`
`130 CCLK
`
`132
`
`CORE
`
`
`
`Jul. 13, 1999 Sheet 1 of 6 5,922,076
`U.S. Patent
`
`FIG. j
`
`LCLK HCLK LCLK HCLK
`
`P2
`
`Pi
`
`PROC. PROC.
`
`HCLK 100
`
`102
`
`104
`
`MEM.
`
`CLK
`
`108
`
`P3
`106
`
`ANALOG SWITCH
`
`P4
`
`HOST
`
`110
`
`BUFFERS HCLK'
`
`LCLK HCLK
`
`
`HCLK's TO EACH DESTINATION
`112
`PROC. PROC.
`
`LCLK HCLK LCLK HCLK
`
`
`
`LCLK's TO EACH DESTINATION
`
`BUFFERS LCLK'
`
`FIG. 2
`
`___,.--P 1
`
`LCLK
`
`PCLK
`
`MUX
`___./12 4
`
`MCLK
`
`HCLK
`
`
`
`d •
`r:JJ. •
`
`�
`�
`...... �
`
`= ......
`
`rF.J.=�
`
`
`�
`....
`N
`
`0
`
`....,
`O'I
`
`Ul ....
`
`\0
`N
`N....
`
`= .....:a
`0--,
`
`FIG. 3
`
`lV
`
`3
`
`�
`
`::::¢::>
`
`24
`78� 72 CONTROL BLOCK
`/ 36
`70
`-\ J ALU ;
`LINK PORT
`DMAG b
`34
`'
`.(
`COMMUNICATION
`BUFFERS "'
`60MOO
`; 76 DMAG a X ALU PROGRAM
`
`PORTS (4)
`4 CHSUM - 32x32 - SEQUENCER
`-74
`62MD1 IAB
`� MAO MA1 MAE MA2
`PRIM
`TO ALL
`-INSTR � BLOCKS
`58
`16 16 ., 32 16
`64
`DEC
`. 32 /.
`MD2
`50� 52� 56/ 54--
`ADDRESi
`,----,- 64
`128
`MD2
`- EXT
`� 128 �
`,,--62
`' . MD1
`PORT ME
`•130✓
`r----60
`MOO
`68
`. 64 /.
`. '
`42 44
`_ DATA
`_,• ! .. . I .
`�30
`.
`' . 40� DATA ADDA
`DATA ADDA DATA ADDA
`M1
`M2
`MO
`R
`2Mbit 2Mbit
`2Mbit
`MEMORYMEMORY
`MEMORY
`....,..... 16
`BANK BANK
`BANK
`COMPUTATION BLOCK X
`
`
`64K X 32 64K X 32
`64K X 32
`OR 16Kx120
`OR 16Kx120
`
`COMPUTATION BLOCKY OR 16Kx120
`
`'"""'
`��
`'"""'
`\0
`\0
`\0
`
`✓
`
`V
`
`•
`
`12
`
`-
`
`
`
`U.S. Patent
`
`
`5,922,076
`Jul. 13, 1999 Sheet 3 of 6
`
`FIG. 4
`
`LCLK
`HCLK
`
`120
`
`118
`
`P1
`PROC.
`
`MUX
`
`124
`
`126
`PCLK
`
`PERIPH.
`
`LCLK
`
`128
`FREQ.
`MULT.
`
`130
`CCLK
`
`132
`
`CORE
`
`
`
`
`
`U.S. Patent Jul. 13, 1999 Sheet 4 of 6 5,922,076
`
`5
`FIG.
`
`I
`
`OUTPUT AND
`
`I I 164
`LATCHING
`
`I
`I
`
`I
`I
`
`140
`166
`168
`T1 pd
`
`Q
`en
`
`pd'
`
`162
`COPY OF
`DISTRIBUTION
`
`DISTRIBUTION TREE
`TREE 64
`LATEN
`
`T1
`
`142 144
`
`158
`
`T2 OUTPUT
`- PAD
`
`160
`
`154 64 156 64
`
`T1
`T1
`138 ,,_.___,.....__-----,.....--,..____,.....__�
`DISTRIUTION
`DISTRIBUTION
`TREE TREE
`
`PHASE
`COPY OF
`CONTROL
`DET
`
`DELAY T2 150
`
`152
`
`I/0
`CLOCK
`146
`
`INVERTER CHAIN
`
`UPDATEN TRIEN
`
`
`
`
`
`DELAY LOCKED LOOP
`
`136
`
`
`
`U.S. Patent Jul. 13, 1999 Sheet 5 of 6
`
`5,922,076
`
`FIG. 6
`
`/ 28
`
`178 170
`IFIFO ;
`( IDFIFO
`PACK --
`i
`
`IAFIFO INPUT FIFO
`INTERNAL 180
`DATA 60
`
`ADR (DEST) :
`'""'
`BUSES
`,
`18
`(
`MO MUXES
`AND __/ 176
`62 DRIVERS 172
`( OFIFO ------ODFIFO
`Mi (
`64
`UN---
`i
`( OAFIFO OMA & PACK
`182 DIRECT WRITE
`M2
`66 ( ADR !DEST] :
`EXT AD
`OBUF
`32-c_
`MAE
`
`� DIRECT READ
`(SLAVE]
`OMAR
`174/
`, 4
`
`(6)
`
`, TA
`EXT DA
`54\C-
`68
`
`--_,,.
`
`.
`
`R
`
`58
`
`.
`
`
`
`FIG. 7
`
`d •
`r:JJ. •
`�
`�
`...... �
`
`= ......
`
`204
`
`200
`
`190
`HCLKRE-SYNCHRONIZATION LATCHES
`
`WRITE DECODER 3
`
`----.-------1 COUNTER
`192
`3 198 194 lat
`)dr LSBs
`ARBLAT
`
`WRITE FROM
`HCLK
`00, EXT BUS
`01 32 OR 64
`COMPARE
`
`
`LATCH ON RISING EDGE
`,I - I I
`--r I � Id I � J 110 DATA IN ON WR
`CCLK
`CCLK (-t)
`
`EARLY CCLK
`196
`(OR CLOCK) RISING
`_11 EDGE
`EMPTY
`��- CCLK (RE-SYNCHRONIZED)
`READ DECODER
`COUNTER
`
`170
`
`b
`
`32,64
`OR 128
`
`C
`VI%! I I I W/'.I V /21 �
`
`17t
`
`n
`
`3
`
`3
`
`'"""'
`��
`'"""'
`\0
`\0
`\0
`
`en
`
`'JJ.
`=�
`�
`(t=3ns)
`....
`O'I
`
`0
`
`....,
`O'I
`
`206
`
`202
`
`Ul ....
`\0
`N
`N....
`
`= .....:a
`0--,
`
`
`
`1
`
`5,922,076
`
`PROCESSOR SYSTEM
`
`SUMMARY OF THE INVENTION
`
`It is a general object of the present invention to provide an
`
`2
`external bus system. A host computer, connected to each of
`
`
`
`CLOCKING SCHEME FOR DIGITAL SIGNAL
`
`
`
`the processors in the system through the bus system, may
`
`
`
`access any of the processors. The host computer operates at
`FIELD OF THE INVENTION
`
`
`a host clock frequency that may be unrelated
`
`
`5 (asynchronously related) to the input clock frequency (1/0
`The present invention relates to digital signal processors,
`
`
`
`
`
`
`
`clock frequency) of each of the processors in the cluster.
`
`
`
`
`and more specifically, to a digital signal processor system
`When the host wishes to access any of the processors,
`
`
`
`and method having a unique asynchronous clocking scheme.
`
`
`
`
`either the host clock and the processor 1/0 clock must be
`BACKGROUND OF THE INVENTION
`
`synchronized, or asynchronous access must be enabled.
`
`
`
`A digital signal processor (DSP) is a special purpose
`
`
`
`10 Synchronization would require some type of external syn
`
`
`
`
`chronizing interface between the host and each processor in
`
`
`computer that is designed to optimize performance for
`
`
`
`the cluster. Alternatively, the provision of asynchronous
`
`
`
`
`
`digital signal processing applications such as, for examples,
`
`
`
`
`access would require an additional, asynchronous processor
`
`
`fast Fourier transforms, digital filtering, image processing
`
`
`1/0 interface. To date, each of the approaches aimed at
`
`
`
`
`and speech recognition. Digital signal processing applica
`
`
`15 enabling an asynchronously operating host to access a
`
`
`
`
`tions typically are characterized by real time operation, high
`
`
`
`
`processor requires complex and expensive circuitry. In
`
`
`interrupt rates and intensive numeric computations. In
`
`
`
`addition, each of such approaches may be difficult for a user
`
`
`
`
`addition, digital signal processing applications tend to be
`
`to implement and use.
`
`
`
`
`intensive in memory access operations and to require the
`
`
`
`input and output of large quantities of data. Thus, designs of
`
`
`
`
`
`20 improved
`
`
`
`
`
`digital signal processors may be quite different from those of
`
`
`processor clocking scheme.
`
`general purpose computers.
`A typical digital signal processor includes at least one
`
`
`
`
`
`One embodiment of the invention is directed to a digital
`
`
`
`
`
`memory for storing digital signal processing operations
`
`
`
`
`
`signal processor. The digital signal processor receives a local
`
`
`
`instructions as well as operands used in the digital signal
`
`
`processing operations, and a core processor, connected to the 25
`
`
`clock and a system clock, wherein the local clock frequency
`
`
`
`
`
`and the system clock frequency may be asynchronous with
`
`
`
`
`memory, for carrying out such operations. A digital signal
`
`
`
`
`processor also typically includes a peripheral input/output
`
`
`
`one another. A core processor operates at a core clock
`
`
`
`frequency is a multiple of the local clock frequency. An
`
`
`
`
`(1/0) device enabling communication with, and the transfer
`
`
`
`external parallel port, coupled to the core processor, is
`
`
`
`
`of data to/from, other processors and/or external devices.
`
`The core processor includes some type of computation unit 30
`
`
`operable at the system clock frequency or at the local clock
`
`
`
`
`frequency.
`
`for performing the digital signal processing operations (i.e.,
`
`
`
`computations) on the operands based on the instructions.
`
`
`In an embodiment of the invention, the digital signal
`
`
`
`
`
`
`Many different computational schemes as well as data
`
`
`
`
`processor further includes a resynchronization circuit,
`
`
`
`
`
`storage and transferring schemes have been developed for
`
`
`coupled between the external parallel port and the core
`35
`
`
`
`
`optimizing speed, accuracy, size and performance of digital
`
`
`
`processor, that receives an input command signal and latches
`
`signal processors.
`
`in the command signal when valid.
`
`
`
`A digital signal processor commonly operates based upon
`Another embodiment of the invention is directed to a
`
`
`
`
`
`
`
`receipt of a single input clock. From this single input clock
`
`
`
`
`
`digital signal processing system. The system includes a
`
`
`
`
`are derived a core processor clock, on which the core
`
`
`
`
`plurality of processors, each connected to another by an
`40
`
`
`
`
`processor operates, and an 1/0 clock, on which the 1/0
`
`
`
`external bus system through an external port. A host, con-
`
`
`
`device operates. It is not uncommon for the input clock and
`
`
`
`nected to each of the plurality of processors through the
`
`
`the 1/0 clock to be maintained at the same frequency.
`
`
`
`external bus system, operates at a host clock frequency. The
`
`
`host can access each processor through the external bus
`
`
`The core processor clock may be a multiple of this input
`
`
`
`
`system. The external port of each of the processors operates
`clock such that the core processor operates at a different
`
`
`45
`
`
`(typically greater) clock frequency than that of the 1/0
`
`
`
`either at a local clock frequency or at the host clock
`
`
`
`
`
`frequency, or at a multiple of either the local clock frequency
`
`
`
`device. The speed of the 1/0 device is limited by the speed
`
`
`or host clock frequency. Upon a host access, the clock
`
`
`of the external signals upon which they operate. The speed
`
`
`
`
`frequency of the external port of each processor automati-
`
`
`
`
`of such external signals may be limited by physical con
`
`50 cally is controlled to operate at the host clock frequency.
`
`
`
`
`straints and capacitances and inductances of external devices
`
`
`
`
`and buses. The core processor is not so limited. Therefore,
`
`
`In one embodiment, the system further includes an exter
`
`
`
`it is preferable to have the core processor operate at a
`
`
`
`nal memory unit, connected to the host and to at least one of
`
`different, and more optimal clock frequency.
`
`
`
`
`the processors through the external bus system. The memory
`
`
`
`also operates either at the local clock frequency or at the host
`
`
`
`
`Some digital signal processors allow the user to select a
`
`one of the
`55 clock frequency. Upon a host access of either
`
`
`ratio (e.g., X2, X2.5, X3, X3.5, X4 .. . ) by which the input
`
`
`
`processors or of the memory unit, the clock frequency of the
`
`
`
`
`clock will be multiplied to produce the core processor clock.
`memory unit also automatically is controlled to operate at
`
`
`
`
`This enables the user to select, within a limited range, a core
`the host clock frequency.
`
`
`
`processor frequency that is best for the particular processor.
`In an embodiment, the clock frequency of operation of the
`
`
`
`
`
`As the geometries of processors shrink, internal speed
`
`
`60 external port of each processor is user-controlled.
`
`
`
`
`paths improve, enabling faster operation. For a particular
`
`
`
`
`processor, therefore, there is an optimal speed at which the
`
`
`
`
`In an embodiment, each processor includes a switch that
`
`
`
`
`processor can operate. A limitation in currently available
`
`
`receives a local clock and a host clock and selects one for
`
`
`
`processors is that the core processor frequency is limited by
`
`
`operation of the external parallel port. In one embodiment,
`
`
`the input clock and the user-selectable core clock ratios
`
`
`the switch includes a multiplexer.
`available.
`
`
`In an embodiment, the clock frequency of the memory
`65
`
`
`
`
`
`
`In a digital signal processing system, a cluster (i.e., four,
`
`
`unit is controlled by a master processor to which it is
`
`
`
`
`six or eight) of processors may be interconnected by an
`connected.
`
`
`
`5,922,076
`
`
`
`4
`3
`
`
`
`In an embodiment of the system, each processor of the
`operate asynchronously with the periphery of the processor.
`
`
`
`
`
`
`In particular, the periphery of the processor, such as an
`
`
`
`system includes a core processor that operates at a multiple
`
`
`
`
`
`external parallel port, may operate at either a local clock
`
`of the local clock frequency, wherein the local clock fre
`
`
`
`frequency or a host clock frequency, wherein a user may
`
`quency may be asynchronous with the host clock frequency.
`
`
`
`5 select between the two. A core processor of the digital
`
`
`
`
`In this embodiment, each processor further includes a resyn
`
`
`
`processor operates at a multiple of the local clock frequency.
`
`
`
`
`chronization circuit, coupled between the core processor and
`
`
`
`
`The local clock frequency and the host clock frequency may
`
`
`
`the external port, that latches in a received command signal
`
`
`
`be independently generated and may be asynchronous with
`when valid.
`one another.
`A further embodiment of the invention is directed to a
`
`
`
`
`
`: 10
`FIG. 1 is a block diagram showing an exemplary embodi-
`
`
`
`method of digital signal processing. The method includes
`
`
`
`
`
`
`
`
`ment of the present invention including a cluster of digital
`
`
`
`
`connecting a host to a plurality of digital signal processors
`
`signal processors Pl-P4. The system shown also includes a
`
`
`
`
`
`through a bus system; operating an external port of each
`host 100 and a memory 102. The host 100, memory 102, and
`
`
`processor at a local clock frequency, a host clock frequency,
`
`
`processors Pl-P4 are interconnected by a bus system 104.
`
`
`
`or a multiple of either the local clock frequency or host clock
`
`
`
`15 The host may include an external computer that communi
`
`frequency; and automatically switching operation of the
`
`
`
`cates with each of processors Pl-P4 and external memory
`
`
`external port of each processor to the host clock frequency
`
`
`102.External memory 102 may be any suitable external
`
`
`upon an access by the host of one of the processors.
`
`
`
`memory that operates with such a digital signal processing
`
`
`
`In an embodiment, the method further includes the step of
`
`system such as Synchronous Dynamic Random Access
`
`
`operating a core processor of each digital signal processor at
`20 Memory (SDRAM). Data may be
`
`
`written to or read from
`a multiple of the local clock frequency, which may be
`
`
`each of the processors, as well as to/from the memory.
`
`asynchronous with the system clock frequency.
`
`
`
`Preferably, the external bus operates as a pipelined bus. In
`
`
`
`
`The features and advantages of the present invention will
`
`
`other words, the data may arrive one, two or three cycles
`
`
`
`
`be more readily understood and apparent from the following
`
`
`after an address is issued, corresponding to a pipeline delay
`
`
`
`detailed description of the invention, which should be read
`
`
`
`25 of one, two or three cycles respectively. Addresses may be
`
`
`
`in conjunction with the accompanying drawings and from
`
`
`
`
`
`issued on every cycle. Preferably, all signals are sampled on
`
`
`the claims which are appended to the end of the detailed
`
`
`the clock signal rising edge and must meet a set-up time and
`description.
`
`a hold-time requirement.
`During operation, host 100 may access any one of pro-
`
`
`30
`
`
`
`cessors Pl-P4 or memory 102 through bus 104. Host 100
`
`
`
`
`
`For a better understanding of the present invention, ref
`
`operates on a host clock HCLK at a host clock frequency.
`
`
`
`erence is made to the accompanying drawings, which are
`
`
`Each processor Pl-P4 receives the host clock HCLK and a
`
`
`incorporated herein by reference.
`
`
`
`local clock LCLK. In one embodiment, as explained in
`
`
`
`FIG. 1 is a block diagram of a system including a cluster
`
`
`greater detail below, the host clock HCLK and local clock
`
`
`
`
`of processors according to one embodiment of the invention.
`
`35 LCLK are independently generated and may be asynchro
`
`
`
`FIG. 2 is a block diagram of an alternate embodiment of
`nous with one another.
`
`the system shown in FIG. 1.
`A periphery of each processor, that portion of the
`
`
`
`FIG. 3 is a block diagram of the internal components of
`
`
`
`
`
`
`
`processor, such as an external parallel port, which couples
`
`
`
`an exemplary processor that may be used with the present
`
`
`
`to the external bus 40 the internal components of the processor
`invention.
`
`
`system 104, may operate at either the local clock LCLK
`FIG. 4 is a part functional, part structural block diagram
`
`
`
`
`
`frequency or the host clock HCLK frequency. In one
`
`
`
`
`of certain processor components and the different clock
`
`
`
`embodiment, as explained below, this operation is user
`
`
`
`signals on which the components operate.
`
`
`
`
`selectable. Similarly, the memory may operate at either the
`
`
`
`FIG. 5 is a block diagram of an exemplary delay calibra
`
`
`local clock LCLK frequency or the host clock HCLK
`45
`
`
`tion circuit that may be used with a processor of the
`frequency.
`invention.
`In this embodiment, a buffer 110, having multiple series
`
`
`
`
`
`
`
`FIG. 6 is a block diagram of an exemplary external port
`
`
`
`terminated outputs, provides the host clock HCLK signal to
`
`
`
`
`
`block that may be employed within a processor of the
`
`
`
`
`each destination, which, in this embodiment, includes host
`invention.
`
`
`102. Similarly, 50 100, each processor Pl-P4, and memory
`
`
`buffer 112, also having multiple series-terminated outputs,
`
`
`
`FIG. 7 is a part functional, part structural block diagram
`
`
`
`provides local clock LCLK signal to each destination,
`
`
`of a resynchronization circuit that may be employed within
`
`which, in this embodiment, includes each processor Pl-P4
`
`
`
`
`a processor of the invention.
`
`
`
`and memory 102. Each clock signal is provided on a
`
`
`
`
`
`buffer. The buffers ensure that 55 separate trace, output from the
`
`
`the same clock signal timing is provided to each designation.
`
`
`
`
`One embodiment of the present invention is directed to a
`
`During operation, a periphery of each processor Pl-P4
`
`
`
`cluster of digital signal processors interconnected by a bus
`
`
`
`
`and memory 102 may be operating at the local clock LCLK
`
`
`system, and a host that can access any of the processors
`
`
`
`frequency. When host 100 is to access one of processors
`
`
`
`
`through the bus system. A periphery of each of the
`
`
`processors, connected to the bus system, operates at one of 60
`
`
`Pl-P4 or memory 102, the clock frequency of operation of
`
`
`
`
`
`
`the periphery of each processor Pl-P4 automatically is
`
`
`a local clock frequency and a host clock frequency. The host
`
`
`
`
`operates at the host clock frequency and, when the host
`
`
`
`switched from that of the local clock LCLK to that of the
`
`
`
`
`accesses one of the processors, the clock frequency of
`
`host clock HCLK. At the same time, the clock frequency of
`
`operation of the periphery of each of the processors auto
`
`
`
`operation of the memory also is switched automatically from
`
`
`matically is switched to the host clock frequency.
`
`host clock HCLK. 65 that of the local clock LCLK to that of the
`
`
`
`
`
`
`
`
`
`Another embodiment of the present invention is directed In one embodiment, the switching occurs when a Host
`
`
`
`
`
`
`
`to a digital signal processor having a core processor that may Bus Request (HER) or Host Bus Grant (HBG) control signal
`
`BRIEF DESCRIPTION OF IBE DRAWING
`
`DETAILED DESCRIPTION
`
`
`
`5,922,076
`
`6
`5
`is asserted by the host. Such control signal
`
`
`may be provided data words of 32 bits each can be transferred to or from each
`
`
`
`
`
`
`
`
`
`
`to each processor causing an internal
`
`
`memory bank in switch (not shown) in a single clock cycle.
`each processor to switch the clock frequency from the local
`
`
`
`
`
`The elements of DSP 10 are interconnected by buses for
`
`clock LCLK to the host clock HCLK. The switch internal to
`
`
`
`
`
`
`
`efficient, high speed operation. Each of the buses includes
`each processor may include a multiplexer, or the like. Glitch
`
`
`
`
`
`
`
`
`5 multiple lines for parallel transfer of binary information. A
`
`
`
`suppression is required for any clock signal switch to the
`
`
`
`first address bus 50 (MAO) interconnects memory bank 40
`
`
`
`processor. For example, glitch suppression can be attained
`
`
`(MO) and control block 24. A second address bus 52 (MAl)
`
`
`by waiting for one clock to go low, and holding the clock
`
`
`interconnects memory bank 42 (Ml) and control block 24.
`
`
`
`output until the other clock goes low, and then driving the
`
`
`
`A third address bus 54 (MA2) interconnects memory bank
`
`
`output with the first clock at that point.
`
`
`buses 50, 10 44 (M2) and control block 24. Each of the address
`
`
`
`
`In one embodiment, an external analog switch 108 selects
`
`
`
`52 and 54 may be 16-bits wide. An external address bus 56
`one of host clock HCLK or local clock LCLK to clock the
`
`
`
`(MAE) interconnects external port 28 and control block 24.
`
`
`
`
`
`memory. A master processor P3 provides a control signal
`
`
`
`
`External address bus 56 is connected through external port
`
`
`
`along line 106, at the appropriate time, causing analog
`
`
`
`28 to external address bus 58. Each of the external address
`
`switch 108 to select the host clock HCLK signal and
`
`
`15 buses 56 and 58 may be 32 bits wide. A first data bus 60
`
`
`provides such signal to memory 102. Switch 108 preferably
`
`
`(MDO) interconnects memory bank 40, computation blocks
`
`
`is a low-resistance analog switch, such that the switching
`
`
`12 and 14, control block 24, link port buffers 26, IAB 32 and
`
`
`delay is maintained to be less than 0.2 nanoseconds. For
`
`
`external port 28. A second data bus 62 (MDl) interconnects
`
`example, the switch may be made from a low-resistance
`
`memory bank 42, computation blocks 12 and 14, control
`
`
`
`Field Effect Transistor. For external switch 108, the switch
`
`
`20 block 24, link port buffers 26, IAB 32 and external port 28.
`
`
`ing from the local clock LCLK to the host clock HCLK does
`
`
`A third data bus 64 (MD2) interconnects memory bank 44,
`
`
`
`not have to be glitch-free because no memory access is
`
`
`computation blocks 12 and 14, control block 24, link port
`
`occurring during the switch over.
`
`
`buffers 26, IAB 32 and external port 28. The data buses 60,
`
`
`
`
`
`
`In an alternate embodiment of the system shown in FIG. 62 and 64 are connected through external port 28 to external
`
`
`
`1, switch 108 of FIG. 1 is replaced
`
`25 data bus 68. Each of the data by an internal multiplexer buses 60, 62 and 64 may be 128
`124, shown in FIG. 2. Such a system includes
`
`
`bits wide, and external four proces data bus 68 may be 64 bits wide.
`
`
`sors Pl-P4, host 100, and memory 102 (see FIG. 1). Like the
`
`
`The first address bus 50 and the first data bus 60 comprise
`
`
`system of FIG. 1, the host operates at a host clock HCLK
`
`
`a bus for transfer of data to and from memory bank 40. The
`
`
`
`frequency and a periphery (1/0 port) of each of the proces
`
`
`
`second address bus 52 and the second data bus 62 comprise
`
`
`sors Pl-P4 operates at a periphery clock PCLK frequency
`
`
`
`
`30 a second bus for transfer of data to and from memory bank
`
`which may be equal to either the host clock HCLK fre
`
`
`42.The third address bus 54 and the third data bus 64
`
`quency or at the local clock LCLK frequency. Memory 102
`
`
`
`comprise a third bus for transfer of data to and from memory
`
`
`
`operates at a memory clock MCLK frequency which also
`
`bank 44. Since each of memory banks 40, 42 and 44 has a
`
`
`may be equal to either the host clock HCLK frequency or at
`
`separate bus, memory banks 40, 42 and 44 may be accessed
`
`
`the local clock LCLK frequency. As in the embodiment of 35
`
`
`
`simultaneously. As used herein, "data" refers to binary
`
`
`FIG. 1, upon a host access (of memory or a processor),
`
`
`
`words, which may represent either instructions or operands
`
`periphery clock PCLK and memory clock MCLK automati
`
`
`
`that are associated with the operation of DSP 10. In a typical
`
`
`cally are switched to host clock HCLK. The switching may
`
`
`
`
`operating mode, program instructions are stored in one of
`
`
`be performed internally of each processor by multiplexer
`
`
`
`
`the memory banks, and operands are stored in the other two
`
`124.Multiplexer 124 is controlled to switch automatically to 40
`
`
`
`
`
`memory banks. Thus, at least one instruction and two
`
`
`
`the host clock HCLK upon a host bus access or grant. The
`
`
`
`
`operands can be provided to computation blocks 12 and 14
`
`
`
`
`output of multiplexer 124 includes periphery clock PCLK
`
`
`
`
`in a single clock cycle. As described below, each of memory
`signal and memory clock MCLK signal. One master pro
`
`
`
`
`
`
`andbanks 40, 42, and 44 is confipermit reading gu red to
`cessor Pl-P4 may be selected to provide memory clock
`
`
`
`
`
`
`
`writing of multiple data words in a single clock cycle. The
`
`
`MCLK signal along bus 116 to memory 102.
`
`
`
`45 simultaneous transfer of multiple data words from each
`
`
`
`Each processor shown in the systems of FIGS. 1 and 2
`
`
`memory bank in a single clock cycle is accomplished
`
`
`
`
`
`without may be implemented having the components shown in FIG. requiring an instruction cache or a data cache.
`
`3.As shown, the principle components of DSP 10 are
`
`
`
`
`
`The control block 24 includes a program sequencer 70, a
`
`
`
`computation blocks 12 and 14, a memory 16, a control block
`
`
`
`first integer ALU 72 (J ALU), a second integer ALU 74 (K
`
`24, link port buffers 26, an external port 28, a DRAM 50
`
`
`
`ALU), a first DMA address generator 76 (DMAG A) and a
`
`
`controller 30, an instruction alignment buffer (IAB) 32 and
`
`
`second DMA address generator 78 (DMAG B). Integer
`
`
`
`
`a primary instruction decoder 34. Computation blocks 12
`
`
`
`ALU's 72 and 74, at different times, execute integer ALU
`
`
`and 14, instruction alignment buffer 32, primary instruction
`
`
`
`instructions and perform data address generation. During
`
`
`
`decoder 34 and control block 24 constitute a core processor
`
`
`
`
`
`execution of a program, program sequencer 70 supplies a
`
`
`which performs the main computation and data processing
`
`
`
`
`sequence of instruction addresses on one of address buses
`55
`
`
`
`functions of DSP 10. External port 28 controls external
`
`
`
`50, 52, 54 and 56, depending on the memory location of the
`
`
`
`communications via an external address bus 58 and an
`
`
`
`
`instruction sequence. Typically, one of memory banks 40, 42
`
`
`
`external data bus 68. External port 28 may constitute the
`
`
`or 44 is used for storage of the instruction sequence. Each of
`
`
`periphery of DSP 10. Link port buffers 26 control external
`
`
`
`integer ALU's 72 and 74 supplies a data address on one of
`
`
`communication via communication ports 36. DSP 10 is 60
`
`
`
`address buses 50, 52, 54 and 56, depending on the location
`
`
`
`integrated cir-preferably configu red as a single monolithic
`
`
`
`
`
`of the operand required by the instruction. Assume, for
`cuit.
`
`
`
`
`example, that an instruction sequence is stored in memory
`
`
`
`Memory 16 includes three independent,
`
`
`
`bank 40 and that the large capacity required operands are stored in memory
`
`memory banks
`banks 42 and 44. In this case, 40, 42 and 44. In an embodiment, each of the program sequencer
`
`
`
`
`
`
`memory banks 40, 42 and 44 has a capacity of 64K words 65 supplies instruction addresses on address bus 50 and the
`
`
`
`
`
`
`
`
`
`of 32 bits each. Each of the memory banks 40, 42 and 44 accessed instructions are supplied to the instruction align
`
`
`
`
`
`
`
`
`
`may have a 128-bit data bus. Up to four consecutive aligned ment buffer 32, as described below. Integer ALU's 72 and 74
`
`
`
`5,922,076
`
`7
`8
`core processor on address 132, operating at a core clock CCLK
`
`
`
`
`
`
`
`may, for example, output addresses of operands
`
`
`
`
`buses 52 and 54, respectively. In response to the
`
`
`
`frequency, addresses and a periphery 126, operating at either a local
`
`generated by integer
`
`clock LCLK frequency ALU's 72 and 74, memory banks 42 or a host clock HCLK frequency, or
`
`
`
`and 44 supply operands
`
`
`a multiple on data buses 62 and 64, of either LCLK or HCLK. Periphery 126 may
`
`
`
`
`5 consist of external port 28 that communicates with external
`
`respectively, to either or both of computation blocks 12 and
`
`
`
`
`
`data bus 68 and external address bus 58, shown in FIG. 3.
`
`14.Memory banks 40, 42 and 44 are interchangeable with
`
`
`
`respect to storage of instructions and operands.
`
`
`
`Processor 132 receives both the local clock LCLK signal
`
`
`and the host clock HCLK signal as inputs. Not shown in
`
`
`
`
`Program sequencer 70 and the integer ALU's 72 and 74
`
`
`FIG. 4 is a delay calibration circuit through which each input
`
`
`
`may access an external memory (not shown) via external
`10
`
`
`
`
`clock signal is run to account for propagation delays, as
`
`
`
`port 28. The desired external memory address is placed on
`
`
`
`
`
`described in greater detail hereinafter with reference to FIG.
`
`
`
`address bus 56. The external address is coupled through
`
`
`
`
`5.Both are provided to switch 124 which selects one as the
`
`
`
`external port 28 to external address bus 58. The external
`
`
`
`periphery clock PCLK to periphery 126, as described above
`
`
`memory supplies the requested data word or data words on
`
`with reference to FIGS. 1 and 2.
`
`
`
`external data bus 68. The external data is supplied via
`
`
`external port 28 and one of the data buses 60, 62 and 64 to 15
`
`The local clock LCLK signal also is provided to a
`
`
`
`
`
`one or both of computation blocks 12 and 14. The DRAM
`
`
`
`
`
`frequency multiplier 128. Frequency multiplier 128 multi
`
`
`
`controller 30 controls the external memory.
`
`
`
`plies the local clock LCLK signal by a ratio selected by the
`
`
`user and outputs the product, which is the core clock signal
`
`
`As indicated above, each of the memory banks 40, 42 and
`
`
`CCLK, on line 130 to core processor 132. Frequency
`
`
`44 may have a capacity of 64k words of 32 bits each. Each
`
`memory bank may be connected to a data bus that is 128 bits 20
`
`
`
`multiplier may, for example, include the ratios, X2, X2.5,
`
`
`X3, X3.5, X4, one of which is selected by a user to produce
`
`
`wide. In an alternative embodiment, each data bus may be 64
`the core clock CCLK.
`
`bits wide, and 64 bits are transferred on each of clock phase
`
`
`1 and clock phase 2, thus providing an effective bus width
`This embodiment of the invention enables the frequency
`
`
`
`
`of 128 bits. Multiple data words can be accessed in each
`
`
`
`of operation of the core processor 132 to be optimized
`
`memory bank in a single clock cycle. Specifically, data can 25
`
`
`
`
`
`
`independently of the frequency of operation of the periphery
`
`be accessed as single, dual or quad words of 32 bits each.
`
`
`
`126.The frequency of operation of the periphery 126 may be
`
`
`
`
`Dual and quad accesses require the data to be aligned in
`
`
`
`
`limited by the external bus should such periphery consist of
`
`
`
`memory. Typical applications for quad data accesses are the
`
`
`the external parallel port. Such a limitation would not,
`
`
`
`
`fast Fourier transform (FFT) and complex FIR filters. Quad
`
`
`
`however, affect the speed of the core processor. The inven
`30
`
`
`
`
`accesses also assist double precision operations. Preferably,
`
`
`
`tion also enables the frequency of operation of the periphery
`
`
`
`instructions are accessed as quad words. However, as dis
`
`
`
`to be optimized independently of the speed of operation of
`
`
`
`
`
`
`cussed below, instructions are not required to be aligned in
`the core.
`memory.
`
`As stated, the host clock HCLK and local clock LCLK are
`
`
`Using quad word transfers, four instructions and eight
`
`
`and may be asynchronous with one 35 generated independently
`
`operands, each of 32 bits, can be supplied to computation
`
`
`
`
`
`another. For example, host clock HCLK may be 66 MHz and
`blocks 12 and 14 in a single clock cycle. The number of data
`
`
`
`
`local clock LCLK may be 100 MHz. When periphery 126
`words transferred and the computati