`
`US 7,895,587 B2
`(10) Patent No.:
`a2) United States Patent
`Babaianetal.
`(45) Date of Patent:
`Feb. 22, 2011
`
`
`(75)
`
`(54) SINGLE-CHIP MULTIPROCESSOR WITH
`CLOCK CYCLE-PRECISE PROGRAM
`SCHEDULING OF PARALLEL EXECUTION
`.
`.
`.
`.
`.
`Inventors: Boris A. Babaian, Moscow (RU);Yuli
`Kh. Sakhin, Moscow (RU); Vladimir
`Yu. Volkonskiy, Moscow (RU); Sergey
`A. Rozhkov, Moscow (RU); VladimirV.
`Tikhorsky, Moscow (RU); Feodor A.
`Gruzdov, Moscow (RU); Leonid N.
`Nazarov, Moscow (RU); Mikhail L.
`Chudakov, Moscow (RU)
`
`(73) Assignee: Elbrus International, Moscow (RU)
`(*) Notice:
`Subject to any disclaimer, the term ofthis
`atent is extended or adjusted under 35
`0G SC. 154(b)
`by 1202 da
`“— (b)
`by
`ys.
`.
`.
`No.:
`(21) Appl. No.: 11/518,038
`
`(22)
`
`Filed:
`
`Sep. 8, 2006
`
`(65)
`
`Prior Publication Data
`
`US 2007/0006193 Al
`
`Jan. 4, 2007
`
`(51)
`
`Related U.S. Application Data
`(62) Division of application No. 09/789,850,filed on Feb.
`20, 2001, now Pat. No. 7,143,401.
`(60) Provisional application No. 60/183,176, filed on Feb.
`17. 2000.
`°
`Int. Cl.
`(2006.01)
`GOOF 9/45
`(2006.01)
`G06F 9/30
`717/161: 717/150: 717/153:
`(52) US. Cl
`eee 77/1 50: 712/220: 712/228
`58)
`Field of Classification
`h
`°
`°
`N
`S
`Sox
`° li
`ten le for nol tesearebhistor
`one
`(58)
`ee applcation
`hie
`tor comprete search)
`story:
`(56)
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,613,146 A
`5,761,516 A *
`5,913,925 A *
`6,058,465 A *
`6,282,583 BL*
`6,347,344 B1*
`6,360,243 B1*
`7,143,401 B2
`
`3/1997 Goveet al.
`....0...... 710/260
`6/1998 Rostoker etal.
`6/1999 Kahleetal. ne. 712/206
`5/2000 Nguyen... eeeeeeee 712/7
`8/2001 Pincus etal. .s.s.sesseese 713/375
`
`we. 710/20
`2/2002 Baker etal. .....
`............ 718/103
`3/2002 Lindsley etal.
`11/2006 Babaian etal.
`
`OTHER PUBLICATIONS
`
`Soohnget al., “Across Multiple Superscalar Processors on a Single
`Chip”, Nov. 1997, Proceedings of PACT’97, pp. 1-10.*
`Gupta ct al., “The Design ofRISC based. MultiprocessorChip”, 1990,
`Preceedings of the 1990 conference on Supercomputing, pp. 920-
`929.8
`7
`.
`.
`oe
`Guptaet al.,
`“The Design ofRISC based Multiprocessor Chip”, 1990,
`Proceedings ofthe 1990 conference on Supercomputing, p. 920-929.
`Soohongetal., “Across Multiple Superscalar Processors on a Single
`Chip”, Nov. 1997, Proceedings of PACT’97, pp. 1-10.
`* cited by examiner
`
`Primary Examiner—Tuan Q Dam
`Assistant Examiner—lsaac T Tecklu
`(74) Attorney, Agent, or Firm—kKilpatrick Townsend &
`Stockton LLP
`
`(57)
`
`ABSTRACT
`
`Asingle-chip multiprocessor system and operation method of
`this system based on a static macro-scheduling of parallel
`streams for multiprocessor parallel execution. The single-
`chip multiprocessor system has buses for direct exchange
`between the processorregister files and access to their store
`addresses and data. Each explicit parallelism architecture
`processorof this system has an interprocessorinterface pro-
`viding the synchronization signals exchange, data exchange
`at the registerfile level and accessto store addresses and data
`of other processors. The single-chip multiprocessor system
`uses ILP to increase the performance. Synchronization ofthe
`streamsparallel execution is ensured using special operations
`setting a sequence ofstreams and stream fragments execution
`prescribed by the program algorithm.
`
`4,414,624 A * 11/1983 Summeretal.
`
`.....0000... 712/21
`
`17 Claims, 12 Drawing Sheets
`
`530
`
`540
`
`Processor A
`
`Processor B
`
`510
`
`“
`Stream
`t
`AO
`Al
`
`Stream
`J YX 20
`BO
`Bl
`
`A2
`
`522 B2
`
`524-526
`
`
`A3
`walt pea
`event Bi B3
`
`
`‘sat pec
`wyAd 514
`516
`528
`B3
`
`
`
`
`event A; AS o 1
`Bz
`feet een
`AG — B3
`P=st fee0
`ee
`AT
`B4
`
`
`
`
`1
`
`SAMSUNG 1055
`SAMSUNG 1055
`SAMSUNG v. SMART MOBILE
`SAMSUNG v. SMART MOBILE
`IPR2022-01004
`IPR2022-01004
`
`Cycle
`
`. .
`
`.
`N
`N+1
`
`Nt+2
`
`N+3
`N+4
`N+5
`N+6
`N+7
`
`1
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 1 of 12
`
`US 7,895,587 B2
`
`Cycle
`
`N
`
`N+1
`
`112
`
`114
`
`116
`
`118
`
`Opl
`
`Op5
`
`Op3
`Op2
`
`
`Op4
`
`Op6
`
`Op7
`
`Op8
`
`+410
`
`120
`
`il
`
`i2
`
`Fig. 1
`
`o»
`
`
`
`
`
`
`
`
`N43 Op7|Ops i2
`
`
`
`055
`
`
`236
`
`
`
`248
`
`
`
`
`
`
`2
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 2 of 12
`
`US 7,895,587 B2
`
`a
`
`320
`
`330
`
`340
`
`Cycle
`
`CPUO
`
`CPU1
`
`CPU2
`
`CPU3
`
`N
`
`312
`
`324
`
`336
`
`348
`
`Fig. 3
`
`3
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 3 of 12
`
`US 7,895,587 B2
`
`A
`
`410
`
`B
`
`420
`
`414
`
`
`
`Aj
`
`412
`
`416
`
`424
`
`Bj
`|wait [permit
`SILL|0|
`
`422
`
`time
`
`Fig. 4
`
`4
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 4 of 12
`
`US 7,895,587 B2
`
`530
`
`Cycle
`
`Processor A
`
`Stream
`
`{ 510
`
`AO
`
`A2
`
`A3
`516
`514
`512 A4
`oemAST OPT
`A6
`
`518
`
`;
`
`N
`
`N+2
`
`N+3
`N+4
`N45
`+
`N+6
`
`N+7
`
`540
`7
`Processor B
`
`Stream
`
`| XX 599
`
`BO
`
`event
`928
`
`526
`524
`592 B2
`Se
`B;
`wait
`[permit
`By B3
`B3 wall permi
`Bs Ae
`B3
`
`a
`
`lbeamil
`
`wait [permit
`
`A7
`
`B4
`
`Fig. 5
`
`5
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 5 of 12
`
`US 7,895,587 B2
`
`>
`
`610
`
`B
`
`620
`
`624
`
`Bi
`
`614
`
`Fwait(permilg’ Si
`POTTS
`
`619
`
`Pifo iM
`
`616
`
`time
`
`618
`
`a6
`
`Fig. 6
`
`6
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 6 of 12
`
`US 7,895,587 B2
`
` To l/O Main Memory
`
`and other Single-chip
`Multiprocesor
`Systems 700
`
`Fig. 7
`
`7
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 7 of 12
`
`US 7,895,587 B2
`
`
`:YTedsfbfexsetfekseP|
`Che4|4|4|A
`
`
`
`
`O[Lyo}e01poig
`
`
`
`ayoe7)UOHONsuy
`
`
`
`JoyjngyouojoigAeuy
`
`sngsse
`dig
`
`sugssedég
`
`
`
`062eYyseDpaleysOL
`
`gSI
`
`
`
`wayshsqng odueyoxy Jossa001droyuy
`
`08Z joouUu0D Jossas0ldiaquy oO],
`
`8
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 8 of 12
`
`US 7,895,587 B2
`
`0S6
`
`46‘Sty
`
`96“SITY
`
`0S6 96
`
`
`
`
`€-cv6C-2v6b-cv6O-Cv6C-0"6c-OV61-OF60-0V6
`
`
`
`v96296€-09672-0961-0960-0968S69567S67cS6
`
`
`
`[tdzd|baloa]aosealentzmalbu]om|vosor
`ce6826vce026916
`aa[anfeat[zu|anJouposa]amn]scfoes
`“ayeoipaig|***[esoi]
`
`S9[qe[AS91-Z
`
`
`SAOIVV}-yonu05
`
`
`
`
`86‘SII
`
`9¢6
`
`006
`
`c16
`
`806
`
`vO6
`
`Jopeayy
`
`9
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 9 of 12
`
`US 7,895,587 B2
`
`804
`
`802
`
`Synchronization
`
`Control Unit 724
`
`Synchronization
`Exchange
`Interconnect 781
`
`Register
`Exchange
`Interconnect 782
`
`Cache Exchange
`Interconnect 783
`
`Exchange
`Bypass Bus 748
`Unit
`
`Control Unit 724
`
`MMU 712
`
`806
`
`Register
`Exchange
`Unit
`
`808
`
`Cache
`Exchange
`
`Fig 10
`
`10
`
`10
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet10 of 12
`
`US 7,895,587 B2
`
`Perm_count_overflow_out_0
`Permit
`in 0 +
`830-0
`831-0
`—
`
`O
`
`-\
`
`1
`
`804
`
`7
`
`-)
`
`839-0
`
`From
`configuration
`registers 802
`
`833-0
`Wait_0
`833-1
`ait
`ait_2
`ait=~833-3
`j
`833-2
`
`.
`B
`
`|
`
`Perm_count_overflow_out_1
`:
`
`|
`830-1
`
`O
`
`832-0
`
`Permit_in_dI.,
`
`
`IT
`
`832-1
`
`
`
`To/fromSynchronizationExchangeInterconnect781
`
`
`
`
`
`To/fromCPUControlUnit724
`
`C =D
`
`834
`
`Perm_count_overflow_out_2
`Permit
`in
`7
`830-2
`831-2
`, L
`O
`
`C_|
`
`832-2
`
`|
`
`839-2
`
`7
`
`rom configuration registers 802
`825
`
`i
`837-0
`comme 837-1
`oe, \@ Petmit_out_0
`
`
`
`835-2 re Petmit_out_1
`837-2
`
`
`Af Permit_out_2
`
`
`rom configuration registers 802
`’
`
`837-3 826
`
`erm_count_overflow_in_0
`Perm_count_overflow_in 1
`erm_count_overflow_in %
`—
`
`Permit
`
`Permit
`aN
`Permit
`
`lock 1
`
`lock 2
`lock 3
`
`°/-” ee
`-/7°°
`
`836-0
`836-1
`836-2.
`
`Fig. 11
`
`11
`
`11
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 11 of 12
`
`US7,895,587 B2
`
`806
`
`7
`
`.
`Overflow_in
`Overiow in
`
`Data_out_0
`
`Data_out_1
`
`7
`
`874
`
`866-0
`
`a
`oO
`é~
`3
`
`oS
`
`866-
`
`—-866-
`
`868-0
`867-1
`
`Overflow_0
`
`868-1
`
`867-
`
`Overflow_2
`872-2
`
`5O
`
`oO
`
`o6
`
`D
`q
`a
`ao
`
`FIFO
`
`
`
`
`
`858-1|
`|
`
`
`
`Mx
`
`Data_i
`
`871
`
`Data_in_2 |||FIFO]|
`
`868-2
`
`Fig. 12
`
`12
`
`Transmit Control
`Switch
`
`854-0
`
`=~
`mz
`Pp —N
`[-—4
`™
`a Transmit request
`=
`a
`5
`+
`~—
`sz
`5
`Receive request ©
`870
`
`.
`Receive Control
`Decoder
`
`From Configuration Unit 802
`852
`876
`Transmit lock
`
`
`
`
`Data_out
`
`864
`
`Co
`tT
`™
`
`Data_out_2 ea
`aoan
`2
`sa7z0 BT
`serot
`2
`2 “saya
`zs
`ws
`2
`: 5b
`Overflow_1
`2.
`@
`872-1
`co
`e
`-
`Data_in_1
`
`12
`
`
`
`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 12 of 12
`
`US7,895,587 B2
`
`808
`
`-
`
`From Configuration Unit 802
`B89
`
`882-0
`
`890-OO
`
`890-1 ‘|
`890-2
`i
`895-0
`
`882-1
`
`882-2
`
`888
`
`97
`
`“
`N
`-
`8
`S
`893 2
`SoO
`&
`o
`|
`S
`<
`
`a4
`
`886
`
`g99 ©
`&oO
`=
`
`898
`
`3
`
`(—
`CS
`(ke
`
`ae4
`
`Store
`idan
`
`Store
`Queue
`
`Fig. 13
`
`13
`
`en
`ve)
`
`™ 3
`
`jan)
`2
`oS
`O
`
`3O
`
`F
`°g
`= 895-2
`=
`©
`on
`&
`2
`
`895-1
`
`896-0
`896-1
`896-2
`
`yy
`
`892-0
`
`892-1
`892-2
`
`13
`
`
`
`US 7,895,587 B2
`
`1
`SINGLE-CHIP MULTIPROCESSOR WITH
`CLOCK CYCLE-PRECISE PROGRAM
`SCHEDULING OF PARALLEL EXECUTION
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`20
`
`25
`
`35
`
`40
`
`45
`
`This application claims priority from U.S. Provisional
`Application No. 60/183,176, entitled “SINGLE-CHIP MUL-
`TIPROCESSOR WITH CYCLE-PRECISE PROGRAM
`SCHEDULING OF PARALLEL EXECUTION”,filed Feb.
`17, 2000, the disclosure of which is incorporated herein by
`reference. This application also claimspriority to U.S. appli-
`cation Ser. No. 09/789,850, entitled “SINGLE-CHIP MUL-
`TIPROCESSOR WITH CYCLE-PRECISE PROGRAM
`SCHEDULING OF PARALLEL EXECUTION”,filed Feb.
`20, 2001, the disclosure of which is incorporated herein by
`reference.
`
`BACKGROUNDOF THE INVENTION
`
`1. Field of the Invention
`The present invention relates to shared memory multipro-
`cessors and, more specifically, to a single-chip macro-sched-
`ule multiprocessor architecture, providing program-con-
`trolled cooperation of the processors with explicit use of
`instruction-level parallelism.
`2. Description of the Prior Art
`Today’s fast growth of transistor-per-chip number raises
`the question of how to gain a respectively higher perfor-
`mance. Onealternative is to build larger on-chip memories,
`but this approach canbeefficient only to a certain point, after
`which adding more cache provides a minor performance
`improvement. Thus, a preferred alternative at this point is to
`exploit more parallelism. There
`are
`generally two
`approaches: instruction-level parallelism (ILP) and thread-
`level parallelism (TLP).
`Useof instruction-level parallelism (ILP) involves parallel
`execution of the instruction groups, which helps the perfor-
`mance growth. There are dynamic (superscalar) and static
`(Very Long Instruction Word—VLIW and. Explicit Parallel
`Instruction Computing—EPIC)approaches to ILP use. With
`the dynamic approach, parallel instruction groups are hard-
`ware-generated at
`the program run, and with the static
`approach, at the program compilation. An example of the
`dynamic approachis provided with the microprocessor Pen-
`tium IV ofIntel (see “Pentium 4 (Partially) Previewed”’,Peter
`N. Glaskowsky, Microprocessor Report, Aug. 28, 2000-
`2001). An example ofthe static approachis provided with the
`microprocessor Itanium of Intel (see “Merced Shows Inno-
`vative Design”, Linley Gwennap, Microprocessor Report,
`volume 13, number 13, Oct. 6, 1999).
`In the dynamic approach,there is a big dynamic hardware
`window for viewing the executedinstructions(in the Pentium
`IV the window viewsover 100 instructions) where all pos-
`sible operation collisions are resolved. In the caseofthe static
`approach, the compiler forms instruction groups for their
`parallel execution and schedules their optimal execution with
`regard to each instruction execution time and possible inter-
`instruction collisions. In this case, the instruction groups
`include only independentinstructions. This approach simpli-
`fies the processor hardware. The size of a parallel execution
`instruction group for modern superscalar architecture micro-
`processors generally reaches 4-6 instructions, with future
`increases up to 8 instructions (see microprocessor Power 4
`IBM, “IBM’s Power 4 Unveiling Continues”, Microproces-
`sor Report Nov. 20, 2000-2003). For static architecture
`
`2
`microprocessors it generally reaches from 6-12 instructions
`(see JA-64, Itanium, McKinley) to over 20 instructions (see
`Keith Diefendorff, “The Russians Are Coming”, Micropro-
`cessor Report, pp. 1, 6-11, vol. 13, number2, Feb. 15, 1999).
`Further increase ofthe parallel execution instruction group
`size leads to physically large monolithic cores and complex
`control mechanisms, whichare limiting factors for increases
`in the clock frequency. The numberof access portsto register
`files and internal caches is growing. The hardware forresolv-
`ing inter-instruction dependencies in superscalar micropro-
`cessors is becoming complicated. The probability of unac-
`counted collisions in a static architecture microprocessor
`during compilation is growing, which results in violations of
`the schedule madeat compile time causing additional delays
`at the program run. Moreover, design andverification become
`too complicated and time-consuming.
`Thread-level parallelism (TLP)is a perspective method of
`further performance increases for dynamic andstatic archi-
`tectures. Use of thread-level parallelism (TLP) involves par-
`allel execution of many program threads in a multiprocessor
`environment. Threads are weakly coupled orjust independent
`fragments of one program allowing their parallel execution
`on different processors with small overheads for control and
`synchronization, which are performed by the operation sys-
`tem and by means of semaphores. However, not all applica-
`tions can be parallelized in such a way. A majordifficulty is
`posed by parallelization of the integer applications, which
`have data dependencies andshort parallel branches, because
`synchronization using semaphoresis verycostly for them.
`Static architectures have a potential for performance
`growth in a multiprocessor system due to a more aggressive
`use of ILP and application ofthe static scheduling method to
`aparallel execution on many processors. The examples ofILP
`use can be really independent
`in-program computations
`(separate expressions, procedures, loop iterations, etc.), as
`well as compiler optimizations aimed at speeding-up the
`computations due to parallel execution of possible alterna-
`tives (the so-called speculative and predicative computa-
`tions). This may allow to increase utilization of ILP in the
`programsby up to 63%. (See Gary Tysonet al., “Quantifying
`Instruction Level Parallelism Limits on an EPIC Architec-
`
`ture”, ISPASS-2000, Austin, Tex., 2000.)
`The compiler for static macro-schedule architecture per-
`forms a global scheduling of the program execution taking
`into account the available data and control dependences. In
`this case the numberof instructions in a group intended for
`parallel execution (super-wide instruction) is equal to the
`total numberofinstructions in all instruction groups (wide
`instruction) in all processors of the multiprocessor system.
`That is, the compiler makes a schedule for a synchronous
`execution of the super-wide instructions in all processors of
`the system. A sequence of wideinstructionsto be executed in
`one processor forms a wide instruction stream or simply a
`stream. Thus, the schedule for the whole program execution is
`divided into a multitude of streams in compliance with the
`available numberof processors.
`While making a schedule for parallel operation of all pro-
`cessors in a multiprocessor system,
`the compiler forms
`streams for each processor to minimize data and control
`dependencies between different streams. This shortens the
`delays caused by a necessity to access the context of another
`stream executed in another processor. The streams can be
`executed independently of each other until an explicitly
`specified synchronization instruction appears in the instruc-
`tion sequence. During the program run thestatic schedule can
`be violated, which is caused by collisionsarising in different
`processors, which cannot be accounted at the compilation
`
`14
`
`14
`
`
`
`US 7,895,587 B2
`
`4
`The compiler for sucha system performsstatic clock cycle-
`precise scheduling of executing the program by super-wide
`instructions and then divides this schedule into a few separate
`streamsfor parallel execution on a multiprocessor. Streaming
`is performed to minimize the interstream dependencies. In
`this way, the effect of a super-wide instruction issue in the
`on-chip multiprocessor system is attained.
`Fast synchronization of the parallel program streamsis
`performed, if necessary,
`in compliance with the program
`algorithm using special synchronization operations maintain-
`ing the sequence of execution for separate fragments of dif-
`ferent streams.
`
`25
`
`3
`stage. Examples of such collisions may be a cache miss,
`data-dependent divide and multiply operations, etc. For this
`reason it is necessary to have synchronization means, 1.e.,
`maintenanceofthe specified sequence of executing separate
`fragments in different streams with the aim to properly
`resolve the data and control dependencies. Theefficiency of
`the macro-schedule multiprocessor system dependslargely
`on the efficiency of the interstream context access and syn-
`chronization means implementation.
`A single-chip multiprocessor is generally most suited for
`static macro-schedule execution. A single-processor chip has
`a limited numberof external connections caused by the con-
`The single-chip multiprocessor system can be tightly-
`strained package abilities. A single-processor chip typically
`coupled and include high performance buses for data
`has only system interface for access to main memory, other
`exchange between the processors at the register file level,
`processors and 1/O. Unlikethis, the single-chip multiproces-
`buses for supporting processor cache memories coherence
`sor besides the system interface may include very fast and
`and communication links for transmitting control signals
`wide interprocessor connections data exchange,
`internal
`ensuring the required synchronization (execution sequence of
`caches coherence support and synchronization of the streams
`executed in parallel.
`critical parts of different streams) in the multiprocessor sys-
`20
`
`A single-chip multiprocessor may haveavirtual processor tem, or weakly-coupled, when the data exchange between the
`numbering, which allows for simultaneously performing
`program streamsis executed using the memory access opera-
`tions.
`independent
`programs
`providing
`sufficient processor
`resources. Further performance increases may be attained in
`The explicit parallelism architecture processor comprises
`a multi-ship system comprising single-chip multiprocessors,
`synchronization means, means for data exchangeat the reg-
`in which interchip access and synchronization may be
`ister file level and data cache level for close processors inter-
`action.
`handled in a traditional way using a semaphore method,etc.
`Static macro-schedule architecture efficiently uses TLP,
`since in this case the threads may be considered as streams
`with weak data and control dependencies.
`ExpLicit Basic Resource Utilization Scheduling (EL-
`BRUS) microprocessor architecture (see Keith Diefendorff,
`“The Russians Are Coming”, Microprocessor Report, pp. 1,
`6-11, vol. 13, number2, Feb. 15, 1999) is mostly suited for the
`single-chip multiprocessor using static macro-schedule,
`because ELBRUSarchitecture is oriented to the execution of
`
`The above-described static macro-scheduling method of
`the program compilation and on-chip multiprocessor system
`has a number of advantages. The static macro-scheduling
`methodof the program compilation makes a most aggressive
`use of ILP due to a super-wide instruction schedule. An on-
`chip multiprocessor system due to asynchronousexecution of
`parallel streams substantially reduces the effect of collisions
`dynamically arising in the processor hardware during the
`programrun.
`system substantially
`The single-chip multiprocessor
`increases the performance dueto parallel and macro-sched-
`uled execution of a multitude of program streams, high per-
`formance and efficiently synchronized data exchange
`between the processors.
`Each processorin a single-chip multiprocessor system may
`have a more simple structure and limited parallel execution
`abilities, which promotes to the clock frequency increase,
`reduces timefor parallel processing, and reduces time for the
`processor design andverification.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIGS. 1 and 2 are tablesillustrating the effect of summa-
`rizing delays caused by collisions unaccounted for by a com-
`piler in a quad-issue explicit parallelism architecture proces-
`SOL;
`FIG.3 illustrates a real schedule for execution of a program
`in four 1-issue explicit parallelism architecture processors;
`FIG.4 is a schematic illustration of an exampleofinitial-
`izing a sequential path of a synchronization point;
`FIG. 5 is a diagram of executing streams by processors;
`FIG.6 is a schematic illustration ofan example of a sequen-
`tial path of synchronization points;
`FIG.7 is a block diagram of a single-chip multiprocessor
`system in accordance with the present invention;
`FIG.8 is an overview block diagram of an embodimentof
`a processor in accordance with the present invention;
`FIG.9a is a diagram of an embodimentof a wide instruc-
`tion;
`FIG. 96 is a diagram of a format of an MP synchronization
`syllable;
`
`40
`
`45
`
`the static clock cycle-precise scheduled program with explicit
`parallelism.
`AnELBRUSmicroprocessor wide instruction maycontain
`over 20 operations (simple instructions of the type:
`load,
`store, add, multiply, shift, logic and others). An ELBRUS
`microprocessor has additionally speculative and predicative
`mode operations, which increases its potentialities to effi-
`ciently use ILP. A scoreboarding feature allows automatic
`correction of the static schedule, when dynamic collisions
`arise during the program run.
`An object of the present invention is therefore a method of
`synchronization and control of parallel execution of streams
`of a macro-scheduled program without addressing the opera-
`tion system, based on the static macro-scheduling of the
`program. Anotherobject ofthe present inventionis to provide
`a single-chip multiprocessor with interprocessor connections
`for fast registers’ data exchange, acceleration of cache coher-
`ency support and synchronization of parallel streams execu-
`tion. A further object of the present invention is an ExpLicit
`Basic Resource Utilization Scheduling (ELBRUS)micropro-
`cessor with means for interprocessor synchronization and
`interprocessor exchange of data and addresses through
`above-mentioned interprocessor connections.
`
`SUMMARYOF THE INVENTION
`
`60
`
`In accordance with the present invention, a single-chip
`multiprocessor system with explicit parallelism architecture
`processors and an operation methodofthis system is based on
`the static macro-scheduling of the program for the multipro-
`cessor parallel execution to help ensure a high-level use of
`ILP.
`
`15
`
`15
`
`
`
`US 7,895,587 B2
`
`5
`FIG.9c is a diagram of an MPregister exchange syllable;
`FIG. 10 is a block diagram of an interprocessor exchange
`subsystem in accordance with the present invention;
`FIG. 11 is a schematic illustration of a synchronization
`exchange unit in accordance with the present invention;
`FIG. 12 is a schematic illustration of a register exchange
`unit in accordance with the present invention; and
`FIG. 13 is a schematicillustration of a cache exchangeunit
`in accordance with the present invention.
`
`DETAILED DESCRIPTION OF THE PREFERRED
`EXEMPLARY EMBODIMENTS
`
`
`
`FIGS. 1 and 2 show the effect of summarizing delays
`caused by the collisions unaccounted for by the compiler in a
`quad-issue explicit parallelism architecture processor. It is
`assumed that each wide instruction consists of four opera-
`tions 112, 114, 116, 118. The execution time of each wide
`instruction is equal to one cycle, and the time of collision
`resolution is also equal to one cycle. Assume the program
`consists of 16 independent operations and the compilerallo-
`cates them to four wide instructions 110, 120, 130, 140.
`The compiler makes a schedule of executing four wide
`instructions, which will take four cycles from n to n+3 (FIG.
`1). Let the collisions unaccounted for by the compiler take
`place in each wideinstruction: op1 212, op6 224, op11 236
`and op16 248 during the program run time. Thus, thereal time
`run of this program takes eight cycles from n to n+7, since
`each wide instruction 210, 220, 230, 240 is delayed for an
`additional one cycle and the total delay unaccounted for by
`the compiler is equal
`to the sum ofall delays and makes up
`four cycles (see FIG. 2).
`FIG. 3 presents a real schedule for execution of the same
`program in four 1-issue explicit parallelism architecture pro-
`cessors with the same assumptions about the instruction
`execution time, collisions resolution time and the time of
`collisions occurrencein the operations. The compiler sets up
`four program streams 310, 320, 330, 340 for execution on
`four processors. In this case, the whole program is executed
`within five cycles without the effect of accumulated delays
`caused by the collisions. ‘he total delay that is unaccounted
`for is equal to the longest of the delays that occurred in four
`program streams. In this simple example, the delay 312, 324,
`336,348 in all four streams is equal to one cycle, which is why
`the total delay is equal to one cycle.
`Examples presented in FIGS. 1, 2 and3 illustrate that with
`the equal issue width in monoprocessor and multiprocessor
`systems, the latter is less affected by the collisions unac-
`counted for in static scheduling at the compilation stage.
`In accordance with an embodiment, synchronization is
`implemented using special operations, which along with
`other operationsare a part ofwideinstructions, and which are
`located in the synchronization points of the streams. The
`synchronization operation with the help ofa set of bit pairs
`“wait” and “permit” specifies in a synchronization point the
`relationship betweenthe given stream and each otherstream.
`Presented below are possible states of bits relationship:
`
`wait
`0
`0
`1
`1
`
`permit
`0
`1
`0
`1
`
`don’t care
`permit another stream
`wait for another stream
`wait for another stream, then permit another stream
`
`6
`Code 00 has no synchronization influence. Code 01 per-
`mits anotherstream to pass its synchronization point. Code 10
`prescribes waiting for another stream passing of its synchro-
`nization point. Code 11 determines the relation to two syn-
`chronization points of another stream: waiting for another
`stream passing its first synchronization point and then per-
`mitting another stream to pass its second synchronization
`point.
`FIG.4 illustrates an example ofinitializing a sequential
`pass of the synchronization point Ai 412 by a stream A 410
`and point Bj 422 by a stream B 420, and thestate of“wait” and
`“permit”bits 414, 424 in both streamsof the synchronization
`operation. The synchronization operation in point Bj has a
`state “wait” and may be executed only after the synchroniza-
`tion operation in point Ai, issuing a “permit” 414 signal, is
`executed.
`FIG. 5 presents a diagram of executing the streams A 510
`and B 520 by processors 530 and 540 in this case, when an
`event in point Bj 528 occurs earlier than an event in point Aj
`518. In this case, the whole wideinstruction B3 522, contain-
`ing the synchronization operation, is delayed until a wide
`instruction A5 512 is executed.
`FIG. 6 presents an example of a sequential pass of the
`synchronization points Ai 612 of the stream A 610, Bj 622 of
`the stream B 620, andAk 616 ofthe streamA 610 andthestate
`of “wait” and “permit”bits in the synchronization operations
`of both streams. The synchronization operation in point Bj
`622 has the state “wait” and “permit” 624 and may be
`executed provided only the synchronization operation in
`point Ai 612 is executed and “permit” signal 618 is issued.
`Only now does the synchronization operation in point Bj 622
`issue a “permit” signal 626 for the synchronization operation
`in point Ak 616.
`A reverse counter may be used to count “permit” and
`“wait” events. This allows for set-up of the relation of the
`execution sequence to the groups of events in the synchro-
`nized streams. The signal “permit” from another stream
`increments the counter, and the execution of the synchroni-
`zation operation containing the “wait” bit in its own stream
`decrements the counter. Overflow of the counter locks the
`
`stream, which has an intention to transmit the “permit”sig-
`nals, in the point of execution of its synchronization opera-
`tion. Zero state ofthe reverse counter locks the stream, which
`is waiting for the “permit” signals as shown in FIGS. 4 and 6.
`A method of synchronization of the streams’ parallel
`execution in accordance with this embodiment is intended to
`ensure the order of data accesses in compliance with the
`program algorithm during the program streams’ parallel
`execution.
`
`The contents of each processorregisterfile may be trans-
`mitted through the external interface in theregisterfiles ofall
`other processors. Thisis the fastest way to transfer the com-
`putation results from stream to stream.
`Store addresses and store data of each processorare acces-
`sible to all other processors through the external interface.
`This method speedsupthe traditional mechanism ofthe inter-
`nal cache coherence support in a multiprocessor system.
`An alternative embodiment provides data exchange only
`through the memory access operations.
`Each processor may transmit target addresses for streams
`branching toall other processors through the external inter-
`face. This method provides a meansofdirect control of the
`streams’ parallel execution directly from the executed pro-
`gram without addressing the operation system.
`An alternative embodiment, the description of which fol-
`lows hereafter, provides the target address exchange only
`through the memory access operations.
`
`10
`
`35
`
`45
`
`60
`
`65
`
`16
`
`16
`
`
`
`US 7,895,587 B2
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`8
`7
`whichis not advantageousto exploit in existing multiproces-
`In FIG.7, a block diagram of a Single-chip Multiprocessor
`sor systems because of the big overheads for the processors
`System 700 suited for parallel execution based on static
`interaction. The main signals of the Synchronization
`macro schedulingis illustrated. Included in this embodiment
`Exchange Interconnect 781 are “permit” signals, which pro-
`of Single-chip Multiprocessor System 700 are four Proces-
`cessors exchange over full-cross interconnection; i.e., in a
`sors 710, an Interprocessor Connect 780, a Shared Cache Unit
`four-processor system each processor transmits and receives
`790, and a System Interface Unit 792. In addition, the multi-
`three “permit” signals. Additionally, the interface includes
`processor system 700 may include somesets of units, such as
`“permit counter overflow” signals prohibiting “permit” sig-
`I/O controllers 796, memory controllers 798, specialized co-
`nals issue—theprinciples of signal generation andinteraction
`processors, and so on.
`will be described below.
`The Processors 710 are preferably ELBRUSprocessors,
`Aspreviously mentioned, the provided system allows the
`whichinternal structure will be further described below. (EL-
`same arrangement of a multiprocessor system operation as
`BRUSarchitecture is well suited for use in a Single-chip
`that of a super-wide instruction issue processor andutilizes
`Multiprocessor System, but other embodiments can use other
`program parallelism of any kind possible for a given program
`types of processors.) In accordance with the present inven-
`with the utmost efficiency. To do this, besides the fast syn-
`tion, an Interprocessor Exchange Subsystem is preferably
`chronization signals exchange, each processor needs an
`includedin the processorstructure, and the Instruction Setis
`extremely fast access to the data generated by other proces-
`preferably extendedto explicitly support static macro-sched-
`sors. The highest performance maybe attained if a shared
`uling on the basis of fast interprocessor synchronization and
`register file and shared data cache are available forall pro-
`data and address exchange.
`cessors. However,it is obviousthat suchsolutions are accept-
`Shared Cache Unit 790 preferably includes a four-way set
`able only for small, for example two-processor, systems, con-
`associative cache memory, address and data buffers and
`sisting of processors with the issue width no more than 2-4.
`appropriate control and switching hardware. The cache
`
`memory has an interleaved multi-bank structure and can Otherwiseregister files and cache memories withavery large
`execute up to four read or write operations per cycle. The
`numberofports have to be built, which results in an unrea-
`control hardware provides data and address
`switching
`sonable growth of hardware volume and, which is more
`between cache banks and four processors and non-blocking
`important, in the inevitable decrease of the processor clock
`execution ofread and write operations. In case ofa cache miss
`frequency. Actually, such an approach implies building of a
`or non-cached request (such as a request to I/O area), the
`wide instruction issue processor with internalclustered struc-
`cache control hardware requests the System Interface Unit
`ture instead of the multiprocessor system. The examples of
`such solutions can be MAJC [4] or the below-described Pro-
`792 for main memory orI/O access. Besidesthat, the respon-
`cessor 710, where the cluster structure is used to attain the
`sibility of the cache control hardware is a cache coherency
`support between the Shared Cache Unit 790 and internal
`utmost parallelism in one processor, andits further extension
`caches of the Processors 710.
`may be implementedonly by building a single-chip multipro-
`cessor s