throbber
US007895587B2
`
`US 7,895,587 B2
`(10) Patent No.:
`a2) United States Patent
`Babaianetal.
`(45) Date of Patent:
`Feb. 22, 2011
`
`
`(75)
`
`(54) SINGLE-CHIP MULTIPROCESSOR WITH
`CLOCK CYCLE-PRECISE PROGRAM
`SCHEDULING OF PARALLEL EXECUTION
`.
`.
`.
`.
`.
`Inventors: Boris A. Babaian, Moscow (RU);Yuli
`Kh. Sakhin, Moscow (RU); Vladimir
`Yu. Volkonskiy, Moscow (RU); Sergey
`A. Rozhkov, Moscow (RU); VladimirV.
`Tikhorsky, Moscow (RU); Feodor A.
`Gruzdov, Moscow (RU); Leonid N.
`Nazarov, Moscow (RU); Mikhail L.
`Chudakov, Moscow (RU)
`
`(73) Assignee: Elbrus International, Moscow (RU)
`(*) Notice:
`Subject to any disclaimer, the term ofthis
`atent is extended or adjusted under 35
`0G SC. 154(b)
`by 1202 da
`“— (b)
`by
`ys.
`.
`.
`No.:
`(21) Appl. No.: 11/518,038
`
`(22)
`
`Filed:
`
`Sep. 8, 2006
`
`(65)
`
`Prior Publication Data
`
`US 2007/0006193 Al
`
`Jan. 4, 2007
`
`(51)
`
`Related U.S. Application Data
`(62) Division of application No. 09/789,850,filed on Feb.
`20, 2001, now Pat. No. 7,143,401.
`(60) Provisional application No. 60/183,176, filed on Feb.
`17. 2000.

`Int. Cl.
`(2006.01)
`GOOF 9/45
`(2006.01)
`G06F 9/30
`717/161: 717/150: 717/153:
`(52) US. Cl
`eee 77/1 50: 712/220: 712/228
`58)
`Field of Classification
`h


`N
`S
`Sox
`° li
`ten le for nol tesearebhistor
`one
`(58)
`ee applcation
`hie
`tor comprete search)
`story:
`(56)
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,613,146 A
`5,761,516 A *
`5,913,925 A *
`6,058,465 A *
`6,282,583 BL*
`6,347,344 B1*
`6,360,243 B1*
`7,143,401 B2
`
`3/1997 Goveet al.
`....0...... 710/260
`6/1998 Rostoker etal.
`6/1999 Kahleetal. ne. 712/206
`5/2000 Nguyen... eeeeeeee 712/7
`8/2001 Pincus etal. .s.s.sesseese 713/375
`
`we. 710/20
`2/2002 Baker etal. .....
`............ 718/103
`3/2002 Lindsley etal.
`11/2006 Babaian etal.
`
`OTHER PUBLICATIONS
`
`Soohnget al., “Across Multiple Superscalar Processors on a Single
`Chip”, Nov. 1997, Proceedings of PACT’97, pp. 1-10.*
`Gupta ct al., “The Design ofRISC based. MultiprocessorChip”, 1990,
`Preceedings of the 1990 conference on Supercomputing, pp. 920-
`929.8
`7
`.
`.
`oe
`Guptaet al.,
`“The Design ofRISC based Multiprocessor Chip”, 1990,
`Proceedings ofthe 1990 conference on Supercomputing, p. 920-929.
`Soohongetal., “Across Multiple Superscalar Processors on a Single
`Chip”, Nov. 1997, Proceedings of PACT’97, pp. 1-10.
`* cited by examiner
`
`Primary Examiner—Tuan Q Dam
`Assistant Examiner—lsaac T Tecklu
`(74) Attorney, Agent, or Firm—kKilpatrick Townsend &
`Stockton LLP
`
`(57)
`
`ABSTRACT
`
`Asingle-chip multiprocessor system and operation method of
`this system based on a static macro-scheduling of parallel
`streams for multiprocessor parallel execution. The single-
`chip multiprocessor system has buses for direct exchange
`between the processorregister files and access to their store
`addresses and data. Each explicit parallelism architecture
`processorof this system has an interprocessorinterface pro-
`viding the synchronization signals exchange, data exchange
`at the registerfile level and accessto store addresses and data
`of other processors. The single-chip multiprocessor system
`uses ILP to increase the performance. Synchronization ofthe
`streamsparallel execution is ensured using special operations
`setting a sequence ofstreams and stream fragments execution
`prescribed by the program algorithm.
`
`4,414,624 A * 11/1983 Summeretal.
`
`.....0000... 712/21
`
`17 Claims, 12 Drawing Sheets
`
`530
`
`540
`
`Processor A
`
`Processor B
`
`510
`
`“
`Stream
`t
`AO
`Al
`
`Stream
`J YX 20
`BO
`Bl
`
`A2
`
`522 B2
`
`524-526
`
`
`A3
`walt pea
`event Bi B3
`
`
`‘sat pec
`wyAd 514
`516
`528
`B3
`
`
`
`
`event A; AS o 1
`Bz
`feet een
`AG — B3
`P=st fee0
`ee
`AT
`B4
`
`
`
`
`1
`
`SAMSUNG 1055
`SAMSUNG 1055
`SAMSUNG v. SMART MOBILE
`SAMSUNG v. SMART MOBILE
`IPR2022-01004
`IPR2022-01004
`
`Cycle
`
`. .
`
`.
`N
`N+1
`
`Nt+2
`
`N+3
`N+4
`N+5
`N+6
`N+7
`
`1
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 1 of 12
`
`US 7,895,587 B2
`
`Cycle
`
`N
`
`N+1
`
`112
`
`114
`
`116
`
`118
`
`Opl
`
`Op5
`
`Op3
`Op2
`
`
`Op4
`
`Op6
`
`Op7
`
`Op8
`
`+410
`
`120
`
`il
`
`i2
`
`Fig. 1
`
`o»
`
`
`
`
`
`
`
`
`N43 Op7|Ops i2
`
`
`
`055
`
`
`236
`
`
`
`248
`
`
`
`
`
`
`2
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 2 of 12
`
`US 7,895,587 B2
`
`a
`
`320
`
`330
`
`340
`
`Cycle
`
`CPUO
`
`CPU1
`
`CPU2
`
`CPU3
`
`N
`
`312
`
`324
`
`336
`
`348
`
`Fig. 3
`
`3
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 3 of 12
`
`US 7,895,587 B2
`
`A
`
`410
`
`B
`
`420
`
`414
`
`
`
`Aj
`
`412
`
`416
`
`424
`
`Bj
`|wait [permit
`SILL|0|
`
`422
`
`time
`
`Fig. 4
`
`4
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 4 of 12
`
`US 7,895,587 B2
`
`530
`
`Cycle
`
`Processor A
`
`Stream
`
`{ 510
`
`AO
`
`A2
`
`A3
`516
`514
`512 A4
`oemAST OPT
`A6
`
`518
`
`;
`
`N
`
`N+2
`
`N+3
`N+4
`N45
`+
`N+6
`
`N+7
`
`540
`7
`Processor B
`
`Stream
`
`| XX 599
`
`BO
`
`event
`928
`
`526
`524
`592 B2
`Se
`B;
`wait
`[permit
`By B3
`B3 wall permi
`Bs Ae
`B3
`
`a
`
`lbeamil
`
`wait [permit
`
`A7
`
`B4
`
`Fig. 5
`
`5
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 5 of 12
`
`US 7,895,587 B2
`
`>
`
`610
`
`B
`
`620
`
`624
`
`Bi
`
`614
`
`Fwait(permilg’ Si
`POTTS
`
`619
`
`Pifo iM
`
`616
`
`time
`
`618
`
`a6
`
`Fig. 6
`
`6
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 6 of 12
`
`US 7,895,587 B2
`
` To l/O Main Memory
`
`and other Single-chip
`Multiprocesor
`Systems 700
`
`Fig. 7
`
`7
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 7 of 12
`
`US 7,895,587 B2
`
`
`:YTedsfbfexsetfekseP|
`Che4|4|4|A
`
`
`
`
`O[Lyo}e01poig
`
`
`
`ayoe7)UOHONsuy
`
`
`
`JoyjngyouojoigAeuy
`
`sngsse
`dig
`
`sugssedég
`
`
`
`062eYyseDpaleysOL
`
`gSI
`
`
`
`wayshsqng odueyoxy Jossa001droyuy
`
`08Z joouUu0D Jossas0ldiaquy oO],
`
`8
`
`
`
`
`
`
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 8 of 12
`
`US 7,895,587 B2
`
`0S6
`
`46‘Sty
`
`96“SITY
`
`0S6 96
`
`
`
`
`€-cv6C-2v6b-cv6O-Cv6C-0"6c-OV61-OF60-0V6
`
`
`
`v96296€-09672-0961-0960-0968S69567S67cS6
`
`
`
`[tdzd|baloa]aosealentzmalbu]om|vosor
`ce6826vce026916
`aa[anfeat[zu|anJouposa]amn]scfoes
`“ayeoipaig|***[esoi]
`
`S9[qe[AS91-Z
`
`
`SAOIVV}-yonu05
`
`
`
`
`86‘SII
`
`9¢6
`
`006
`
`c16
`
`806
`
`vO6
`
`Jopeayy
`
`9
`
`
`
`
`
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 9 of 12
`
`US 7,895,587 B2
`
`804
`
`802
`
`Synchronization
`
`Control Unit 724
`
`Synchronization
`Exchange
`Interconnect 781
`
`Register
`Exchange
`Interconnect 782
`
`Cache Exchange
`Interconnect 783
`
`Exchange
`Bypass Bus 748
`Unit
`
`Control Unit 724
`
`MMU 712
`
`806
`
`Register
`Exchange
`Unit
`
`808
`
`Cache
`Exchange
`
`Fig 10
`
`10
`
`10
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet10 of 12
`
`US 7,895,587 B2
`
`Perm_count_overflow_out_0
`Permit
`in 0 +
`830-0
`831-0
`—
`
`O
`
`-\
`
`1
`
`804
`
`7
`
`-)
`
`839-0
`
`From
`configuration
`registers 802
`
`833-0
`Wait_0
`833-1
`ait
`ait_2
`ait=~833-3
`j
`833-2
`
`.
`B
`
`|
`
`Perm_count_overflow_out_1
`:
`
`|
`830-1
`
`O
`
`832-0
`
`Permit_in_dI.,
`
`
`IT
`
`832-1
`
`
`
`To/fromSynchronizationExchangeInterconnect781
`
`
`
`
`
`To/fromCPUControlUnit724
`
`C =D
`
`834
`
`Perm_count_overflow_out_2
`Permit
`in
`7
`830-2
`831-2
`, L
`O
`
`C_|
`
`832-2
`
`|
`
`839-2
`
`7
`
`rom configuration registers 802
`825
`
`i
`837-0
`comme 837-1
`oe, \@ Petmit_out_0
`
`
`
`835-2 re Petmit_out_1
`837-2
`
`
`Af Permit_out_2
`
`
`rom configuration registers 802
`’
`
`837-3 826
`
`erm_count_overflow_in_0
`Perm_count_overflow_in 1
`erm_count_overflow_in %
`—
`
`Permit
`
`Permit
`aN
`Permit
`
`lock 1
`
`lock 2
`lock 3
`
`°/-” ee
`-/7°°
`
`836-0
`836-1
`836-2.
`
`Fig. 11
`
`11
`
`11
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 11 of 12
`
`US7,895,587 B2
`
`806
`
`7
`
`.
`Overflow_in
`Overiow in
`
`Data_out_0
`
`Data_out_1
`
`7
`
`874
`
`866-0
`
`a
`oO
`é~
`3
`
`oS
`
`866-
`
`—-866-
`
`868-0
`867-1
`
`Overflow_0
`
`868-1
`
`867-
`
`Overflow_2
`872-2
`
`5O
`
`oO
`
`o6
`
`D
`q
`a
`ao
`
`FIFO
`
`
`
`
`
`858-1|
`|
`
`
`
`Mx
`
`Data_i
`
`871
`
`Data_in_2 |||FIFO]|
`
`868-2
`
`Fig. 12
`
`12
`
`Transmit Control
`Switch
`
`854-0
`
`=~
`mz
`Pp —N
`[-—4
`™
`a Transmit request
`=
`a
`5
`+
`~—
`sz
`5
`Receive request ©
`870
`
`.
`Receive Control
`Decoder
`
`From Configuration Unit 802
`852
`876
`Transmit lock
`
`
`
`
`Data_out
`
`864
`
`Co
`tT
`™
`
`Data_out_2 ea
`aoan
`2
`sa7z0 BT
`serot
`2
`2 “saya
`zs
`ws
`2
`: 5b
`Overflow_1
`2.
`@
`872-1
`co
`e
`-
`Data_in_1
`
`12
`
`

`

`U.S. Patent
`
`Feb. 22, 2011
`
`Sheet 12 of 12
`
`US7,895,587 B2
`
`808
`
`-
`
`From Configuration Unit 802
`B89
`
`882-0
`
`890-OO
`
`890-1 ‘|
`890-2
`i
`895-0
`
`882-1
`
`882-2
`
`888
`
`97
`
`“
`N
`-
`8
`S
`893 2
`SoO
`&
`o
`|
`S
`<
`
`a4
`
`886
`
`g99 ©
`&oO
`=
`
`898
`
`3
`
`(—
`CS
`(ke
`
`ae4
`
`Store
`idan
`
`Store
`Queue
`
`Fig. 13
`
`13
`
`en
`ve)
`
`™ 3
`
`jan)
`2
`oS
`O
`
`3O
`
`F
`°g
`= 895-2
`=

`on
`&
`2
`
`895-1
`
`896-0
`896-1
`896-2
`
`yy
`
`892-0
`
`892-1
`892-2
`
`13
`
`

`

`US 7,895,587 B2
`
`1
`SINGLE-CHIP MULTIPROCESSOR WITH
`CLOCK CYCLE-PRECISE PROGRAM
`SCHEDULING OF PARALLEL EXECUTION
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`20
`
`25
`
`35
`
`40
`
`45
`
`This application claims priority from U.S. Provisional
`Application No. 60/183,176, entitled “SINGLE-CHIP MUL-
`TIPROCESSOR WITH CYCLE-PRECISE PROGRAM
`SCHEDULING OF PARALLEL EXECUTION”,filed Feb.
`17, 2000, the disclosure of which is incorporated herein by
`reference. This application also claimspriority to U.S. appli-
`cation Ser. No. 09/789,850, entitled “SINGLE-CHIP MUL-
`TIPROCESSOR WITH CYCLE-PRECISE PROGRAM
`SCHEDULING OF PARALLEL EXECUTION”,filed Feb.
`20, 2001, the disclosure of which is incorporated herein by
`reference.
`
`BACKGROUNDOF THE INVENTION
`
`1. Field of the Invention
`The present invention relates to shared memory multipro-
`cessors and, more specifically, to a single-chip macro-sched-
`ule multiprocessor architecture, providing program-con-
`trolled cooperation of the processors with explicit use of
`instruction-level parallelism.
`2. Description of the Prior Art
`Today’s fast growth of transistor-per-chip number raises
`the question of how to gain a respectively higher perfor-
`mance. Onealternative is to build larger on-chip memories,
`but this approach canbeefficient only to a certain point, after
`which adding more cache provides a minor performance
`improvement. Thus, a preferred alternative at this point is to
`exploit more parallelism. There
`are
`generally two
`approaches: instruction-level parallelism (ILP) and thread-
`level parallelism (TLP).
`Useof instruction-level parallelism (ILP) involves parallel
`execution of the instruction groups, which helps the perfor-
`mance growth. There are dynamic (superscalar) and static
`(Very Long Instruction Word—VLIW and. Explicit Parallel
`Instruction Computing—EPIC)approaches to ILP use. With
`the dynamic approach, parallel instruction groups are hard-
`ware-generated at
`the program run, and with the static
`approach, at the program compilation. An example of the
`dynamic approachis provided with the microprocessor Pen-
`tium IV ofIntel (see “Pentium 4 (Partially) Previewed”’,Peter
`N. Glaskowsky, Microprocessor Report, Aug. 28, 2000-
`2001). An example ofthe static approachis provided with the
`microprocessor Itanium of Intel (see “Merced Shows Inno-
`vative Design”, Linley Gwennap, Microprocessor Report,
`volume 13, number 13, Oct. 6, 1999).
`In the dynamic approach,there is a big dynamic hardware
`window for viewing the executedinstructions(in the Pentium
`IV the window viewsover 100 instructions) where all pos-
`sible operation collisions are resolved. In the caseofthe static
`approach, the compiler forms instruction groups for their
`parallel execution and schedules their optimal execution with
`regard to each instruction execution time and possible inter-
`instruction collisions. In this case, the instruction groups
`include only independentinstructions. This approach simpli-
`fies the processor hardware. The size of a parallel execution
`instruction group for modern superscalar architecture micro-
`processors generally reaches 4-6 instructions, with future
`increases up to 8 instructions (see microprocessor Power 4
`IBM, “IBM’s Power 4 Unveiling Continues”, Microproces-
`sor Report Nov. 20, 2000-2003). For static architecture
`
`2
`microprocessors it generally reaches from 6-12 instructions
`(see JA-64, Itanium, McKinley) to over 20 instructions (see
`Keith Diefendorff, “The Russians Are Coming”, Micropro-
`cessor Report, pp. 1, 6-11, vol. 13, number2, Feb. 15, 1999).
`Further increase ofthe parallel execution instruction group
`size leads to physically large monolithic cores and complex
`control mechanisms, whichare limiting factors for increases
`in the clock frequency. The numberof access portsto register
`files and internal caches is growing. The hardware forresolv-
`ing inter-instruction dependencies in superscalar micropro-
`cessors is becoming complicated. The probability of unac-
`counted collisions in a static architecture microprocessor
`during compilation is growing, which results in violations of
`the schedule madeat compile time causing additional delays
`at the program run. Moreover, design andverification become
`too complicated and time-consuming.
`Thread-level parallelism (TLP)is a perspective method of
`further performance increases for dynamic andstatic archi-
`tectures. Use of thread-level parallelism (TLP) involves par-
`allel execution of many program threads in a multiprocessor
`environment. Threads are weakly coupled orjust independent
`fragments of one program allowing their parallel execution
`on different processors with small overheads for control and
`synchronization, which are performed by the operation sys-
`tem and by means of semaphores. However, not all applica-
`tions can be parallelized in such a way. A majordifficulty is
`posed by parallelization of the integer applications, which
`have data dependencies andshort parallel branches, because
`synchronization using semaphoresis verycostly for them.
`Static architectures have a potential for performance
`growth in a multiprocessor system due to a more aggressive
`use of ILP and application ofthe static scheduling method to
`aparallel execution on many processors. The examples ofILP
`use can be really independent
`in-program computations
`(separate expressions, procedures, loop iterations, etc.), as
`well as compiler optimizations aimed at speeding-up the
`computations due to parallel execution of possible alterna-
`tives (the so-called speculative and predicative computa-
`tions). This may allow to increase utilization of ILP in the
`programsby up to 63%. (See Gary Tysonet al., “Quantifying
`Instruction Level Parallelism Limits on an EPIC Architec-
`
`ture”, ISPASS-2000, Austin, Tex., 2000.)
`The compiler for static macro-schedule architecture per-
`forms a global scheduling of the program execution taking
`into account the available data and control dependences. In
`this case the numberof instructions in a group intended for
`parallel execution (super-wide instruction) is equal to the
`total numberofinstructions in all instruction groups (wide
`instruction) in all processors of the multiprocessor system.
`That is, the compiler makes a schedule for a synchronous
`execution of the super-wide instructions in all processors of
`the system. A sequence of wideinstructionsto be executed in
`one processor forms a wide instruction stream or simply a
`stream. Thus, the schedule for the whole program execution is
`divided into a multitude of streams in compliance with the
`available numberof processors.
`While making a schedule for parallel operation of all pro-
`cessors in a multiprocessor system,
`the compiler forms
`streams for each processor to minimize data and control
`dependencies between different streams. This shortens the
`delays caused by a necessity to access the context of another
`stream executed in another processor. The streams can be
`executed independently of each other until an explicitly
`specified synchronization instruction appears in the instruc-
`tion sequence. During the program run thestatic schedule can
`be violated, which is caused by collisionsarising in different
`processors, which cannot be accounted at the compilation
`
`14
`
`14
`
`

`

`US 7,895,587 B2
`
`4
`The compiler for sucha system performsstatic clock cycle-
`precise scheduling of executing the program by super-wide
`instructions and then divides this schedule into a few separate
`streamsfor parallel execution on a multiprocessor. Streaming
`is performed to minimize the interstream dependencies. In
`this way, the effect of a super-wide instruction issue in the
`on-chip multiprocessor system is attained.
`Fast synchronization of the parallel program streamsis
`performed, if necessary,
`in compliance with the program
`algorithm using special synchronization operations maintain-
`ing the sequence of execution for separate fragments of dif-
`ferent streams.
`
`25
`
`3
`stage. Examples of such collisions may be a cache miss,
`data-dependent divide and multiply operations, etc. For this
`reason it is necessary to have synchronization means, 1.e.,
`maintenanceofthe specified sequence of executing separate
`fragments in different streams with the aim to properly
`resolve the data and control dependencies. Theefficiency of
`the macro-schedule multiprocessor system dependslargely
`on the efficiency of the interstream context access and syn-
`chronization means implementation.
`A single-chip multiprocessor is generally most suited for
`static macro-schedule execution. A single-processor chip has
`a limited numberof external connections caused by the con-
`The single-chip multiprocessor system can be tightly-
`strained package abilities. A single-processor chip typically
`coupled and include high performance buses for data
`has only system interface for access to main memory, other
`exchange between the processors at the register file level,
`processors and 1/O. Unlikethis, the single-chip multiproces-
`buses for supporting processor cache memories coherence
`sor besides the system interface may include very fast and
`and communication links for transmitting control signals
`wide interprocessor connections data exchange,
`internal
`ensuring the required synchronization (execution sequence of
`caches coherence support and synchronization of the streams
`executed in parallel.
`critical parts of different streams) in the multiprocessor sys-
`20
`
`A single-chip multiprocessor may haveavirtual processor tem, or weakly-coupled, when the data exchange between the
`numbering, which allows for simultaneously performing
`program streamsis executed using the memory access opera-
`tions.
`independent
`programs
`providing
`sufficient processor
`resources. Further performance increases may be attained in
`The explicit parallelism architecture processor comprises
`a multi-ship system comprising single-chip multiprocessors,
`synchronization means, means for data exchangeat the reg-
`in which interchip access and synchronization may be
`ister file level and data cache level for close processors inter-
`action.
`handled in a traditional way using a semaphore method,etc.
`Static macro-schedule architecture efficiently uses TLP,
`since in this case the threads may be considered as streams
`with weak data and control dependencies.
`ExpLicit Basic Resource Utilization Scheduling (EL-
`BRUS) microprocessor architecture (see Keith Diefendorff,
`“The Russians Are Coming”, Microprocessor Report, pp. 1,
`6-11, vol. 13, number2, Feb. 15, 1999) is mostly suited for the
`single-chip multiprocessor using static macro-schedule,
`because ELBRUSarchitecture is oriented to the execution of
`
`The above-described static macro-scheduling method of
`the program compilation and on-chip multiprocessor system
`has a number of advantages. The static macro-scheduling
`methodof the program compilation makes a most aggressive
`use of ILP due to a super-wide instruction schedule. An on-
`chip multiprocessor system due to asynchronousexecution of
`parallel streams substantially reduces the effect of collisions
`dynamically arising in the processor hardware during the
`programrun.
`system substantially
`The single-chip multiprocessor
`increases the performance dueto parallel and macro-sched-
`uled execution of a multitude of program streams, high per-
`formance and efficiently synchronized data exchange
`between the processors.
`Each processorin a single-chip multiprocessor system may
`have a more simple structure and limited parallel execution
`abilities, which promotes to the clock frequency increase,
`reduces timefor parallel processing, and reduces time for the
`processor design andverification.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIGS. 1 and 2 are tablesillustrating the effect of summa-
`rizing delays caused by collisions unaccounted for by a com-
`piler in a quad-issue explicit parallelism architecture proces-
`SOL;
`FIG.3 illustrates a real schedule for execution of a program
`in four 1-issue explicit parallelism architecture processors;
`FIG.4 is a schematic illustration of an exampleofinitial-
`izing a sequential path of a synchronization point;
`FIG. 5 is a diagram of executing streams by processors;
`FIG.6 is a schematic illustration ofan example of a sequen-
`tial path of synchronization points;
`FIG.7 is a block diagram of a single-chip multiprocessor
`system in accordance with the present invention;
`FIG.8 is an overview block diagram of an embodimentof
`a processor in accordance with the present invention;
`FIG.9a is a diagram of an embodimentof a wide instruc-
`tion;
`FIG. 96 is a diagram of a format of an MP synchronization
`syllable;
`
`40
`
`45
`
`the static clock cycle-precise scheduled program with explicit
`parallelism.
`AnELBRUSmicroprocessor wide instruction maycontain
`over 20 operations (simple instructions of the type:
`load,
`store, add, multiply, shift, logic and others). An ELBRUS
`microprocessor has additionally speculative and predicative
`mode operations, which increases its potentialities to effi-
`ciently use ILP. A scoreboarding feature allows automatic
`correction of the static schedule, when dynamic collisions
`arise during the program run.
`An object of the present invention is therefore a method of
`synchronization and control of parallel execution of streams
`of a macro-scheduled program without addressing the opera-
`tion system, based on the static macro-scheduling of the
`program. Anotherobject ofthe present inventionis to provide
`a single-chip multiprocessor with interprocessor connections
`for fast registers’ data exchange, acceleration of cache coher-
`ency support and synchronization of parallel streams execu-
`tion. A further object of the present invention is an ExpLicit
`Basic Resource Utilization Scheduling (ELBRUS)micropro-
`cessor with means for interprocessor synchronization and
`interprocessor exchange of data and addresses through
`above-mentioned interprocessor connections.
`
`SUMMARYOF THE INVENTION
`
`60
`
`In accordance with the present invention, a single-chip
`multiprocessor system with explicit parallelism architecture
`processors and an operation methodofthis system is based on
`the static macro-scheduling of the program for the multipro-
`cessor parallel execution to help ensure a high-level use of
`ILP.
`
`15
`
`15
`
`

`

`US 7,895,587 B2
`
`5
`FIG.9c is a diagram of an MPregister exchange syllable;
`FIG. 10 is a block diagram of an interprocessor exchange
`subsystem in accordance with the present invention;
`FIG. 11 is a schematic illustration of a synchronization
`exchange unit in accordance with the present invention;
`FIG. 12 is a schematic illustration of a register exchange
`unit in accordance with the present invention; and
`FIG. 13 is a schematicillustration of a cache exchangeunit
`in accordance with the present invention.
`
`DETAILED DESCRIPTION OF THE PREFERRED
`EXEMPLARY EMBODIMENTS
`
`
`
`FIGS. 1 and 2 show the effect of summarizing delays
`caused by the collisions unaccounted for by the compiler in a
`quad-issue explicit parallelism architecture processor. It is
`assumed that each wide instruction consists of four opera-
`tions 112, 114, 116, 118. The execution time of each wide
`instruction is equal to one cycle, and the time of collision
`resolution is also equal to one cycle. Assume the program
`consists of 16 independent operations and the compilerallo-
`cates them to four wide instructions 110, 120, 130, 140.
`The compiler makes a schedule of executing four wide
`instructions, which will take four cycles from n to n+3 (FIG.
`1). Let the collisions unaccounted for by the compiler take
`place in each wideinstruction: op1 212, op6 224, op11 236
`and op16 248 during the program run time. Thus, thereal time
`run of this program takes eight cycles from n to n+7, since
`each wide instruction 210, 220, 230, 240 is delayed for an
`additional one cycle and the total delay unaccounted for by
`the compiler is equal
`to the sum ofall delays and makes up
`four cycles (see FIG. 2).
`FIG. 3 presents a real schedule for execution of the same
`program in four 1-issue explicit parallelism architecture pro-
`cessors with the same assumptions about the instruction
`execution time, collisions resolution time and the time of
`collisions occurrencein the operations. The compiler sets up
`four program streams 310, 320, 330, 340 for execution on
`four processors. In this case, the whole program is executed
`within five cycles without the effect of accumulated delays
`caused by the collisions. ‘he total delay that is unaccounted
`for is equal to the longest of the delays that occurred in four
`program streams. In this simple example, the delay 312, 324,
`336,348 in all four streams is equal to one cycle, which is why
`the total delay is equal to one cycle.
`Examples presented in FIGS. 1, 2 and3 illustrate that with
`the equal issue width in monoprocessor and multiprocessor
`systems, the latter is less affected by the collisions unac-
`counted for in static scheduling at the compilation stage.
`In accordance with an embodiment, synchronization is
`implemented using special operations, which along with
`other operationsare a part ofwideinstructions, and which are
`located in the synchronization points of the streams. The
`synchronization operation with the help ofa set of bit pairs
`“wait” and “permit” specifies in a synchronization point the
`relationship betweenthe given stream and each otherstream.
`Presented below are possible states of bits relationship:
`
`wait
`0
`0
`1
`1
`
`permit
`0
`1
`0
`1
`
`don’t care
`permit another stream
`wait for another stream
`wait for another stream, then permit another stream
`
`6
`Code 00 has no synchronization influence. Code 01 per-
`mits anotherstream to pass its synchronization point. Code 10
`prescribes waiting for another stream passing of its synchro-
`nization point. Code 11 determines the relation to two syn-
`chronization points of another stream: waiting for another
`stream passing its first synchronization point and then per-
`mitting another stream to pass its second synchronization
`point.
`FIG.4 illustrates an example ofinitializing a sequential
`pass of the synchronization point Ai 412 by a stream A 410
`and point Bj 422 by a stream B 420, and thestate of“wait” and
`“permit”bits 414, 424 in both streamsof the synchronization
`operation. The synchronization operation in point Bj has a
`state “wait” and may be executed only after the synchroniza-
`tion operation in point Ai, issuing a “permit” 414 signal, is
`executed.
`FIG. 5 presents a diagram of executing the streams A 510
`and B 520 by processors 530 and 540 in this case, when an
`event in point Bj 528 occurs earlier than an event in point Aj
`518. In this case, the whole wideinstruction B3 522, contain-
`ing the synchronization operation, is delayed until a wide
`instruction A5 512 is executed.
`FIG. 6 presents an example of a sequential pass of the
`synchronization points Ai 612 of the stream A 610, Bj 622 of
`the stream B 620, andAk 616 ofthe streamA 610 andthestate
`of “wait” and “permit”bits in the synchronization operations
`of both streams. The synchronization operation in point Bj
`622 has the state “wait” and “permit” 624 and may be
`executed provided only the synchronization operation in
`point Ai 612 is executed and “permit” signal 618 is issued.
`Only now does the synchronization operation in point Bj 622
`issue a “permit” signal 626 for the synchronization operation
`in point Ak 616.
`A reverse counter may be used to count “permit” and
`“wait” events. This allows for set-up of the relation of the
`execution sequence to the groups of events in the synchro-
`nized streams. The signal “permit” from another stream
`increments the counter, and the execution of the synchroni-
`zation operation containing the “wait” bit in its own stream
`decrements the counter. Overflow of the counter locks the
`
`stream, which has an intention to transmit the “permit”sig-
`nals, in the point of execution of its synchronization opera-
`tion. Zero state ofthe reverse counter locks the stream, which
`is waiting for the “permit” signals as shown in FIGS. 4 and 6.
`A method of synchronization of the streams’ parallel
`execution in accordance with this embodiment is intended to
`ensure the order of data accesses in compliance with the
`program algorithm during the program streams’ parallel
`execution.
`
`The contents of each processorregisterfile may be trans-
`mitted through the external interface in theregisterfiles ofall
`other processors. Thisis the fastest way to transfer the com-
`putation results from stream to stream.
`Store addresses and store data of each processorare acces-
`sible to all other processors through the external interface.
`This method speedsupthe traditional mechanism ofthe inter-
`nal cache coherence support in a multiprocessor system.
`An alternative embodiment provides data exchange only
`through the memory access operations.
`Each processor may transmit target addresses for streams
`branching toall other processors through the external inter-
`face. This method provides a meansofdirect control of the
`streams’ parallel execution directly from the executed pro-
`gram without addressing the operation system.
`An alternative embodiment, the description of which fol-
`lows hereafter, provides the target address exchange only
`through the memory access operations.
`
`10
`
`35
`
`45
`
`60
`
`65
`
`16
`
`16
`
`

`

`US 7,895,587 B2
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`8
`7
`whichis not advantageousto exploit in existing multiproces-
`In FIG.7, a block diagram of a Single-chip Multiprocessor
`sor systems because of the big overheads for the processors
`System 700 suited for parallel execution based on static
`interaction. The main signals of the Synchronization
`macro schedulingis illustrated. Included in this embodiment
`Exchange Interconnect 781 are “permit” signals, which pro-
`of Single-chip Multiprocessor System 700 are four Proces-
`cessors exchange over full-cross interconnection; i.e., in a
`sors 710, an Interprocessor Connect 780, a Shared Cache Unit
`four-processor system each processor transmits and receives
`790, and a System Interface Unit 792. In addition, the multi-
`three “permit” signals. Additionally, the interface includes
`processor system 700 may include somesets of units, such as
`“permit counter overflow” signals prohibiting “permit” sig-
`I/O controllers 796, memory controllers 798, specialized co-
`nals issue—theprinciples of signal generation andinteraction
`processors, and so on.
`will be described below.
`The Processors 710 are preferably ELBRUSprocessors,
`Aspreviously mentioned, the provided system allows the
`whichinternal structure will be further described below. (EL-
`same arrangement of a multiprocessor system operation as
`BRUSarchitecture is well suited for use in a Single-chip
`that of a super-wide instruction issue processor andutilizes
`Multiprocessor System, but other embodiments can use other
`program parallelism of any kind possible for a given program
`types of processors.) In accordance with the present inven-
`with the utmost efficiency. To do this, besides the fast syn-
`tion, an Interprocessor Exchange Subsystem is preferably
`chronization signals exchange, each processor needs an
`includedin the processorstructure, and the Instruction Setis
`extremely fast access to the data generated by other proces-
`preferably extendedto explicitly support static macro-sched-
`sors. The highest performance maybe attained if a shared
`uling on the basis of fast interprocessor synchronization and
`register file and shared data cache are available forall pro-
`data and address exchange.
`cessors. However,it is obviousthat suchsolutions are accept-
`Shared Cache Unit 790 preferably includes a four-way set
`able only for small, for example two-processor, systems, con-
`associative cache memory, address and data buffers and
`sisting of processors with the issue width no more than 2-4.
`appropriate control and switching hardware. The cache
`
`memory has an interleaved multi-bank structure and can Otherwiseregister files and cache memories withavery large
`execute up to four read or write operations per cycle. The
`numberofports have to be built, which results in an unrea-
`control hardware provides data and address
`switching
`sonable growth of hardware volume and, which is more
`between cache banks and four processors and non-blocking
`important, in the inevitable decrease of the processor clock
`execution ofread and write operations. In case ofa cache miss
`frequency. Actually, such an approach implies building of a
`or non-cached request (such as a request to I/O area), the
`wide instruction issue processor with internalclustered struc-
`cache control hardware requests the System Interface Unit
`ture instead of the multiprocessor system. The examples of
`such solutions can be MAJC [4] or the below-described Pro-
`792 for main memory orI/O access. Besidesthat, the respon-
`cessor 710, where the cluster structure is used to attain the
`sibility of the cache control hardware is a cache coherency
`support between the Shared Cache Unit 790 and internal
`utmost parallelism in one processor, andits further extension
`caches of the Processors 710.
`may be implementedonly by building a single-chip multipro-
`cessor s

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket