`(12) Patent Application Publication (10) Pub. No.: US 2010/0185719 A1
`(43) Pub. Date:
`Jul. 22, 2010
`HOWard
`
`US 20100185719A1
`
`(54) APPARATUS FOR ENHANCING
`PERFORMANCE OF A PARALLEL
`PROCESSING ENVIRONMENT, AND
`ASSOCATED METHODS
`
`Kevin D. Howard, Tempe, AZ (US)
`(76) Inventor:
`Correspondence Address:
`LATHROP & GAGE LLP
`4845 PEARL EAST CIRCLE, SUITE 201
`BOULDER, CO 80301 (US)
`
`(21)
`(22)
`
`Appl. No.:
`
`12/750,338
`
`Filed:
`
`Mar. 30, 2010
`
`Related U.S. Application Data
`(60) Continuation of application No. 12/197.881, filed on
`Aug. 25, 2008, now Pat. No. 7,730,121, Continuation
`in-part of application No. 12/197,881, filed on Aug.
`25, 2008, now Pat. No. 7,730,121, which is a division
`of application No. 10/340,524, filed on Jan. 10, 2003,
`now Pat. No. 7,418,470, which is a continuation-in
`part of application No. 09/603,020, filed on Jun. 26,
`2000, now Pat. No. 6,857,004.
`
`(60) Provisional application No. 61/165.301, filed on Mar.
`31, 2009, provisional application No. 61/166,630,
`filed on Apr. 3, 2009, provisional application No.
`60/347,325, filed on Jan. 10, 2002.
`
`Publication Classification
`
`(51) Int. Cl.
`(2006.01)
`G06F 5/16
`(52) U.S. Cl. ........................................................ 709/201
`
`ABSTRACT
`(57)
`Parallel Processing Communication Accelerator (PPCA) sys
`tems and methods for enhancing performance of a Parallel
`Processing Environment (PPE). In an embodiment, a Mes
`sage Passing Interface (MPI) devolver enabled PPCA is in
`communication with the PPE and a host node. The host node
`executes at least aparallel processing application and an MPI
`process. The MPI devolver communicates with the MPI pro
`cess and the PPE to improve the performance of the PPE by
`offloading MPI process functionality to the PPCA. Offload
`ing MPI processing to the PPCA frees the host node for other
`processing tasks, for example, executing the parallel process
`ing application, thereby improving the performance of the
`PPE.
`
`10. N
`Node 100
`
`Disk Storage
`122
`
`Parallel Application 104
`
`MP 106
`
`Parallel Data15
`
`
`
`Ost CPU
`120
`
`- - - - - - - - - - - - -
`Parallel Application 104
`
`Switch 116
`
`
`
`
`
`Node 1002)
`
`Node 1003)
`
`Node 1004)
`
`Node 1005
`
`PCA
`28
`
`PPCA
`128G
`
`CA
`128H
`
`Node 100(6)
`
`Node 100(A)
`
`Node 1008)
`
`
`Ex.1018 / Page 1 of 41Ex.1018 / Page 1 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 1 of 24
`
`US 2010/0185719 A1
`
`101 \
`
`Node 100(1)
`
`Disk Storage
`122
`
`Parallel Application 104
`
`MPI 106
`
`Parallel Data 105
`
`Host RAM126
`
`- - - - - - - - - - - - -
`: Parallel Application 104
`
`MP 106
`- - - - -
`
`- - - - - - - - - - - - -
`Parallel Data 105
`
`HostNIS Bridge
`124
`
`40
`
`SWitch 11 6
`
`
`
`Node 1 OO(2)
`
`Node 100(3)
`
`Node 100(4)
`
`Node 100(5)
`
`Node 100(6)
`
`Node 100(7)
`
`Node 100(8)
`
`FIG. 1
`
`
`Ex.1018 / Page 2 of 41Ex.1018 / Page 2 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 2 of 24
`
`US 2010/0185719 A1
`
`
`
`PPCA128
`
`Ethernet Channel 221
`
`HostN/S Bridge
`Interface
`216
`
`N/S Bridge
`218
`
`an
`
`Ethernet
`Connect 220
`
`RAM 224
`
`Software 214
`
`Data 215
`
`226
`
`Data SSD 228
`
`NVM222
`
`Firmware 223
`
`Data 225
`
`FIG. 2
`
`
`Ex.1018 / Page 3 of 41Ex.1018 / Page 3 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 3 of 24
`
`US 2010/0185719 A1
`
`
`
`
`
`
`
`
`
`
`
`
`
`HostN/S Bridge
`Interface
`216
`
`RAM 224
`
`
`
`Software 214
`
`Data 215
`
`
`
`CPU212
`
`N/S Bridge
`218
`
`
`
`
`
`
`Ethernet
`Connect 232
`
`Ethernet
`Connect230
`
`
`
`Ethernet
`Connect220
`
`NVM222
`
`Firmware 223
`
`Data 225
`
`
`
`
`
`Ethernet Channel 221
`
`Ethernet Channel 231
`
`Ethernet Channel 233
`
`FIG. 2A
`
`
`Ex.1018 / Page 4 of 41Ex.1018 / Page 4 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 4 of 24
`
`US 2010/0185719 A1
`
`
`
`
`Ex.1018 / Page 5 of 41Ex.1018 / Page 5 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 5 of 24
`
`US 2010/0185719 A1
`
`NODE 100
`
`Parallel Application 104
`
`
`
`Parallel Data 105
`
`
`
`
`
`
`
`
`
`MPI 106
`MPicollective-commands
`
`PPCA128
`
`Ethernet Channe 221
`
`MP Collective-Commands
`MP blocking commands
`MPI group Commands
`MP topology
`
`MP devolver
`314
`
`FIG. 3A
`
`
`Ex.1018 / Page 6 of 41Ex.1018 / Page 6 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 6 of 24
`
`US 2010/0185719 A1
`
`350
`
`360
`
`IPI Collective
`Operation (1024
`Servers, 25
`ME dataset site
`
`PPCA utilizing MPI Library
`370
`
`Communication
`Technique 372.
`
`
`
`Scatter-athet
`
`PPCA 28 Serial
`Exchange
`
`Reduction with
`wector operations
`
`PPCA 128 Serial
`Exchange
`
`Scain With Wecti
`Operations
`
`PPCA 128 Serial
`Exchange
`
`PPCA 128 PAAX
`Exchange
`
`All-Reduce with
`Wactor operations
`
`PPCA 128 Serial
`Exchange
`
`Reic-Scatter
`with Wector
`operations
`
`PPCA 128 Serial
`Exchange
`
`Broadcast.
`
`PPCA 128 Serial
`Exchange
`
`Al-to-All
`
`PPCA. 28 FAAX
`Exchange
`
`.
`
`O.
`
`O.
`
`Estimated
`Complete time 374
`1 Exposed Protocol
`Latency
`Se Tatilit
`time
`1 Exposed Protocol
`Latency
`Exposed Wector
`Operations
`Processing times
`.
`Sec Trailsittit
`time
`1 Exposed Protocol
`Latency
`O Exposed Wector
`Operations
`Processing times
`O.
`Sec Transitit
`time
`238 Exposed
`Protocol. Latercies
`0.51
`sec transitit
`time
`1 Exposed Protocol
`Latency
`O Exposed Wector
`Operations
`0.000 sec transmit
`time
`2 Exposed Protocol
`Latencies
`O Exposed Wector
`Operations
`sec trailsittit
`title
`Exposed Protocol
`Latency
`set transmit
`title
`238 Exposed
`Protocol. Latencies
`O.
`sectiaisitit
`title
`
`Master-Slave
`
`Standard 10 Gbis NC with Standard
`MPI Library 380
`Estimated
`Communication
`Complete time 384
`Technique 382.
`1024 Exposed
`Protocol. Latencies
`1.43&C Trainstit
`time
`11 Exposed Protocol
`Latencies
`30 Exposed Wector
`Operations
`0.11 sec transitit
`title
`
`Biffilia Tree
`
`Bitial Tree
`
`Sequential
`Einottia Tree
`Broadcast
`
`Binoritial Tree
`
`Master-Slave +
`Binoitial Tree
`
`Binonial Tree
`Broadcast
`
`Sequential
`Bitloftial Tree
`Broadcast
`
`11 Exposed Protocol
`Latencies
`30 Exposed Wector
`Operations
`..li sec transitit
`titae
`10,240 Exposed
`Protocol Latencies
`1.4 sectansitit
`title
`1 Exposed Protocol
`Latency
`10 Exposed Wector
`Operations
`.ill sec trailsittit
`time
`1034 Exposed
`Protocol. Lateicies
`30 Exposed Wector
`Operations
`1.34 sec trainstit
`time
`11 Exposed. Latency
`O.
`sec transitit
`title
`10,240 Exposed
`Protocol Latencies
`1.24 sec trainstilit
`time
`
`FIG. 3B
`
`
`Ex.1018 / Page 7 of 41Ex.1018 / Page 7 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 7 of 24
`
`US 2010/0185719 A1
`
`Protocol List 420
`
`PPCA128
`
`Current Cluster
`Topology 410
`
`LLP Select
`Function 400
`
`State Data 412
`
`- Protocol List 420
`
`Cluster Configuration 414
`
`
`
`
`
`Cluster Topology 416
`
`Selected Protocol 418
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`SWitch 11
`
`FIG. 4A
`
`
`Ex.1018 / Page 8 of 41Ex.1018 / Page 8 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 8 of 24
`
`US 2010/0185719 A1
`
`- PPCA detectS
`(
`Start
`Cluster topology
`sā-
`452
`
`450
`
`
`
`
`
`
`
`
`
`
`
`First Start-Up?
`454
`
`Detect changes in cluster
`configuration since shutdown
`456
`
`Changes
`detected
`458
`
`O
`
`PPCA detects for
`homogeneity
`460
`
`
`
`yes
`
`Homogeneous
`462
`
`O
`
`PPCA detects for
`TOE
`466
`
`
`
`
`
`yes
`
`ls TOE
`available?
`468
`
`O
`
`
`
`LLC Select
`Selects a lowest
`latency protocol
`464
`
`PPCA Select TOE
`470
`
`PPCA Select
`TCP/IP
`472
`
`FIG. 4B
`
`End
`
`
`Ex.1018 / Page 9 of 41Ex.1018 / Page 9 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 9 of 24
`
`US 2010/0185719 A1
`
`
`
`NODE 100
`
`HOSt RAM 126
`o
`
`e
`
`s
`
`Paging Code
`
`N
`V
`
`506
`
`1.
`'
`Page Frame
`
`f
`M
`/
`
`^
`
`/
`
`V
`V
`Y
`
`N
`
`Paging File 502
`
`Page Frame 514
`
`PPCA paging
`COde 526
`
`Firmware 223
`
`FIG. 5
`
`
`Ex.1018 / Page 10 of 41Ex.1018 / Page 10 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 10 of 24
`
`US 2010/0185719 A1
`
`101
`
`
`
`VIRTUAL DISKARRAY - PPCA
`
`PPCA
`128(A)
`
`Data storage location
`MD Updater
`Virtual Disk Array
`Functionality
`610
`
`PPCA 128(F)
`
`ssp.26).
`:
`
`PPCA.128(B).
`Metadata 603
`
`PPCA.128(P).
`Metadata 603
`
`PPCA 128(G)
`Metadata 603
`
`PPCA128C).
`Metadata 603 :
`
`PPCA 128(E
`Metadata 603
`
`PPCA 128(H)
`Metadata 603
`
`FIG. 6
`
`
`Ex.1018 / Page 11 of 41Ex.1018 / Page 11 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 11 of 24
`
`US 2010/0185719 A1
`
`Parallel Application 104
`
`s
`
`s
`
`8 as is a sala as a a
`
`NAD Cache
`Functionality 710
`
`
`
`
`
`
`
`Cache 740
`
`
`
`
`
`720 YO Billy
`REQUEST i
`V O
`
`
`
`722
`
`
`
`NAD 704
`
`FIG. 7A
`
`
`Ex.1018 / Page 12 of 41Ex.1018 / Page 12 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 12 of 24
`
`US 2010/0185719 A1
`
`7000
`
`
`
`
`
`
`
`Receive data to Write to NAD 7002
`
`
`
`
`
`Store Data in Cache 7004
`
`
`
`Send Store Request to NAD 7006
`
`Wait for NADResponse 7008
`
`
`
`NAD
`7010
`
`O
`
`Write from PPCA
`Cache to NAD
`7012
`
`
`
`FIG. 7B
`
`
`Ex.1018 / Page 13 of 41Ex.1018 / Page 13 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 13 of 24
`
`US 2010/0185719 A1
`
`101
`
`
`
`
`
`
`
`Node 100 (1
`checkpoint State 846.----------
`
`Parallel Application 104 H - - - - -
`
`Parallel Data 105
`
`m am am sm m m -
`
`- - - - rom ma as an -
`
`- - -m as and
`
`PPCA 128(A)
`Checkpoint Store 850
`
`Checkpoint State 846(1)
`
`s
`...
`
`"
`
`
`
`
`
`:
`
`Holographic
`Checkpoint
`Functionality 810
`
`
`
`
`
`
`
`
`
`Node 100(2)
`
`Node 100(3)
`
`PPCA 128(B (B)
`Checkpoint State
`846
`
`
`
`
`
`PPCA 128(C )
`Checkpoint State
`846
`
`Node 100(4)
`
`
`
`PPCA 128(D)
`Checkpoint State
`846
`
`FIG. 8
`
`
`Ex.1018 / Page 14 of 41Ex.1018 / Page 14 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 14 of 24
`
`US 2010/0185719 A1
`
`101 v
`
`Node 1001)
`PPCA
`128(A)
`
`Node 100(2)
`PPCA
`128(B)
`
`PPCA
`128(C)
`Node 100(3)
`
`PPCA
`128(D)
`Node 100.4)
`
`FIG. 9A
`
`Node 100(2)
`
`Node 100
`
`101 v
`
`
`
`NOde 1001
`PPCA
`128(A)
`
`
`Ex.1018 / Page 15 of 41Ex.1018 / Page 15 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 15 of 24
`
`US 2010/0185719 A1
`
`Restart Command 1030(2)
`
`101
`
`a
`
`Node 100(1)
`PPCA
`128(A)
`
`Node 100(2)
`PPCA
`128(B)
`
`Failed
`Communications
`1010
`
`
`
`
`
`Failed NOde
`100(3
`PPCA
`128(C)
`
`Command
`1030(1)
`
`Spare Select
`Command 1020
`
`:
`
`a s p A
`
`FIG. 10
`
`
`Ex.1018 / Page 16 of 41Ex.1018 / Page 16 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 16 of 24
`
`US 2010/0185719 A1
`
`Node 1 O O
`
`
`
`COMPRESSION/
`DECOMPRESSION
`
`C/D Functionality 1110
`
`C/D Store 1124
`
`Data 1120
`
`Compressed Data 1122
`lag 1130
`
`Received Data 112
`
`Flag 1131
`
`-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
`Estimated Data Size 1123
`
`Uncompressed Data 11
`
`Ethernet Channel 221
`
`FIG. 11
`
`
`Ex.1018 / Page 17 of 41Ex.1018 / Page 17 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 17 of 24
`
`US 2010/0185719 A1
`
`101
`
`YY
`
`
`
`NODE 1001)
`
`PPCA 128(A)
`
`APS
`Functionality
`1210
`
`Data 1222
`
`APS Store 1224
`
`Topology Data 1223
`
`
`
`NODE 100(4)
`
`PROTOCOL
`12O2
`
`
`
`
`
`PPCA 128(D)
`
`PROTOCOL
`1204
`
`NODE 100(5)
`
`PPCA 128(E)
`
`PROTOCOL
`12O6
`
`
`
`NODE 100(6)
`
`PPCA 128(F
`
`FIG. 12A
`
`
`Ex.1018 / Page 18 of 41Ex.1018 / Page 18 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 18 of 24
`
`US 2010/0185719 A1
`
`1250
`
`
`
`Does
`Transmission Cross Multiple
`Networks?
`1256
`
`
`
`
`
`
`
`DOes dataset
`require robust protocol
`1260
`
`Select LOWest Latency Protocol to
`Next NOde
`1266
`
`Selectrobust protocol
`1268
`
`FIG. 12B
`
`
`Ex.1018 / Page 19 of 41Ex.1018 / Page 19 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 19 of 24
`
`US 2010/0185719 A1
`
`
`
`PPCA 1328
`
`RAM 1324
`
`-
`-
`-
`SDR fate 130- -
`Node ID
`R-channel
`1344
`1342
`- - - -
`
`SDR Software
`
`HostN/S Bridge
`Connect
`1316
`
`NIS Bridge
`1318
`
`SDR Components 132
`
`SDR Controller 1334
`
`SDR Hardware 1336
`
`NVM 1322
`SDR Table 1330
`
`SDR Software
`1332
`
`SDR Antenna 1338
`
`FIG. 13A
`
`
`Ex.1018 / Page 20 of 41Ex.1018 / Page 20 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 20 of 24
`
`US 2010/0185719 A1
`
`1350 \
`
`NODE 136O1
`1360(1)
`
`1371
`
`PPCA 1328(A)
`
`SDR Table 1330
`
`RF Processing
`
`SDR functionality 1352
`
`SDR Antenna 1326
`SDR Antenna 1326
`
`
`
`SDR Antenna 1326
`
`PPCA 1328(B
`
`PPCA 1328(C)
`
`PPCA 1328(D)
`
`PPCA 1328E
`
`R-channel 1372
`
`R-Channel 1373
`
`R-Channel 1374
`
`R-Channel 1375
`
`Node 1360(2)
`
`Node 1360(3)
`
`Node 1360(4)
`
`Node 1360(5)
`
`FIG. 13B
`
`
`Ex.1018 / Page 21 of 41Ex.1018 / Page 21 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 21 of 24
`
`US 2010/0185719 A1
`
`
`
`
`
`
`
`
`
`1350 \
`
`NODE1360(1)
`
`PPCA 1328(A)
`
`R-Channel 1371
`
`
`
`
`
`SDR Table 1330
`
`Node ID 1360(1) ā R-Channel 1371
`
`Node ID 1360(2) ā R-Channel 1372
`
`Node ID 1360(3)
`
`- R-Channel 1373
`
`Node ID 136O(4) ". R-Channel 1374
`
`
`
`PPCA 1328(B)
`R-channel 1371
`
`PPCA 1328(C)
`R channel 1371
`
`PPCA 1328(D)
`R-channel 1371.
`
`
`
`
`
`1372
`
`
`
`
`
`R-Channel 1373
`
`R-channel 1374
`
`Node 1360(2)
`
`Node 1360(3)
`
`Node 1360(4
`
`FIG. 14
`
`
`Ex.1018 / Page 22 of 41Ex.1018 / Page 22 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 22 of 24
`
`US 2010/0185719 A1
`
`NODE1360(1)
`PPCA 1328(A)
`
`POSSible R-Channels 1530
`
`SDR Table 1330
`
`PPCA 1328(B)
`R-channel 1371.
`Node 1360(2)
`
`PPCA 1328(C)
`Rchannel 1371.
`Node 1360(3)
`
`Node 1360(4)
`
`Node 1360(5)
`
`FIG. 15A
`
`
`
`NODE 1410(1)
`
`PPCA 1328(A)
`R-Channel 1371
`
`POSSible R-Channels 1530
`
`SDR Table 1330
`
`PPCA 1328CB)
`SDR Table 1330
`Richannel 1371 ||
`Node 1360(2)
`
`||
`
`PPCA 1328(D)
`|| PPCA 1328(E)
`SDR Table 1330
`SDR Table 1330
`|| Richannel 1371 || || Richannel 1371 || || Richarnel 1371
`Node 1360(3)
`Node 1360(4)
`Node 1360(5)
`FIG. 15B
`
`
`Ex.1018 / Page 23 of 41Ex.1018 / Page 23 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 23 of 24
`
`US 2010/0185719 A1
`
`1350 \
`Node 1360(2)
`
`
`
`PPCA 1328(B)
`
`Base Table 1620
`
`Assignment Alg 1622
`
`SDR Table 16
`26
`
`Node 1360(5)
`
`PPCA 1328(E)
`
`Base Table 1620
`
`ASsignment Alg 1622
`
`SDR Table 1626
`
`PPCA 1328(C)
`
`PPCA 1328(D)
`
`Base Table 1620
`
`Base Table 1620
`
`ASSignment Aig 1622
`
`ASSignment Alg 1622
`
`SDR Table 1626
`
`SDR Table 1626
`
`NOde 13603
`
`NOde 1360 (4
`
`FIG. 16
`
`
`Ex.1018 / Page 24 of 41Ex.1018 / Page 24 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`Patent Application Publication
`
`Jul. 22, 2010 Sheet 24 of 24
`
`US 2010/0185719 A1
`
`1701 v
`
`
`
`Node 100(1)
`PPCA
`128(A)
`
`Node 100(2)
`PPCA
`128(B)
`
`Node 100(3)
`PPCA
`128(C)
`
`Node 100(4)
`PPCA
`128(D)
`
`Checkpoint Data 1720
`
`FIG. 17
`
`
`Ex.1018 / Page 25 of 41Ex.1018 / Page 25 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`US 2010/0185719 A1
`
`Jul. 22, 2010
`
`APPARATUS FOR ENHANCNG
`PERFORMANCE OF A PARALLEL
`PROCESSING ENVIRONMENT, AND
`ASSOCATED METHODS
`
`RELATED APPLICATIONS
`
`0001. This application claims priority to U.S. Patent
`Application Ser. No. 61/165.301, filed Mar. 31, 2009, and
`U.S. Patent Application Ser. No. 61/166,630, filed Apr. 3,
`2009, both of which are incorporated herein by reference.
`This application is also a continuation-in-part of U.S. patent
`application Ser. No. 12/197,881, filed Aug. 25, 2008, which is
`a divisional application of U.S. patent application Ser. No.
`10/340,524, filed Jan. 10, 2003 (now U.S. Pat. No. 7,418,
`470), which claims priority to U.S. Patent Application Ser.
`No. 60/347,325, filed Jan. 10, 2002, and is a continuation-in
`part of U.S. patent application Ser. No. 09/603,020 (now U.S.
`Pat. No. 6,857,004), filed on Jun. 26, 2000, all of which are
`incorporated herein by reference.
`
`BACKGROUND
`
`0002. A parallel processing computer cluster is made up of
`networked computers that form nodes of the cluster. Each
`node of the cluster can contain one or more processors, each
`including one or more cores. A computational task, received
`from a requesting system, is broken down into Sub-tasks that
`are distributed to one or more nodes for processing. If there
`are multiple processors and/or cores the computational task is
`further decomposed. Processing results from the cores are
`collected by the processors, and then collected by the node.
`From the node level, results are transmitted back to the
`requesting system. The methods of breaking down and dis
`tributing these sub-tasks, and then collecting results, vary
`based upon the type and configuration of the computer cluster
`as well as the algorithm being processed.
`0003. One constraint of current parallel processing com
`puter clusters is presented by inter-node, inter-processor and
`inter-core communication. Particularly, within each com
`puter node, a processor or core that is used to process a
`Sub-task is also used to process low-level communication
`operations and make communication decisions. The time cost
`of these communication decisions directly impact the perfor
`mance of the processing cores and processors, which directly
`impact the performance of the node.
`0004. Within a computer system, such as a personal com
`puter or a server, a PCIe bus, known in the art, provides
`point-to-point multiple serial communication lanes with
`faster communication than atypical computerbus, such as the
`peripheral component interconnect standard bus. For
`example, the PCIe bus Supports simultaneous send and
`receive communications, and may be configured to use an
`appropriate number of serial communication lanes to match
`the communication requirements of an installed PCIe-format
`computer card. A low speed peripheral may require one PCIe
`serial communication lane, while a graphics card may require
`sixteen PCIe serial communication lanes. The PCIe bus may
`include Zero, one or more PCIe format card slots, and may
`provide one, two, four, eight, sixteen or thirty-two serial
`communication lanes. PCIe communication is typically des
`ignated by the number of serial communication lanes used for
`communication (e.g., "X1ā designates a single serial commu
`nication lane PCIe channel and 'x4' designates a four serial
`
`communication lane PCIe channel), and by the PCIe format,
`for example PCIe 1.1 of PCIe 2.0.
`0005 Regarding the PCIe formats, PCIe 1.1 format is the
`most commonly used PCIe format; PCIe version 2.0 was
`launched in 2007. PCIe version 2.0 is twice as fast as version
`1.1. Compared to a PCI standard bus, PCIe 2.0 has nearly
`twice the bi-directional transferrate of 250 MB/s (250 million
`bytes per second). A 32-bit PCI standard bus has a peak
`transfer rate of 133 MB/s (133 million bytes per second) and
`is half-duplex (i.e., it can only transmit or receive at any one
`time).
`0006 Within a parallel application, a message-passing
`interface (MPI) may include routines for implementing mes
`sage passing. The MPI is typically called to execute the mes
`sage passing routines of low-level protocols using hardware
`of the host computer to send and receive messages. Typically,
`MPI routines execute on the processor of the host computer.
`0007. In high performance computer clusters, cabling and
`Switching between nodes or computers of a computer cluster
`may create significant issues. One approach to simplify
`cabling between nodes is blade technology, well known in the
`art, which uses a large backplane to provide connectivity
`between nodes. Blade technology has high cost and requires
`special techniques, such as grid technology, to interconnect
`large numbers of computer nodes. When connecting large
`numbers of nodes, however, grid technology introduces data
`transfer bottlenecks that reduce cluster performance. Further
`more, issues related to Switching technology Such as costs and
`interconnect limitations are not resolved by blade technology.
`
`SUMMARY
`0008 Disclosed are Parallel Processing Communication
`Accelerator (PPCA) systems and methods for enhancing per
`formance of a Parallel Processing Environment (PPE). The
`PPCA includes a micro-processing unit (MPU), a memory, a
`PPE connection for communicating with other nodes within
`the parallel processing environment, a host node connection
`for communicating with a host node and a Message Passing
`Interface (MPI) devolver. The MPI devolver communicates
`with a host node executed MPI process for optimizing com
`munication between a host node executed parallel application
`and the parallel processing environment. In addition, the MPI
`devolver processes at least a portion of the MPI process
`including one or more of MPI collective-commands, MPI
`blocking commands, MPI group commands, and MPI topol
`Ogy.
`
`BRIEF DESCRIPTION OF THE EMBODIMENTS
`0009 FIG. 1 shows exemplary apparatus for enhancing
`performance within parallel processing environment.
`0010 FIG. 2 shows the PPCA of FIG. 1 in further detail.
`0011
`FIG. 2A shows an alternative embodiment of the
`PPCA of FIG. 2.
`0012 FIG. 2B shows an embodiment of a system using the
`PPCA of FIG. 2A coupled in parallel-star configuration.
`0013 FIG.2C shows an embodiment of a system using the
`PPCA of FIG. 2A with one port of each PPCA in star con
`figuration to a Switch, and three or more ports coupled in tree
`configuration.
`(0014 FIG. 3A shows one exemplary MPI devolver
`enabled system.
`0015 FIG.3B shows one exemplary chart comparing esti
`mated completion time of MPI collective operations between
`
`Ex.1018 / Page 26 of 41Ex.1018 / Page 26 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`US 2010/0185719 A1
`
`Jul. 22, 2010
`
`a one exemplary PPCA, utilizing a PPCA optimized MPI
`library, and a standard 10 Gb/s NIC, utilizing a standard MPI
`library.
`0016 FIG. 4A shows one exemplary low latency protocol
`(LLP) enabled system.
`0017 FIG. 4B shows one exemplary low latency protocol
`(LLP) selection method.
`0018 FIG. 5 shows one exemplary PPCA based paging
`enabled system.
`0019 FIG. 6 shows the parallel processing environment of
`FIG. 1 implementing a virtual disk array (VDA) using the
`PPCA within each of nodes.
`0020 FIG. 7A shows one exemplary network attached
`device (NAD) caching enabled system.
`0021
`FIG. 7B shows one exemplary NAD caching
`method.
`0022 FIG. 8 illustrates one step in one exemplary all-to
`all exchange in a holographic checkpoint enabled parallel
`processing environment with one detailed node.
`0023 FIGS.9A-C illustrates three steps of one exemplary
`all-to-all exchange in a holographic checkpoint enabled sys
`tem.
`0024 FIG. 10 shows one exemplary illustrative represen
`tation of one exemplary holographic checkpoint restart
`operation enabled system.
`0025 FIG. 11 shows one exemplary compression enabled
`system.
`0026 FIG. 12A shows one exemplary auto protocol selec
`tion enabled system.
`0027 FIG. 12B is one exemplary auto protocol selection
`method.
`0028 FIG. 13A is one exemplary software defined radio
`(SDR) enabled PPCA.
`0029 FIG. 13B is one exemplary SDR enabled system.
`0030 FIG. 14 shows one exemplary SDR fixed channel
`node assignment (FCNA) enabled system utilizing a centrally
`located r-channel look-up table.
`0031
`FIG. 15A shows one exemplary gather step for a
`SDR-FCNA enabled system utilizing a gather-scatter method
`for distributing a distributed r-channel look-up table.
`0032 FIG. 15B shows one exemplary scatter step for a
`SDR-FCNA enabled system utilizing a gather-scatter method
`for distributing a distributed r-channel look-up table.
`0033 FIG.16 shows one exemplary SDR-FCNA enabled
`system utilizing an all-to-all exchange method for distribut
`ing a distributed r-channel look-up table.
`0034 FIG. 17 shows one exemplary single time-step
`checkpoint/restart enabled system.
`
`DETAILED DESCRIPTION OF THE
`EMBODIMENTS
`0035. In a parallel processing environment that includes a
`cluster having several computing nodes, a parallel computing
`task is divided into two or more sub-tasks, each of which are
`assigned to one or more of the computing nodes. A measure of
`efficiency of the parallel processing environment is the time
`taken to process the parallel computing task, and the time
`taken to process each sub-task within the compute nodes.
`0036) Each compute node includes one or more processors
`that process assigned tasks and Sub-tasks in as short a time as
`possible. However, each computing node must also commu
`nicate with other computing nodes within the cluster to
`receive assigned sub-tasks and to return results from process
`ing sub-tasks. This communication imposes an overhead
`
`within the compute node that can delay completion of the
`assigned sub-task. To reduce this delay, certain low-level
`operations may be devolved from the one or more processors
`of the computing node to a devolving engine. The devolving
`engine, in an embodiment, is located on an accelerator card
`having some functions similar to a network interface card
`(NIC) installed in the computing node and provides commu
`nication between networked computing nodes of the cluster.
`0037. The devolving engine allows the host computer to
`offload low-level communication operations to the devolving
`engine while maintaining control of high-leveloperations and
`high-level communication decisions.
`0038 FIG. 1 shows an exemplary Parallel Processing
`Communication Accelerator (PPCA) 128 for enhancing per
`formance within a parallel processing environment 101
`formed of a plurality of computing nodes 100 and a switch
`116. PPCA 128 is preferably included within each computing
`node 100 of environment 101.
`0039. In an embodiment, at least one of computing nodes
`100 represents a host node as used within a Howard Cascade
`(see U.S. Pat. No. 6,857,004 incorporated herein by refer
`ence). In the example of FIG. 1, environment 101 has eight
`computing nodes 100(1-8) that communicate through switch
`116. Environment 101 may have more or fewer nodes without
`departing from the scope hereof. Each node 100(1-8) includes
`a PPCA 128(A-H) that provides devolving and communica
`tion functionality.
`0040. In FIG. 1, only node 100(1) is shown in detail for
`clarity of illustration. Nodes 100 are similar to each other and
`may include components and functionality of conventional
`computer systems. For example, nodes 100 may also include
`components and functionality found in personal computers
`and servers. Node 100(1) has host central processing unit
`(CPU) 120, a host north/south (N/S) bridge 124, a host ran
`dom access memory (RAM) 126, and disk storage 122.
`Nodes 100 may include other hardware and software, for
`example as found in personal computers and servers, without
`departing from the scope hereof.
`0041
`Host N/S bridge 124 may support one or more bus
`ses within node 100 to provide communication between host
`CPU 120, disk storage 122, host RAM 126 and PPCA 128
`(A). For example, host N/S bridge 124 may implement a bus
`140 that allows one or more computer cards (e.g., graphics
`adapters, network interface cards, and the like) to be installed
`within node 100. In an embodiment, Bus 140 is a peripheral
`component interconnect express (PCIe). In the example of
`FIG. 1, PPCA 128 connects to bus 140 when installed within
`node 100, and provides a communication interface to com
`municate with other PPCA 128 equipped nodes 100 via
`Switch 116.
`0042. When configured in the form of a PCIe card, PPCA
`128 may be installed in a computer system Supporting the
`PCIe bus to form node 100. Although PPCA 128 is shown
`connecting within node 100(1) using bus 140, PPCA 128 may
`be configured to connect to node 100(1) using other computer
`busses without departing from the scope hereof. In an alter
`nate embodiment, PPCA 128 is incorporated into a mother
`board of node 100.
`0043. Disk storage 122 is shown storing a parallel appli
`cation 104, a message passing interface (MPI) 106 and par
`allel data 105. Disk storage 122 may store other information
`and functionality, such as an operating system, executable
`computer code, computational tasks, Sub-tasks, sub-task
`results, computation task results, and other information and
`
`Ex.1018 / Page 27 of 41Ex.1018 / Page 27 of 41
`
`Sandisk Technologies, Inc. et alSandisk Technologies, Inc. et al
`
`
`
`US 2010/0185719 A1
`
`Jul. 22, 2010
`
`data of node 100, without departing from the scope hereof
`Parallel application 104 may represent a software program
`that includes instructions for processing parallel data 105.
`MPI 106 represents a software interface that provides com
`munications for a parallel application 104 running on nodes
`100 of environment 101. MPI 106 may include one or more
`interface routines that instruct PPCA 128 to perform one or
`more operations that provide communications between node
`100(1) to other nodes 100, and may implement additional
`functionality, as described below.
`0044 CPU 112 is shown as a single processing unit, but
`CPU 112 may represent a plurality of processing units, for
`example, a central processing unit, an arithmetic logic unit
`and a floating-point unit.
`0045. In one example of operation, at least part of each of
`parallel application 104, MPI 106 and parallel data 105 are
`loaded into host RAM 126 for execution and/or access by host
`CPU 120. Parallel application 104, MPI 106 and parallel data
`105 are illustratively shown in dashed outline within host
`RAM 126. Parallel data 105 may be all, or a portion of, a data
`set associated with a parallel processing task or Sub-task. Host
`RAM 126 may store other programs, software routines, infor
`mation and data for access by host CPU 120, without depart
`ing from the scope hereof.
`0046. In an embodiment where bus 140 is a PCIe bus with
`one or more card slots that accept PCIe format computer
`cards, PPCA 128 is a PCIe format computer card that plugs
`into one of these card slots. Further, PPCA 128 is configured
`to use one or more serial communication lanes of bus 140, and
`it is preferred that bus 140 provide sufficient serial commu
`nication lanes to match or exceed the requirements of PPCA
`128. The greater the number of serial communication lanes
`used by PPCA 128, the greater the communication bandwidth
`between PPCA 128 and host N/S bridge 124.
`0047 PPCA 128 functions to devolve certain parallel pro
`cessing tasks from host CPU 120 to PPCA 128, thereby
`increasing the availability of host CPU 120 for task process
`ing. PPCA 128 provides enhanced communication perfor
`mance between node 100 and switch 116, and in particular,
`provides enhance communication between nodes 100 of envi
`ronment 101.
`0048 FIG. 2 illustrates PPCA 128 of FIG. 1 in further
`detail. PPCA 128 includes a Microprocessor Unit (MPU)
`212, a N/S bridge 218, an Ethernet connect 220, a non-volatile
`memory (NVM) 222, and a random access memory (RAM)
`224. PPCA 128 may also include a solid-state drive (SSD)
`226. N/S Bridge 218 provides communication between MPU
`212, host N/S bridge interface 216, Ethernet connect 220,
`NVM 222, RAM 224, and optional SSD 226. Host node
`connection, host N/S bridge interface 216, provides connec
`tivity betweenN/S Bridge 218 and bus 140, thereby providing
`communication between PPCA 128 and components of node
`100, into which PPCA 128 is installed and/or configured.
`0049 Parallel processing environment connection, Ether
`net connect 220, connects t

Accessing this document will incur an additional charge of $.
After purchase, you can access this document again without charge.
Accept $ ChargeStill Working On It
This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.
Give it another minute or two to complete, and then try the refresh button.
A few More Minutes ... Still Working
It can take up to 5 minutes for us to download a document if the court servers are running slowly.
Thank you for your continued patience.

This document could not be displayed.
We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.
You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.
Set your membership
status to view this document.
With a Docket Alarm membership, you'll
get a whole lot more, including:
- Up-to-date information for this case.
- Email alerts whenever there is an update.
- Full text search for other cases.
- Get email alerts whenever a new case matches your search.

One Moment Please
The filing āā is large (MB) and is being downloaded.
Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!
If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document
We are unable to display this document, it may be under a court ordered seal.
If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.
Access Government Site