`Switchingg
`
`(cid:2)(cid:13)(cid:11)(cid:12)(cid:1)(cid:3)(cid:9)(cid:17)(cid:10)(cid:16)(cid:17)(cid:14)(cid:6)(cid:15)(cid:7)(cid:9)
`(cid:5)(cid:20)(cid:13)(cid:18)(cid:7)(cid:12)(cid:13)(cid:15)(cid:11)(cid:1)(cid:6)(cid:15)(cid:8)(cid:1)(cid:4)(cid:16)(cid:19)(cid:18)(cid:13)(cid:15)(cid:11)
`
`(cid:11)(cid:14)(cid:17)(cid:14)(cid:13)(cid:20)(cid:18)(cid:1)(cid:9)(cid:14)(cid:19)(cid:24)(cid:14)(cid:22)(cid:1)(cid:12)(cid:20)(cid:22)(cid:16)(cid:23)(cid:15)(cid:20)(cid:21)(cid:8)(cid:1)(cid:1)(cid:10)(cid:14)(cid:21)(cid:24)(cid:1)(cid:5)(cid:2)(cid:1)(cid:4)(cid:7)(cid:7)(cid:6)(cid:3)
`
`Nickk McKeown
`Assistantt Professorr off Electricall Engineeringg
`andd Computerr Science,, Stanfordd University
`nickm@stanford.edu
`http://www.stanford.edu/~nickm
`
`Exhibit 1015
`IPR2023-00581
`U.S. Patent 8,886,772
`
`Page 1 of 58
`
`
`
`Sir William Preece, Chief of the British
`Postal System, 1876:
`
`“The Americans may have need of the
`telephone, but we do not. We have plenty of
`messenger boys.”
`
`Page 2 of 58
`
`
`
`OOutline
`
`• Introduction
`What is a packet-switch?
`• The Memory Bandwidth Problem
`• Input-Queued Switches
`Reducing memory bandwidth requirements
`• Combined Input-Output Queued Switches
`Making input-queued switches useful
`• Parallel Packet Switches
`Further reducing memory b/width requirements
`
`Page 3 of 58
`
`
`
`IIntroduction
`What is a Packet Switch?
`
`• Introduction
`What is a packet-switch?
`– Basic Architectural Components
`– Some Example Packet Switches
`– The Evolution of IP Routers
`• The Memory Bandwidth Problem
`• Input-Queued Switches
`Reducing memory bandwidth requirements
`• Combined Input-Output Queued Switches
`Making input-queued switches useful
`• Parallel Packet Switches
`Further reducing memory b/width requirements
`
`Page 4 of 58
`
`
`
`BBasic Architectural Components
`
`Admission
`Control
`
`Congestion
`Control
`Routing
`
`Reservation
`
`Control
`
`Policing
`
`Switching
`
`Output
`Scheduling
`
`Datapath:
`per-packet
`processing
`
`Page 5 of 58
`
`
`
`Basic Architectural Components
`Datapath: per-packet processing
`3.
`Output
`Scheduling
`
`1.
`Forwarding
`Table
`
`2.
`Interconnect
`
`Forwarding
`Decision
`
`Forwarding
`Table
`
`Forwarding
`Decision
`
`Forwarding
`Table
`
`Forwarding
`Decision
`
`Page 6 of 58
`
`
`
`WWhere high performance packet
`switches are used
`
`- Carrier Class Core Router
`- ATM Switch
`- Frame Relay Switch
`
`The Internet Core
`
`Edge Router
`
`Enterprise WAN access
`& Enterprise Campus Switch
`
`Page 7 of 58
`
`
`
`IIntroduction
`What is a Packet Switch?
`
`• Introduction
`What is a packet-switch?
`– Basic Architectural Components
`– Some Example Packet Switches
`– The Evolution of IP Routers
`• The Memory Bandwidth Problem
`• Input-Queued Switches
`Reducing memory bandwidth requirements
`• Combined Input-Output Queued Switches
`Making input-queued switches useful
`• Parallel Packet Switches
`Further reducing memory b/width requirements
`
`Page 8 of 58
`
`
`
`AATM Switch
`
`• Lookup cell VCI/VPI in VC table.
`• Replace old VCI/VPI with new.
`• Forward cell to outgoing interface.
`• Transmit cell onto link.
`
`Page 9 of 58
`
`
`
`EEthernet Switch
`
`• Lookup frame DA in forwarding table.
`– If known, forward to correct port.
`– If unknown, broadcast to all ports.
`• Learn SA of incoming frame.
`• Forward frame to outgoing interface.
`• Transmit frame onto link.
`
`Page 10 of 58
`
`
`
`IIP Router
`
`• Lookup packet DA in forwarding
`table.
`– If known, forward to correct port.
`– If unknown, drop packet.
`• Decrement TTL, update header
`Cksum.
`• Forward packet to outgoing interface.
`• Transmit packet onto link.
`
`Page 11 of 58
`
`
`
`IIntroduction
`What is a Packet Switch?
`
`• Introduction
`What is a packet-switch?
`– Basic Architectural Components
`– Some Example Packet Switches
`– The Evolution of IP Routers
`• The Memory Bandwidth Problem
`• Input-Queued Switches
`Reducing memory bandwidth requirements
`• Combined Input-Output Queued Switches
`Making input-queued switches useful
`• Parallel Packet Switches
`Further reducing memory b/width requirements
`
`Page 12 of 58
`
`
`
`FFirst Generation Packet Switches
`
`Fixed length “DMA” blocks
`Shared Backplane
`or cells. Reassembled on egress
`linecard
`
`CPU
`
`Buffer
`Memory
`
`CPU
`
`Lin
`e I
`
`nterface
`
`DMA
`Line
`Interface
`MAC
`
`DMA
`Line
`Interface
`MAC
`
`DMA
`Line
`Interface
`MAC
`
`Fixed length cells or
`variable length packets
`
`Memory
`
`Page 13 of 58
`
`
`
`SSecond Generation Packet Switches
`
`CPU
`
`Buffer
`Memory
`
`DMA
`Line
`Card
`Local
`Buffer
`Memory
`
`DMA
`Line
`Card
`Local
`Buffer
`Memory
`
`DMA
`Line
`Card
`Local
`Buffer
`Memory
`
`MAC
`
`MAC
`
`MAC
`
`Page 14 of 58
`
`
`
`Third Generation Packet Switches
`
`Switched Backplane
`
`CPU
`Card
`
`Line
`Card
`
`Local
`Buffer
`Memory
`
`MAC
`
`Line
`Card
`
`Local
`Buffer
`Memory
`
`MAC
`
`Lin
`CPU
`
`e Interface
`
`Memory
`
`Page 15 of 58
`
`
`
`Fourth Generation Packet Switches
`
`Page 16 of 58
`
`
`
`OOutline
`
`• Introduction
`What is a packet-switch?
`• The Memory Bandwidth Problem
`• Input-Queued Switches
`Reducing memory bandwidth requirements
`• Combined Input-Output Queued Switches
`Making input-queued switches useful
`• Parallel Packet Switches
`Further reducing memory b/width requirements
`
`Page 17 of 58
`
`
`
`TTwoo Basicc Techniques
`
`1+1 = 2 operations per cell time
`
`N+N = 2N operations per cell time
`
`Input-queued Crossbar
`
`Shared Memory
`
`Page 18 of 58
`
`
`
`SShared Memory
`The Ideal
`
`KTD
`
`ZPI
`
`A
`
`AAAAAA
`
`HBAD
`
`FX
`
`A
`
`ZZ
`
`A
`
`A
`
`ZZZ
`
`A
`
`A
`
`Z
`
`Numerous work has proven and
`made possible:
`– Fairness
`– Delay Guarantees
`– Delay Variation Control
`– Loss Guarantees
`– Statistical Guarantees
`
`Page 19 of 58
`
`
`
`AA Comparison
`Memory speeds for 32x32 switch
`
`Shared-Memory
`Memory
`Access Time
`BW
`Per cell
`6.4 Gb/s
`80 ns
`64 Gb/s
`8 ns
`160 Gb/s
`3.2 ns
`0.8 ns
`640 Gb/s
`
`Input-queued
`Memory
`Access Time
`BW
`200 Mb/s
`2 Gb/s
`5 Gb/s
`20 Gb/s
`
`2.12 μs
`212 ns
`84.8 ns
`21.2 ns
`
`Line Rate
`
`100 Mb/s
`1 Gb/s
`2.5 Gb/s
`10 Gb/s
`
`Page 20 of 58
`
`
`
`BBuffer Memory
`How Fast Can I Make a Packet Buffer?
`
`5ns SRAM
`
`Buffer
`Memory
`
`64-byte wide bus
`
`64-byte wide bus
`
`Rough Estimate:
`– 5ns per memory operation.
`– Two memory operations per
`packet.
`– Therefore, maximum 51.2Gb/s.
`In practice, closer to 40Gb/s.
`–
`
`Page 21 of 58
`
`
`
`BBuffer Memory
`Is It Going to Get Better?
`
`Specmarks,
`Memory size,
`Gate density
`
`Memory
`Bandwidth
`(to core)
`
`time
`
`time
`
`Page 22 of 58
`
`
`
`Parallel
`Packet
`Switches
`
`Progression
`
`Shared
`Memory
`
`Input
`Queued
`
`Combined
`Input and
`Output Queued
`
`Self-Routing Network
`
`000
`001
`010
`011
`100
`101
`110
`111
`
`76543210
`76453202
`
`Batcher Sorter
`
`74560312
`
`70513426
`
`75231064
`
`72356104
`
`37526014
`
`Multi
`stage
`
`Page 23 of 58
`
`
`
`OOutline
`
`• Introduction
`What is a packet-switch?
`• The Memory Bandwidth Problem
`• Input-Queued Switches
`Reducing memory bandwidth requirements
`• Combined Input-Output Queued Switches
`Making input-queued switches useful
`• Parallel Packet Switches
`Further reducing memory b/width requirements
`
`Page 24 of 58
`
`
`
`Input Queueing
`
`(cid:5)(cid:8)(cid:9)(cid:10)(cid:11)(cid:13)(cid:1)(cid:7)(cid:2)(cid:12)(cid:1)(cid:4)(cid:1)(cid:3)(cid:6)
`
`(cid:1)(cid:2)(cid:5)(cid:4)(cid:3)(cid:8)(cid:6)(cid:4)(cid:7)
`
`Data In
`
`configuration
`
`Data Out
`
`Page 25 of 58
`
`
`
`IInput Queueing
`Head of Line Blocking
`
`Delay
`
`Load
`
`58.6%
`
`100%
`
`Page 26 of 58
`
`
`
`HHead of Line Blocking
`
`Head of Line Blocking
`
`Page 27 of 58
`
`
`
`
`
`Page 28 of 58
`
`
`
`
`
`Page 29 of 58
`
`
`
`IInput Queueing
`Virtual output queues
`
`Page 30 of 58
`
`
`
`IInput Queues
`Virtual Output Queues
`
`Delay
`
`Load
`
`100%
`Proof by Lyapunov function
`
`Page 31 of 58
`
`
`
`OOutline
`
`• Introduction
`What is a packet-switch?
`• The Memory Bandwidth Problem
`• Input-Queued Switches
`Reducing memory bandwidth requirements
`• Combined Input-Output Queued Switches
`Making input-queued switches useful
`• Parallel Packet Switches
`Further reducing memory b/width requirements
`
`Page 32 of 58
`
`
`
`TThe Speedup Problem
`
`Find a compromise: 1 < Speedup << N
`
`- to get the performance of a shared memory switch
`- close to the cost of an IQ switch
`
`Page 33 of 58
`
`
`
`SSome Early Approaches
`
`Probabilistic Analyses
`- assume traffic models (Bernoulli, Markov-modulated,
`non-uniform loading, “friendly correlated”)
`- obtain mean throughput and delays, bounds on tails
`- analyze different fabrics (crossbar, multistage, etc)
`
`Numerical Methods
`- use actual and simulated traffic traces
`- run different algorithms
`- set the “speedup dial” at various values
`
`Page 34 of 58
`
`
`
`TThe findings
`
`Very tantalizing ...
`- under different settings
`(traffic, loading, algorithm, etc)
`- and even for varying switch sizes
`
`A speedup of between 2 and 5 was sufficient!
`
`Page 35 of 58
`
`
`
`(cid:1)
`
`(cid:1)(cid:2)
`
`(cid:2)
`
`(cid:1)
`
`UUsing Speedup
`
`Page 36 of 58
`
`
`
`N
`
`TThe Ideal Solution
`
`Output Queued Switch
`
`1
`
`N
`
`N
`
`= ?
`
`Combined Input-Output Queued Switch
`1
`
`N
`
`Page 37 of 58
`
`
`
`IInteresting Result
`
`Theorem:
`For a switch with combined input and output queueing
`to exactlymimic an output queued switch, for all
`types of traffic, a speedup of 2-1/N is necessary
`and sufficient.
`
`Joint work with Balaji Prabhakar, Ashish Goel and
`Shang-tse Chuang.
`
`Page 38 of 58
`
`
`
`OOutline
`
`• Introduction
`What is a packet-switch?
`• The Memory Bandwidth Problem
`• Input-Queued Switches
`Reducing memory bandwidth requirements
`• Combined Input-Output Queued Switches
`Making input-queued switches useful
`• Parallel Packet Switches
`Further reducing memory b/width requirements
`
`Page 39 of 58
`
`
`
`OOptical Physical Layers…
`…are Going to Make Things “Worse”
`
`DWDM:
`– More λ’s per fiber (cid:2) more ports per switch.
`– # ports: 16, …, 1000’s.
`
`Data rate:
`– More b/s per λ (cid:2) higher capacity.
`– Data rates: 2.5Gb/s, 10Gb/s, 40Gb/s, 160Gb/s, …
`
`Page 40 of 58
`
`
`
`AApproach #1: Ping-pong Buffering
`
`64-byte wide bus
`
`Buffer
`Memory
`
`Buffer
`Memory
`
`64-byte wide bus
`
`Page 41 of 58
`
`
`
`AApproach #1: Ping-pong Buffering
`
`64-byte wide bus
`
`Buffer
`Memory
`
`Buffer
`Memory
`
`64-byte wide bus
`
`Memory bandwidth doubled to ~80 Gb/s
`
`Page 42 of 58
`
`
`
`AApproach #2:
`Multiple Parallel Buffers
`aka Banking, Interleaving
`
`Buffer
`Memory
`Buffer
`Memory
`Buffer
`Memory
`Buffer
`Memory
`
`Page 43 of 58
`
`
`
`TThe Fork Join Router
`Router
`
`rate, R
`
`rate, R
`
`1
`
`N
`
`1 2
`
`k
`
`Bufferless
`
`rate, R
`
`rate, R
`
`1
`
`N
`
`Page 44 of 58
`
`
`
`TThe Fork Join Router
`
`• Advantages
`– k(cid:4) (cid:2) memory bandwidth (cid:5)
`– k(cid:4) (cid:2) lookup/classification rate (cid:5)
`– k(cid:4) (cid:2) routing/classification table size (cid:5)
`• Problems
`– How to demultiplex prior to
`lookup/classification?
`– How does the system perform/behave?
`– Can we predict/guarantee performance?
`
`Page 45 of 58
`
`
`
`AA Parallel Packet Switch
`
`rate, R
`
`rate, R
`
`1
`
`N
`
`1
`Output
`Queued
`Switch
`
`2
`Output
`Queued
`Switch
`
`k
`Output
`Queued
`Switch
`
`rate, R
`
`rate, R
`
`1
`
`N
`
`Page 46 of 58
`
`
`
`PParallel Packet Switch
`Questions
`1. Can it be work-conserving?
`2. Can it emulate a single big shared
`memory switch?
`3. Can it support delay guarantees,
`strict-priorities, WFQ, …?
`
`Page 47 of 58
`
`
`
`PParallel Packet Switch
`Work Conservation
`1
`
`rate, R
`
`1
`
`R/k
`
`R/k
`
`2
`
`R/k
`
`k
`
`R/k
`
`R/k
`
`R/k
`
`rate, R
`
`1
`
`Input Link
`Constraint
`
`Output Link
`Constraint
`
`Page 48 of 58
`
`
`
`PParallel Packet Switch
`Work Conservation
`1
`
`1234115
`rate, R
`
`1
`
`1 2 3
`
`R/k
`
`R/k
`
`2
`
`45
`
`1
`
`R/k
`
`4
`
`2
`
`R/k
`
`R/k
`
`k
`
`R/k
`
`3
`
`Output Link
`Constraint
`
`rate, R
`
`1
`
`Page 49 of 58
`
`
`
`PParallel Packet Switch
`Work Conservation
`1
`Output
`Queued
`Switch
`
`S(R/k)
`
`S(R/k)
`
`rate, R
`
`S(R/k)
`
`1
`
`2
`Output
`Queued
`Switch
`
`S(R/k)
`
`1
`
`rate, R
`
`rate, R
`
`N
`
`k
`Output
`Queued
`Switch
`
`S(R/k)
`
`S(R/k)
`
`rate, R
`
`N
`
`Page 50 of 58
`
`
`
`PParallel Packet Switch
`Theorems
`1. If S > 2k/(k+2) ≅ 2 then a parallel
`packet switch can be work-
`conserving for all traffic.
`
`2. If S > 2k/(k+2) ≅ 2 then a parallel
`packet switch can preciselyemulate
`a FCFS output-queued switch for all
`traffic.
`
`Page 51 of 58
`
`
`
`PParallel Packet Switch
`Theorems
`
`3. If S > 3k/(k+3) ≅ 3 then a parallel
`packet switch can be precisely
`emulate a switch with WFQ, strict
`priorities, and other types of QoS,
`for all traffic.
`
`With Sundar Iyer and Amr Awadallah
`
`Page 52 of 58
`
`
`
`PPrecise Emulation of an FCFS
`Shared Memory Switch
`
`N
`
`Shared Memory
`
`1
`
`N
`
`N
`
`= ?
`
`Parallel Packet Switch
`
`1
`
`N
`
`1
`
`N
`
`Page 53 of 58
`
`
`
`AAn aside
`Unbuffered Clos Circuit Switch
`
`Expansion factor required = 2-1/N
`
`Page 54 of 58
`
`
`
`CClos Network
`O1 O2 O3 Ox
`b
`
`I1 I2 I3 Ix
`
`O1
`
`}m
`
`OX
`
`}m
`
`a b
`
`c
`
`m {
`
`I1
`
`m {
`
`IX
`
`R middle
`stage switches
`
`<= min(R,m) entries in each row
`<= min(R,m) entries in each column
`
`Page 55 of 58
`
`
`
`CClos Network
`
`O1 O2 O3 Ox
`b
`
`I1 I2 I3 Ix
`
`<= min(R,m) entries in each row
`<= min(R,m) entries in each column
`
`O1
`
`}m
`
`OX
`
`}m
`
`ab
`
`c
`R middle
`stage switches
`
`m {
`
`m {
`
`I1
`
`IX
`
`Define: UIL(Ii) = used links at switch Ii to connect to middle stages.
`UOL(Oi) = used links at switch Oi to connect to middle stages.
`
`If we wish to connect Ii to Oi:
`When adding connection: |UIL(Ii)| <= m-1 and |UOL(Oi)| <= m-1
`Worst-case: |UIL(Ii) U UOL(Oi)| = 2m -2
`Therefore, if R >= 2m-2 there are always enough middle stages.
`
`Page 56 of 58
`
`
`
`AAn aside
`Unbuffered Clos Circuit Switch
`
`Expansion factor required = 2-1/N
`
`Page 57 of 58
`
`
`
`OOutline
`
`• Introduction
`What is a packet-switch?
`• The Memory Bandwidth Problem
`• Input-Queued Switches
`Reducing memory bandwidth requirements
`• Combined Input-Output Queued Switches
`Making input-queued switches useful
`• Parallel Packet Switches
`Further reducing memory b/width requirements
`
`Page 58 of 58
`
`