`
`Chips, Systems, Software, and Applications
`
`STARFIRE:
`Extending the SMP Envelope
`
`Copyright © 1998 Institute of Electrical and Electronics Engineers.
`Reprinted, with permission, from Jan/Feb IEEE Micro.
`
`This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply
`IEEE endorsement of any of Sun Microsystem's products or services. Internal or personal use of this material is
`permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for cre-
`ating new collective works for resale or redistribution must be obtained from the IEEE by sending a blank email
`message to info.pub.permission@ieee.org.
`
`By choosing to view this document, you agree to all provisions of the copyright laws protecting it.
`
`
`
`
`
`.
`
`STARFIRE:
`
`Extending the SMP Envelope
`
`Alan Charlesworth
`
`Sun Microsystems
`
`Point-to-point routers
`
`and an active
`
`centerplane with four
`
`address routers are
`
`key components of
`
`today’s largest
`
`uniform-memory-
`
`access symmetric
`
`New and faster processors get all the
`
`glory, but the interconnect “glue”
`that cements the processors togeth-
`er should get the headlines. Without high-
`bandwidth and low-latency interconnects,
`today’s high-powered servers wouldn’t exist.
`The current mainstream server architec-
`ture is the cache-coherent symmetric multi-
`processor (SMP), illustrated in Figure 1. The
`technology was developed first on main-
`frames and then moved to minicomputers.
`Sun Microsystems and other server vendors
`brought it to microprocessor-based systems
`in the early 1990s.
`In an SMP, one or more levels of cache for
`each processor retain recently accessed data
`that can be quickly reused, thereby avoiding
`contention for further memory accesses.
`When a processor can’t find needed data in
`its cache, it sends out the address of the mem-
`ory location it wants to read onto the system
`bus. All processors check the address to see
`if they have a more up-to-date copy in their
`caches. If another processor has modified the
`data, it tells the requester to get the data from
`it rather than from memory. For a processor
`to modify a memory location, it must gain
`ownership of it. Upon modification, any other
`processor that has a cached copy of the data
`being modified must invalidate its copy.
`
`This checking of the address stream on a
`system bus is called snooping, and a proto-
`col called cache coherency keeps track of
`where everything is. For more background
`on cache-coherent multiprocessing systems,
`see Hennessy and Patterson (chapter 8).1
`A central system bus provides the quick-
`est snooping response and thus the most
`efficient system operation, especially if many
`different processors frequently modify
`shared data. Since all memory locations are
`equally accessible by all processors, there is
`no concern about trying to have applications
`optimally place data in one memory mod-
`ule or another. This feature of a traditional
`SMP system is called uniform memory
`access.
`
`Keeping pace
`A snoopy bus is a good solution only so
`long as it is not overloaded by too many
`processors making too many memory
`requests. Processor speeds have been
`increasing by 55% per year.1 With increased
`levels of component integration, we can
`squeeze more processors into a single cab-
`inet. Sun has pushed the envelope of SMP
`systems through three generations of
`snoopy-bus-based uniform-memory-access
`interconnects: MBus2 and XDBus,2 and the
`
`Processor
`
`Processor
`
`Processor
`
`Processor
`
`I/O system
`
`Cache
`
`Cache
`
`Cache
`
`Cache
`
`Cache
`
`Memory
`module
`
`Memory
`module
`
`Memory
`module
`
`Memory
`module
`
`Snoopy
`system bus
`
`multiprocessor system.
`
`Figure 1. Symmetric multiprocessor.
`
`0272-1732/98/$10.00 © 1998 IEEE
`
`January/February 1998 39
`
`
`
`.
`
`Starfire
`
`Ultra Port Architecture.3 Table 1 shows the characteristics of
`these architectures.
`Our system architects have made the following cumula-
`tive improvements in bus capacity:
`
`line width allows twice as much data bandwidth for a
`given snoop rate.
`• Widened the data wires from 8 bytes to 16 bytes to move
`twice as much data per clock.
`
`• Increased bus clock rates from 40 MHz to 100 MHz by
`using faster bus-driving logic.4
`• Changed from a circuit-switched protocol to a packet-
`switched protocol. In a circuit-switched organization,
`each processor’s bus request must complete before the
`next can begin. Packet switching separates the requests
`from the replies, letting bus transactions from several
`processors overlap.
`• Separated the addresses and data onto distinct wires so
`that addresses and data no longer have to compete with
`each other for transmission.
`• Interleaved multiple snoop buses. Using four address
`buses allows four addresses to be snooped in parallel.
`The physical memory space is divided into quarters, and
`each address bus snoops a different quarter of the
`memory.
`• Doubled the cache block size from 32 bytes to 64. Since
`each cache block requires a snoop, doubling the cache
`
`The combined effect of these improvements has been to
`increase bus snooping rates from 2.5 million snoops per
`second on our 600MP in 1991 to 167 million snoops per sec-
`ond on the Starfire in 1997—a 66-fold increase in six years.
`Combined with a two-times wider cache line, this has
`allowed data bandwidths to increase by 133 times.
`
`Ultra Port Architecture interconnect
`All current Sun workstations and servers use Sun’s Ultra
`Port Architecture.3 The UPA provides writeback MOESI
`(exclusive modified, shared modified, exclusive clean, shared
`clean, and invalid) coherency on 64-byte-wide cache blocks.
`The UPA uses packet-switched transactions with separate
`address and 18-byte-wide data lines, including two bytes of
`error-correcting code (ECC).
`We have developed small, medium, and large implemen-
`tations of the Ultra Port Architecture, as shown in Figure 2,
`to optimize for different parts of the price spectrum. The
`
`Table 1. Sun interconnect generations.
`
`Architecture
`
`MBus
`1990
`
`XDBus
`1993
`
`Ultra Port Architecture
`1996
`
`Bus improvements
`Bus clock
`Bus protocol
`Address and data information
`Maximum number of interleaved buses
`Cache block size
`Data port width
`Maximum interconnect performance
`Snoops/bus/clock
`Maximum snooping rate
`Corresponding maximum data bandwidth
`
`40 MHz
`Circuit switched
`Multiplexed on same wires
`1
`32 bytes
`8 bytes
`
`40-55 MHz
`Packet switched
`Multiplexed on same wires
`4
`64 bytes
`8 bytes
`
`83.3-100 MHz
`Packet switched
`Separate wires
`4
`64 bytes
`16 bytes
`
`1/16
`2.5 million/s
`80 MBps
`
`1/11
`20 million/s
`1,280 MBps
`
`1/2
`167 million/s
`10,667 MBps
`
`Ultra 6000
`6-30 processors
`
`Starfire Ultra 10000
`24-64 processors
`
`16 x 16 data crossbar
`
`System
`board
`
`System
`board
`
`System
`board
`
`Four address buses
`
`(c)
`
`32-byte-wide data bus
`
`System
`board
`
`System
`board
`
`System
`board
`
`Address bus
`
`(b)
`
`Ultra 450
`1-4 processors
`
`5 x 5 data crossbar
`
`Processor
`Processor
`Processor
`Processor
`I/O bridges
`Memory
`
`System controller
`
`(a)
`
`Figure 2. Three Ultra Port Architecture implementations: (a) a small system consisting of a single board with four proces-
`sors, I/O interfaces, and memory; (b) a medium-sized system with one address bus and a wide data bus between boards;
`and (c) a large system with four address buses and a data crossbar between boards.
`
`40 IEEE Micro
`
`
`
`.
`
`• Point-to-point routing. We wanted to keep failures on
`one system board from affecting other system boards,
`and we wanted the capability to dynamically partition
`the system. To electrically isolate the boards, we used
`point-to-point router ASICs (application-specific inte-
`grated circuits) for the entire interconnect—data, arbi-
`tration, and the four address buses. Also, across a large
`cabinet, point-to-point wires can be clocked faster than
`bused signals.
`• An active centerplane. The natural place to mount the
`router ASICs was on the centerplane, which is physi-
`cally and electrically in the middle of the system.
`• A system service processor (SSP). On our previous system
`it was very useful to have a known-good system that
`was physically separate from the server. We connected
`the SSP via Ethernet to Starfire’s control boards, where
`it has access to internal ASIC status information.
`
`Starfire interconnect
`Like most multiboard systems, Starfire has a two-level
`interconnect. The on-board interconnect conveys traffic from
`the processors, SBus cards, and memory to the off-board
`address and data ports. The centerplane interconnect trans-
`fers addresses and data between the boards.
`Memory accesses always traverse the global interconnect,
`even if the requested memory location is physically on the
`same board. Addresses must be sent off board anyway to
`accomplish global snooping. Data transfers are highly
`pipelined, and local shortcuts to save a few cycles would
`have unduly complicated the design. As with the rest of the
`Ultra server family, Starfire’s uniform-memory-access time is
`independent of the board where memory is located.
`Address interconnect. Table 2 (next page) characterizes
`the address interconnect. Address transactions take two
`cycles. The two low-order cache-block address bits deter-
`mine which address bus to use.
`Data interconnect. Table 3 characterizes the data inter-
`connect. Data packets take four cycles. In the case of a load-
`miss, the missed-upon 16 bytes are sent first. The Starfire
`
`Bandwidth at 83.3-MHz clock (MBps)
`
`Memory bandwidth
`Snooping capacity
`Data-crossbar capacity
`with random addresses
`
`21,333
`
`16,000
`
`10,667
`
`5,333
`
`Snoop
`limited
`
`Data-crossbar limited
`
`4
`
`5
`
`6
`
`9 10 11 12 13 14 15 16
`8
`7
`System boards
`
`0
`
`small system’s centralized coherency controller and small
`data crossbar provide the lowest possible cost and memory
`latency within the limited expansion needs of a single-board
`system. The medium-sized system’s Gigaplane bus4 provides
`a broad range of expandability and the lowest possible mem-
`ory latency. In the large system, Starfire’s multiple-address
`broadcast routers and data crossbar extend the UPA family’s
`bandwidth by four times and provide the unique ability to
`dynamically repartition and hot swap the system boards.
`
`Starfire design choices
`In the fall of 1993, we set out to implement the largest cen-
`terplane-connected, Ultra Port Architecture-based system that
`could be squeezed into one large cabinet. Our goals were to
`
`• increase system address and data bandwidth by four
`times over the medium-sized Ultra 6000 system,
`• provide a new dimension of Unix server flexibility with
`Dynamic System Domains, and
`• improve system reliability, availability, and serviceability.
`
`Our team had already implemented three previous gen-
`eration enterprise servers. When we started designing the
`Starfire, our top-of-the-line product was the 64-processor
`CS6400, which used the SuperSparc processor and the
`XDBus interconnect. Since the scale of the CS6400 worked
`well, we decided to carry over many of its concepts to the
`new UltraSparc/UPA technology generation.
`We made the following design choices:
`
`256
`
`192
`
`128
`
`64
`
`Bytes per clock
`
`• Four-way interleaved address buses for the necessary
`snooping bandwidth. This approach had worked well
`on our 64-processor XDBus-generation system. Each
`address bus covers 1/4 of the physical address space.
`The buses snoop on every other cycle and update the
`duplicate tags in alternate cycles. At an 83.3-MHz system
`clock, Starfire’s coherency rate is 167 million snoops per
`second. Multiplied by the Ultra Port Architecture’s 64-
`byte cache line width, this is enough for a 10,667-
`megabyte-per-second (MBps)
`data rate.
`• A 16 · 16 data crossbar. To
`match the snooping rate, we
`chose a 16 · 16 interboard data
`crossbar having the same 18-
`byte width as the UPA data bus.
`Figure 3 shows how the snoop-
`ing and data bandwidths relate
`as the system is expanded. Since
`the snooping rate is a constant
`two snoops per clock, while the
`data crossbar capacity expands
`as boards are added, there is
`only one point of exact balance,
`at about 13 boards. For 12 and
`fewer boards, the data crossbar
`governs the interconnect capac-
`ity; for 14 to 16 boards, the
`snoop rate sets the ceiling.
`
`0
`
`1
`
`2
`
`3
`
`Figure 3. Snooping and data interconnect capacity.
`
`January/February 1998 41
`
`
`
`.
`
`Starfire
`
`Table 2. Address interconnect.
`
`Unit
`
`ASIC type
`
`Purpose
`
`ASICs on
`ASICs per
`system board centerplane
`
`Port controller
`Coherency interface controller
`Memory controller
`UPA to SBus
`Local address arbiter
`Global address arbiter
`Global address bus
`
`PC
`CIC
`MC
`SYSIO
`LAARB mode of XARB
`GAARB mode of XARB
`GAB mode of 4 XMUXes
`
`Controls two UPA address-bus ports
`Maintains duplicate tags to snoop local caches
`Controls four DRAM banks
`Bridges UPA to SBus
`Arbitrates local address requests
`Arbitrates global requests for an address bus
`Connects a CIC on every board
`
`3
`4
`1
`2
`1
`0
`0
`
`0
`0
`0
`0
`0
`4
`16
`
`Unit
`
`ASIC type
`
`Purpose
`
`ASICs on
`ASICs per
`system board centerplane
`
`Table 3. Data interconnect.
`
`UltraSparc data buffer
`
`Pack/unpack
`
`Data buffer
`Local data arbiter
`Local data router
`
`Global data arbiter
`Global data router
`
`Pack mode of 2 XMUXes
`
`DB
`LDARB mode of XARB
`LDR mode of 4 XMUXes
`
`UDB
`
`Buffers data from the processor;
`generates and checks ECC
`Assembles and disassembles data into
`72-byte memory blocks
`Buffers data from two UPA data-bus ports
`Arbitrates on-board data requests
`Connects four Starfire data buffers
`to a crossbar port
`Arbitrates requests for the data crossbar
`GDARB
`GDR mode of 12 XMUXes 16 · 16 · 18-byte crossbar between the boards
`
`8
`
`4
`4
`1
`
`4
`0
`0
`
`0
`
`0
`0
`0
`
`0
`2
`12
`
`data buffer ASICs provide temporary storage for packets that
`are waiting their turn to be moved across the centerplane.
`The local and global routers are not buffered, and transfers
`take a fixed eight clocks from the data buffer on the send-
`ing board to the data buffer on the receiving board.
`Interconnect operation. An example of a load-miss to
`memory illustrates Starfire’s interconnect operation. The
`interconnect diagram in Figure 4 shows the steps listed in
`Table 4 (on page 44).
`Buses versus point-to-point routers. Starfire’s pin-to-
`pin latency for a load-miss is 38 clocks (468 nanoseconds),
`counting from the cycle when the address request leaves the
`processor through the cycle when data arrives at the proces-
`sor. The medium-sized Ultra 6000’s bus takes only 18 clocks
`(216 ns).
`Buses have lower latencies than routers: a bus takes only
`1 clock to move information from one system component to
`another. A router, on the other hand, takes 3 clocks to move
`information: a cycle on the wires to the router, a cycle inside
`the routing chip, and another cycle on the wires to the receiv-
`ing chip. Buses are the preferred interconnect topology for
`small and medium-sized systems.
`Ultra 10000 designers used routers with point-to-point
`interconnects to emphasize bandwidth, partitioning, and reli-
`ability, availability, and serviceability. Ultra 6000 designers
`used a bus to optimize for low latency and economy over a
`broad product range.
`
`Starfire packaging
`Starfire’s cabinet is 70 inches tall · 34 inches wide · 46
`inches deep. Inside are two rows of eight system boards
`mounted on either side of a centerplane. Starfire is our fourth
`generation of centerplane-based systems.
`Besides the card cage, power supply, and cooling system,
`the cabinet has room for three disk trays. The remaining periph-
`erals are housed separately in standard Sun peripheral racks.
`Starfire performs two levels of power conversion. Up to
`eight N + 1 redundant bulk supplies convert from 220 Vac to
`48 Vdc, which is then distributed to each board. On-board
`supplies convert from 48 Vdc to 3.3 and 5 Vdc. Having local
`power supplies facilitates the hot swap of system boards.
`Starfire uses 12 hot-pluggable fan trays, half above and
`half below the card cage. Fan speed is automatically con-
`trolled to reduce noise in normal environmental conditions.
`Centerplane. The centerplane holds the 20 address ASICs
`and 14 data ASICs that route information between the 16 sys-
`tem-board sockets. It is 27 inches wide · 18 inches tall · 141
`mils thick, with 14 signal layers and 14 power layers. The
`net density utilization is nearly 100%. We routed approxi-
`mately 95% of the 14,000 nets by hand. There are approxi-
`mately 2 miles of wire etch and 43,000 holes.
`Board spacing is 3 inches to allow enough airflow to cool
`the four 45-watt processor modules on each system board.
`Signal lengths had to be minimized to run a 10-ns system
`clock across 16 boards. The maximum etched wire length is
`
`42 IEEE Micro
`
`
`
`.
`
`Address buses
`
`Global address bus 3
`Global address arbiter 3
`
`Global address bus 2
`Global address arbiter 2
`
`Global address bus 1
`Global address arbiter 1
`
`1 1 1 1
`
`Address phase
`
`Read phase
`
`Data phase
`
`Write phase
`
`Active units
`
`Address and
`data flow
`
`Arbitration and
`control flow
`
`Global address bus 0
`Global address arbiter 0
`
`January/February 1998 43
`
`5
`
`7 8
`
`3
`
`4
`
`Coherency
`6
`interface
`controller
`and Dtag
`SRAMs
`9
`
`Coherency
`interface
`controller
`and Dtag
`SRAMs
`
`Coherency
`interface
`controller
`and Dtag
`SRAMs
`
`Coherency
`interface
`controller
`and Dtag
`SRAMs
`
`7 8
`
`6
`Coherency
`interface
`controller
`and Dtag
`SRAMs
`
`Coherency
`interface
`controller
`and Dtag
`SRAMs
`
`Coherency
`interface
`controller
`and Dtag
`SRAMs
`
`Coherency
`interface
`controller
`and Dtag
`SRAMs
`
`2
`
`Local
`address
`arbiter
`
`2
`
`3
`
`1
`
`Port
`controller
`
`Port
`controller
`
`Port
`controller
`6
`
`Memory
`controller
`
`Requesting board
`
`UltraSparc
`3
`module
`
`UltraSparc
`module
`
`UltraSparc
`module
`
`UltraSparc
`module
`
`I/O
`module
`
`Pack/
`unpack
`
`Pack/
`unpack
`
`Two
`mem
`banks
`
`Two
`mem
`banks
`
`1
`
`2
`
`Data buffer
`
`Data buffer
`
`Data buffer
`
`Data buffer
`
`3
`
`7
`
`Data crossbar
`
`2
`
`Local
`data
`arbiter
`
`Local data router
`
`6
`
`1
`
`2
`
`Local
`data
`arbiter
`
`1
`
`Responding board
`
`Local
`address
`arbiter
`
`UltraSparc
`module
`
`UltraSparc
`UltraSPARC
`module
`module
`
`UltraSparc
`module
`
`UltraSparc
`module
`
`Port
`controller
`
`Port
`controller
`
`I/O
`module
`
`Port
`controller
`
`9
`
`6
`
`Pack/
`4
`unpack
`
`3
`
`Pack/
`unpack
`
`Two
`mem
`banks
`
`Two
`mem
`banks
`
`2
`
`1
`Memory
`controller
`
`Data buffer
`
`Data buffer
`
`Data buffer
`
`Data buffer
`
`5
`
`Local data router
`
`5
`
`3
`
`4
`
`Global data arbiter
`Global data router
`
`Figure 4. Interconnect steps for a load-miss from memory.
`
`
`
`.
`
`Starfire
`
`Table 4. Interconnect sequence for a load-miss to memory.
`
`Phase
`
`Steps
`
`Clocks
`
`Send
`address
`and
`establish
`coherency
`
`Read
`from
`memory
`
`Transfer
`data
`
`Write
`data
`
`1. Processor makes a request to its port controller.
`2. Port controller sends the address to a coherency interface controller, and sends the request
`to the local address arbiter.
`3. Local address arbiter requests a global address bus cycle.
`4. Global address arbiter grants an address bus cycle.
`5. Coherency interface controller sends the address through the global address bus to the rest
`of the coherency interface controllers on the other boards.
`6. All coherency interface controllers relay the address to their memory controllers and snoop
`the address in their duplicate tags.
`7. All coherency interface controllers send their snoop results to the global address arbiter.
`8. Global address arbiter broadcasts the global snoop result.
`9. Memory is not aborted by its coherency interface controller because the snoop did not hit.
`
`1. Memory controller recognizes that this address is for one of its memory banks.
`2. Memory controller orchestrates a DRAM cycle and requests a data transfer from its local
`data arbiter.
`3. Memory sends 72 bytes of data to the unpack unit.
`4. Unpack splits the data into four 18-byte pieces.
`5. Unpack sends data to the data buffer to be buffered for transfer.
`
`1. Local data arbiter requests a data transfer.
`2. Global data arbiter grants a data transfer and notifies the receiving local data arbiter that
`data is coming.
`3. Sending local data arbiter tells the data buffer to begin the transfer.
`4. Sending data buffer sends data to the local data router.
`5. Data moves through the local data router to the centerplane crossbar.
`6. Data moves through the centerplane crossbar to the receiving board’s local data router.
`7. Data moves through the receiving local data router to the receiver’s data buffer.
`
`1. Port controller tells the data buffer to send the data packet.
`2. Data buffer sends data to the UltraSparc data buffer on the processor module.
`3. UltraSparc data buffer sends data to the processor.
`
`13
`
`13
`
`8
`
`4
`
`approximately 20 inches. After extensive cross-talk analysis,
`we developed and implemented a novel method for
`absolute–cross-talk minimization on long, minimally cou-
`pled lines.
`We distributed clock sources through the centerplane to
`each system board ASIC. Routing all traces on identical topolo-
`gies minimizes skew between clock arrivals at the ASICs.
`We used unidirectional, point-to-point, source-terminated
`CMOS to implement the 144-bit-wide 16 · 16 data crossbar
`and the four 48-bit-wide address-broadcast routers. We
`designed and tested the system to run at 100 MHz at worst-
`case temperatures and voltages. However, the UltraSparc-II
`processor constrains the system clock to be 1/3 or 1/4 of the
`processor clock. As of this writing, the processor clock is 250
`MHz, so the system clock is 250/3 = 83.3 MHz.
`System boards. The system boards, shown in Figure 5
`(next page), each hold six mezzanine modules on the top
`side: the memory module, the I/O module, and four proces-
`sor modules. The bottom side has nine address ASICs, nine
`data ASICs, and five 48-volt power converters. The boards are
`16 · 20 inches with 24 layers.
`
`Processor module. Starfire uses the same UltraSparc proces-
`sor module as the rest of Sun’s departmental and data cen-
`ter servers. As of this writing, the server modules have a
`250-MHz processor with 4 Mbytes of external cache.
`Memory module. The memory module contains four 576-
`bit-wide banks of memory composed of 32 standard 168-pin
`ECC DIMMs (dual in-line memory modules). It also has four
`ASICs, labeled Pk, which pack and unpack 576-bit-wide
`memory words into 144-bit-wide data-crossbar blocks. The
`memory module contains 4 Gbytes of memory using 64-Mbit
`DRAMs.
`I/O module. The current I/O module interfaces between
`the UPA and two SBuses and provides four SBus card slots.
`Each SBus has an achievable bandwidth of 100 MBps.
`ASIC types. We designed seven unique ASIC types. Six
`of them implement an entire functional unit on a single chip,
`while the seventh is a multiplexer part used to implement
`the local and global routers, as well as the pack/unpack func-
`tion. The ASICs are fabricated in 3.3-V, 0.5-micron CMOS
`technology. The largest die is 9.95 · 10 mm, with five metal
`layers. The ASICs are all packaged in 32 · 32-mm ceramic
`
`44 IEEE Micro
`
`
`
`.
`
`Pk
`
`Pk
`
`Memory module
`with 32 DIMMs
`and 4 ASICs
`
`Pk
`
`Pk
`
`Four
`SBus
`cards
`on I/O
`module
`
`Four UltraSparc
`processor modules
`
`Top
`
`Bus 0
`Bus 1
`Bus 2
`Bus 3
`Upper 72 bits
`
`AAAA
`
`DD
`
`Lower 72 bits
`
`AAAA
`
`D
`
`D
`
`D
`
`AA
`
`AA
`
`AA
`
`AA
`
`D
`
`DA
`
`DA
`
`D
`
`AAAA
`
`D
`
`D
`
`D
`
`AAA
`
`A
`
`D
`
`D
`
`Four address buses
`(20 ASICs)
`AA = GAARB
`A = GAR
`
`Dupli-
`cate
`Tags
`
`MC
`
`DB
`
`9
`address
`ASICs
`
`PC
`
`PC
`
`PC
`
`9 data
`ASICs
`
`LDARB
`
`CIC
`
`CIC
`
`LAARB
`
`CIC
`
`CIC
`
`LDR
`
`LDR
`
`LDR
`
`converters
`48-V power
`
`DB
`
`DB
`
`DB
`
`LDR
`
`Bottom
`
`Centerplane
`
`16 · 16 data crossbar
`DA = GDARB
`D = GDR
`
`Figure 5. Bottom (left) and top (right) of the system boards, with the centerplane shown in the middle.
`
`ball grid arrays with 624 pins.
`For more details on Starfire’s physical implementation, see
`Charlesworth et al.5
`Interconnect reliability. In addition to the ECC for data
`that is generated and checked by the processor module,
`Starfire ASICs also generate and check ECC for address pack-
`ets. To help isolate faults, the Starfire data-buffer chips check
`data-packet ECC along the way through the interconnect.
`Failed components. If an UltraSparc module, DIMM, SBus
`board, memory module, I/O module, system board, control
`board, centerplane support board, power supply, or fan fails,
`the system tries to recover without service interruption. Later,
`the failed component can be hot swapped out of the system
`and replaced.
`Redundant components. Customers can optionally config-
`ure a Starfire to have 100% hardware redundancy of config-
`urable components: control boards, support boards, system
`boards, disk storage, bulk power subsystems, bulk power
`supplies, cooling fans, peripheral controllers, and system ser-
`vice processors. If the centerplane experiences a failure, it
`can operate in a degraded mode. If one of the four address
`buses fails, the remaining buses will allow access to all sys-
`tem resources. The data crossbar is divided into separate
`halves, so it can operate at half-bandwidth if an ASIC fails.
`Crash recovery. A fully redundant system can always
`recover from a system crash, utilizing standby components
`or operating in degraded mode. Automatic system recovery
`enables the system to reboot immediately following a failure,
`automatically disabling the failed component. This approach
`prevents a faulty hardware component from causing the sys-
`
`tem to crash again or from keeping the entire system down.
`
`Dynamic System Domains
`Dynamic System Domains make Starfire unique among
`Unix servers. Starfire can be dynamically subdivided into
`multiple computers, each consisting of one or more system
`boards. System domains are similar to partitions on a main-
`frame. Each domain is a separate shared-memory SMP sys-
`tem that runs its own local copy of Solaris and has its own
`disk storage and network connections. Because individual
`system domains are logically isolated from other system
`domains, hardware and software errors are confined to their
`respective domain and do not affect the rest of the system.
`Thus, a system domain can be used to test device drivers,
`updates to Solaris, or new application software without
`impacting production usage.
`Dynamic System Domains can serve many purposes,
`enabling the site to manage the Starfire resources effectively:
`
`• LAN consolidation. A single Starfire can replace two or
`more smaller servers. It is easier to administer because
`it uses a single system service processor (SSP), and it is
`more robust because it has better reliability, availabili-
`ty, and serviceability features. Starfire offers the flexi-
`bility to shift resources quickly from one “server” to
`another. This is beneficial as applications grow, or when
`demand reaches peak levels and requires rapid reas-
`signment of computing resources.
`• Development, production, and test environments. In a
`production environment, most sites require separate
`
`January/February 1998 45
`
`
`
`Help
`
`!
`
`Failure
`indicator
`
`SB9
`36
`37
`38
`39
`
`SB9
`32
`32
`34
`35
`
`System
`boards
`
`The system service processor is a
`Sparc workstation that runs standard
`Solaris plus a suite of diagnostics and
`management programs. It is con-
`nected via Ethernet to a Starfire con-
`trol board. The control board has an
`embedded control processor that
`interprets the TCP/IP Ethernet traffic
`and converts it to JTAG control infor-
`mation. Figure 6 shows an example
`of Starfire’s hardware and domain
`status in the Hostview main screen.
`In this instance there are four
`domains, plus an additional board
`being tested in isolation.
`Domain implementation. Do-
`main protection is implemented at
`two levels: in the centerplane
`arbiters and in the coherency inter-
`face controllers on each board.
`filtering. Global
`Centerplane
`arbiters provide the top-level sepa-
`ration between unrelated domains.
`Each global arbiter contains a 16 · 16-bit set of domain con-
`trol registers. For each system board there is a register that,
`when the bits are set to one, establishes the set of system
`boards in a particular board’s domain group.
`Board-level filtering. Board-level filtering lets a group of
`domains view a region of each other’s memory to facilitate
`interdomain networking—a fast form of communication
`between a group of domains. As Figure 7 illustrates, all four
`coherency interface controllers on a system board have iden-
`tical copies of the following registers:
`
`Domain #2
`(3 boards)
`
`Domain #4
`(6 boards)
`
`27
`26
`25
`24
`SB6
`
`31
`30
`29
`28
`SB7
`
`• Domain mask. Sixteen bits identify which other system
`boards are in the board’s domain.
`• Group memory mask. Sixteen bits identify which other
`boards are in a board’s domain group, to facilitate mem-
`ory-based networking between domains.
`• Group memory base and limit registers. These registers
`contain the lower and upper physical addresses of the
`board’s memory that are visible to other domains in a
`group of domains. The granularity of these addresses is
`64 Kbytes.
`
`Dynamic reconfiguration. With dynamic reconfigura-
`tion, system administrators can logically move boards
`between running domains on-the-fly. The process has two
`phases: attach and detach.
`Attach. This phase connects a system board to a domain
`and makes it possible to perform online upgrades, redis-
`tribute system resources for load balancing, or reintroduce
`a board after it has been repaired. Attach diagnoses and con-
`figures the candidate system board so that it can be intro-
`duced safely into the running Solaris operating system. There
`are two steps:
`
`1. The board is added to the target domain’s board list in the
`domain configuration files on the SSP. Power-on self-test
`
`.
`
`Starfire
`
`File Edit Control
`
`Terminal View
`
`Back
`
`CSB1
`
`CB1
`
`SB15
`60
`61
`62
`63
`
`SB14
`56
`57
`58
`59
`
`SB13
`52
`53
`54
`55
`
`SB12
`48
`49
`50
`51
`
`SB11
`44
`45
`46
`47
`
`SB10
`40
`41
`42
`43
`
`ABUS2
`ABUS3
`DBUS1
`DBUS0
`ABUS1
`ABUS0
`
`03
`02
`01
`00
`SB0
`
`07
`06
`05
`04
`SB1
`
`11
`10
`09
`08
`SB2
`
`15
`14
`13
`12
`SB3
`Front
`
`19
`18
`17
`16
`SB4
`
`23
`22
`21
`20
`SB5
`
`J C
`
`CSB0
`
`CB0
`
`Being tested in isolation
`
`Power,
`temperature,
`fan buttons
`
`Support board
`
`Control board
`
`Domain #1
`(5 boards)
`
`Address and
`data interconnect
`
`Domain #3
`(1 board)
`
`Figure 6. System service processor Hostview main screen.
`
`development and test facilities. With Starfire, those func-
`tions can safely coexist in the same box. Having isolat-
`ed facilities enables development work to continue on
`a regular schedule without impacting production.
`• Software migration. Dynamic System Domains may be
`used as a way to migrate systems or application soft-
`ware to updated versions. This applies to the Solaris
`operating system, database applications, new adminis-
`trative environments, and applications.
`• Special I/O or network functions. A system domain can
`be established to deal with specific I/O devices or func-
`tions. For example, a high-end tape device could be
`attached to a dedicated system domain, which is alter-
`nately merged into other system domains that need to
`use the device for backup or other purposes.
`• Departmental systems. Multiple projects or departments
`can share a single Starfire system, simplifying cost-jus-
`tification and cost-accounting requirements.
`
`Many domain schemes are possible with Starfire’s 64 proces-
`sors and 16 boards. For example, we could make domain 1 a
`12-board (48-processor) production domain running the cur-
`rent release of Solaris. Domain 2 could be a two-board (8-
`processor) domain for checking out an early version of the
`next Solaris release. Domain 3 could be a two-board (8-
`processor) domain running a special application—for instance,
`proving that the application is fully stable before allowing it
`to run in the production domain. Each domain has its own
`boot disk and storage, as well as its own network connection.
`Domain administration. System administrators can
`dynamically switch system boards between domains or
`remove them from active domains for upgrade or servicing.
`After service, boards can be reintroduced into one of the
`active domains, all without interrupting system operation.
`Each system domain is administered from the SSP, which ser-
`vices all the domains.
`
`46 IEEE Micro
`
`
`
`.
`
`Address router 0
`
`Address router 1
`
`Address router 2
`
`Address router 3
`
`16 domain control registers
`
`16 domain control registers
`
`16 domain control registers
`
`16 domain control registers
`
`Each system board
`
`Coherency Interface
`controller 0
`Domain mask
`Domain mask
`
`Coherency Interface
`controller 1
`Domain mask
`
`Coherency Interface
`controller 2
`Domain mask
`
`Coherency Interface
`controller 3
`Domain mask
`
`Group memory mask
`Group memory mask
`
`Group memory mask
`
`Group memory mask
`
`Group memory mask
`
`Group memory base
`Group memory base
`
`Group memory base
`
`Group memory base
`
`Group memory base
`
`Group memory limit
`Group memory limit
`
`Group memory limit
`
`Group memory limit
`
`Group memory limit
`
`Data crossbar arbiter
`
`16 domain control registers
`
`Figure 7. Domain registers in the Starfire interconnect.
`
`(POST) executes, testing and configuring the board. POST
`also creates a single-board domain group that isolates the
`candidate board from the centerplane’s other system
`boards. The processors shift from a reset state into a spin
`mode, preparing them for code execution. The center-
`plane and board-level domain registers are configured to
`include the candidate board in the target domain.
`2. When these operations are complete, the Solaris