`
`Contents lists available at ScienceDirect
`
`Computer Networks
`
`j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / c o m n e t
`
`Data center evolution
`A tutorial on state of the art, issues, and challenges
`
`Krishna Kant
`
`Intel Corporation, Hillsboro, Oregon, USA
`
`a r t i c l e
`
`i n f o
`
`a b s t r a c t
`
`Keywords:
`Data center
`Virtualization
`InfiniBand
`Ethernet
`Solid state storage
`Power management
`
`Data centers form a key part of the infrastructure upon which a variety of information tech-
`nology services are built. As data centers continue to grow in size and complexity, it is
`desirable to understand aspects of their design that are worthy of carrying forward, as well
`as existing or upcoming shortcomings and challenges that would have to be addressed. We
`envision the data center evolving from owned physical entities to potentially outsourced,
`virtualized and geographically distributed infrastructures that still attempt to provide the
`same level of control and isolation that owned infrastructures do. We define a layered
`model for such data centers and provide a detailed treatment of state of the art and emerg-
`ing challenges in storage, networking, management and power/thermal aspects.
`Ó 2009 Published by Elsevier B.V.
`
`1. Introduction
`
`Data centers form the backbone of a wide variety of ser-
`vices offered via the Internet including Web-hosting, e-
`commerce, social networking, and a variety of more gen-
`eral services such as software as a service (SAAS), platform
`as a service (PAAS), and grid/cloud computing. Some exam-
`ples of these generic service platforms are Microsoft’s
`Azure platform, Google App engine, Amazon’s EC2 plat-
`form and Sun’s Grid Engine. Virtualization is the key to
`providing many of these services and is being increasingly
`used within data centers to achieve better server utiliza-
`tion and more flexible resource allocation. However, virtu-
`alization also makes many aspects of data center
`management more challenging.
`As the complexity, variety, and penetration of such ser-
`vices grows, data centers will continue to grow and prolif-
`erate. Several forces are shaping the data center landscape
`and we expect future data centers to be lot more than sim-
`ply bigger versions of those existing today. These emerging
`
`E-mail address: krishna.kant@intel.com
`
`1389-1286/$ - see front matter Ó 2009 Published by Elsevier B.V.
`doi:10.1016/j.comnet.2009.10.004
`
`trends – more fully discussed in Section 3 – are expected to
`turn data centers into distributed, virtualized, multi-lay-
`ered infrastructures that pose a variety of difficult
`challenges.
`In this paper, we provide a tutorial coverage of a vari-
`ety of emerging issues in designing and managing large
`virtualized data centers. In particular, we consider a lay-
`ered model of virtualized data centers and discuss stor-
`age, networking, management,
`and power/thermal
`issues for such a model. Because of the vastness of the
`space, we shall avoid detailed treatment of certain well
`researched issues. In particular, we do not delve into
`the
`intricacies of virtualization techniques, virtual
`machine migration and
`scheduling
`in virtualized
`environments.
`The organization of the paper is as follows. Section 2
`discusses the organization of a data center and points out
`several challenging areas in data center management. Sec-
`tion 3 discusses emerging trends in data centers and new
`issues posed by them. Subsequent sections then discuss
`specific issues in detail
`including storage, networking,
`management and power/thermal issues. Finally, Section 8
`summarizes the discussion.
`
`US Conec EX1014
`IPR2024-00116
`U.S. Patent No. 11,307,369
`
`
`
`2940
`
`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`2. Data center organization and issues
`
`2.1. Rack-level physical organization
`
`A data center is generally organized in rows of ‘‘racks”
`where each rack contains modular assets such as servers,
`switches, storage ‘‘bricks”, or specialized appliances as
`shown in Fig. 1. A standard rack is 78 in. high, 23–25 in.
`wide and 26–30 in. deep. Typically, each rack takes a num-
`ber of modular ‘‘rack mount” assets inserted horizontally
`into the racks. The asset thickness is measured using an
`unit called ‘‘U”, which is 45 mm (or approximately
`1.8 in.). An overwhelming majority of servers are single
`or dual socket processors and can fit the 1U size, but larger
`ones (e.g., 4-socket multiprocessors) may require 2U or lar-
`ger sizes. A standard rack can take a total of 42 1U assets
`when completely filled. The sophistication of the rack itself
`may vary greatly – in the simplest case, it is nothing more
`than a metal enclosure. Additional features may include
`rack power distribution, built-in KVM (keyboard–video–
`mouse) switch, rack-level air or liquid cooling, and perhaps
`even a rack-level management unit.
`For greater compactness and functionality, servers can
`be housed in a self-contained chassis which itself slides
`into the rack. With 13 in. high chassis, six chassis can fit
`into a single rack. A chassis comes complete with its own
`power supply, fans, backplane interconnect, and manage-
`ment infrastructure. The chassis provides standard size
`slots where one could insert modular assets (usually
`known as blades). A single chassis can hold up to 16 1U
`servers, thereby providing a theoretical rack capacity of
`96 modular assets.
`The substantial increase in server density achievable by
`using the blade form factor results in corresponding in-
`crease in per-rack power consumption which, in turn,
`can seriously tax the power delivery infrastructure. In par-
`ticular, many older data centers are designed with about
`7 KW per-rack power rating, whereas racks loaded with
`blade servers could approach 21 KW. There is a similar is-
`sue with respect to thermal density – the cooling infra-
`structure may be unable to handle the offered thermal
`load. The net result is that it may be impossible to load
`the racks to their capacity. For some applications, a fully
`
`Fig. 1. Physical organization of a data center.
`
`loaded rack may not offer the required peak network or
`storage bandwidth (BW) either, thereby requiring careful
`management of resources to stay within the BW limits.
`
`2.2. Storage and networking infrastructure
`
`Storage in data centers may be provided in multiple
`ways. Often the high performance storage is housed in spe-
`cial ‘‘storage towers” that allow transparent remote access
`to the storage irrespective of the number and types of
`physical storage devices used. Storage may also be pro-
`vided in smaller ‘‘storage bricks” located in rack or chassis
`slots or directly integrated with the servers. In all cases, an
`efficient network access to the storage is crucial.
`A data center typically requires four types of network
`accesses, and could potentially use four different types of
`physical networks. The client–server network provides
`external access into the data center, and necessarily uses
`a commodity technology such as the wired Ethernet or
`wireless LAN. Server-to-server network provides high-
`speed communication between servers and may use Ether-
`net, InfiniBand (IBA) or other technologies. The storage ac-
`cess has traditionally been provided by Fiber Channel but
`could also use Ethernet or InfiniBand. Finally, the network
`used for management is also typically Ethernet but may
`either use separate cabling or exist as a ‘‘sideband” on
`the mainstream network.
`Both mainstream and storage networks typically follow
`identical configuration. For blade servers mounted on a
`chassis, the chassis provides a switch through which all
`the servers in the chassis connect to outside servers. The
`switches are duplexed for reliability and may be arranged
`for load sharing when both switches are working. In order
`to keep the network manageable, the overall topology is
`basically a tree with full connectivity at the root level.
`For example, each chassis level (or level 1) switch has an
`uplink leading to the level 2 switch, so that communication
`between two servers in different chassis must go through
`at least three switches. Depending on the size of the data
`center, the multiple level 2 switches may be either con-
`nected into a full mesh, or go through one or more level
`3 switches. The biggest issue with such a structure is po-
`tential bandwidth inadequacy at higher levels. Generally,
`uplinks are designed for a specific oversubscription ratio
`since providing a full bisection bandwidth is usually not
`feasible. For example, 20 servers, each with a 1 GB/s Ether-
`net may share a single 10 GB/s Ethernet uplink for a over-
`subscription ratio of 2.0. This may be troublesome if the
`workload mapping is such that there is substantial non-lo-
`cal communication. Since storage is traditionally provided
`in a separate storage tower, all storage traffic usually
`crosses the chassis uplink on the storage network. As data
`centers grow in size, a more scalable network architecture
`becomes necessary.
`
`2.3. Management infrastructure
`
`Each server usually carries a management controller
`called the BMC (baseboard management controller). The
`management network terminates at the BMC of each ser-
`ver. When the management network is implemented as a
`
`
`
`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`2941
`
`‘‘sideband” network, no additional switches are required
`for it; otherwise, a management switch is required in each
`chassis/rack to support external communication. The basic
`functions of the BMC include monitoring of various hard-
`ware sensors, managing various hardware and software
`alerts, booting up and shutting down the server, maintain-
`ing configuration data of various devices and drivers, and
`providing remote management capabilities. Each chassis
`or rack may itself sport its own higher level management
`controller which communicates with the lower level
`controller.
`Configuration management is a rather generic term and
`can refer to management of parameter settings of a variety
`of objects that are of interest in effectively utilizing the
`computer system infrastructure from individual devices
`up to complex services running on large networked clus-
`ters. Some of this management clearly belongs to the base-
`board management controller (BMC) or corresponding
`higher level management chain. This is often known as
`out-of-band (OOB) management since it is done without
`involvement of main CPU or the OS. Other activities may
`be more appropriate for in-band management and may be
`done by the main CPU in hardware, in OS, or in the middle-
`ware. The higher level management may run on separate
`systems that have both in-band and OOB interfaces. On a
`server, the most critical OOB functions belong to the pre-
`boot phase and in monitoring of server health while the
`OS is running. On other assets such as switches, routers,
`and storage bricks the management is necessarily OOB.
`
`2.4. Electrical and cooling infrastructure
`
`Even medium-sized data centers can sport peak power
`consumption of several megawatts or more. For such
`power loads, it becomes necessary to supply power using
`high voltage lines (e.g., 33 KV, 3 phase) and step it down
`on premises to the 280–480 V (3 phase) range for routing
`through the uninterrupted power supply (UPS). The UPS
`unit needs to convert AC to DC to charge its batteries and
`then convert DC to AC on the output end. Since the UPS
`unit sits directly in the power path, it can continue to sup-
`ply output power uninterrupted in case of input power
`
`loss. The output of UPS (usually 240/120 V, single phase)
`is routed to the power distribution unit (PDU) which, in
`turn, supplies power to individual rack-mounted servers
`or blade chassis. Next the power is stepped down, con-
`verted from AC to DC, and partially regulated in order to
`yield the typical 12 and 5 V outputs with the desired
`current ratings (20–100 A). These voltages are delivered
`to the motherboard where the voltage regulators (VRs)
`must convert them to as many voltage rails as the server
`design demands. For example, in an IBM blade server, the
`supported voltage rails include 5–6 V (3.3 V down to
`1.1 V), in addition to the 12 V and 5 V rails.
`Each one of these power conversion/distribution stages
`results in power loss, with some stages showing efficien-
`cies in 85–95% range or worse. It is thus not surprising that
`the cumulative power efficiency by the time we get down
`to voltage rails on the motherboard is only 50% or less
`(excluding cooling, lighting, and other auxiliary power
`uses). Thus there is a significant scope for gaining power
`efficiencies by a better design of power distribution and
`conversion infrastructure.
`The cooling infrastructure in a data center can be quite
`elaborate and expensive involving building level air-condi-
`tioning units requiring large chiller plants, fans and air
`recirculation systems. Evolving cooling technologies tend
`to emphasize more localized cooling or try to simplify
`cooling infrastructure. The server racks are generally
`placed on a raised plenum and arranged in alternately
`back-facing and front-facing aisles as shown in Fig. 2. Cold
`air is forced up in the front facing aisles and the server or
`chassis fans draw the cold air through the server to the
`back. The hot air on the back then rises and is directed
`(sometimes by using some deflectors) towards the chiller
`plant for cooling and recirculation. This basic setup is not
`expensive but can also create hot spots either due to un-
`even cooling or the mixing of hot and cold air.
`
`2.5. Major data center issues
`
`Data center applications increasingly involve access to
`massive data sets, real-time data mining, and streaming
`media delivery that place heavy demands on the storage
`
`Fig. 2. Cooling in a data center.
`
`
`
`2942
`
`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`infrastructure. Efficient access to large amounts of storage
`necessitates not only high performance file systems but
`also high performance storage technologies such as solid-
`state storage (SSD) media. These issues are discussed in
`Section 5. Streaming large amounts of data (from disks or
`SSDs) also requires high-speed, low-latency networks. In
`clustered applications, the inter-process communication
`(IPC) often involves rather small messages but with very
`low-latency requirements. These applications may also
`use remote main memories as ‘‘network caches” of data
`and thus tax the networking capabilities. It is much cheap-
`er to carry all types of data – client–server, IPC, storage and
`perhaps management – on the same physical fabric such as
`Ethernet. However, doing so requires sophisticated QoS
`capabilities that are not necessarily available in existing
`protocols. These aspects are discussed in Section 4.
`Configuration management is a vital component for the
`smooth operation of data centers but has not received
`much attention in literature. Configuration management
`is required at multiple levels, ranging from servers to ser-
`ver enclosures to the entire data center. Virtualized envi-
`ronments introduce issues of configuration management
`at a logical – rather than physical – level as well. As the
`complexity of servers, operating environments, and appli-
`cations increases, effective real-time management of large
`heterogeneous data centers becomes quite complex. These
`challenges and some approaches are discussed in Section 6.
`The increasing size of data centers not only results in
`high utility costs [1] but also leads to significant challenges
`in power and thermal management [82]. It is estimated
`that the total data center energy consumption as a percent-
`age of total US energy consumption doubled between 2000
`and 2007 and is set to double yet again by 2012. The high
`utility costs and environmental impact of such an increase
`are reasons enough to address power consumption. Addi-
`tionally, high power consumption also results in unsus-
`tainable current, power, and thermal densities, and
`inefficient usage of data center space. Dealing with
`power/thermal issues effectively requires power, cooling
`and thermal control techniques at multiple levels (e.g., de-
`vice, system, enclosure, etc.) and across multiple domains
`(e.g., hardware, OS and systems management). In many
`cases, power/thermal management impacts performance
`and thus requires a combined treatment of power and per-
`formance. These issues are discussed in Section 7.
`As data centers increase in size and criticality, they be-
`come increasingly attractive targets of attack since an iso-
`lated vulnerability can be exploited to impact a large
`number of customers and/or large amounts of sensitive
`data [14]. Thus a fundamental security challenge for data
`centers is to find workable mechanisms that can reduce
`this growth of vulnerability with size. Basically, the secu-
`rity must be implemented so that no single compromise
`can provide access to a large number of machines or large
`amount of data. Another important issue is that in a virtu-
`alized outsourced environment, it is no longer possible to
`speak of ‘‘inside” and ‘‘outside” of data center – the intrud-
`ers could well be those sharing the same physical infra-
`structure for their business purposes. Finally, the basic
`virtualization techniques themselves enhance vulnerabili-
`ties since the flexibility provided by virtualization can be
`
`easily exploited for disruption and denial of service. For
`example, any vulnerability in mapping VM level attributes
`to the physical system can be exploited to sabotage the en-
`tire system. Due to limited space, we do not, however,
`delve into security issues in this paper.
`
`3. Future directions in data center evolution
`
`Traditional data centers have evolved as large computa-
`tional facilities solely owned and operated by a single en-
`tity – commercial or otherwise. However, the forces in
`play are resulting in data centers moving towards much
`more complex ownership scenarios. For example, just as
`virtualization allows consolidation and cost savings within
`a data center, virtualization across data centers could allow
`a much higher level of aggregation. This notion leads to the
`possibility of ‘‘out-sourced” data centers that allows an
`organization to run a large data center without having to
`own the physical infrastructure. Cloud computing, in fact,
`provides exactly such a capability except that in cloud
`computing the resources are generally obtained dynami-
`cally for short periods and underlying management of
`these resources is entirely hidden from the user. Subscrib-
`ers of virtual data centers would typically want longer-
`term arrangements and much more control over the infra-
`structure given to them. There is a move afoot to provide
`Enterprise Cloud facilities whose goals are similar to those
`discussed here [2]. The distributed virtualized data center
`model discussed here is similar to the one introduced in
`[78].
`In the following we present a 4-layer conceptual model
`of future data centers shown in Fig. 3 that subsumes a wide
`range of emergent data center implementations. In this
`depiction, rectangles refer to software layers and ellipses
`refer to the resulting abstractions.
`The bottom layer in this conceptual model is the Physi-
`cal Infrastructure Layer (PIL) that manages the physical
`infrastructure (often known as ‘‘server farm”) installed in
`a given location. Because of the increasing cost of the
`power consumed, space occupied, and management per-
`sonnel required, server farms are already being located clo-
`ser to sources of cheap electricity, water,
`land, and
`manpower. These locations are by their nature geographi-
`cally removed from areas of heavy service demand, and
`thus the developments in ultra high-speed networking
`over long distances are essential enablers of such remotely
`located server farms. In addition to the management of
`physical computing hardware, the PIL can allow for lar-
`ger-scale consolidation by providing capabilities to carve
`out well-isolated sections of the server farm (or ‘‘server
`patches”) and assign them to different ‘‘customers.” In this
`case, the PIL will be responsible for management of bound-
`aries around the server patch in terms of security, traffic
`firewalling, and reserving access bandwidth. For example,
`set up and management of virtual LANs will be done by PIL.
`The next layer is the Virtual Infrastructure Layer (VIL)
`which exploits the virtualization capabilities available in
`individual servers, network and storage elements to sup-
`port the notion of a virtual cluster, i.e., a set of virtual or real
`nodes along with QoS controlled paths to satisfy their
`
`
`
`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`2943
`
`Fig. 3. Logical organization of future data centers.
`
`communication needs. In many cases, the VIL will be inter-
`nal to an organization who has leased an entire physical
`server patch to run its business. However, it is also con-
`ceivable that VIL services are actually under the control
`of infrastructure provider that effectively presents a virtual
`server patch abstraction to its customers. This is similar to
`cloud computing, except that the subscriber to a virtual
`server patch would expect explicit SLAs in terms of compu-
`tational, storage and networking infrastructure allocated to
`it and would need enough visibility to provide its own next
`level management required for running multiple services
`or applications.
`The third layer in our model is the Virtual Infrastructure
`Coordination Layer (VICL) whose purpose is to tie up virtual
`server patches across multiple physical server farms in or-
`der to create a geographically distributed virtualized data
`center (DVDC). This layer must define and manage virtual
`pipes between various virtual data centers. This layer
`would also be responsible for cross-geographic location
`application deployment, replication and migration when-
`ever that makes sense. Depending on its capabilities, VICL
`could be exploited for other purposes as well, such as
`reducing energy costs by spreading load across time-zones
`and utility rates, providing disaster or large scale failure
`tolerance, and even enabling truly large-scale distributed
`computations.
`Finally, the Service Provider Layer (SPL) is responsible for
`managing and running applications on the DVDC con-
`structed by the VICL. The SPL would require substantial
`visibility into the physical configuration, performance, la-
`tency, availability and other aspects of the DVDC so that
`it can manage the applications effectively. It is expected
`that SPL will be owned by the customer directly.
`The model in Fig. 3 subsumes everything from a non-
`virtualized, single location data center entirely owned by
`a single organization all the way up to a geographically dis-
`
`tributed, fully virtualized data center where each layer
`possibly has a separate owner. The latter extreme provides
`a number of advantages in terms of consolidation, agility,
`and flexibility, but it also poses a number of difficult chal-
`lenges in terms of security, SLA definition and enforce-
`ment, efficiency and issues of layer separation. For this
`reason, real data centers are likely to be limited instances
`of this general model.
`In subsequent sections, we shall address the needs of
`such DVDC’s when relevant, although many of the issues
`apply to traditional data centers as well.
`
`4. Data center networking
`
`4.1. Networking infrastructure in data centers
`
`The increasing complexity and sophistication of data
`center applications demands new features in the data cen-
`ter network. For clustered applications, servers often need
`to exchange inter-process communication (IPC) messages
`for synchronization and data exchange, and such messages
`may require very low-latency in order to reduce process
`stalls. Direct data exchange between servers may also be
`motivated by low access latency to data residing in the
`memory of another server as opposed to retrieving it from
`the local secondary storage [18]. Furthermore, mixing of
`different types of data on the same networking fabric
`may necessitate QoS mechanisms for performance isola-
`tion. These requirements have led to considerable activity
`in the design and use of low-latency specialized data cen-
`ter fabrics such as PCI-Express based backplane intercon-
`nects, InfiniBand (IBA) [37], data center Ethernet [40,7],
`and lightweight transport protocols implemented directly
`over the Ethernet layer [5]. We shall survey some of these
`developments in subsequent sections before examining
`networking challenges in data centers.
`
`
`
`2944
`
`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`4.2. Overview of data center fabrics
`
`In this section, we provide a brief overview of two ma-
`jor network fabrics in the data center namely, Ethernet and
`InfiniBand (IBA). Although Fiber channel can be used as a
`general networking fabric as well, we do not discuss it here
`because of its strong storage association. Also, although the
`term ‘‘Ethernet” relates only to the MAC layer, in practice,
`TCP/UDP over IP over Ethernet is the more appropriate net-
`work stack to compare against InfiniBand or Fiber-channel
`which specify their own network and transport layers. For
`this reason, we shall speak of ‘‘Ethernet-stack” instead of
`just the Ethernet.
`
`4.2.1. InfiniBand
`InfiniBand architecture (IBA) was defined in late 1990s
`specifically for inter-system IO in the data center and uses
`the communication link semantics rather than the tradi-
`tional memory read/write semantics [37]. IBA provides a
`complete fabric starting with its unique cabling/connec-
`tors, physical layer, link, network and transport layers. This
`incompatibility with Ethernet at the very basic cabling/
`connector level makes IBA (and other data center fabrics
`such as Myrinet) difficult to introduce in an existing data
`center. However, the available bridging products make it
`possible to mix IBA and Ethernet based clusters in the
`same data center.
`IBA links use bit-serial, differential signaling technology
`which can scale up to much higher data rates than the tra-
`ditional parallel
`link technologies. Traditional parallel
`transmission technologies (with one wire per bit) suffer
`from numerous problems such as severe cross-talk and
`skew between the bit timings, especially as the link speeds
`increase. The differential signaling, in contrast, uses two
`physical wires for each bit direction and the difference be-
`tween the two signals is used by the receiver. Each such
`pair of wires is called a lane, and higher data throughput
`can be achieved by using multiple lanes, each of which
`can independently deliver a data frame. As the link speeds
`move into multi GB/s range, differential bit-serial technol-
`ogy is being adopted almost universally for all types of
`links. It has also led to the idea of re-purposable PHY,
`wherein the same PHY layer can be configured to provide
`the desired link type (e.g., IBA, PCI-Express, Fiber Channel,
`Ethernet, etc.).
`links run at
`The generation 1 (or GEN1) bit-serial
`2.5 GHz but use a 8–10 byte encoding for robustness which
`effectively delivers 2 GB/s speed in each direction. GEN2
`links double this rate to 4 GB/s. The GEN3 technology does
`not use 8-10 byte encoding and can provide 10 GB/s speed
`over fiber. Implementing 10 GB/s on copper cables is very
`challenging, at least for distances of more than a few me-
`ters. The IBA currently defines three lane widths – 1X,
`4X, and 12X. Thus, IBA can provide bandwidths up to
`120 GB/s over fiber using GEN3 technology.
`In IBA, a layer-2 network (called subnet) can be built
`using IBA switches which can route messages based on
`explicitly configured routing tables. A switch can have up
`to 256 ports; therefore, most subnets use a single switch.
`For transmission, application messages are broken up into
`‘‘packets” or message transfer units (MTUs) with maxi-
`
`mum size settable to 256 bytes, 1 KB, 2 KB, or 4 KB. In sys-
`tems with mixed MTUs, subnet management provides
`endpoints with the Path MTU appropriate to reach a given
`destination. IBA switches may also optionally support mul-
`ticast which allows for message replication at the switches.
`There are two important attributes for a transport proto-
`col: (a) connectionless (datagram) vs. connection oriented
`transfer, and (b) reliable vs. unreliable transmission. In
`the Ethernet context, only the reliable connection oriented
`service (TCP) are unreliable datagram service (UDP) are
`supported.
`IBA supports all four combinations, known
`respectively as Reliable Connection (RC), Reliable Datagram
`(RD), Unreliable Connection (UC), and Unreliable Datagram
`(UD). All of these are provided in HW (e.g., unlike TCP/UDP)
`and operate entirely in user mode which eliminates unnec-
`essary context switches or data copies.
`IBA implements the ‘‘virtual interface” (VI) – an abstract
`communication architecture defined in terms of a per-con-
`nection send–receive queue pair (QP) in each direction and
`a completion queue that holds the operation completion
`notification. The completion queue may be exploited for
`handling completions in a flexible manner (e.g., polling,
`or interrupts with or without batching). The virtual inter-
`face is intended to handle much of the IO operation in user
`mode and thereby avoid the OS kernel bottleneck. In order
`to facilitate orderly usage of the same physical device (e.g.,
`NIC) by multiple processes, a VI capable device needs to
`support ‘‘doorbell”, by which each process can post the
`descriptor of a new send/receive operation to the device.
`The device also needs to support virtual-to-physical (VtP)
`address translation in HW so that OS intervention is
`avoided. Some OS involvement is still required: such as
`for registration of buffers, setup of VtP translation tables,
`and interrupt handling. Virtual interface adapter (VIA) is
`a well known networking standard that implements these
`ideas [13].
`VI is a key technology in light-weight user-mode IO and
`forms the basis for implementing Remote Direct Memory Ac-
`cess (RDMA) capability that allows a chunk of data to be
`moved directly from the source application buffer to the
`destination application’s buffer. Such transfers can occur
`with very low-latency since they do not involve any OS
`intervention or copies. Enabling the direct transfer does re-
`quire a setup phase where the two ends communicate in or-
`der to register and pin the buffers on either side and do the
`appropriate access control checks. RDMA is appropriate for
`sustained communication between two applications and
`for transfers of large amounts of data since the setup over-
`head gets amortized over a large number of operations [6].
`RDMA has been exploited for high performance implemen-
`tations of a variety of functionalities such as virtual ma-
`chine migration [19], implementation of MPI (message
`passing interface) [28], and network attached storage [33].
`One interesting feature of IBA link layer is the notion of
`‘‘virtual lanes” (VLs).1 A maximum of 15 virtual lanes (VL0–
`VL14) can be specified for carrying normal traffic over an IBA
`link (VL15 is reserved for management traffic). A VL can be
`
`1 A virtual lane has nothing to do with the physical ‘‘lane” concept
`discussed above in the context of bit-serial links.
`
`
`
`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`2945
`
`designated as either high priority or low priority and all VLs
`belonging to one priority level can be further differentiated
`by a weighted round-robin scheduling among them. Mes-
`sages are actually marked with a more abstract concept
`called Service Level (SL), and SLto VL mapping table is used
`at each switch to decide how to carry this message. This al-
`lows messages to pass through multiple switches, each with
`a different number of VL’s supported.
`IBA uses a credit based flow control for congestion man-
`agement on a per virtual lane basis in order to avoid packet
`losses [3,36]. A VL receiver periodically grants ‘‘credit” to
`the sender for sending more data. As the congestion devel-
`ops, the sender will have fewer credits and will be forced to
`slow down. However, such a mechanism suffers from two
`problems: (a) it is basically a ‘‘back-pressure” mechanism
`and could take a long time to squeeze flows that transit a
`large number of hops and (b) If multiple flows use the
`same virtual lane, the flows that are not responsible for
`the congestion could also get shut-down unnecessarily.
`To address these issues, IBA supports a Forward Explicit
`Congestion Notification (FECN) mechanism coupled with
`endpoint rate control.
`FECN tags the packets as they experience congestion on
`the way to destination and this information is returned
`back in the acknowledgement packets to the source. Up
`to 16 congestion levels are supported by IBA. A node
`receiving congestion notification can tell if it is the source
`or victim of congestion by checking whether its credits
`have already been exhausted. A congestion source controls
`the traffic injection rate successively based on received
`congestion ind