throbber
Computer Networks 53 (2009) 2939–2965
`
`Contents lists available at ScienceDirect
`
`Computer Networks
`
`j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / c o m n e t
`
`Data center evolution
`A tutorial on state of the art, issues, and challenges
`
`Krishna Kant
`
`Intel Corporation, Hillsboro, Oregon, USA
`
`a r t i c l e
`
`i n f o
`
`a b s t r a c t
`
`Keywords:
`Data center
`Virtualization
`InfiniBand
`Ethernet
`Solid state storage
`Power management
`
`Data centers form a key part of the infrastructure upon which a variety of information tech-
`nology services are built. As data centers continue to grow in size and complexity, it is
`desirable to understand aspects of their design that are worthy of carrying forward, as well
`as existing or upcoming shortcomings and challenges that would have to be addressed. We
`envision the data center evolving from owned physical entities to potentially outsourced,
`virtualized and geographically distributed infrastructures that still attempt to provide the
`same level of control and isolation that owned infrastructures do. We define a layered
`model for such data centers and provide a detailed treatment of state of the art and emerg-
`ing challenges in storage, networking, management and power/thermal aspects.
`Ó 2009 Published by Elsevier B.V.
`
`1. Introduction
`
`Data centers form the backbone of a wide variety of ser-
`vices offered via the Internet including Web-hosting, e-
`commerce, social networking, and a variety of more gen-
`eral services such as software as a service (SAAS), platform
`as a service (PAAS), and grid/cloud computing. Some exam-
`ples of these generic service platforms are Microsoft’s
`Azure platform, Google App engine, Amazon’s EC2 plat-
`form and Sun’s Grid Engine. Virtualization is the key to
`providing many of these services and is being increasingly
`used within data centers to achieve better server utiliza-
`tion and more flexible resource allocation. However, virtu-
`alization also makes many aspects of data center
`management more challenging.
`As the complexity, variety, and penetration of such ser-
`vices grows, data centers will continue to grow and prolif-
`erate. Several forces are shaping the data center landscape
`and we expect future data centers to be lot more than sim-
`ply bigger versions of those existing today. These emerging
`
`E-mail address: krishna.kant@intel.com
`
`1389-1286/$ - see front matter Ó 2009 Published by Elsevier B.V.
`doi:10.1016/j.comnet.2009.10.004
`
`trends – more fully discussed in Section 3 – are expected to
`turn data centers into distributed, virtualized, multi-lay-
`ered infrastructures that pose a variety of difficult
`challenges.
`In this paper, we provide a tutorial coverage of a vari-
`ety of emerging issues in designing and managing large
`virtualized data centers. In particular, we consider a lay-
`ered model of virtualized data centers and discuss stor-
`age, networking, management,
`and power/thermal
`issues for such a model. Because of the vastness of the
`space, we shall avoid detailed treatment of certain well
`researched issues. In particular, we do not delve into
`the
`intricacies of virtualization techniques, virtual
`machine migration and
`scheduling
`in virtualized
`environments.
`The organization of the paper is as follows. Section 2
`discusses the organization of a data center and points out
`several challenging areas in data center management. Sec-
`tion 3 discusses emerging trends in data centers and new
`issues posed by them. Subsequent sections then discuss
`specific issues in detail
`including storage, networking,
`management and power/thermal issues. Finally, Section 8
`summarizes the discussion.
`
`US Conec EX1014
`IPR2024-00116
`U.S. Patent No. 11,307,369
`
`

`

`2940
`
`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`2. Data center organization and issues
`
`2.1. Rack-level physical organization
`
`A data center is generally organized in rows of ‘‘racks”
`where each rack contains modular assets such as servers,
`switches, storage ‘‘bricks”, or specialized appliances as
`shown in Fig. 1. A standard rack is 78 in. high, 23–25 in.
`wide and 26–30 in. deep. Typically, each rack takes a num-
`ber of modular ‘‘rack mount” assets inserted horizontally
`into the racks. The asset thickness is measured using an
`unit called ‘‘U”, which is 45 mm (or approximately
`1.8 in.). An overwhelming majority of servers are single
`or dual socket processors and can fit the 1U size, but larger
`ones (e.g., 4-socket multiprocessors) may require 2U or lar-
`ger sizes. A standard rack can take a total of 42 1U assets
`when completely filled. The sophistication of the rack itself
`may vary greatly – in the simplest case, it is nothing more
`than a metal enclosure. Additional features may include
`rack power distribution, built-in KVM (keyboard–video–
`mouse) switch, rack-level air or liquid cooling, and perhaps
`even a rack-level management unit.
`For greater compactness and functionality, servers can
`be housed in a self-contained chassis which itself slides
`into the rack. With 13 in. high chassis, six chassis can fit
`into a single rack. A chassis comes complete with its own
`power supply, fans, backplane interconnect, and manage-
`ment infrastructure. The chassis provides standard size
`slots where one could insert modular assets (usually
`known as blades). A single chassis can hold up to 16 1U
`servers, thereby providing a theoretical rack capacity of
`96 modular assets.
`The substantial increase in server density achievable by
`using the blade form factor results in corresponding in-
`crease in per-rack power consumption which, in turn,
`can seriously tax the power delivery infrastructure. In par-
`ticular, many older data centers are designed with about
`7 KW per-rack power rating, whereas racks loaded with
`blade servers could approach 21 KW. There is a similar is-
`sue with respect to thermal density – the cooling infra-
`structure may be unable to handle the offered thermal
`load. The net result is that it may be impossible to load
`the racks to their capacity. For some applications, a fully
`
`Fig. 1. Physical organization of a data center.
`
`loaded rack may not offer the required peak network or
`storage bandwidth (BW) either, thereby requiring careful
`management of resources to stay within the BW limits.
`
`2.2. Storage and networking infrastructure
`
`Storage in data centers may be provided in multiple
`ways. Often the high performance storage is housed in spe-
`cial ‘‘storage towers” that allow transparent remote access
`to the storage irrespective of the number and types of
`physical storage devices used. Storage may also be pro-
`vided in smaller ‘‘storage bricks” located in rack or chassis
`slots or directly integrated with the servers. In all cases, an
`efficient network access to the storage is crucial.
`A data center typically requires four types of network
`accesses, and could potentially use four different types of
`physical networks. The client–server network provides
`external access into the data center, and necessarily uses
`a commodity technology such as the wired Ethernet or
`wireless LAN. Server-to-server network provides high-
`speed communication between servers and may use Ether-
`net, InfiniBand (IBA) or other technologies. The storage ac-
`cess has traditionally been provided by Fiber Channel but
`could also use Ethernet or InfiniBand. Finally, the network
`used for management is also typically Ethernet but may
`either use separate cabling or exist as a ‘‘sideband” on
`the mainstream network.
`Both mainstream and storage networks typically follow
`identical configuration. For blade servers mounted on a
`chassis, the chassis provides a switch through which all
`the servers in the chassis connect to outside servers. The
`switches are duplexed for reliability and may be arranged
`for load sharing when both switches are working. In order
`to keep the network manageable, the overall topology is
`basically a tree with full connectivity at the root level.
`For example, each chassis level (or level 1) switch has an
`uplink leading to the level 2 switch, so that communication
`between two servers in different chassis must go through
`at least three switches. Depending on the size of the data
`center, the multiple level 2 switches may be either con-
`nected into a full mesh, or go through one or more level
`3 switches. The biggest issue with such a structure is po-
`tential bandwidth inadequacy at higher levels. Generally,
`uplinks are designed for a specific oversubscription ratio
`since providing a full bisection bandwidth is usually not
`feasible. For example, 20 servers, each with a 1 GB/s Ether-
`net may share a single 10 GB/s Ethernet uplink for a over-
`subscription ratio of 2.0. This may be troublesome if the
`workload mapping is such that there is substantial non-lo-
`cal communication. Since storage is traditionally provided
`in a separate storage tower, all storage traffic usually
`crosses the chassis uplink on the storage network. As data
`centers grow in size, a more scalable network architecture
`becomes necessary.
`
`2.3. Management infrastructure
`
`Each server usually carries a management controller
`called the BMC (baseboard management controller). The
`management network terminates at the BMC of each ser-
`ver. When the management network is implemented as a
`
`

`

`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`2941
`
`‘‘sideband” network, no additional switches are required
`for it; otherwise, a management switch is required in each
`chassis/rack to support external communication. The basic
`functions of the BMC include monitoring of various hard-
`ware sensors, managing various hardware and software
`alerts, booting up and shutting down the server, maintain-
`ing configuration data of various devices and drivers, and
`providing remote management capabilities. Each chassis
`or rack may itself sport its own higher level management
`controller which communicates with the lower level
`controller.
`Configuration management is a rather generic term and
`can refer to management of parameter settings of a variety
`of objects that are of interest in effectively utilizing the
`computer system infrastructure from individual devices
`up to complex services running on large networked clus-
`ters. Some of this management clearly belongs to the base-
`board management controller (BMC) or corresponding
`higher level management chain. This is often known as
`out-of-band (OOB) management since it is done without
`involvement of main CPU or the OS. Other activities may
`be more appropriate for in-band management and may be
`done by the main CPU in hardware, in OS, or in the middle-
`ware. The higher level management may run on separate
`systems that have both in-band and OOB interfaces. On a
`server, the most critical OOB functions belong to the pre-
`boot phase and in monitoring of server health while the
`OS is running. On other assets such as switches, routers,
`and storage bricks the management is necessarily OOB.
`
`2.4. Electrical and cooling infrastructure
`
`Even medium-sized data centers can sport peak power
`consumption of several megawatts or more. For such
`power loads, it becomes necessary to supply power using
`high voltage lines (e.g., 33 KV, 3 phase) and step it down
`on premises to the 280–480 V (3 phase) range for routing
`through the uninterrupted power supply (UPS). The UPS
`unit needs to convert AC to DC to charge its batteries and
`then convert DC to AC on the output end. Since the UPS
`unit sits directly in the power path, it can continue to sup-
`ply output power uninterrupted in case of input power
`
`loss. The output of UPS (usually 240/120 V, single phase)
`is routed to the power distribution unit (PDU) which, in
`turn, supplies power to individual rack-mounted servers
`or blade chassis. Next the power is stepped down, con-
`verted from AC to DC, and partially regulated in order to
`yield the typical 12 and 5 V outputs with the desired
`current ratings (20–100 A). These voltages are delivered
`to the motherboard where the voltage regulators (VRs)
`must convert them to as many voltage rails as the server
`design demands. For example, in an IBM blade server, the
`supported voltage rails include 5–6 V (3.3 V down to
`1.1 V), in addition to the 12 V and 5 V rails.
`Each one of these power conversion/distribution stages
`results in power loss, with some stages showing efficien-
`cies in 85–95% range or worse. It is thus not surprising that
`the cumulative power efficiency by the time we get down
`to voltage rails on the motherboard is only 50% or less
`(excluding cooling, lighting, and other auxiliary power
`uses). Thus there is a significant scope for gaining power
`efficiencies by a better design of power distribution and
`conversion infrastructure.
`The cooling infrastructure in a data center can be quite
`elaborate and expensive involving building level air-condi-
`tioning units requiring large chiller plants, fans and air
`recirculation systems. Evolving cooling technologies tend
`to emphasize more localized cooling or try to simplify
`cooling infrastructure. The server racks are generally
`placed on a raised plenum and arranged in alternately
`back-facing and front-facing aisles as shown in Fig. 2. Cold
`air is forced up in the front facing aisles and the server or
`chassis fans draw the cold air through the server to the
`back. The hot air on the back then rises and is directed
`(sometimes by using some deflectors) towards the chiller
`plant for cooling and recirculation. This basic setup is not
`expensive but can also create hot spots either due to un-
`even cooling or the mixing of hot and cold air.
`
`2.5. Major data center issues
`
`Data center applications increasingly involve access to
`massive data sets, real-time data mining, and streaming
`media delivery that place heavy demands on the storage
`
`Fig. 2. Cooling in a data center.
`
`

`

`2942
`
`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`infrastructure. Efficient access to large amounts of storage
`necessitates not only high performance file systems but
`also high performance storage technologies such as solid-
`state storage (SSD) media. These issues are discussed in
`Section 5. Streaming large amounts of data (from disks or
`SSDs) also requires high-speed, low-latency networks. In
`clustered applications, the inter-process communication
`(IPC) often involves rather small messages but with very
`low-latency requirements. These applications may also
`use remote main memories as ‘‘network caches” of data
`and thus tax the networking capabilities. It is much cheap-
`er to carry all types of data – client–server, IPC, storage and
`perhaps management – on the same physical fabric such as
`Ethernet. However, doing so requires sophisticated QoS
`capabilities that are not necessarily available in existing
`protocols. These aspects are discussed in Section 4.
`Configuration management is a vital component for the
`smooth operation of data centers but has not received
`much attention in literature. Configuration management
`is required at multiple levels, ranging from servers to ser-
`ver enclosures to the entire data center. Virtualized envi-
`ronments introduce issues of configuration management
`at a logical – rather than physical – level as well. As the
`complexity of servers, operating environments, and appli-
`cations increases, effective real-time management of large
`heterogeneous data centers becomes quite complex. These
`challenges and some approaches are discussed in Section 6.
`The increasing size of data centers not only results in
`high utility costs [1] but also leads to significant challenges
`in power and thermal management [82]. It is estimated
`that the total data center energy consumption as a percent-
`age of total US energy consumption doubled between 2000
`and 2007 and is set to double yet again by 2012. The high
`utility costs and environmental impact of such an increase
`are reasons enough to address power consumption. Addi-
`tionally, high power consumption also results in unsus-
`tainable current, power, and thermal densities, and
`inefficient usage of data center space. Dealing with
`power/thermal issues effectively requires power, cooling
`and thermal control techniques at multiple levels (e.g., de-
`vice, system, enclosure, etc.) and across multiple domains
`(e.g., hardware, OS and systems management). In many
`cases, power/thermal management impacts performance
`and thus requires a combined treatment of power and per-
`formance. These issues are discussed in Section 7.
`As data centers increase in size and criticality, they be-
`come increasingly attractive targets of attack since an iso-
`lated vulnerability can be exploited to impact a large
`number of customers and/or large amounts of sensitive
`data [14]. Thus a fundamental security challenge for data
`centers is to find workable mechanisms that can reduce
`this growth of vulnerability with size. Basically, the secu-
`rity must be implemented so that no single compromise
`can provide access to a large number of machines or large
`amount of data. Another important issue is that in a virtu-
`alized outsourced environment, it is no longer possible to
`speak of ‘‘inside” and ‘‘outside” of data center – the intrud-
`ers could well be those sharing the same physical infra-
`structure for their business purposes. Finally, the basic
`virtualization techniques themselves enhance vulnerabili-
`ties since the flexibility provided by virtualization can be
`
`easily exploited for disruption and denial of service. For
`example, any vulnerability in mapping VM level attributes
`to the physical system can be exploited to sabotage the en-
`tire system. Due to limited space, we do not, however,
`delve into security issues in this paper.
`
`3. Future directions in data center evolution
`
`Traditional data centers have evolved as large computa-
`tional facilities solely owned and operated by a single en-
`tity – commercial or otherwise. However, the forces in
`play are resulting in data centers moving towards much
`more complex ownership scenarios. For example, just as
`virtualization allows consolidation and cost savings within
`a data center, virtualization across data centers could allow
`a much higher level of aggregation. This notion leads to the
`possibility of ‘‘out-sourced” data centers that allows an
`organization to run a large data center without having to
`own the physical infrastructure. Cloud computing, in fact,
`provides exactly such a capability except that in cloud
`computing the resources are generally obtained dynami-
`cally for short periods and underlying management of
`these resources is entirely hidden from the user. Subscrib-
`ers of virtual data centers would typically want longer-
`term arrangements and much more control over the infra-
`structure given to them. There is a move afoot to provide
`Enterprise Cloud facilities whose goals are similar to those
`discussed here [2]. The distributed virtualized data center
`model discussed here is similar to the one introduced in
`[78].
`In the following we present a 4-layer conceptual model
`of future data centers shown in Fig. 3 that subsumes a wide
`range of emergent data center implementations. In this
`depiction, rectangles refer to software layers and ellipses
`refer to the resulting abstractions.
`The bottom layer in this conceptual model is the Physi-
`cal Infrastructure Layer (PIL) that manages the physical
`infrastructure (often known as ‘‘server farm”) installed in
`a given location. Because of the increasing cost of the
`power consumed, space occupied, and management per-
`sonnel required, server farms are already being located clo-
`ser to sources of cheap electricity, water,
`land, and
`manpower. These locations are by their nature geographi-
`cally removed from areas of heavy service demand, and
`thus the developments in ultra high-speed networking
`over long distances are essential enablers of such remotely
`located server farms. In addition to the management of
`physical computing hardware, the PIL can allow for lar-
`ger-scale consolidation by providing capabilities to carve
`out well-isolated sections of the server farm (or ‘‘server
`patches”) and assign them to different ‘‘customers.” In this
`case, the PIL will be responsible for management of bound-
`aries around the server patch in terms of security, traffic
`firewalling, and reserving access bandwidth. For example,
`set up and management of virtual LANs will be done by PIL.
`The next layer is the Virtual Infrastructure Layer (VIL)
`which exploits the virtualization capabilities available in
`individual servers, network and storage elements to sup-
`port the notion of a virtual cluster, i.e., a set of virtual or real
`nodes along with QoS controlled paths to satisfy their
`
`

`

`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`2943
`
`Fig. 3. Logical organization of future data centers.
`
`communication needs. In many cases, the VIL will be inter-
`nal to an organization who has leased an entire physical
`server patch to run its business. However, it is also con-
`ceivable that VIL services are actually under the control
`of infrastructure provider that effectively presents a virtual
`server patch abstraction to its customers. This is similar to
`cloud computing, except that the subscriber to a virtual
`server patch would expect explicit SLAs in terms of compu-
`tational, storage and networking infrastructure allocated to
`it and would need enough visibility to provide its own next
`level management required for running multiple services
`or applications.
`The third layer in our model is the Virtual Infrastructure
`Coordination Layer (VICL) whose purpose is to tie up virtual
`server patches across multiple physical server farms in or-
`der to create a geographically distributed virtualized data
`center (DVDC). This layer must define and manage virtual
`pipes between various virtual data centers. This layer
`would also be responsible for cross-geographic location
`application deployment, replication and migration when-
`ever that makes sense. Depending on its capabilities, VICL
`could be exploited for other purposes as well, such as
`reducing energy costs by spreading load across time-zones
`and utility rates, providing disaster or large scale failure
`tolerance, and even enabling truly large-scale distributed
`computations.
`Finally, the Service Provider Layer (SPL) is responsible for
`managing and running applications on the DVDC con-
`structed by the VICL. The SPL would require substantial
`visibility into the physical configuration, performance, la-
`tency, availability and other aspects of the DVDC so that
`it can manage the applications effectively. It is expected
`that SPL will be owned by the customer directly.
`The model in Fig. 3 subsumes everything from a non-
`virtualized, single location data center entirely owned by
`a single organization all the way up to a geographically dis-
`
`tributed, fully virtualized data center where each layer
`possibly has a separate owner. The latter extreme provides
`a number of advantages in terms of consolidation, agility,
`and flexibility, but it also poses a number of difficult chal-
`lenges in terms of security, SLA definition and enforce-
`ment, efficiency and issues of layer separation. For this
`reason, real data centers are likely to be limited instances
`of this general model.
`In subsequent sections, we shall address the needs of
`such DVDC’s when relevant, although many of the issues
`apply to traditional data centers as well.
`
`4. Data center networking
`
`4.1. Networking infrastructure in data centers
`
`The increasing complexity and sophistication of data
`center applications demands new features in the data cen-
`ter network. For clustered applications, servers often need
`to exchange inter-process communication (IPC) messages
`for synchronization and data exchange, and such messages
`may require very low-latency in order to reduce process
`stalls. Direct data exchange between servers may also be
`motivated by low access latency to data residing in the
`memory of another server as opposed to retrieving it from
`the local secondary storage [18]. Furthermore, mixing of
`different types of data on the same networking fabric
`may necessitate QoS mechanisms for performance isola-
`tion. These requirements have led to considerable activity
`in the design and use of low-latency specialized data cen-
`ter fabrics such as PCI-Express based backplane intercon-
`nects, InfiniBand (IBA) [37], data center Ethernet [40,7],
`and lightweight transport protocols implemented directly
`over the Ethernet layer [5]. We shall survey some of these
`developments in subsequent sections before examining
`networking challenges in data centers.
`
`

`

`2944
`
`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`4.2. Overview of data center fabrics
`
`In this section, we provide a brief overview of two ma-
`jor network fabrics in the data center namely, Ethernet and
`InfiniBand (IBA). Although Fiber channel can be used as a
`general networking fabric as well, we do not discuss it here
`because of its strong storage association. Also, although the
`term ‘‘Ethernet” relates only to the MAC layer, in practice,
`TCP/UDP over IP over Ethernet is the more appropriate net-
`work stack to compare against InfiniBand or Fiber-channel
`which specify their own network and transport layers. For
`this reason, we shall speak of ‘‘Ethernet-stack” instead of
`just the Ethernet.
`
`4.2.1. InfiniBand
`InfiniBand architecture (IBA) was defined in late 1990s
`specifically for inter-system IO in the data center and uses
`the communication link semantics rather than the tradi-
`tional memory read/write semantics [37]. IBA provides a
`complete fabric starting with its unique cabling/connec-
`tors, physical layer, link, network and transport layers. This
`incompatibility with Ethernet at the very basic cabling/
`connector level makes IBA (and other data center fabrics
`such as Myrinet) difficult to introduce in an existing data
`center. However, the available bridging products make it
`possible to mix IBA and Ethernet based clusters in the
`same data center.
`IBA links use bit-serial, differential signaling technology
`which can scale up to much higher data rates than the tra-
`ditional parallel
`link technologies. Traditional parallel
`transmission technologies (with one wire per bit) suffer
`from numerous problems such as severe cross-talk and
`skew between the bit timings, especially as the link speeds
`increase. The differential signaling, in contrast, uses two
`physical wires for each bit direction and the difference be-
`tween the two signals is used by the receiver. Each such
`pair of wires is called a lane, and higher data throughput
`can be achieved by using multiple lanes, each of which
`can independently deliver a data frame. As the link speeds
`move into multi GB/s range, differential bit-serial technol-
`ogy is being adopted almost universally for all types of
`links. It has also led to the idea of re-purposable PHY,
`wherein the same PHY layer can be configured to provide
`the desired link type (e.g., IBA, PCI-Express, Fiber Channel,
`Ethernet, etc.).
`links run at
`The generation 1 (or GEN1) bit-serial
`2.5 GHz but use a 8–10 byte encoding for robustness which
`effectively delivers 2 GB/s speed in each direction. GEN2
`links double this rate to 4 GB/s. The GEN3 technology does
`not use 8-10 byte encoding and can provide 10 GB/s speed
`over fiber. Implementing 10 GB/s on copper cables is very
`challenging, at least for distances of more than a few me-
`ters. The IBA currently defines three lane widths – 1X,
`4X, and 12X. Thus, IBA can provide bandwidths up to
`120 GB/s over fiber using GEN3 technology.
`In IBA, a layer-2 network (called subnet) can be built
`using IBA switches which can route messages based on
`explicitly configured routing tables. A switch can have up
`to 256 ports; therefore, most subnets use a single switch.
`For transmission, application messages are broken up into
`‘‘packets” or message transfer units (MTUs) with maxi-
`
`mum size settable to 256 bytes, 1 KB, 2 KB, or 4 KB. In sys-
`tems with mixed MTUs, subnet management provides
`endpoints with the Path MTU appropriate to reach a given
`destination. IBA switches may also optionally support mul-
`ticast which allows for message replication at the switches.
`There are two important attributes for a transport proto-
`col: (a) connectionless (datagram) vs. connection oriented
`transfer, and (b) reliable vs. unreliable transmission. In
`the Ethernet context, only the reliable connection oriented
`service (TCP) are unreliable datagram service (UDP) are
`supported.
`IBA supports all four combinations, known
`respectively as Reliable Connection (RC), Reliable Datagram
`(RD), Unreliable Connection (UC), and Unreliable Datagram
`(UD). All of these are provided in HW (e.g., unlike TCP/UDP)
`and operate entirely in user mode which eliminates unnec-
`essary context switches or data copies.
`IBA implements the ‘‘virtual interface” (VI) – an abstract
`communication architecture defined in terms of a per-con-
`nection send–receive queue pair (QP) in each direction and
`a completion queue that holds the operation completion
`notification. The completion queue may be exploited for
`handling completions in a flexible manner (e.g., polling,
`or interrupts with or without batching). The virtual inter-
`face is intended to handle much of the IO operation in user
`mode and thereby avoid the OS kernel bottleneck. In order
`to facilitate orderly usage of the same physical device (e.g.,
`NIC) by multiple processes, a VI capable device needs to
`support ‘‘doorbell”, by which each process can post the
`descriptor of a new send/receive operation to the device.
`The device also needs to support virtual-to-physical (VtP)
`address translation in HW so that OS intervention is
`avoided. Some OS involvement is still required: such as
`for registration of buffers, setup of VtP translation tables,
`and interrupt handling. Virtual interface adapter (VIA) is
`a well known networking standard that implements these
`ideas [13].
`VI is a key technology in light-weight user-mode IO and
`forms the basis for implementing Remote Direct Memory Ac-
`cess (RDMA) capability that allows a chunk of data to be
`moved directly from the source application buffer to the
`destination application’s buffer. Such transfers can occur
`with very low-latency since they do not involve any OS
`intervention or copies. Enabling the direct transfer does re-
`quire a setup phase where the two ends communicate in or-
`der to register and pin the buffers on either side and do the
`appropriate access control checks. RDMA is appropriate for
`sustained communication between two applications and
`for transfers of large amounts of data since the setup over-
`head gets amortized over a large number of operations [6].
`RDMA has been exploited for high performance implemen-
`tations of a variety of functionalities such as virtual ma-
`chine migration [19], implementation of MPI (message
`passing interface) [28], and network attached storage [33].
`One interesting feature of IBA link layer is the notion of
`‘‘virtual lanes” (VLs).1 A maximum of 15 virtual lanes (VL0–
`VL14) can be specified for carrying normal traffic over an IBA
`link (VL15 is reserved for management traffic). A VL can be
`
`1 A virtual lane has nothing to do with the physical ‘‘lane” concept
`discussed above in the context of bit-serial links.
`
`

`

`K. Kant / Computer Networks 53 (2009) 2939–2965
`
`2945
`
`designated as either high priority or low priority and all VLs
`belonging to one priority level can be further differentiated
`by a weighted round-robin scheduling among them. Mes-
`sages are actually marked with a more abstract concept
`called Service Level (SL), and SLto VL mapping table is used
`at each switch to decide how to carry this message. This al-
`lows messages to pass through multiple switches, each with
`a different number of VL’s supported.
`IBA uses a credit based flow control for congestion man-
`agement on a per virtual lane basis in order to avoid packet
`losses [3,36]. A VL receiver periodically grants ‘‘credit” to
`the sender for sending more data. As the congestion devel-
`ops, the sender will have fewer credits and will be forced to
`slow down. However, such a mechanism suffers from two
`problems: (a) it is basically a ‘‘back-pressure” mechanism
`and could take a long time to squeeze flows that transit a
`large number of hops and (b) If multiple flows use the
`same virtual lane, the flows that are not responsible for
`the congestion could also get shut-down unnecessarily.
`To address these issues, IBA supports a Forward Explicit
`Congestion Notification (FECN) mechanism coupled with
`endpoint rate control.
`FECN tags the packets as they experience congestion on
`the way to destination and this information is returned
`back in the acknowledgement packets to the source. Up
`to 16 congestion levels are supported by IBA. A node
`receiving congestion notification can tell if it is the source
`or victim of congestion by checking whether its credits
`have already been exhausted. A congestion source controls
`the traffic injection rate successively based on received
`congestion ind

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket