throbber
A Cloud-Scale Acceleration Architecture
`
`Andrew Putnam
`Eric S. Chung
`Adrian M. Caulfield
`Jeremy Fowers Michael Haselman Stephen Heil Matt Humphrey
`Hari Angepat
`Joo-Young Kim Daniel Lo Todd Massengill Kalin Ovtcharov
`Puneet Kaur
`Michael Papamichael Lisa Woods Sitaram Lanka Derek Chiou Doug Burger
`
`Microsoft Corporation
`
`Abstract—Hyperscale datacenter providers have struggled to
`balance the growing need for specialized hardware (efficiency)
`with the economic benefits of homogeneity (manageability). In
`this paper we propose a new cloud architecture that uses
`reconfigurable logic to accelerate both network plane func-
`tions and applications. This Configurable Cloud architecture
`places a layer of reconfigurable logic (FPGAs) between the
`network switches and the servers, enabling network flows to be
`programmably transformed at line rate, enabling acceleration
`of local applications running on the server, and enabling the
`FPGAs to communicate directly, at datacenter scale, to harvest
`remote FPGAs unused by their local servers. We deployed this
`design over a production server bed, and show how it can be
`used for both service acceleration (Web search ranking) and
`network acceleration (encryption of data in transit at high-
`speeds). This architecture is much more scalable than prior
`work which used secondary rack-scale networks for inter-FPGA
`communication. By coupling to the network plane, direct FPGA-
`to-FPGA messages can be achieved at comparable latency to
`previous work, without the secondary network. Additionally, the
`scale of direct inter-FPGA messaging is much larger. The average
`round-trip latencies observed in our measurements among 24,
`1000, and 250,000 machines are under 3, 9, and 20 microseconds,
`respectively. The Configurable Cloud architecture has been
`deployed at hyperscale in Microsoft’s production datacenters
`worldwide.
`
`I. INTRODUCTION
`
`Modern hyperscale datacenters have made huge strides with
`improvements in networking, virtualization, energy efficiency,
`and infrastructure management, but still have the same basic
`structure as they have for years:
`individual servers with
`multicore CPUs, DRAM, and local storage, connected by the
`NIC through Ethernet switches to other servers. At hyperscale
`(hundreds of thousands to millions of servers), there are signif-
`icant benefits to maximizing homogeneity; workloads can be
`migrated fungibly across the infrastructure, and management
`is simplified, reducing costs and configuration errors.
`Both the slowdown in CPU scaling and the ending of
`Moore’s Law have resulted in a growing need for hard-
`ware specialization to increase performance and efficiency.
`However, placing specialized accelerators in a subset of a
`hyperscale infrastructure’s servers reduces the highly desir-
`able homogeneity. The question is mostly one of economics:
`whether it is cost-effective to deploy an accelerator in every
`new server, whether it
`is better to specialize a subset of
`an infrastructure’s new servers and maintain an ever-growing
`
`978-1-5090-3508-3/16/$31.00 c(cid:13) 2016 IEEE
`
`number of configurations, or whether it is most cost-effective
`to do neither. Any specialized accelerator must be compatible
`with the target workloads through its deployment lifetime (e.g.
`six years: two years to design and deploy the accelerator and
`four years of server deployment lifetime). This requirement
`is a challenge given both the diversity of cloud workloads
`and the rapid rate at which they change (weekly or monthly).
`It is thus highly desirable that accelerators incorporated into
`hyperscale servers be programmable, the two most common
`examples being FPGAs and GPUs.
`Both GPUs and FPGAs have been deployed in datacenter
`infrastructure at reasonable scale without direct connectivity
`between accelerators [1], [2], [3]. Our recent publication
`described a medium-scale FPGA deployment in a production
`datacenter to accelerate Bing web search ranking using multi-
`ple directly-connected accelerators [4]. That design consisted
`of a rack-scale fabric of 48 FPGAs connected by a secondary
`network. While effective at accelerating search ranking, our
`first architecture had several significant limitations:
`• The secondary network (a 6x8 torus) required expensive
`and complex cabling, and required awareness of the physical
`location of machines.
`• Failure handling of the torus required complex re-routing
`of traffic to neighboring nodes, causing both performance loss
`and isolation of nodes under certain failure patterns.
`• The number of FPGAs that could communicate directly,
`without going through software, was limited to a single rack
`(i.e. 48 nodes).
`• The fabric was a limited-scale “bolt on” accelerator, which
`could accelerate applications but offered little for enhancing
`the datacenter infrastructure, such as networking and storage
`flows.
`In this paper, we describe a new cloud-scale, FPGA-based
`acceleration architecture, which we call
`the Configurable
`Cloud, which eliminates all of the limitations listed above with
`a single design. This architecture has been — and is being
`— deployed in the majority of new servers in Microsoft’s
`production datacenters across more than 15 countries and
`5 continents. A Configurable Cloud allows the datapath of
`cloud communication to be accelerated with programmable
`hardware. This datapath can include networking flows, stor-
`age flows, security operations, and distributed (multi-FPGA)
`applications.
`The key difference over previous work is that the accelera-
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 1
`
`

`

`L1
`
`L2
`TOR
`
`TOR
`
`TOR
`
`L1
`
`TOR
`
`Web search
`ranking
`Web search
`ranking
`
`Deep neural
`Expensive
`networks
`compression
`Bioinformatics
`
`(a)
`
`TOR
`
`(b)
`
`Fig. 1.
`
`(a) Decoupled Programmable Hardware Plane, (b) Server + FPGA schematic.
`
`tion hardware is tightly coupled with the datacenter network—
`placing a layer of FPGAs between the servers’ NICs and
`the Ethernet network switches. Figure 1b shows how the
`accelerator fits into a host server. All network traffic is routed
`through the FPGA, allowing it to accelerate high-bandwidth
`network flows. An independent PCIe connection to the host
`CPUs is also provided, allowing the FPGA to be used as a local
`compute accelerator. The standard network switch and topol-
`ogy removes the impact of failures on neighboring servers,
`removes the need for non-standard cabling, and eliminates the
`need to track the physical location of machines in each rack.
`While placing FPGAs as a network-side “bump-in-the-wire”
`solves many of the shortcomings of the torus topology, much
`more is possible. By enabling the FPGAs to generate and
`consume their own networking packets independent of the
`hosts, each and every FPGA in the datacenter can reach
`every other one (at a scale of hundreds of thousands) in
`a small number of microseconds, without any intervening
`software. This capability allows hosts to use remote FPGAs for
`acceleration with low latency, improving the economics of the
`accelerator deployment, as hosts running services that do not
`use their local FPGAs can donate them to a global pool and
`extract value which would otherwise be stranded. Moreover,
`this design choice essentially turns the distributed FPGA
`resources into an independent computer in the datacenter,
`at the same scale as the servers, that physically shares the
`network wires with software. Figure 1a shows a logical view
`of this plane of computation.
`This model offers significant flexibility. From the local
`perspective, the FPGA is used as a compute or a network
`accelerator. From the global perspective, the FPGAs can be
`managed as a large-scale pool of resources, with acceleration
`
`services mapped to remote FPGA resources. Ideally, servers
`not using all of their local FPGA resources can donate
`those resources to the global pool, while servers that need
`additional resources can request the available resources on
`remote servers. Failing nodes are removed from the pool
`with replacements quickly added. As demand for a service
`grows or shrinks, a global manager grows or shrinks the pools
`correspondingly. Services are thus freed from having a fixed
`ratio of CPU cores per FPGAs, and can instead allocate (or
`purchase, in the case of IaaS) only the resources of each type
`needed.
`
`Space limitations prevent a complete description of the
`management policies and mechanisms for the global resource
`manager. Instead, this paper focuses first on the hardware
`architecture necessary to treat remote FPGAs as available
`resources for global acceleration pools. We describe the com-
`munication protocols and mechanisms that allow nodes in
`a remote acceleration service to connect, including a proto-
`col called LTL (Lightweight Transport Layer) that supports
`lightweight connections between pairs of FPGAs, with mostly
`lossless transport and extremely low latency (small numbers
`of microseconds). This protocol makes the datacenter-scale
`remote FPGA resources appear closer than either a single local
`SSD access or the time to get through the host’s networking
`stack. Then, we describe an evaluation system of 5,760 servers
`which we built and deployed as a precursor to hyperscale
`production deployment. We measure the performance charac-
`teristics of the system, using web search and network flow
`encryption as examples. We show that significant gains in
`efficiency are possible, and that this new architecture enables a
`much broader and more robust architecture for the acceleration
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 2
`
`

`

`40Gb
`QSFP
`Network
`to NIC
`
`Temp, power, LEDs
`40Gb
`USB to
`QSFP
`JTAG µC
`Network
`to TOR
`Altera Stratix V D5
`FPGA
`PCIe Gen3 x8
`PCIe Gen3 x8
`4 lanes @ 10.3125 Gbps
`4 lanes @ 10.3125 Gbps
`72 bits
`(with ECC)Q
`256 Mb
`4 GB DDR3-1600
`Config
`Flash
`
`I2C
`
`USB
`
`is
`Photograph of the manufactured board. The DDR channel
`Fig. 3.
`implemented using discrete components. PCIe connectivity goes through a
`mezzanine connector on the bottom side of the board (not shown).
`
`other services that are slower to adopt (or do not require) the
`accelerator fabric.
`In addition to architectural requirements that provide suffi-
`cient flexibility to justify scale production deployment, there
`are also physical restrictions in current infrastructures that
`must be overcome. These restrictions include strict power
`limits, a small physical space in which to fit, resilience
`to hardware failures, and tolerance to high temperatures.
`For example, the accelerator architecture we describe is the
`widely-used OpenCompute server that constrained power to
`35W, the physical size to roughly a half-height half-length
`PCIe expansion card (80mm x 140 mm), and tolerance to
`an inlet air temperature of 70◦C at 160 lfm airflow. These
`constraints make deployment of current GPUs impractical
`except in special HPC SKUs, so we selected FPGAs as the
`accelerator.
`We designed the accelerator board as a standalone FPGA
`board that is added to the PCIe expansion slot in a production
`server SKU. Figure 2 shows a schematic of the board, and
`Figure 3 shows a photograph of the board with major com-
`ponents labeled. The FPGA is an Altera Stratix V D5, with
`172.6K ALMs of programmable logic. The FPGA has one
`4 GB DDR3-1600 DRAM channel, two independent PCIe Gen
`3 x8 connections for an aggregate total of 16 GB/s in each
`direction between the CPU and FPGA, and two independent
`40 Gb Ethernet interfaces with standard QSFP+ connectors. A
`256 Mb Flash chip holds the known-good golden image for the
`FPGA that is loaded on power on, as well as one application
`image.
`To measure the power consumption limits of the entire
`FPGA card (including DRAM, I/O channels, and PCIe), we
`developed a power virus that exercises nearly all of the FPGA’s
`interfaces, logic, and DSP blocks—while running the card in
`a thermal chamber operating in worst-case conditions (peak
`ambient temperature, high CPU load, and minimum airflow
`due to a failed fan). Under these conditions, the card consumes
`29.2W of power, which is well within the 32W TDP limits for
`a card running in a single server in our datacenter, and below
`the max electrical power draw limit of 35W.
`The dual 40 Gb Ethernet interfaces on the board could allow
`for a private FPGA network as was done in our previous
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 3
`
`SPI
`Connector
`PCIe Mezanine
`
`Fig. 2. Block diagram of the major components of the accelerator board.
`
`of hyperscale datacenter services.
`
`II. HARDWARE ARCHITECTURE
`
`There are many constraints on the design of hardware
`accelerators for the datacenter. Datacenter accelerators must
`be highly manageable, which means having few variations
`or versions. The environments must be largely homogeneous,
`which means that the accelerators must provide value across a
`plurality of the servers using it. Given services’ rate of change
`and diversity in the datacenter, this requirement means that a
`single design must provide positive value across an extremely
`large, homogeneous deployment.
`The solution to addressing the competing demands of ho-
`mogeneity and specialization is to develop accelerator archi-
`tectures which are programmable, such as FPGAs and GPUs.
`These programmable architectures allow for hardware homo-
`geneity while allowing fungibility via software for different
`services. They must be highly flexible at the system level, in
`addition to being programmable, to justify deployment across a
`hyperscale infrastructure. The acceleration system we describe
`is sufficiently flexible to cover three scenarios: local compute
`acceleration (through PCIe), network acceleration, and global
`application acceleration,
`through configuration as pools of
`remotely accessible FPGAs. Local acceleration handles high-
`value scenarios such as search ranking acceleration where
`every server can benefit from having its own FPGA. Network
`acceleration can support services such as intrusion detection,
`deep packet
`inspection and network encryption which are
`critical to IaaS (e.g. “rental” of cloud servers), and which have
`such a huge diversity of customers that it makes it difficult to
`justify local compute acceleration alone economically. Global
`acceleration permits accelerators unused by their host servers
`to be made available for large-scale applications, such as
`machine learning. This decoupling of a 1:1 ratio of servers
`to FPGAs is essential for breaking the “chicken and egg”
`problem where accelerators cannot be added until enough
`applications need them, but applications will not rely upon
`the accelerators until they are present in the infrastructure.
`By decoupling the servers and FPGAs, software services that
`demand more FPGA capacity can harness spare FPGAs from
`
`

`

`40G Network Bridge and Bypass
`40GMAC(TOR)
`40G MAC(NIC)
`Lightweight Transport Layer Protocol Engine
`Elastic
`Router
`Role
`Role
`PCIe Gen 3
`Role x N
`DMA x 2
`DDR3
`Controller
`
`Role
`40G MAC/PHY (TOR)
`40G MAC/PHY (NIC)
`Network Bridge / Bypass
`DDR3 Memory Controller
`Elastic Router
`LTLProtocol Engine
`LTL Packet Switch
`PCIe DMA Engine
`Other
`TotalArea Used
`Total Area Available
`
`ALMs
`55340 (32%)
`9785 (6%)
`13122 (8%)
`4685 (3%)
`13225 (8%)
`3449 (2%)
`11839 (7%)
`4815 (3%)
`6817 (4%)
`8273 (5%)
`131350(76%)
`172600
`
`MHz
`175
`313
`313
`313
`200
`156
`156
`-
`250
`-
`-
`-
`
`Fig. 4. The Shell Architecture in a Single FPGA.
`
`Fig. 5. Area and frequency breakdown of production-deployed image with
`remote acceleration support.
`
`work [4], but this configuration also allows the FPGA to be
`wired as a “bump-in-the-wire”, sitting between the network
`interface card (NIC) and the top-of-rack switch (ToR). Rather
`than cabling the standard NIC directly to the ToR, the NIC is
`cabled to one port of the FPGA, and the other FPGA port is
`cabled to the ToR, as we previously showed in Figure 1b.
`Maintaining the discrete NIC in the system enables us
`to leverage all of the existing network offload and packet
`transport functionality hardened into the NIC. This simplifies
`the minimum FPGA code required to deploy the FPGAs to
`simple bypass logic. In addition, both FPGA resources and
`PCIe bandwidth are preserved for acceleration functionality,
`rather than being spent on implementing the NIC in soft logic.
`Unlike [5], both the FPGA and NIC have separate connec-
`tions to the host CPU via PCIe. This allows each to operate
`independently at maximum bandwidth when the FPGA is
`being used strictly as a local compute accelerator (with the
`FPGA simply doing network bypass). This path also makes it
`possible to use custom networking protocols that bypass the
`NIC entirely when desired.
`One potential drawback to the bump-in-the-wire architecture
`is that an FPGA failure, such as loading a buggy application,
`could cut off network traffic to the server, rendering the server
`unreachable. However, unlike a torus or mesh network, failures
`in the bump-in-the-wire architecture do not degrade any neigh-
`boring FPGAs, making the overall system more resilient to
`failures. In addition, most datacenter servers (including ours)
`have a side-channel management path that exists to power
`servers on and off. By policy, the known-good golden image
`that loads on power up is rarely (if ever) overwritten, so power
`cycling the server through the management port will bring
`the FPGA back into a good configuration, making the server
`reachable via the network once again.
`
`A. Shell architecture
`
`Within each FPGA, we use the partitioning and terminology
`we defined in prior work [4] to separate the application logic
`
`(Role) from the common I/O and board-specific logic (Shell)
`used by accelerated services. Figure 4 gives an overview of
`this architecture’s major shell components, focusing on the
`network. In addition to the Ethernet MACs and PHYs, there
`is an intra-FPGA message router called the Elastic Router
`(ER) with virtual channel support for allowing multiple Roles
`access to the network, and a Lightweight Transport Layer
`(LTL) engine used for enabling inter-FPGA communication.
`Both are described in detail in Section V.
`
`The FPGA’s location as a bump-in-the-wire between the
`network switch and host means that
`it must always pass
`packets between the two network interfaces that it controls.
`The shell implements a bridge to enable this functionality,
`shown at the top of Figure 4. The shell provides a tap for
`FPGA roles to inject, inspect, and alter the network traffic as
`needed, such as when encrypting network flows, which we
`describe in Section III.
`
`Full FPGA reconfiguration briefly brings down this network
`link, but in most cases applications are robust to brief network
`outages. When network traffic cannot be paused even briefly,
`partial reconfiguration permits packets to be passed through
`even during reconfiguration of the role.
`
`Figure 5 shows the area and clock frequency of the shell IP
`components used in the production-deployed image. In total,
`the design uses 44% of the FPGA to support all shell functions
`and the necessary IP blocks to enable access to remote pools of
`FPGAs (i.e., LTL and the Elastic Router). While a significant
`fraction of the FPGA is consumed by a few major shell
`components (especially the 40G PHY/MACs at 14% and the
`DDR3 memory controller at 8%), enough space is left for
`the role(s) to provide large speedups for key services, as we
`show in Section III. Large shell components that are stable for
`the long term are excellent candidates for hardening in future
`generations of datacenter-optimized FPGAs.
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 4
`
`

`

`0.00.51.01.52.0
`
`0.0
`
`Software
`Local FPGA
`
`3.0
`2.0
`1.0
`Throughput (normalized)
`
`4.0
`
`99th percentile target)
`Latency (normalized to
`
`Fig. 6. 99% Latency versus Throughput of ranking service queries running
`on a single server, with and without FPGAs enabled.
`
`A. Bing Search Page Ranking Acceleration
`
`We describe Bing web search ranking acceleration as an
`example of using the local FPGA to accelerate a large-scale
`service. This example is useful both because Bing search is
`a large datacenter workload, and since we had described its
`acceleration in depth on the Catapult v1 platform [4]. At a
`high level, most web search ranking algorithms behave sim-
`ilarly; query-specific features are generated from documents,
`processed, and then passed to a machine learned model to
`determine how relevant the document is to the query.
`Unlike in [4], we implement only a subset of the feature
`calculations (typically the most expensive ones), and nei-
`ther compute post-processed synthetic features nor run the
`machine-learning portion of search ranking on the FPGAs.
`We do implement two classes of features on the FPGA. The
`first is the traditional finite state machines used in many search
`engines (e.g. “count the number of occurrences of query term
`two”). The second is a proprietary set of features generated
`by a complex dynamic programming engine.
`We implemented the selected features in a Feature Func-
`tional Unit (FFU), and the Dynamic Programming Features in
`a separate DPF unit. Both the FFU and DPF units were built
`into a shell that also had support for execution using remote
`accelerators, namely the ER and LTL blocks as described
`in Section V. This FPGA image also, of course, includes
`the network bridge for NIC-TOR communication, so all the
`server’s network traffic is passing through the FPGA while
`it is simultaneously accelerating document ranking. The pass-
`through traffic and the search ranking acceleration have no
`performance interaction.
`We present results in a format similar to the Catapult
`results to make direct comparisons simpler. We are running
`this image on a full production bed consisting of thousands
`of servers. In a production environment, it is infeasible to
`simulate many different points of query load as there is
`substantial infrastructure upstream that only produces requests
`at the rate of arrivals. To produce a smooth distribution with
`repeatable results, we used a single-box test with a stream
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 5
`
`B. Datacenter Deployment
`
`To evaluate the system architecture and performance at
`scale, we manufactured and deployed 5,760 servers containing
`this accelerator architecture and placed it into a production dat-
`acenter. All machines were configured with the shell described
`above. The servers and FPGAs were stress tested using the
`power virus workload on the FPGA and a standard burn-in test
`for the server under real datacenter environmental conditions.
`The servers all passed, and were approved for production use
`in the datacenter.
`We brought up a production Bing web search ranking
`service on the servers, with 3,081 of these machines using the
`FPGA for local compute acceleration, and the rest used for
`other functions associated with web search. We mirrored live
`traffic to the bed for one month, and monitored the health and
`stability of the systems as well as the correctness of the ranking
`service. After one month, two FPGAs had hard failures, one
`with a persistently high rate of single event upset (SEU) errors
`in the configuration logic, and the other with an unstable 40 Gb
`network link to the NIC. A third failure of the 40 Gb link to the
`TOR was found not to be an FPGA failure, and was resolved
`by replacing a network cable. Given aggregate datacenter
`failure rates, we deemed the FPGA-related hardware failures
`to be acceptably low for production.
`We also measured a low number of soft errors, which
`were all correctable. Five machines failed to train to the full
`Gen3 x8 speeds on the secondary PCIe link. There were
`eight total DRAM calibration failures which were repaired
`by reconfiguring the FPGA. The errors have since been traced
`to a logical error in the DRAM interface rather than a hard
`failure. Our shell scrubs the configuration state for soft errors
`and reports any flipped bits. We measured an average rate
`of one bit-flip in the configuration logic every 1025 machine
`days. While the scrubbing logic often catches the flips before
`functional failures occur, at least in one case there was a
`role hang that was likely attributable to an SEU event. Since
`the scrubbing logic completes roughly every 30 seconds, our
`system recovers from hung roles automatically, and we use
`ECC and other data integrity checks on critical interfaces, the
`exposure of the ranking service to SEU events is low. Overall,
`the hardware and interface stability of the system was deemed
`suitable for scale production.
`In the next sections, we show how this board/shell combi-
`nation can support local application acceleration while simul-
`taneously routing all of the server’s incoming and outgoing
`network traffic. Following that, we show network acceleration,
`and then acceleration of remote services.
`
`III. LOCAL ACCELERATION
`
`is important for an at-scale
`it
`As we described earlier,
`datacenter accelerator to enhance local applications and in-
`frastructure functions for different domains (e.g. web search
`and IaaS). In this section we measure the performance of our
`system on a large datacenter workload.
`
`

`

`Software
`Local FPGA
`
`1
`
`4
`3
`2
`Query Load (normalized)
`
`5
`
`6
`
`Fig. 8. Query 99.9% Latency vs. Offered Load.
`
`0.00.51.01.52.02.53.03.5
`
`0
`
`99.9% software latency
`99.9% FPGA latency
`average software load
`average FPGA query load
`Day 1
`Day 5
`Day 2
`Day 3
`Day 4
`
`Query Latency 99.9 (normalized)
`
`1.02.03.04.05.06.07.0
`
`Normalized Load & Latency
`
`Fig. 7. Five day query throughput and latency of ranking service queries
`running in production, with and without FPGAs enabled.
`
`of 200,000 queries, and varied the arrival rate of requests to
`measure query latency versus throughput. Figure 6 shows the
`results; the curves shown are the measured latencies of the
`99% slowest queries at each given level of throughput. Both
`axes show normalized results.
`We have normalized both the target production latency and
`typical average throughput in software-only mode to 1.0. The
`software is well tuned; it can achieve production targets for
`throughput at the required 99th percentile latency. With the
`single local FPGA, at the target 99th percentile latency, t he
`throughput can be safely increased by 2.25x, which means
`that fewer than half as many servers would be needed to
`sustain the target throughput at the required latency. Even at
`these higher loads, the FPGA remains underutilized, as the
`software portion of ranking saturates the host server before
`the FPGA is saturated. Having multiple servers drive fewer
`FPGAs addresses the underutilization of the FPGAs, which is
`the goal of our remote acceleration model.
`Production Measurements: We have deployed the FPGA
`accelerated ranking service into production datacenters at scale
`and report end-to-end measurements below.
`Figure 7 shows the performance of ranking service running
`in two production datacenters over a five day period, one
`with FPGAs enabled and one without (both datacenters are
`of identical scale and configuration, with the exception of the
`FPGAs). These results are from live production traffic, not on a
`synthetic workload or mirrored traffic. The top two bars show
`the normalized tail query latencies at the 99.9th percentile
`(aggregated across all servers over a rolling time window),
`while the bottom two bars show the corresponding query
`loads received at each datacenter. As load varies throughout
`the day, the queries executed in the software-only datacenter
`experience a high rate of latency spikes, while the FPGA-
`accelerated queries have much lower, tighter-bound latencies,
`despite seeing much higher peak query loads.
`Figure 8 plots the load versus latency over the same 5-day
`
`period for the two datacenters. Given that these measurements
`were performed on production datacenters and traffic, we were
`unable to observe higher loads on the software datacenter
`(vs. the FPGA-enabled datacenter) due to a dynamic load
`balancing mechanism that caps the incoming traffic when tail
`latencies begin exceeding acceptable thresholds. Because the
`FPGA is able to process requests while keeping latencies low,
`it is able to absorb more than twice the offered load, while
`executing queries at a latency that never exceeds the software
`datacenter at any load.
`
`IV. NETWORK ACCELERATION
`
`The bump-in-the-wire architecture was developed to enable
`datacenter networking applications, the first of which is host-
`to-host line rate encryption/decryption on a per flow basis. As
`each packet passes from the NIC through the FPGA to the
`ToR, its header is examined to determine if it is part of an
`encrypted flow that was previously set up by software. If it
`is, the software-provided encryption key is read from internal
`FPGA SRAM or the FPGA-attached DRAM and is used to
`encrypt or decrypt the packet. Thus, once the encrypted flows
`are set up, there is no load on the CPUs to encrypt or decrypt
`the packets; encryption occurs transparently from software’s
`perspective, which sees all packets as unencrypted at the end
`points.
`The ability to offload encryption/decryption at network line
`rates to the FPGA yields significant CPU savings. According
`to Intel [6], its AES GCM-128 performance on Haswell is
`1.26 cycles per byte for encrypt and decrypt each. Thus, at
`a 2.4 GHz clock frequency, 40 Gb/s encryption/decryption
`consumes roughly five cores. Different standards, such as
`256b or CBC are, however, significantly slower. In addition,
`there is sometimes a need for crypto hash functions, further
`reducing performance. For example, AES-CBC-128-SHA1 is
`needed for backward compatibility for some software stacks
`and consumes at least fifteen cores to achieve 40 Gb/s full
`duplex. Of course, consuming fifteen cores just for crypto is
`impractical on a typical multi-core server, as there would be
`that many fewer cores generating traffic. Even five cores of
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 6
`
`

`

`savings is significant when every core can otherwise generate
`revenue.
`Our FPGA implementation supports full 40 Gb/s encryption
`and decryption. The worst case half-duplex FPGA crypto
`latency for AES-CBC-128-SHA1 is 11 µs for a 1500B packet,
`from first flit to first flit. In software, based on the Intel num-
`bers, it is approximately 4 µs. AES-CBC-SHA1 is, however,
`especially difficult for hardware due to tight dependencies. For
`example, AES-CBC requires processing 33 packets at a time
`in our implementation, taking only 128b from a single packet
`once every 33 cycles. GCM latency numbers are significantly
`better for FPGA since a single packet can be processed with
`no dependencies and thus can be perfectly pipelined.
`The software performance numbers are Intel’s very best
`numbers that do not account for the disturbance to the
`core, such as effects on the cache if encryption/decryption is
`blocked, which is often the case. Thus, the real-world benefits
`of offloading crypto to the FPGA are greater than the numbers
`presented above.
`
`V. REMOTE ACCELERATION
`
`In the previous section, we showed that the Configurable
`Cloud acceleration architecture could support local functions,
`both for Bing web search ranking acceleration and accelerating
`networking/infrastructure functions, such as encryption of net-
`work flows. However, to treat the acceleration hardware as a
`global resource and to deploy services that consume more than
`one FPGA (e.g. more aggressive web search ranking, large-
`scale machine learning, and bioinformatics), communication
`among FPGAs is crucial. Other systems provide explicit
`connectivity through secondary networks or PCIe switches to
`allow multi-FPGA solutions. Either solution, however, limits
`the scale of connectivity. By having FPGAs communicate
`directly through the datacenter Ethernet infrastructure, large
`scale and low latency are both achieved. However,
`these
`communication channels have several requirements:
`
`• They cannot go through software protocol stacks on the
`CPU due to their long latencies.
`• They must be resilient to failures and dropped packets, but
`should drop packets rarely.
`• They should not consume significant FPGA resources.
`
`The rest of this section describes the FPGA implementation
`that supports cross-datacenter, inter-FPGA communication and
`meets the above requirements. There are two major functions
`that must be implemented on the FPGA: the inter-FPGA com-
`munication engine and the intra-FPGA router that coordinates
`the various flows of traffic on the FPGA among the network,
`PCIe, DRAM, and the application roles. We describe each
`below.
`
`A. Lightweight Transport Layer
`
`for
`the inter-FPGA network protocol LTL,
`We call
`Lightweight Transport Layer. This protocol,
`like previous
`work [7], uses UDP for frame encapsulation and IP for
`routing packets across the datacenter network. Low-latency
`
`communication demands infrequent packet drops and infre-
`quent packet reorders. By using “lossless” traffic classes
`provided in datacenter switches and provisioned for traffic like
`RDMA and FCoE, we avoid most packet drops and reorders.
`Separating out such traffic to their own classes also protects
`the datacenter’s baseline TCP traffi

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket