`
`Andrew Putnam
`Eric S. Chung
`Adrian M. Caulfield
`Jeremy Fowers Michael Haselman Stephen Heil Matt Humphrey
`Hari Angepat
`Joo-Young Kim Daniel Lo Todd Massengill Kalin Ovtcharov
`Puneet Kaur
`Michael Papamichael Lisa Woods Sitaram Lanka Derek Chiou Doug Burger
`
`Microsoft Corporation
`
`Abstract—Hyperscale datacenter providers have struggled to
`balance the growing need for specialized hardware (efficiency)
`with the economic benefits of homogeneity (manageability). In
`this paper we propose a new cloud architecture that uses
`reconfigurable logic to accelerate both network plane func-
`tions and applications. This Configurable Cloud architecture
`places a layer of reconfigurable logic (FPGAs) between the
`network switches and the servers, enabling network flows to be
`programmably transformed at line rate, enabling acceleration
`of local applications running on the server, and enabling the
`FPGAs to communicate directly, at datacenter scale, to harvest
`remote FPGAs unused by their local servers. We deployed this
`design over a production server bed, and show how it can be
`used for both service acceleration (Web search ranking) and
`network acceleration (encryption of data in transit at high-
`speeds). This architecture is much more scalable than prior
`work which used secondary rack-scale networks for inter-FPGA
`communication. By coupling to the network plane, direct FPGA-
`to-FPGA messages can be achieved at comparable latency to
`previous work, without the secondary network. Additionally, the
`scale of direct inter-FPGA messaging is much larger. The average
`round-trip latencies observed in our measurements among 24,
`1000, and 250,000 machines are under 3, 9, and 20 microseconds,
`respectively. The Configurable Cloud architecture has been
`deployed at hyperscale in Microsoft’s production datacenters
`worldwide.
`
`I. INTRODUCTION
`
`Modern hyperscale datacenters have made huge strides with
`improvements in networking, virtualization, energy efficiency,
`and infrastructure management, but still have the same basic
`structure as they have for years:
`individual servers with
`multicore CPUs, DRAM, and local storage, connected by the
`NIC through Ethernet switches to other servers. At hyperscale
`(hundreds of thousands to millions of servers), there are signif-
`icant benefits to maximizing homogeneity; workloads can be
`migrated fungibly across the infrastructure, and management
`is simplified, reducing costs and configuration errors.
`Both the slowdown in CPU scaling and the ending of
`Moore’s Law have resulted in a growing need for hard-
`ware specialization to increase performance and efficiency.
`However, placing specialized accelerators in a subset of a
`hyperscale infrastructure’s servers reduces the highly desir-
`able homogeneity. The question is mostly one of economics:
`whether it is cost-effective to deploy an accelerator in every
`new server, whether it
`is better to specialize a subset of
`an infrastructure’s new servers and maintain an ever-growing
`
`978-1-5090-3508-3/16/$31.00 c(cid:13) 2016 IEEE
`
`number of configurations, or whether it is most cost-effective
`to do neither. Any specialized accelerator must be compatible
`with the target workloads through its deployment lifetime (e.g.
`six years: two years to design and deploy the accelerator and
`four years of server deployment lifetime). This requirement
`is a challenge given both the diversity of cloud workloads
`and the rapid rate at which they change (weekly or monthly).
`It is thus highly desirable that accelerators incorporated into
`hyperscale servers be programmable, the two most common
`examples being FPGAs and GPUs.
`Both GPUs and FPGAs have been deployed in datacenter
`infrastructure at reasonable scale without direct connectivity
`between accelerators [1], [2], [3]. Our recent publication
`described a medium-scale FPGA deployment in a production
`datacenter to accelerate Bing web search ranking using multi-
`ple directly-connected accelerators [4]. That design consisted
`of a rack-scale fabric of 48 FPGAs connected by a secondary
`network. While effective at accelerating search ranking, our
`first architecture had several significant limitations:
`• The secondary network (a 6x8 torus) required expensive
`and complex cabling, and required awareness of the physical
`location of machines.
`• Failure handling of the torus required complex re-routing
`of traffic to neighboring nodes, causing both performance loss
`and isolation of nodes under certain failure patterns.
`• The number of FPGAs that could communicate directly,
`without going through software, was limited to a single rack
`(i.e. 48 nodes).
`• The fabric was a limited-scale “bolt on” accelerator, which
`could accelerate applications but offered little for enhancing
`the datacenter infrastructure, such as networking and storage
`flows.
`In this paper, we describe a new cloud-scale, FPGA-based
`acceleration architecture, which we call
`the Configurable
`Cloud, which eliminates all of the limitations listed above with
`a single design. This architecture has been — and is being
`— deployed in the majority of new servers in Microsoft’s
`production datacenters across more than 15 countries and
`5 continents. A Configurable Cloud allows the datapath of
`cloud communication to be accelerated with programmable
`hardware. This datapath can include networking flows, stor-
`age flows, security operations, and distributed (multi-FPGA)
`applications.
`The key difference over previous work is that the accelera-
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 1
`
`
`
`L1
`
`L2
`TOR
`
`TOR
`
`TOR
`
`L1
`
`TOR
`
`Web search
`ranking
`Web search
`ranking
`
`Deep neural
`Expensive
`networks
`compression
`Bioinformatics
`
`(a)
`
`TOR
`
`(b)
`
`Fig. 1.
`
`(a) Decoupled Programmable Hardware Plane, (b) Server + FPGA schematic.
`
`tion hardware is tightly coupled with the datacenter network—
`placing a layer of FPGAs between the servers’ NICs and
`the Ethernet network switches. Figure 1b shows how the
`accelerator fits into a host server. All network traffic is routed
`through the FPGA, allowing it to accelerate high-bandwidth
`network flows. An independent PCIe connection to the host
`CPUs is also provided, allowing the FPGA to be used as a local
`compute accelerator. The standard network switch and topol-
`ogy removes the impact of failures on neighboring servers,
`removes the need for non-standard cabling, and eliminates the
`need to track the physical location of machines in each rack.
`While placing FPGAs as a network-side “bump-in-the-wire”
`solves many of the shortcomings of the torus topology, much
`more is possible. By enabling the FPGAs to generate and
`consume their own networking packets independent of the
`hosts, each and every FPGA in the datacenter can reach
`every other one (at a scale of hundreds of thousands) in
`a small number of microseconds, without any intervening
`software. This capability allows hosts to use remote FPGAs for
`acceleration with low latency, improving the economics of the
`accelerator deployment, as hosts running services that do not
`use their local FPGAs can donate them to a global pool and
`extract value which would otherwise be stranded. Moreover,
`this design choice essentially turns the distributed FPGA
`resources into an independent computer in the datacenter,
`at the same scale as the servers, that physically shares the
`network wires with software. Figure 1a shows a logical view
`of this plane of computation.
`This model offers significant flexibility. From the local
`perspective, the FPGA is used as a compute or a network
`accelerator. From the global perspective, the FPGAs can be
`managed as a large-scale pool of resources, with acceleration
`
`services mapped to remote FPGA resources. Ideally, servers
`not using all of their local FPGA resources can donate
`those resources to the global pool, while servers that need
`additional resources can request the available resources on
`remote servers. Failing nodes are removed from the pool
`with replacements quickly added. As demand for a service
`grows or shrinks, a global manager grows or shrinks the pools
`correspondingly. Services are thus freed from having a fixed
`ratio of CPU cores per FPGAs, and can instead allocate (or
`purchase, in the case of IaaS) only the resources of each type
`needed.
`
`Space limitations prevent a complete description of the
`management policies and mechanisms for the global resource
`manager. Instead, this paper focuses first on the hardware
`architecture necessary to treat remote FPGAs as available
`resources for global acceleration pools. We describe the com-
`munication protocols and mechanisms that allow nodes in
`a remote acceleration service to connect, including a proto-
`col called LTL (Lightweight Transport Layer) that supports
`lightweight connections between pairs of FPGAs, with mostly
`lossless transport and extremely low latency (small numbers
`of microseconds). This protocol makes the datacenter-scale
`remote FPGA resources appear closer than either a single local
`SSD access or the time to get through the host’s networking
`stack. Then, we describe an evaluation system of 5,760 servers
`which we built and deployed as a precursor to hyperscale
`production deployment. We measure the performance charac-
`teristics of the system, using web search and network flow
`encryption as examples. We show that significant gains in
`efficiency are possible, and that this new architecture enables a
`much broader and more robust architecture for the acceleration
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 2
`
`
`
`40Gb
`QSFP
`Network
`to NIC
`
`Temp, power, LEDs
`40Gb
`USB to
`QSFP
`JTAG µC
`Network
`to TOR
`Altera Stratix V D5
`FPGA
`PCIe Gen3 x8
`PCIe Gen3 x8
`4 lanes @ 10.3125 Gbps
`4 lanes @ 10.3125 Gbps
`72 bits
`(with ECC)Q
`256 Mb
`4 GB DDR3-1600
`Config
`Flash
`
`I2C
`
`USB
`
`is
`Photograph of the manufactured board. The DDR channel
`Fig. 3.
`implemented using discrete components. PCIe connectivity goes through a
`mezzanine connector on the bottom side of the board (not shown).
`
`other services that are slower to adopt (or do not require) the
`accelerator fabric.
`In addition to architectural requirements that provide suffi-
`cient flexibility to justify scale production deployment, there
`are also physical restrictions in current infrastructures that
`must be overcome. These restrictions include strict power
`limits, a small physical space in which to fit, resilience
`to hardware failures, and tolerance to high temperatures.
`For example, the accelerator architecture we describe is the
`widely-used OpenCompute server that constrained power to
`35W, the physical size to roughly a half-height half-length
`PCIe expansion card (80mm x 140 mm), and tolerance to
`an inlet air temperature of 70◦C at 160 lfm airflow. These
`constraints make deployment of current GPUs impractical
`except in special HPC SKUs, so we selected FPGAs as the
`accelerator.
`We designed the accelerator board as a standalone FPGA
`board that is added to the PCIe expansion slot in a production
`server SKU. Figure 2 shows a schematic of the board, and
`Figure 3 shows a photograph of the board with major com-
`ponents labeled. The FPGA is an Altera Stratix V D5, with
`172.6K ALMs of programmable logic. The FPGA has one
`4 GB DDR3-1600 DRAM channel, two independent PCIe Gen
`3 x8 connections for an aggregate total of 16 GB/s in each
`direction between the CPU and FPGA, and two independent
`40 Gb Ethernet interfaces with standard QSFP+ connectors. A
`256 Mb Flash chip holds the known-good golden image for the
`FPGA that is loaded on power on, as well as one application
`image.
`To measure the power consumption limits of the entire
`FPGA card (including DRAM, I/O channels, and PCIe), we
`developed a power virus that exercises nearly all of the FPGA’s
`interfaces, logic, and DSP blocks—while running the card in
`a thermal chamber operating in worst-case conditions (peak
`ambient temperature, high CPU load, and minimum airflow
`due to a failed fan). Under these conditions, the card consumes
`29.2W of power, which is well within the 32W TDP limits for
`a card running in a single server in our datacenter, and below
`the max electrical power draw limit of 35W.
`The dual 40 Gb Ethernet interfaces on the board could allow
`for a private FPGA network as was done in our previous
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 3
`
`SPI
`Connector
`PCIe Mezanine
`
`Fig. 2. Block diagram of the major components of the accelerator board.
`
`of hyperscale datacenter services.
`
`II. HARDWARE ARCHITECTURE
`
`There are many constraints on the design of hardware
`accelerators for the datacenter. Datacenter accelerators must
`be highly manageable, which means having few variations
`or versions. The environments must be largely homogeneous,
`which means that the accelerators must provide value across a
`plurality of the servers using it. Given services’ rate of change
`and diversity in the datacenter, this requirement means that a
`single design must provide positive value across an extremely
`large, homogeneous deployment.
`The solution to addressing the competing demands of ho-
`mogeneity and specialization is to develop accelerator archi-
`tectures which are programmable, such as FPGAs and GPUs.
`These programmable architectures allow for hardware homo-
`geneity while allowing fungibility via software for different
`services. They must be highly flexible at the system level, in
`addition to being programmable, to justify deployment across a
`hyperscale infrastructure. The acceleration system we describe
`is sufficiently flexible to cover three scenarios: local compute
`acceleration (through PCIe), network acceleration, and global
`application acceleration,
`through configuration as pools of
`remotely accessible FPGAs. Local acceleration handles high-
`value scenarios such as search ranking acceleration where
`every server can benefit from having its own FPGA. Network
`acceleration can support services such as intrusion detection,
`deep packet
`inspection and network encryption which are
`critical to IaaS (e.g. “rental” of cloud servers), and which have
`such a huge diversity of customers that it makes it difficult to
`justify local compute acceleration alone economically. Global
`acceleration permits accelerators unused by their host servers
`to be made available for large-scale applications, such as
`machine learning. This decoupling of a 1:1 ratio of servers
`to FPGAs is essential for breaking the “chicken and egg”
`problem where accelerators cannot be added until enough
`applications need them, but applications will not rely upon
`the accelerators until they are present in the infrastructure.
`By decoupling the servers and FPGAs, software services that
`demand more FPGA capacity can harness spare FPGAs from
`
`
`
`40G Network Bridge and Bypass
`40GMAC(TOR)
`40G MAC(NIC)
`Lightweight Transport Layer Protocol Engine
`Elastic
`Router
`Role
`Role
`PCIe Gen 3
`Role x N
`DMA x 2
`DDR3
`Controller
`
`Role
`40G MAC/PHY (TOR)
`40G MAC/PHY (NIC)
`Network Bridge / Bypass
`DDR3 Memory Controller
`Elastic Router
`LTLProtocol Engine
`LTL Packet Switch
`PCIe DMA Engine
`Other
`TotalArea Used
`Total Area Available
`
`ALMs
`55340 (32%)
`9785 (6%)
`13122 (8%)
`4685 (3%)
`13225 (8%)
`3449 (2%)
`11839 (7%)
`4815 (3%)
`6817 (4%)
`8273 (5%)
`131350(76%)
`172600
`
`MHz
`175
`313
`313
`313
`200
`156
`156
`-
`250
`-
`-
`-
`
`Fig. 4. The Shell Architecture in a Single FPGA.
`
`Fig. 5. Area and frequency breakdown of production-deployed image with
`remote acceleration support.
`
`work [4], but this configuration also allows the FPGA to be
`wired as a “bump-in-the-wire”, sitting between the network
`interface card (NIC) and the top-of-rack switch (ToR). Rather
`than cabling the standard NIC directly to the ToR, the NIC is
`cabled to one port of the FPGA, and the other FPGA port is
`cabled to the ToR, as we previously showed in Figure 1b.
`Maintaining the discrete NIC in the system enables us
`to leverage all of the existing network offload and packet
`transport functionality hardened into the NIC. This simplifies
`the minimum FPGA code required to deploy the FPGAs to
`simple bypass logic. In addition, both FPGA resources and
`PCIe bandwidth are preserved for acceleration functionality,
`rather than being spent on implementing the NIC in soft logic.
`Unlike [5], both the FPGA and NIC have separate connec-
`tions to the host CPU via PCIe. This allows each to operate
`independently at maximum bandwidth when the FPGA is
`being used strictly as a local compute accelerator (with the
`FPGA simply doing network bypass). This path also makes it
`possible to use custom networking protocols that bypass the
`NIC entirely when desired.
`One potential drawback to the bump-in-the-wire architecture
`is that an FPGA failure, such as loading a buggy application,
`could cut off network traffic to the server, rendering the server
`unreachable. However, unlike a torus or mesh network, failures
`in the bump-in-the-wire architecture do not degrade any neigh-
`boring FPGAs, making the overall system more resilient to
`failures. In addition, most datacenter servers (including ours)
`have a side-channel management path that exists to power
`servers on and off. By policy, the known-good golden image
`that loads on power up is rarely (if ever) overwritten, so power
`cycling the server through the management port will bring
`the FPGA back into a good configuration, making the server
`reachable via the network once again.
`
`A. Shell architecture
`
`Within each FPGA, we use the partitioning and terminology
`we defined in prior work [4] to separate the application logic
`
`(Role) from the common I/O and board-specific logic (Shell)
`used by accelerated services. Figure 4 gives an overview of
`this architecture’s major shell components, focusing on the
`network. In addition to the Ethernet MACs and PHYs, there
`is an intra-FPGA message router called the Elastic Router
`(ER) with virtual channel support for allowing multiple Roles
`access to the network, and a Lightweight Transport Layer
`(LTL) engine used for enabling inter-FPGA communication.
`Both are described in detail in Section V.
`
`The FPGA’s location as a bump-in-the-wire between the
`network switch and host means that
`it must always pass
`packets between the two network interfaces that it controls.
`The shell implements a bridge to enable this functionality,
`shown at the top of Figure 4. The shell provides a tap for
`FPGA roles to inject, inspect, and alter the network traffic as
`needed, such as when encrypting network flows, which we
`describe in Section III.
`
`Full FPGA reconfiguration briefly brings down this network
`link, but in most cases applications are robust to brief network
`outages. When network traffic cannot be paused even briefly,
`partial reconfiguration permits packets to be passed through
`even during reconfiguration of the role.
`
`Figure 5 shows the area and clock frequency of the shell IP
`components used in the production-deployed image. In total,
`the design uses 44% of the FPGA to support all shell functions
`and the necessary IP blocks to enable access to remote pools of
`FPGAs (i.e., LTL and the Elastic Router). While a significant
`fraction of the FPGA is consumed by a few major shell
`components (especially the 40G PHY/MACs at 14% and the
`DDR3 memory controller at 8%), enough space is left for
`the role(s) to provide large speedups for key services, as we
`show in Section III. Large shell components that are stable for
`the long term are excellent candidates for hardening in future
`generations of datacenter-optimized FPGAs.
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 4
`
`
`
`0.00.51.01.52.0
`
`0.0
`
`Software
`Local FPGA
`
`3.0
`2.0
`1.0
`Throughput (normalized)
`
`4.0
`
`99th percentile target)
`Latency (normalized to
`
`Fig. 6. 99% Latency versus Throughput of ranking service queries running
`on a single server, with and without FPGAs enabled.
`
`A. Bing Search Page Ranking Acceleration
`
`We describe Bing web search ranking acceleration as an
`example of using the local FPGA to accelerate a large-scale
`service. This example is useful both because Bing search is
`a large datacenter workload, and since we had described its
`acceleration in depth on the Catapult v1 platform [4]. At a
`high level, most web search ranking algorithms behave sim-
`ilarly; query-specific features are generated from documents,
`processed, and then passed to a machine learned model to
`determine how relevant the document is to the query.
`Unlike in [4], we implement only a subset of the feature
`calculations (typically the most expensive ones), and nei-
`ther compute post-processed synthetic features nor run the
`machine-learning portion of search ranking on the FPGAs.
`We do implement two classes of features on the FPGA. The
`first is the traditional finite state machines used in many search
`engines (e.g. “count the number of occurrences of query term
`two”). The second is a proprietary set of features generated
`by a complex dynamic programming engine.
`We implemented the selected features in a Feature Func-
`tional Unit (FFU), and the Dynamic Programming Features in
`a separate DPF unit. Both the FFU and DPF units were built
`into a shell that also had support for execution using remote
`accelerators, namely the ER and LTL blocks as described
`in Section V. This FPGA image also, of course, includes
`the network bridge for NIC-TOR communication, so all the
`server’s network traffic is passing through the FPGA while
`it is simultaneously accelerating document ranking. The pass-
`through traffic and the search ranking acceleration have no
`performance interaction.
`We present results in a format similar to the Catapult
`results to make direct comparisons simpler. We are running
`this image on a full production bed consisting of thousands
`of servers. In a production environment, it is infeasible to
`simulate many different points of query load as there is
`substantial infrastructure upstream that only produces requests
`at the rate of arrivals. To produce a smooth distribution with
`repeatable results, we used a single-box test with a stream
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 5
`
`B. Datacenter Deployment
`
`To evaluate the system architecture and performance at
`scale, we manufactured and deployed 5,760 servers containing
`this accelerator architecture and placed it into a production dat-
`acenter. All machines were configured with the shell described
`above. The servers and FPGAs were stress tested using the
`power virus workload on the FPGA and a standard burn-in test
`for the server under real datacenter environmental conditions.
`The servers all passed, and were approved for production use
`in the datacenter.
`We brought up a production Bing web search ranking
`service on the servers, with 3,081 of these machines using the
`FPGA for local compute acceleration, and the rest used for
`other functions associated with web search. We mirrored live
`traffic to the bed for one month, and monitored the health and
`stability of the systems as well as the correctness of the ranking
`service. After one month, two FPGAs had hard failures, one
`with a persistently high rate of single event upset (SEU) errors
`in the configuration logic, and the other with an unstable 40 Gb
`network link to the NIC. A third failure of the 40 Gb link to the
`TOR was found not to be an FPGA failure, and was resolved
`by replacing a network cable. Given aggregate datacenter
`failure rates, we deemed the FPGA-related hardware failures
`to be acceptably low for production.
`We also measured a low number of soft errors, which
`were all correctable. Five machines failed to train to the full
`Gen3 x8 speeds on the secondary PCIe link. There were
`eight total DRAM calibration failures which were repaired
`by reconfiguring the FPGA. The errors have since been traced
`to a logical error in the DRAM interface rather than a hard
`failure. Our shell scrubs the configuration state for soft errors
`and reports any flipped bits. We measured an average rate
`of one bit-flip in the configuration logic every 1025 machine
`days. While the scrubbing logic often catches the flips before
`functional failures occur, at least in one case there was a
`role hang that was likely attributable to an SEU event. Since
`the scrubbing logic completes roughly every 30 seconds, our
`system recovers from hung roles automatically, and we use
`ECC and other data integrity checks on critical interfaces, the
`exposure of the ranking service to SEU events is low. Overall,
`the hardware and interface stability of the system was deemed
`suitable for scale production.
`In the next sections, we show how this board/shell combi-
`nation can support local application acceleration while simul-
`taneously routing all of the server’s incoming and outgoing
`network traffic. Following that, we show network acceleration,
`and then acceleration of remote services.
`
`III. LOCAL ACCELERATION
`
`is important for an at-scale
`it
`As we described earlier,
`datacenter accelerator to enhance local applications and in-
`frastructure functions for different domains (e.g. web search
`and IaaS). In this section we measure the performance of our
`system on a large datacenter workload.
`
`
`
`Software
`Local FPGA
`
`1
`
`4
`3
`2
`Query Load (normalized)
`
`5
`
`6
`
`Fig. 8. Query 99.9% Latency vs. Offered Load.
`
`0.00.51.01.52.02.53.03.5
`
`0
`
`99.9% software latency
`99.9% FPGA latency
`average software load
`average FPGA query load
`Day 1
`Day 5
`Day 2
`Day 3
`Day 4
`
`Query Latency 99.9 (normalized)
`
`1.02.03.04.05.06.07.0
`
`Normalized Load & Latency
`
`Fig. 7. Five day query throughput and latency of ranking service queries
`running in production, with and without FPGAs enabled.
`
`of 200,000 queries, and varied the arrival rate of requests to
`measure query latency versus throughput. Figure 6 shows the
`results; the curves shown are the measured latencies of the
`99% slowest queries at each given level of throughput. Both
`axes show normalized results.
`We have normalized both the target production latency and
`typical average throughput in software-only mode to 1.0. The
`software is well tuned; it can achieve production targets for
`throughput at the required 99th percentile latency. With the
`single local FPGA, at the target 99th percentile latency, t he
`throughput can be safely increased by 2.25x, which means
`that fewer than half as many servers would be needed to
`sustain the target throughput at the required latency. Even at
`these higher loads, the FPGA remains underutilized, as the
`software portion of ranking saturates the host server before
`the FPGA is saturated. Having multiple servers drive fewer
`FPGAs addresses the underutilization of the FPGAs, which is
`the goal of our remote acceleration model.
`Production Measurements: We have deployed the FPGA
`accelerated ranking service into production datacenters at scale
`and report end-to-end measurements below.
`Figure 7 shows the performance of ranking service running
`in two production datacenters over a five day period, one
`with FPGAs enabled and one without (both datacenters are
`of identical scale and configuration, with the exception of the
`FPGAs). These results are from live production traffic, not on a
`synthetic workload or mirrored traffic. The top two bars show
`the normalized tail query latencies at the 99.9th percentile
`(aggregated across all servers over a rolling time window),
`while the bottom two bars show the corresponding query
`loads received at each datacenter. As load varies throughout
`the day, the queries executed in the software-only datacenter
`experience a high rate of latency spikes, while the FPGA-
`accelerated queries have much lower, tighter-bound latencies,
`despite seeing much higher peak query loads.
`Figure 8 plots the load versus latency over the same 5-day
`
`period for the two datacenters. Given that these measurements
`were performed on production datacenters and traffic, we were
`unable to observe higher loads on the software datacenter
`(vs. the FPGA-enabled datacenter) due to a dynamic load
`balancing mechanism that caps the incoming traffic when tail
`latencies begin exceeding acceptable thresholds. Because the
`FPGA is able to process requests while keeping latencies low,
`it is able to absorb more than twice the offered load, while
`executing queries at a latency that never exceeds the software
`datacenter at any load.
`
`IV. NETWORK ACCELERATION
`
`The bump-in-the-wire architecture was developed to enable
`datacenter networking applications, the first of which is host-
`to-host line rate encryption/decryption on a per flow basis. As
`each packet passes from the NIC through the FPGA to the
`ToR, its header is examined to determine if it is part of an
`encrypted flow that was previously set up by software. If it
`is, the software-provided encryption key is read from internal
`FPGA SRAM or the FPGA-attached DRAM and is used to
`encrypt or decrypt the packet. Thus, once the encrypted flows
`are set up, there is no load on the CPUs to encrypt or decrypt
`the packets; encryption occurs transparently from software’s
`perspective, which sees all packets as unencrypted at the end
`points.
`The ability to offload encryption/decryption at network line
`rates to the FPGA yields significant CPU savings. According
`to Intel [6], its AES GCM-128 performance on Haswell is
`1.26 cycles per byte for encrypt and decrypt each. Thus, at
`a 2.4 GHz clock frequency, 40 Gb/s encryption/decryption
`consumes roughly five cores. Different standards, such as
`256b or CBC are, however, significantly slower. In addition,
`there is sometimes a need for crypto hash functions, further
`reducing performance. For example, AES-CBC-128-SHA1 is
`needed for backward compatibility for some software stacks
`and consumes at least fifteen cores to achieve 40 Gb/s full
`duplex. Of course, consuming fifteen cores just for crypto is
`impractical on a typical multi-core server, as there would be
`that many fewer cores generating traffic. Even five cores of
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 6
`
`
`
`savings is significant when every core can otherwise generate
`revenue.
`Our FPGA implementation supports full 40 Gb/s encryption
`and decryption. The worst case half-duplex FPGA crypto
`latency for AES-CBC-128-SHA1 is 11 µs for a 1500B packet,
`from first flit to first flit. In software, based on the Intel num-
`bers, it is approximately 4 µs. AES-CBC-SHA1 is, however,
`especially difficult for hardware due to tight dependencies. For
`example, AES-CBC requires processing 33 packets at a time
`in our implementation, taking only 128b from a single packet
`once every 33 cycles. GCM latency numbers are significantly
`better for FPGA since a single packet can be processed with
`no dependencies and thus can be perfectly pipelined.
`The software performance numbers are Intel’s very best
`numbers that do not account for the disturbance to the
`core, such as effects on the cache if encryption/decryption is
`blocked, which is often the case. Thus, the real-world benefits
`of offloading crypto to the FPGA are greater than the numbers
`presented above.
`
`V. REMOTE ACCELERATION
`
`In the previous section, we showed that the Configurable
`Cloud acceleration architecture could support local functions,
`both for Bing web search ranking acceleration and accelerating
`networking/infrastructure functions, such as encryption of net-
`work flows. However, to treat the acceleration hardware as a
`global resource and to deploy services that consume more than
`one FPGA (e.g. more aggressive web search ranking, large-
`scale machine learning, and bioinformatics), communication
`among FPGAs is crucial. Other systems provide explicit
`connectivity through secondary networks or PCIe switches to
`allow multi-FPGA solutions. Either solution, however, limits
`the scale of connectivity. By having FPGAs communicate
`directly through the datacenter Ethernet infrastructure, large
`scale and low latency are both achieved. However,
`these
`communication channels have several requirements:
`
`• They cannot go through software protocol stacks on the
`CPU due to their long latencies.
`• They must be resilient to failures and dropped packets, but
`should drop packets rarely.
`• They should not consume significant FPGA resources.
`
`The rest of this section describes the FPGA implementation
`that supports cross-datacenter, inter-FPGA communication and
`meets the above requirements. There are two major functions
`that must be implemented on the FPGA: the inter-FPGA com-
`munication engine and the intra-FPGA router that coordinates
`the various flows of traffic on the FPGA among the network,
`PCIe, DRAM, and the application roles. We describe each
`below.
`
`A. Lightweight Transport Layer
`
`for
`the inter-FPGA network protocol LTL,
`We call
`Lightweight Transport Layer. This protocol,
`like previous
`work [7], uses UDP for frame encapsulation and IP for
`routing packets across the datacenter network. Low-latency
`
`communication demands infrequent packet drops and infre-
`quent packet reorders. By using “lossless” traffic classes
`provided in datacenter switches and provisioned for traffic like
`RDMA and FCoE, we avoid most packet drops and reorders.
`Separating out such traffic to their own classes also protects
`the datacenter’s baseline TCP traffi