IPR2018-01605, No. 1030-34 Exhibit - Caulfield, et al, A cloud scale acceleration architecture, (P.T.A.B. Sep. 6, 2018)

A Cloud-Scale Acceleration Architecture
`
`Andrew Putnam
`Eric S. Chung
`Adrian M. Caulﬁeld
`Jeremy Fowers Michael Haselman Stephen Heil Matt Humphrey
`Hari Angepat
`Joo-Young Kim Daniel Lo Todd Massengill Kalin Ovtcharov
`Puneet Kaur
`Michael Papamichael Lisa Woods Sitaram Lanka Derek Chiou Doug Burger
`
`Microsoft Corporation
`
`Abstract—Hyperscale datacenter providers have struggled to
`balance the growing need for specialized hardware (efﬁciency)
`with the economic beneﬁts of homogeneity (manageability). In
`this paper we propose a new cloud architecture that uses
`reconﬁgurable logic to accelerate both network plane func-
`tions and applications. This Conﬁgurable Cloud architecture
`places a layer of reconﬁgurable logic (FPGAs) between the
`network switches and the servers, enabling network ﬂows to be
`programmably transformed at line rate, enabling acceleration
`of local applications running on the server, and enabling the
`FPGAs to communicate directly, at datacenter scale, to harvest
`remote FPGAs unused by their local servers. We deployed this
`design over a production server bed, and show how it can be
`used for both service acceleration (Web search ranking) and
`network acceleration (encryption of data in transit at high-
`speeds). This architecture is much more scalable than prior
`work which used secondary rack-scale networks for inter-FPGA
`communication. By coupling to the network plane, direct FPGA-
`to-FPGA messages can be achieved at comparable latency to
`previous work, without the secondary network. Additionally, the
`scale of direct inter-FPGA messaging is much larger. The average
`round-trip latencies observed in our measurements among 24,
`1000, and 250,000 machines are under 3, 9, and 20 microseconds,
`respectively. The Conﬁgurable Cloud architecture has been
`deployed at hyperscale in Microsoft’s production datacenters
`worldwide.
`
`I. INTRODUCTION
`
`Modern hyperscale datacenters have made huge strides with
`improvements in networking, virtualization, energy efﬁciency,
`and infrastructure management, but still have the same basic
`structure as they have for years:
`individual servers with
`multicore CPUs, DRAM, and local storage, connected by the
`NIC through Ethernet switches to other servers. At hyperscale
`(hundreds of thousands to millions of servers), there are signif-
`icant beneﬁts to maximizing homogeneity; workloads can be
`migrated fungibly across the infrastructure, and management
`is simpliﬁed, reducing costs and conﬁguration errors.
`Both the slowdown in CPU scaling and the ending of
`Moore’s Law have resulted in a growing need for hard-
`ware specialization to increase performance and efﬁciency.
`However, placing specialized accelerators in a subset of a
`hyperscale infrastructure’s servers reduces the highly desir-
`able homogeneity. The question is mostly one of economics:
`whether it is cost-effective to deploy an accelerator in every
`new server, whether it
`is better to specialize a subset of
`an infrastructure’s new servers and maintain an ever-growing
`
`978-1-5090-3508-3/16/$31.00 c(cid:13) 2016 IEEE
`
`number of conﬁgurations, or whether it is most cost-effective
`to do neither. Any specialized accelerator must be compatible
`with the target workloads through its deployment lifetime (e.g.
`six years: two years to design and deploy the accelerator and
`four years of server deployment lifetime). This requirement
`is a challenge given both the diversity of cloud workloads
`and the rapid rate at which they change (weekly or monthly).
`It is thus highly desirable that accelerators incorporated into
`hyperscale servers be programmable, the two most common
`examples being FPGAs and GPUs.
`Both GPUs and FPGAs have been deployed in datacenter
`infrastructure at reasonable scale without direct connectivity
`between accelerators [1], [2], [3]. Our recent publication
`described a medium-scale FPGA deployment in a production
`datacenter to accelerate Bing web search ranking using multi-
`ple directly-connected accelerators [4]. That design consisted
`of a rack-scale fabric of 48 FPGAs connected by a secondary
`network. While effective at accelerating search ranking, our
`ﬁrst architecture had several signiﬁcant limitations:
`• The secondary network (a 6x8 torus) required expensive
`and complex cabling, and required awareness of the physical
`location of machines.
`• Failure handling of the torus required complex re-routing
`of trafﬁc to neighboring nodes, causing both performance loss
`and isolation of nodes under certain failure patterns.
`• The number of FPGAs that could communicate directly,
`without going through software, was limited to a single rack
`(i.e. 48 nodes).
`• The fabric was a limited-scale “bolt on” accelerator, which
`could accelerate applications but offered little for enhancing
`the datacenter infrastructure, such as networking and storage
`ﬂows.
`In this paper, we describe a new cloud-scale, FPGA-based
`acceleration architecture, which we call
`the Conﬁgurable
`Cloud, which eliminates all of the limitations listed above with
`a single design. This architecture has been — and is being
`— deployed in the majority of new servers in Microsoft’s
`production datacenters across more than 15 countries and
`5 continents. A Conﬁgurable Cloud allows the datapath of
`cloud communication to be accelerated with programmable
`hardware. This datapath can include networking ﬂows, stor-
`age ﬂows, security operations, and distributed (multi-FPGA)
`applications.
`The key difference over previous work is that the accelera-
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 1
`
`

`L1
`
`L2
`TOR
`
`TOR
`
`TOR
`
`L1
`
`TOR
`
`Web search
`ranking
`Web search
`ranking
`
`Deep neural
`Expensive
`networks
`compression
`Bioinformatics
`
`(a)
`
`TOR
`
`(b)
`
`Fig. 1.
`
`(a) Decoupled Programmable Hardware Plane, (b) Server + FPGA schematic.
`
`tion hardware is tightly coupled with the datacenter network—
`placing a layer of FPGAs between the servers’ NICs and
`the Ethernet network switches. Figure 1b shows how the
`accelerator ﬁts into a host server. All network trafﬁc is routed
`through the FPGA, allowing it to accelerate high-bandwidth
`network ﬂows. An independent PCIe connection to the host
`CPUs is also provided, allowing the FPGA to be used as a local
`compute accelerator. The standard network switch and topol-
`ogy removes the impact of failures on neighboring servers,
`removes the need for non-standard cabling, and eliminates the
`need to track the physical location of machines in each rack.
`While placing FPGAs as a network-side “bump-in-the-wire”
`solves many of the shortcomings of the torus topology, much
`more is possible. By enabling the FPGAs to generate and
`consume their own networking packets independent of the
`hosts, each and every FPGA in the datacenter can reach
`every other one (at a scale of hundreds of thousands) in
`a small number of microseconds, without any intervening
`software. This capability allows hosts to use remote FPGAs for
`acceleration with low latency, improving the economics of the
`accelerator deployment, as hosts running services that do not
`use their local FPGAs can donate them to a global pool and
`extract value which would otherwise be stranded. Moreover,
`this design choice essentially turns the distributed FPGA
`resources into an independent computer in the datacenter,
`at the same scale as the servers, that physically shares the
`network wires with software. Figure 1a shows a logical view
`of this plane of computation.
`This model offers signiﬁcant ﬂexibility. From the local
`perspective, the FPGA is used as a compute or a network
`accelerator. From the global perspective, the FPGAs can be
`managed as a large-scale pool of resources, with acceleration
`
`services mapped to remote FPGA resources. Ideally, servers
`not using all of their local FPGA resources can donate
`those resources to the global pool, while servers that need
`additional resources can request the available resources on
`remote servers. Failing nodes are removed from the pool
`with replacements quickly added. As demand for a service
`grows or shrinks, a global manager grows or shrinks the pools
`correspondingly. Services are thus freed from having a ﬁxed
`ratio of CPU cores per FPGAs, and can instead allocate (or
`purchase, in the case of IaaS) only the resources of each type
`needed.
`
`Space limitations prevent a complete description of the
`management policies and mechanisms for the global resource
`manager. Instead, this paper focuses ﬁrst on the hardware
`architecture necessary to treat remote FPGAs as available
`resources for global acceleration pools. We describe the com-
`munication protocols and mechanisms that allow nodes in
`a remote acceleration service to connect, including a proto-
`col called LTL (Lightweight Transport Layer) that supports
`lightweight connections between pairs of FPGAs, with mostly
`lossless transport and extremely low latency (small numbers
`of microseconds). This protocol makes the datacenter-scale
`remote FPGA resources appear closer than either a single local
`SSD access or the time to get through the host’s networking
`stack. Then, we describe an evaluation system of 5,760 servers
`which we built and deployed as a precursor to hyperscale
`production deployment. We measure the performance charac-
`teristics of the system, using web search and network ﬂow
`encryption as examples. We show that signiﬁcant gains in
`efﬁciency are possible, and that this new architecture enables a
`much broader and more robust architecture for the acceleration
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 2
`
`

`40Gb
`QSFP
`Network
`to NIC
`
`Temp, power, LEDs
`40Gb
`USB to
`QSFP
`JTAG µC
`Network
`to TOR
`Altera Stratix V D5
`FPGA
`PCIe Gen3 x8
`PCIe Gen3 x8
`4 lanes @ 10.3125 Gbps
`4 lanes @ 10.3125 Gbps
`72 bits
`(with ECC)Q
`256 Mb
`4 GB DDR3-1600
`Config
`Flash
`
`I2C
`
`USB
`
`is
`Photograph of the manufactured board. The DDR channel
`Fig. 3.
`implemented using discrete components. PCIe connectivity goes through a
`mezzanine connector on the bottom side of the board (not shown).
`
`other services that are slower to adopt (or do not require) the
`accelerator fabric.
`In addition to architectural requirements that provide sufﬁ-
`cient ﬂexibility to justify scale production deployment, there
`are also physical restrictions in current infrastructures that
`must be overcome. These restrictions include strict power
`limits, a small physical space in which to ﬁt, resilience
`to hardware failures, and tolerance to high temperatures.
`For example, the accelerator architecture we describe is the
`widely-used OpenCompute server that constrained power to
`35W, the physical size to roughly a half-height half-length
`PCIe expansion card (80mm x 140 mm), and tolerance to
`an inlet air temperature of 70◦C at 160 lfm airﬂow. These
`constraints make deployment of current GPUs impractical
`except in special HPC SKUs, so we selected FPGAs as the
`accelerator.
`We designed the accelerator board as a standalone FPGA
`board that is added to the PCIe expansion slot in a production
`server SKU. Figure 2 shows a schematic of the board, and
`Figure 3 shows a photograph of the board with major com-
`ponents labeled. The FPGA is an Altera Stratix V D5, with
`172.6K ALMs of programmable logic. The FPGA has one
`4 GB DDR3-1600 DRAM channel, two independent PCIe Gen
`3 x8 connections for an aggregate total of 16 GB/s in each
`direction between the CPU and FPGA, and two independent
`40 Gb Ethernet interfaces with standard QSFP+ connectors. A
`256 Mb Flash chip holds the known-good golden image for the
`FPGA that is loaded on power on, as well as one application
`image.
`To measure the power consumption limits of the entire
`FPGA card (including DRAM, I/O channels, and PCIe), we
`developed a power virus that exercises nearly all of the FPGA’s
`interfaces, logic, and DSP blocks—while running the card in
`a thermal chamber operating in worst-case conditions (peak
`ambient temperature, high CPU load, and minimum airﬂow
`due to a failed fan). Under these conditions, the card consumes
`29.2W of power, which is well within the 32W TDP limits for
`a card running in a single server in our datacenter, and below
`the max electrical power draw limit of 35W.
`The dual 40 Gb Ethernet interfaces on the board could allow
`for a private FPGA network as was done in our previous
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 3
`
`SPI
`Connector
`PCIe Mezanine
`
`Fig. 2. Block diagram of the major components of the accelerator board.
`
`of hyperscale datacenter services.
`
`II. HARDWARE ARCHITECTURE
`
`There are many constraints on the design of hardware
`accelerators for the datacenter. Datacenter accelerators must
`be highly manageable, which means having few variations
`or versions. The environments must be largely homogeneous,
`which means that the accelerators must provide value across a
`plurality of the servers using it. Given services’ rate of change
`and diversity in the datacenter, this requirement means that a
`single design must provide positive value across an extremely
`large, homogeneous deployment.
`The solution to addressing the competing demands of ho-
`mogeneity and specialization is to develop accelerator archi-
`tectures which are programmable, such as FPGAs and GPUs.
`These programmable architectures allow for hardware homo-
`geneity while allowing fungibility via software for different
`services. They must be highly ﬂexible at the system level, in
`addition to being programmable, to justify deployment across a
`hyperscale infrastructure. The acceleration system we describe
`is sufﬁciently ﬂexible to cover three scenarios: local compute
`acceleration (through PCIe), network acceleration, and global
`application acceleration,
`through conﬁguration as pools of
`remotely accessible FPGAs. Local acceleration handles high-
`value scenarios such as search ranking acceleration where
`every server can beneﬁt from having its own FPGA. Network
`acceleration can support services such as intrusion detection,
`deep packet
`inspection and network encryption which are
`critical to IaaS (e.g. “rental” of cloud servers), and which have
`such a huge diversity of customers that it makes it difﬁcult to
`justify local compute acceleration alone economically. Global
`acceleration permits accelerators unused by their host servers
`to be made available for large-scale applications, such as
`machine learning. This decoupling of a 1:1 ratio of servers
`to FPGAs is essential for breaking the “chicken and egg”
`problem where accelerators cannot be added until enough
`applications need them, but applications will not rely upon
`the accelerators until they are present in the infrastructure.
`By decoupling the servers and FPGAs, software services that
`demand more FPGA capacity can harness spare FPGAs from
`
`

`40G Network Bridge and Bypass
`40GMAC(TOR)
`40G MAC(NIC)
`Lightweight Transport Layer Protocol Engine
`Elastic
`Router
`Role
`Role
`PCIe Gen 3
`Role x N
`DMA x 2
`DDR3
`Controller
`
`Role
`40G MAC/PHY (TOR)
`40G MAC/PHY (NIC)
`Network Bridge / Bypass
`DDR3 Memory Controller
`Elastic Router
`LTLProtocol Engine
`LTL Packet Switch
`PCIe DMA Engine
`Other
`TotalArea Used
`Total Area Available
`
`ALMs
`55340 (32%)
`9785 (6%)
`13122 (8%)
`4685 (3%)
`13225 (8%)
`3449 (2%)
`11839 (7%)
`4815 (3%)
`6817 (4%)
`8273 (5%)
`131350(76%)
`172600
`
`MHz
`175
`313
`313
`313
`200
`156
`156
`-
`250
`-
`-
`-
`
`Fig. 4. The Shell Architecture in a Single FPGA.
`
`Fig. 5. Area and frequency breakdown of production-deployed image with
`remote acceleration support.
`
`work [4], but this conﬁguration also allows the FPGA to be
`wired as a “bump-in-the-wire”, sitting between the network
`interface card (NIC) and the top-of-rack switch (ToR). Rather
`than cabling the standard NIC directly to the ToR, the NIC is
`cabled to one port of the FPGA, and the other FPGA port is
`cabled to the ToR, as we previously showed in Figure 1b.
`Maintaining the discrete NIC in the system enables us
`to leverage all of the existing network ofﬂoad and packet
`transport functionality hardened into the NIC. This simpliﬁes
`the minimum FPGA code required to deploy the FPGAs to
`simple bypass logic. In addition, both FPGA resources and
`PCIe bandwidth are preserved for acceleration functionality,
`rather than being spent on implementing the NIC in soft logic.
`Unlike [5], both the FPGA and NIC have separate connec-
`tions to the host CPU via PCIe. This allows each to operate
`independently at maximum bandwidth when the FPGA is
`being used strictly as a local compute accelerator (with the
`FPGA simply doing network bypass). This path also makes it
`possible to use custom networking protocols that bypass the
`NIC entirely when desired.
`One potential drawback to the bump-in-the-wire architecture
`is that an FPGA failure, such as loading a buggy application,
`could cut off network trafﬁc to the server, rendering the server
`unreachable. However, unlike a torus or mesh network, failures
`in the bump-in-the-wire architecture do not degrade any neigh-
`boring FPGAs, making the overall system more resilient to
`failures. In addition, most datacenter servers (including ours)
`have a side-channel management path that exists to power
`servers on and off. By policy, the known-good golden image
`that loads on power up is rarely (if ever) overwritten, so power
`cycling the server through the management port will bring
`the FPGA back into a good conﬁguration, making the server
`reachable via the network once again.
`
`A. Shell architecture
`
`Within each FPGA, we use the partitioning and terminology
`we deﬁned in prior work [4] to separate the application logic
`
`(Role) from the common I/O and board-speciﬁc logic (Shell)
`used by accelerated services. Figure 4 gives an overview of
`this architecture’s major shell components, focusing on the
`network. In addition to the Ethernet MACs and PHYs, there
`is an intra-FPGA message router called the Elastic Router
`(ER) with virtual channel support for allowing multiple Roles
`access to the network, and a Lightweight Transport Layer
`(LTL) engine used for enabling inter-FPGA communication.
`Both are described in detail in Section V.
`
`The FPGA’s location as a bump-in-the-wire between the
`network switch and host means that
`it must always pass
`packets between the two network interfaces that it controls.
`The shell implements a bridge to enable this functionality,
`shown at the top of Figure 4. The shell provides a tap for
`FPGA roles to inject, inspect, and alter the network trafﬁc as
`needed, such as when encrypting network ﬂows, which we
`describe in Section III.
`
`Full FPGA reconﬁguration brieﬂy brings down this network
`link, but in most cases applications are robust to brief network
`outages. When network trafﬁc cannot be paused even brieﬂy,
`partial reconﬁguration permits packets to be passed through
`even during reconﬁguration of the role.
`
`Figure 5 shows the area and clock frequency of the shell IP
`components used in the production-deployed image. In total,
`the design uses 44% of the FPGA to support all shell functions
`and the necessary IP blocks to enable access to remote pools of
`FPGAs (i.e., LTL and the Elastic Router). While a signiﬁcant
`fraction of the FPGA is consumed by a few major shell
`components (especially the 40G PHY/MACs at 14% and the
`DDR3 memory controller at 8%), enough space is left for
`the role(s) to provide large speedups for key services, as we
`show in Section III. Large shell components that are stable for
`the long term are excellent candidates for hardening in future
`generations of datacenter-optimized FPGAs.
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 4
`
`

`0.00.51.01.52.0
`
`0.0
`
`Software
`Local FPGA
`
`3.0
`2.0
`1.0
`Throughput (normalized)
`
`4.0
`
`99th percentile target)
`Latency (normalized to
`
`Fig. 6. 99% Latency versus Throughput of ranking service queries running
`on a single server, with and without FPGAs enabled.
`
`A. Bing Search Page Ranking Acceleration
`
`We describe Bing web search ranking acceleration as an
`example of using the local FPGA to accelerate a large-scale
`service. This example is useful both because Bing search is
`a large datacenter workload, and since we had described its
`acceleration in depth on the Catapult v1 platform [4]. At a
`high level, most web search ranking algorithms behave sim-
`ilarly; query-speciﬁc features are generated from documents,
`processed, and then passed to a machine learned model to
`determine how relevant the document is to the query.
`Unlike in [4], we implement only a subset of the feature
`calculations (typically the most expensive ones), and nei-
`ther compute post-processed synthetic features nor run the
`machine-learning portion of search ranking on the FPGAs.
`We do implement two classes of features on the FPGA. The
`ﬁrst is the traditional ﬁnite state machines used in many search
`engines (e.g. “count the number of occurrences of query term
`two”). The second is a proprietary set of features generated
`by a complex dynamic programming engine.
`We implemented the selected features in a Feature Func-
`tional Unit (FFU), and the Dynamic Programming Features in
`a separate DPF unit. Both the FFU and DPF units were built
`into a shell that also had support for execution using remote
`accelerators, namely the ER and LTL blocks as described
`in Section V. This FPGA image also, of course, includes
`the network bridge for NIC-TOR communication, so all the
`server’s network trafﬁc is passing through the FPGA while
`it is simultaneously accelerating document ranking. The pass-
`through trafﬁc and the search ranking acceleration have no
`performance interaction.
`We present results in a format similar to the Catapult
`results to make direct comparisons simpler. We are running
`this image on a full production bed consisting of thousands
`of servers. In a production environment, it is infeasible to
`simulate many different points of query load as there is
`substantial infrastructure upstream that only produces requests
`at the rate of arrivals. To produce a smooth distribution with
`repeatable results, we used a single-box test with a stream
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 5
`
`B. Datacenter Deployment
`
`To evaluate the system architecture and performance at
`scale, we manufactured and deployed 5,760 servers containing
`this accelerator architecture and placed it into a production dat-
`acenter. All machines were conﬁgured with the shell described
`above. The servers and FPGAs were stress tested using the
`power virus workload on the FPGA and a standard burn-in test
`for the server under real datacenter environmental conditions.
`The servers all passed, and were approved for production use
`in the datacenter.
`We brought up a production Bing web search ranking
`service on the servers, with 3,081 of these machines using the
`FPGA for local compute acceleration, and the rest used for
`other functions associated with web search. We mirrored live
`trafﬁc to the bed for one month, and monitored the health and
`stability of the systems as well as the correctness of the ranking
`service. After one month, two FPGAs had hard failures, one
`with a persistently high rate of single event upset (SEU) errors
`in the conﬁguration logic, and the other with an unstable 40 Gb
`network link to the NIC. A third failure of the 40 Gb link to the
`TOR was found not to be an FPGA failure, and was resolved
`by replacing a network cable. Given aggregate datacenter
`failure rates, we deemed the FPGA-related hardware failures
`to be acceptably low for production.
`We also measured a low number of soft errors, which
`were all correctable. Five machines failed to train to the full
`Gen3 x8 speeds on the secondary PCIe link. There were
`eight total DRAM calibration failures which were repaired
`by reconﬁguring the FPGA. The errors have since been traced
`to a logical error in the DRAM interface rather than a hard
`failure. Our shell scrubs the conﬁguration state for soft errors
`and reports any ﬂipped bits. We measured an average rate
`of one bit-ﬂip in the conﬁguration logic every 1025 machine
`days. While the scrubbing logic often catches the ﬂips before
`functional failures occur, at least in one case there was a
`role hang that was likely attributable to an SEU event. Since
`the scrubbing logic completes roughly every 30 seconds, our
`system recovers from hung roles automatically, and we use
`ECC and other data integrity checks on critical interfaces, the
`exposure of the ranking service to SEU events is low. Overall,
`the hardware and interface stability of the system was deemed
`suitable for scale production.
`In the next sections, we show how this board/shell combi-
`nation can support local application acceleration while simul-
`taneously routing all of the server’s incoming and outgoing
`network trafﬁc. Following that, we show network acceleration,
`and then acceleration of remote services.
`
`III. LOCAL ACCELERATION
`
`is important for an at-scale
`it
`As we described earlier,
`datacenter accelerator to enhance local applications and in-
`frastructure functions for different domains (e.g. web search
`and IaaS). In this section we measure the performance of our
`system on a large datacenter workload.
`
`

`Software
`Local FPGA
`
`1
`
`4
`3
`2
`Query Load (normalized)
`
`5
`
`6
`
`Fig. 8. Query 99.9% Latency vs. Offered Load.
`
`0.00.51.01.52.02.53.03.5
`
`0
`
`99.9% software latency
`99.9% FPGA latency
`average software load
`average FPGA query load
`Day 1
`Day 5
`Day 2
`Day 3
`Day 4
`
`Query Latency 99.9 (normalized)
`
`1.02.03.04.05.06.07.0
`
`Normalized Load & Latency
`
`Fig. 7. Five day query throughput and latency of ranking service queries
`running in production, with and without FPGAs enabled.
`
`of 200,000 queries, and varied the arrival rate of requests to
`measure query latency versus throughput. Figure 6 shows the
`results; the curves shown are the measured latencies of the
`99% slowest queries at each given level of throughput. Both
`axes show normalized results.
`We have normalized both the target production latency and
`typical average throughput in software-only mode to 1.0. The
`software is well tuned; it can achieve production targets for
`throughput at the required 99th percentile latency. With the
`single local FPGA, at the target 99th percentile latency, t he
`throughput can be safely increased by 2.25x, which means
`that fewer than half as many servers would be needed to
`sustain the target throughput at the required latency. Even at
`these higher loads, the FPGA remains underutilized, as the
`software portion of ranking saturates the host server before
`the FPGA is saturated. Having multiple servers drive fewer
`FPGAs addresses the underutilization of the FPGAs, which is
`the goal of our remote acceleration model.
`Production Measurements: We have deployed the FPGA
`accelerated ranking service into production datacenters at scale
`and report end-to-end measurements below.
`Figure 7 shows the performance of ranking service running
`in two production datacenters over a ﬁve day period, one
`with FPGAs enabled and one without (both datacenters are
`of identical scale and conﬁguration, with the exception of the
`FPGAs). These results are from live production trafﬁc, not on a
`synthetic workload or mirrored trafﬁc. The top two bars show
`the normalized tail query latencies at the 99.9th percentile
`(aggregated across all servers over a rolling time window),
`while the bottom two bars show the corresponding query
`loads received at each datacenter. As load varies throughout
`the day, the queries executed in the software-only datacenter
`experience a high rate of latency spikes, while the FPGA-
`accelerated queries have much lower, tighter-bound latencies,
`despite seeing much higher peak query loads.
`Figure 8 plots the load versus latency over the same 5-day
`
`period for the two datacenters. Given that these measurements
`were performed on production datacenters and trafﬁc, we were
`unable to observe higher loads on the software datacenter
`(vs. the FPGA-enabled datacenter) due to a dynamic load
`balancing mechanism that caps the incoming trafﬁc when tail
`latencies begin exceeding acceptable thresholds. Because the
`FPGA is able to process requests while keeping latencies low,
`it is able to absorb more than twice the offered load, while
`executing queries at a latency that never exceeds the software
`datacenter at any load.
`
`IV. NETWORK ACCELERATION
`
`The bump-in-the-wire architecture was developed to enable
`datacenter networking applications, the ﬁrst of which is host-
`to-host line rate encryption/decryption on a per ﬂow basis. As
`each packet passes from the NIC through the FPGA to the
`ToR, its header is examined to determine if it is part of an
`encrypted ﬂow that was previously set up by software. If it
`is, the software-provided encryption key is read from internal
`FPGA SRAM or the FPGA-attached DRAM and is used to
`encrypt or decrypt the packet. Thus, once the encrypted ﬂows
`are set up, there is no load on the CPUs to encrypt or decrypt
`the packets; encryption occurs transparently from software’s
`perspective, which sees all packets as unencrypted at the end
`points.
`The ability to ofﬂoad encryption/decryption at network line
`rates to the FPGA yields signiﬁcant CPU savings. According
`to Intel [6], its AES GCM-128 performance on Haswell is
`1.26 cycles per byte for encrypt and decrypt each. Thus, at
`a 2.4 GHz clock frequency, 40 Gb/s encryption/decryption
`consumes roughly ﬁve cores. Different standards, such as
`256b or CBC are, however, signiﬁcantly slower. In addition,
`there is sometimes a need for crypto hash functions, further
`reducing performance. For example, AES-CBC-128-SHA1 is
`needed for backward compatibility for some software stacks
`and consumes at least ﬁfteen cores to achieve 40 Gb/s full
`duplex. Of course, consuming ﬁfteen cores just for crypto is
`impractical on a typical multi-core server, as there would be
`that many fewer cores generating trafﬁc. Even ﬁve cores of
`
`Petitioner Microsoft Corporation - Ex. 1030, p. 6
`
`

`savings is signiﬁcant when every core can otherwise generate
`revenue.
`Our FPGA implementation supports full 40 Gb/s encryption
`and decryption. The worst case half-duplex FPGA crypto
`latency for AES-CBC-128-SHA1 is 11 µs for a 1500B packet,
`from ﬁrst ﬂit to ﬁrst ﬂit. In software, based on the Intel num-
`bers, it is approximately 4 µs. AES-CBC-SHA1 is, however,
`especially difﬁcult for hardware due to tight dependencies. For
`example, AES-CBC requires processing 33 packets at a time
`in our implementation, taking only 128b from a single packet
`once every 33 cycles. GCM latency numbers are signiﬁcantly
`better for FPGA since a single packet can be processed with
`no dependencies and thus can be perfectly pipelined.
`The software performance numbers are Intel’s very best
`numbers that do not account for the disturbance to the
`core, such as effects on the cache if encryption/decryption is
`blocked, which is often the case. Thus, the real-world beneﬁts
`of ofﬂoading crypto to the FPGA are greater than the numbers
`presented above.
`
`V. REMOTE ACCELERATION
`
`In the previous section, we showed that the Conﬁgurable
`Cloud acceleration architecture could support local functions,
`both for Bing web search ranking acceleration and accelerating
`networking/infrastructure functions, such as encryption of net-
`work ﬂows. However, to treat the acceleration hardware as a
`global resource and to deploy services that consume more than
`one FPGA (e.g. more aggressive web search ranking, large-
`scale machine learning, and bioinformatics), communication
`among FPGAs is crucial. Other systems provide explicit
`connectivity through secondary networks or PCIe switches to
`allow multi-FPGA solutions. Either solution, however, limits
`the scale of connectivity. By having FPGAs communicate
`directly through the datacenter Ethernet infrastructure, large
`scale and low latency are both achieved. However,
`these
`communication channels have several requirements:
`
`• They cannot go through software protocol stacks on the
`CPU due to their long latencies.
`• They must be resilient to failures and dropped packets, but
`should drop packets rarely.
`• They should not consume signiﬁcant FPGA resources.
`
`The rest of this section describes the FPGA implementation
`that supports cross-datacenter, inter-FPGA communication and
`meets the above requirements. There are two major functions
`that must be implemented on the FPGA: the inter-FPGA com-
`munication engine and the intra-FPGA router that coordinates
`the various ﬂows of trafﬁc on the FPGA among the network,
`PCIe, DRAM, and the application roles. We describe each
`below.
`
`A. Lightweight Transport Layer
`
`for
`the inter-FPGA network protocol LTL,
`We call
`Lightweight Transport Layer. This protocol,
`like previous
`work [7], uses UDP for frame encapsulation and IP for
`routing packets across the datacenter network. Low-latency
`
`communication demands infrequent packet drops and infre-
`quent packet reorders. By using “lossless” trafﬁc classes
`provided in datacenter switches and provisioned for trafﬁc like
`RDMA and FCoE, we avoid most packet drops and reorders.
`Separating out such trafﬁc to their own classes also protects
`the datacenter’s baseline TCP trafﬁ

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases