throbber
Orca: A Distributed Serving System for
`Transformer-Based Generative Models
`Gyeong-In Yu and Joo Seong Jeong, Seoul National University;
`Geon-Woo Kim, FriendliAI and Seoul National University; Soojeong Kim, FriendliAI;
`Byung-Gon Chun, FriendliAI and Seoul National University
`https://www.usenix.org/conference/osdi22/presentation/yu
`
`This paper is included in the Proceedings of the
`16th USENIX Symposium on Operating Systems
`Design and Implementation.
`July 11–13, 2022 • Carlsbad, CA, USA
`978-1-939133-28-1
`
`Open access to the Proceedings of the
`16th USENIX Symposium on Operating
`Systems Design and Implementation
`is sponsored by
`
`Petitioner, EX1008
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`

`

`ORCA: A Distributed Serving System for
`Transformer-Based Generative Models
`
`Gyeong-In Yu
`Seoul National University
`
`Joo Seong Jeong
`Seoul National University
`
`Soojeong Kim
`FriendliAI
`
`Abstract
`Large-scale Transformer-based models trained for generation
`tasks (e.g., GPT-3) have recently attracted huge interest, em-
`phasizing the need for system support for serving models in
`this family. Since these models generate a next token in an au-
`toregressive manner, one has to run the model multiple times
`to process an inference request where each iteration of the
`model generates a single output token for the request. How-
`ever, existing systems for inference serving do not perform
`well on this type of workload that has a multi-iteration char-
`acteristic, due to their inflexible scheduling mechanism that
`cannot change the current batch of requests being processed;
`requests that have finished earlier than other requests in a
`batch cannot return to the client, while newly arrived requests
`have to wait until the current batch completely finishes.
`In this paper, we propose iteration-level scheduling, a new
`scheduling mechanism that schedules execution at the gran-
`ularity of iteration (instead of request) where the scheduler
`invokes the execution engine to run only a single iteration of
`the model on the batch. In addition, to apply batching and
`iteration-level scheduling to a Transformer model at the same
`time, we suggest selective batching, which applies batching
`only to a selected set of operations. Based on these two tech-
`niques, we have implemented a distributed serving system
`called ORCA, with additional designs for scalability to models
`with hundreds of billions of parameters. Our evaluation on a
`GPT-3 175B model shows that ORCA can significantly out-
`perform NVIDIA FasterTransformer in terms of both latency
`and throughput: 36.9× throughput improvement at the same
`level of latency.
`
`1 Introduction
`
`Language generation tasks are becoming increasingly
`paramount to many types of applications, such as chatbot [9,
`52], summarization [41,45,54], code generation [13], and cap-
`tion generation [65,66]. Moreover, recent works published by
`
`∗Corresponding author.
`
`Geon-Woo Kim
`FriendliAI
`Seoul National University
`Byung-Gon Chun∗
`FriendliAI
`Seoul National University
`
`AI21 Labs [37], DeepMind [26,48], Google [15,21,63], Meta
`Platforms [10,67], Microsoft [50], Microsoft & NVIDIA [59],
`and OpenAI [12] have reported that every language process-
`ing task, including translation [11, 17], classification [20, 53],
`question-answering [32, 33, 40] and more, can be cast as a
`language generation problem and have shown great improve-
`ments along this direction. The rise of generative models is
`not limited to the language domain; the AI community has
`also given growing interest to generation problems in other do-
`mains such as image, video, speech, or a mixture of multiple
`domains [19,38,51,62]. At the heart of generative models lies
`the Transformer architecture [60] and its variants [15, 47–49].
`By relying on the attention mechanism [60], Transformer
`models can learn better representations where each element
`of the sequence may have a direct connection with every other
`element, which was not possible in recurrent models [25].
`To use generative models in real-world applications, we
`often delegate the inference procedure to a separate service
`responsible for ML inference serving. The growing demands
`for this service, which should provide inference results for
`client requests at low latency and high throughput, have fa-
`cilitated the development of inference serving systems such
`as Triton Inference Server [7] and TensorFlow Serving [42].
`These systems can use a separately-developed DNN execution
`engine to perform the actual tensor operations. For example,
`we can deploy a service for language generation tasks by
`using a combination of Triton and FasterTransformer [4], an
`execution engine optimized for the inference of Transformer-
`based models. In this case, Triton is mainly responsible for
`grouping multiple client requests into a batch, while Faster-
`Transformer receives the batch from Triton and conducts the
`inference procedure in the batched manner.
`Unfortunately, we notice that the existing inference sys-
`tems, including both the serving system layer and the execu-
`tion engine layer, have limitations in handling requests for
`Transformer-based generative models. Since these models are
`trained to generate a next token in an autoregressive manner,
`one should run the model as many times as the number of to-
`kens to generate, while for other models like ResNet [24] and
`
`USENIX Association
`
`16th USENIX Symposium on Operating Systems Design and Implementation 521
`
`

`

`BERT [18] a request can be processed by running the model
`once. That is, in order to process a request to the generative
`model, we have to run multiple iterations of the model; each
`iteration generates a single output token, which is used as
`an input in the following iteration. Such multi-iteration char-
`acteristic calls into question the current design of inference
`systems, where the serving system schedules the execution
`of the engine at the granularity of request. Under this design,
`when the serving system dispatches a batch of requests to
`the engine, the engine returns inference results for the entire
`batch at once after processing all requests within the batch.
`As different client requests may require different numbers of
`iterations for processing, requests that have finished earlier
`than others in the batch cannot return to the client, resulting
`in an increased latency. Requests arrived after dispatching the
`batch also should wait for processing the batch, which can
`significantly increase the requests’ queueing time.
`In this paper, we propose to schedule the execution of the
`engine at the granularity of iteration instead of request. In
`particular, the serving system invokes the engine to run only a
`single iteration of the model on the batch. As a result, a newly
`arrived request can be considered for processing after waiting
`for only a single iteration of the model. The serving system
`checks whether a request has finished processing after every
`return from the engine – hence the finished requests can also
`be returned to the clients immediately.
`Nevertheless, a noticeable challenge arises when we at-
`tempt to apply batching and the iteration-level scheduling at
`the same time. Unlike the canonical request-level scheduling,
`the proposed scheduling can issue a batch of requests where
`each request has so far processed a different number of tokens.
`In such a case, the requests to the Transformer model cannot
`be processed in the batched manner because the attention
`mechanism calls for non-batchable tensor operations whose
`input tensors have variable shapes depending on the number
`of processed tokens.
`To address this challenge, we suggest to apply batching
`only to a selected set of operations, which we call selective
`batching. By taking different characteristics of operations into
`account, selective batching splits the batch and processes each
`request individually for the Attention1 operation while apply-
`ing batching to other operations of the Transformer model.
`We observe that the decision not to batch the executions of
`the Attention operation has only a small impact on efficiency.
`Since the Attention operation is not associated with any model
`parameters, applying batching to Attention has no benefit of
`reducing the amount of GPU memory reads by reusing the
`loaded parameters across multiple requests.
`Based on these techniques, we design and implement
`ORCA, a distributed serving system for Transformer-based
`generative models. In order to handle large-scale models,
`
`1In some literature the Attention operation has an extended definition that
`includes linear layers (QKV Linear and Attn Out Linear; Figure 1b). On the
`other hand, we use a narrow definition as described in Figure 1b.
`
`ORCA adopts parallelization strategies including intra-layer
`and inter-layer model parallelism, which were originally de-
`veloped by training systems [55, 58] for Transformer models.
`We also devise a new scheduling algorithm for the proposed
`iteration-level scheduling, with additional considerations for
`memory management and pipelined execution across work-
`ers.
`We evaluate ORCA using OpenAI GPT-3 [12] models with
`various configurations, scaling up to 341B of parameters. The
`results show that ORCA significantly outperforms FasterTrans-
`former [4], showing 36.9× throughput improvement at the
`same level of latency. While we use a language model as
`a driving example throughout the paper and conduct experi-
`ments only on language models, generative models in other
`domains can benefit from our approach as long as the mod-
`els are based on the Transformer architecture and use the
`autoregressive generation procedure [19, 38, 51, 62].
`
`2 Background
`
`We provide background on the inference procedure of
`GPT [12, 47], a representative example of Transformer-based
`generative models that we use throughout this paper, and ML
`inference serving systems.
`
`Inference procedure of GPT. GPT is an autoregressive
`language model based on one of architectural variants of
`Transformer [60]. It takes text as input and produces new text
`as output. In particular, the model receives a sequence of input
`tokens and then completes the sequence by generating subse-
`quent output tokens. Figure 1a illustrates a simplified compu-
`tation graph that represents this procedure with a three-layer
`GPT model, where nodes and edges indicate Transformer
`layers and dependencies between the layers, respectively. The
`Transformer layers are executed in the order denoted by the
`numbers on the nodes, and the nodes that use the same set
`of model parameters (i.e., nodes representing the same layer)
`are filled with the same color.
`The generated output token is fed back into the model to
`generate the next output token, imposing a sequential, one-
`by-one inference procedure. This autoregressive procedure of
`generating a single token is done by running all the layers of
`the model with the input, which is either a sequence of input
`tokens that came from the client or a previously generated out-
`put token. We define the run of all layers as an iteration of the
`model. In the example shown in Figure 1a, the inference pro-
`cedure comprises three iterations. The first iteration (“iter 1”)
`takes all the input tokens (“I think this”) at once and generates
`the next token (“is”). This iteration composes an initiation
`phase, a procedure responsible for processing the input tokens
`and generating the first output token. The next two iterations
`(“iter 2” and “iter 3”), which compose an increment phase,
`take the output token of the preceding iteration and generate
`
`522 16th USENIX Symposium on Operating Systems Design and Implementation
`
`USENIX Association
`
`

`

`(a) A computation graph representing
`an inference procedure using a GPT
`model. The graph does not depict lay-
`ers other than Transformer layers (e.g.,
`embedding) for simplicity.
`
`(b) A Transformer layer used in GPT.
`
`(c) Internal state usage of Transformer. h, k, v, and c refer
`to layer input/output, Attention key, Attention value, and
`LSTM internal memory, respectively. l denotes layer
`index and t denotes token index.
`
`Figure 1: Illustrations for GPT’s inference procedure, Transformer layer, and internal state usage.
`
`the next token. In this case, “iter 3” is the last iteration because
`it produces “<EOS>”, a special end-of-sequence token that
`terminates output generation. Note that while the increment
`phase comprises multiple iterations because each iteration
`is only able to process a single token, the initiation phase is
`typically implemented as a single iteration by processing all
`the input tokens in parallel.
`The original Transformer [60] employs two stacks of Trans-
`former layers, while GPT’s architecture consists of a single
`layer stack, namely decoder. Figure 1b shows a Transformer
`layer used in GPT. Among the operations that compose the
`Transformer layer, Attention is the essence that distinguishes
`Transformer from other architectures. At a high level, the At-
`tention operation computes a weighted average of the tokens
`of interest so that each token in the sequence is aware of the
`other. It takes three inputs, query, key, and value, computes dot
`products of the query (for the current token) with all keys (for
`the tokens of interest), applies Softmax on the dot products
`to get weights, and conducts weighted average of all values
`associated with the weights.
`Since the Attention requires keys and values of all pre-
`ceding tokens,2 we consider the keys and values as internal
`states that should be maintained across multiple iterations. A
`naïve, state-less inference procedure would take all tokens in
`the sequence (including both the client-provided input tokens
`and the output tokens generated so far) to recompute all the
`keys and values at every iteration. To avoid such recomputa-
`tion, fairseq [43] suggests incremental decoding, which saves
`the keys and values for reuse in successive iterations. Other
`systems for Transformer such as FasterTransformer [4] and
`Megatron-LM [3] also do the same.
`
`2Language models like GPT use causal masking, which means all pre-
`ceding tokens are of interest and participate in the Attention operation.
`
`Figure 1c illustrates the state usage pattern of Transformer,
`along with LSTM [25] that also maintains internal states. The
`main difference is that the size of the states (k for Attention
`key and v for value) in Transformer increases with iteration,
`whereas the size of the states (c for LSTM internal memory
`and h for LSTM layer’s input/output) in LSTM remains con-
`stant. When processing the token at index t, the Attention
`operation takes all previous Attention keys kl,1:t−1 and values
`vl,1:t−1 along with the current key kl,t and value vl,t.3 There-
`fore, the Attention operation should perform computation on
`tensors of different shapes depending on the number of tokens
`already processed.
`Prior to the Attention operation, there are the layer normal-
`ization operation (LayerNorm) and the QKV Linear (linear
`and split operations to get the query, key and value). Opera-
`tions performed after Attention are, in order, a linear operation
`(Attn Out Linear), an add operation for residual connection
`(Add), layer normalization operation (LayerNorm), the multi-
`layer perceptron (MLP) operations, and the other residual
`connection operation (Add).
`
`ML inference serving systems. Growing demands for ML-
`driven applications have made ML inference serving service
`a critical workload in modern datacenters. Users (either the
`end-user or internal microservices of the application) submit
`requests to an inference service, and the service gives replies
`on the requests based on a pre-defined ML model using its
`provisioned resource, typically equipped with specialized ac-
`celerators such as GPUs and TPUs. In particular, the service
`runs a DNN model with input data to generate output for the
`
`3kl,1:t−1 represents Attention keys of the l-th layer for tokens at indices
`1 to t − 1 while kl,t is for the Attention key of the l-th layer for the token at
`index t. Same for vl,1:t−1 and vl,t.
`
`USENIX Association
`
`16th USENIX Symposium on Operating Systems Design and Implementation 523
`
`is
`
`great <EOS>
`
`iter 1
`
`iter 2
`
`iter 3
`
`3
`
`2
`
`1
`
`6
`
`5
`
`4
`
`9
`
`8
`
`7
`
`I
`
`think this
`
`is
`
`great
`
`Add
`
`Attn Out Linear
`
`Attention
`
`Query
`
`Key
`
`Value
`
`QKV Linear
`
`MLP
`
`Output
`
`Add
`
`Linear
`
`GeLU
`
`Linear
`
`LayerNorm
`
`LayerNorm
`
`Input
`
`hl,thl,t
`
`
`
`hl,t+1hl,t+1
`
`kl,1:t−1
`kl,1:t−1
`vl,1:t−1
`vl,1:t−1
`
`Transformer
`layer
`
`kl,1:t
`kl,1:t
`vl,1:t
`vl,1:t
`
`Transformer
`layer
`
`kl,1:t+1
`kl,1:t+1
`vl,1:t+1
`vl,1:t+1
`
`cl,t−1
`cl,t−1
`hl,t−1
`hl,t−1
`
`hl−1,thl−1,t
`
`hl,thl,t
`
`LSTM
`layer
`
`
`
`hl−1,thl−1,t
`
`hl−1,t+1hl−1,t+1
`
`
`hl,t+1hl,t+1
`
`cl,t
`cl,t
`hl,t
`hl,t
`
`LSTM
`layer
`
`cl,t+1
`cl,t+1
`hl,t+1
`hl,t+1
`
`
`
`hl−1,t+1hl−1,t+1
`
`

`

`Figure 2: Overall workflow of serving a generative language
`model with existing systems.
`
`request. Just like other services operating on datacenters, a
`well-managed inference service should provide low latency
`and high throughput within a reasonable amount of cost.
`To meet such constraints, service operators often use ML
`inference serving systems such as Triton Inference Server [7]
`and TensorFlow Serving [42]. These systems can be seen as
`an abstraction sitting atop underlying model execution en-
`gines such as TensorRT [6], TVM [14], TensorFlow [8], and
`many others [44, 46], being agnostic to various kinds of ML
`models, execution engines, and computing hardware. While
`delegating the role of driving the main mathematical opera-
`tions to the engines, serving systems are in charge of exposing
`endpoints that receive inference requests, scheduling execu-
`tions of the engine, and sending responses to the requests.
`Accordingly, these systems focus on aspects such as batch-
`ing the executions [7, 16, 35, 42, 56], selecting an appropriate
`model from multiple model variants [16,27,30,57], deploying
`multiple models (each for different inference services) on the
`same device [7, 29, 35, 56], and so on.
`Among the features and optimizations provided by serv-
`ing systems, batching is a key to achieve high accelerator
`utilization when using accelerators like GPUs. When we run
`the execution engine with batching enabled, the input tensors
`from multiple requests coalesce into a single, large input ten-
`sor before being fed to the first operation of the model. Since
`the accelerators prefer large input tensors over small ones to
`better exploit the vast amount of parallel computation units,
`the engine’s throughput is highly dependent on the batch size,
`i.e., the number of inference requests the engine processes
`together. Reusing the model parameters loaded from off-chip
`memory is another merit in batched execution, especially
`when the model involves memory-intensive operations.
`Figure 2 shows an overall workflow of serving a generative
`language model with existing serving systems and execution
`engines. The main component of the serving system (e.g., Tri-
`ton [7]) is the scheduler, which is responsible for Œ creating
`a batch of requests by retrieving requests from a queue and 
`scheduling the execution engine (e.g., FasterTransformer [4])
`to process the batch. The execution engine Ž processes the
`received batch by running multiple iterations of the model
`being served and  returns the generated text back to the
`serving system. In Figure 2, the serving system schedules the
`engine to process two requests (x1: “I think”, x2: “I love”) in
`
`Figure 3: An illustration for a case where the requests have the
`same input length but some requests finish earlier than others.
`Shaded tokens represent input tokens. “-” denotes inputs and
`outputs of extra computation imposed by the scheduling.
`
`a batch and the engine generates “this is great” and “you” for
`requests x1 and x2, respectively.
`
`3 Challenges and Proposed Solutions
`
`In this section, we describe challenges in serving Transformer-
`based generative models and propose two techniques:
`iteration-level scheduling and selective batching.
`
`C1: Early-finished and late-joining requests. One major
`limitation of existing systems is that the serving system and
`the execution engine interact with each other only when (1)
`the serving system schedules the next batch on an idle engine;
`or (2) the engine finishes processing the current batch. In
`other words, these systems are designed to schedule execu-
`tions at request granularity; the engine maintains a batch of
`requests fixed until all requests in the batch finish. This can be
`problematic in the serving of generative models, since each
`request in a batch may require different number of iterations,
`resulting in certain requests finishing earlier than the others.
`In the example shown in Figure 3, although request x2 finishes
`earlier than request x1, the engine performs computation for
`both “active” and “inactive” requests throughout all iterations.
`Such extra computation for inactive requests (x2 at iter 3 and
`4) limits the efficiency of batched execution.
`What makes it even worse is that this behavior prevents an
`early return of the finished request to the client, imposing a
`substantial amount of extra latency. This is because the engine
`only returns the execution results to the serving system when
`it finishes processing all requests in the batch. Similarly, when
`a new request arrives in the middle of the current batch’s
`execution, the aforementioned scheduling mechanism makes
`the newly arrived request wait until all requests in the current
`batch have finished. We argue that the current request-level
`scheduling mechanism cannot efficiently handle workloads
`with multi-iteration characteristic. Note that this problem of
`early-finished and late-joining requests does not occur in the
`training of language models; the training procedure finishes
`processing the whole batch in a single iteration by using the
`teacher forcing technique [64].
`
`524 16th USENIX Symposium on Operating Systems Design and Implementation
`
`USENIX Association
`
`#
`
`Engine
`
`Execution
`
`"
`
`x1x1: I think
`x2x2: I love
`
`$
`
`x1x1: this is great
`x2x2: you
`
`Serving System
`
`Scheduler
`
`!
`
`Request Queue
`
`Endpoint
`
`request
`
`response
`
`this
`you
`
`is
`
`<EOS>
`
`great
`-
`
`<EOS>
`
`-
`
`iter 1
`
`iter 2
`
`iter 3
`
`iter 4
`
`x1x1
`x2x2
`
`I
`I
`
`think
`love
`
`this
`you
`
`is
`-
`
`great
`-
`
`

`

`decide which requests to run next and  invokes the engine
`to run four selected requests: (x1,x2,x3,x4). The scheduler
`provides the engine with input tokens of the requests sched-
`uled for the first time. In this case, x3 and x4 have not run
`any iterations yet, so the scheduler hands over (x31,x32) for
`x3 and (x41,x42,x43) for x4. The engine Ž runs an iteration
`of the model on the four requests and  returns generated
`output tokens (x15,x23,x33,x44), one for each scheduled re-
`quest. Once a request has finished processing, the request pool
`removes the finished request and notifies the endpoint to send
`a response. Unlike the method shown in Figure 2 that should
`run multiple iterations on a scheduled batch until finish of
`all requests within the batch, ORCA’s scheduler can change
`which requests are going to be processed at every iteration.
`We describe the detailed algorithm about how to select the
`requests at every iteration in Section 4.2.
`
`C2: Batching an arbitrary set of requests. When we try
`to use the iteration-level scheduling in practice, one major
`challenge that we are going to face is batching. To achieve
`high efficiency, the execution engine should be able to process
`any selected set of requests in the batched manner. Without
`batching, one would have to process each selected request
`one by one, losing out on the massively parallel computation
`capabilities of GPUs.
`Unfortunately, there is no guarantee that even for a pair of
`requests (xi,x j), for the next iteration, their executions can be
`merged and replaced with a batched version. There are three
`cases for a pair of requests where the next iteration cannot
`be batched together: (1) both requests are in the initiation
`phase and each has different number of input tokens (e.g.,
`x3 and x4 in Figure 4); (2) both are in the increment phase
`and each is processing a token at different index from each
`other (x1 and x2); or (3) each request is in the different phase:
`initiation or increment (x1 and x3). Recall that in order to
`batch the execution of multiple requests, the execution of each
`request must consist of identical operations, each consuming
`identically-shaped input tensors. In the first case, the two
`requests cannot be processed in a batch because the “length”
`dimension of their input tensors, which is the number of input
`tokens, are not equal. The requests in the second case have
`difference in the tensor shape of Attention keys and values
`because each processes token at different index, as shown in
`Figure 1c. For the third case, we cannot batch the iterations of
`different phases because they take different number of tokens
`as input; an iteration of the initiation phase processes all input
`tokens in parallel for efficiency, while in the increment phase
`each iteration takes a single token as its input (we assume the
`use of fairseq-style incremental decoding [43]).
`Batching is only applicable when the two selected requests
`are in the same phase, with the same number of input tokens
`(in case of the initiation phase) or with the same token index
`(in case of the increment phase). This restriction significantly
`reduces the likelihood of batching in real-world workloads,
`
`Figure 4: System overview of ORCA. Interactions between
`components represented as dotted lines indicate that the inter-
`action takes place at every iteration of the execution engine.
`xi j is the j-th token of the i-th request. Shaded tokens repre-
`sent input tokens received from the clients, while unshaded
`tokens are generated by ORCA. For example, request x1 ini-
`tially arrived with two input tokens (x11,x12) and have run
`two iterations so far, where the first and second iterations gen-
`erated x13 and x14, respectively. On the other hand, request
`x3 only contains input tokens (x31,x32) because it has not run
`any iterations yet.
`
`S1: Iteration-level scheduling. To address the above limi-
`tations, we propose to schedule executions at the granularity
`of iteration. At high level, the scheduler repeats the follow-
`ing procedure: (1) selects requests to run next; (2) invokes
`the engine to execute one iteration for the selected requests;
`and (3) receives execution results for the scheduled iteration.
`Since the scheduler receives a return on every iteration, it can
`detect the completion of a request and immediately return its
`generated tokens to the client. For a newly arrived request, the
`request gets a chance to start processing (i.e., the scheduler
`may select the new request to run next) after execution of
`the currently scheduled iteration, significantly reducing the
`queueing delay. With iteration-level scheduling, the sched-
`uler has a full control on how many and which requests are
`processed in each iteration.
`Figure 4 depicts the system architecture and the overall
`workflow of ORCA using the iteration-level scheduling. ORCA
`exposes an endpoint (e.g., HTTPS or gRPC) where inference
`requests arrive at the system and responses to the requests
`are sent out. The endpoint puts newly arrived requests in the
`request pool, a component that manages all requests in the
`system during their lifetime. The pool is monitored by the
`scheduler, which is responsible for: selecting a set of requests
`from the pool, scheduling the execution engine to run an it-
`eration of the model on the set, receiving execution results
`(i.e., output tokens) from the engine, and updating the pool
`by appending each output token to the corresponding request.
`The engine is an abstraction for executing the actual tensor
`operations, which can be parallelized across multiple GPUs
`spread across multiple machines. In the example shown in
`Figure 4, the scheduler Œ interacts with the request pool to
`
`USENIX Association
`
`16th USENIX Symposium on Operating Systems Design and Implementation 525
`
`#
`
`Engine
`
`Execution
`
`Orca System
`
`Scheduler
`
`"
`
`
`
`x1, x2, x3, x4x1, x2, x3, x4
`
`$
`
`
`
`x15, x23, x33, x44x15, x23, x33, x44
`
`!
`
`Endpoint
`
`request
`
`response
`
`Request Pool
`x1x1 x11x11 x12x12 x13x13 x14x14
`x3x3 x31x31 x32x32
`x4x4 x41x41 x42x42 x43x43
`x2x2 x21x21 x22x22
`
`
`
`· · ·· · ·
`
`

`

`For instance, the aforementioned input tensors for x3 and x4
`can compose a 2-dimensional tensor of shape [∑L,H] = [5,H]
`without an explicit batch dimension. This tensor can be fed
`into all non-Attention operations including Linear, Layer-
`Norm, Add, and GeLU operations because they do not need to
`distinguish tensor elements of different requests. On the other
`hand, the Attention operation requires a notion of requests
`(i.e., requires the batch dimension) to compute attention only
`between the tokens of the same request, typically done by
`applying cuBLAS routines for batch matrix multiplication.
`Selective batching is aware of the different characteristics
`of each operation; it splits the batch and processes each re-
`quest individually for the Attention operation while applying
`token-wise (instead of request-wise) batching to other oper-
`ations without the notion of requests. Figure 5 presents the
`selective batching mechanism processing a batch of requests
`(x1,x2,x3,x4) described in Figure 4. This batch has 7 input
`tokens to process, so we make the input tensor have a shape
`of [7,H] and apply the non-Attention operations. Before the
`Attention operation, we insert a Split operation and run the
`Attention operation separately on the split tensor for each
`request. The outputs of Attention operations are merged back
`into a tensor of shape [7,H] by a Merge operation, bringing
`back the batching functionality to the rest of operations.
`To make the requests in the increment phase can use the
`Attention keys and values for the tokens processed in previous
`iterations, ORCA maintains the generated keys and values in
`the Attention K/V manager. The manager maintains these
`keys and values separately for each request until the scheduler
`explicitly asks to remove certain request’s keys and values,
`i.e., when the request has finished processing. The Attention
`operation for request in the increment phase (x1 and x2) takes
`keys and values of previous tokens (x11,x12,x13 for x1; x21 for
`x2) from the manager, along with the current token’s query,
`key, and value from the Split operation to compute attention
`between the current token and the previous ones.
`
`4 ORCA Design
`
`Based on the above techniques, we design and implement
`ORCA: a distributed serving system for Transformer-based
`generative models. We have already discussed the system
`components and the overall execution model of ORCA while
`describing Figure 4. In this section, we answer the remaining
`issues about how to build an efficient system that can scale to
`large-scale models with hundreds of billions of parameters.
`We also describe the scheduling algorithm for iteration-level
`scheduling, i.e., how to select a batch of requests from the
`request pool at every iteration.
`
`4.1 Distributed Architecture
`Recent works [12, 31] have shown that scaling language mod-
`els can dramatically improve the quality of models. Hence,
`
`Figure 5: An illustration of ORCA execution engine running
`a Transformer layer on a batch of requests with selective
`batching. We only depict the QKV Linear, Attention, and
`Attention Out Linear operations for simplicity.
`
`because the scheduler should make a wish for the presence
`of two requests eligible for batching at the same time. The
`likelihood further decreases exponentially as the batch size
`increases, making it impractical to use a large batch size that
`can pull out better throughput without compromising latency.
`
`S2: Selective batching. We propose selective batching, a
`technique for batched execution that allows high flexibility in
`composing requests as a batch. Instead of processing a batch
`of requests by “batchifying” all tensor operations composing
`the model, this technique selectively apply batching only to a
`handful of operations.
`The main problem regarding batching described above is
`that the three aforementioned cases4 correspond to irregu-
`larly shaped input (or state) tensors, which cannot be coa-
`lesced into a single large tensor and fed into a batch opera-
`tion. In the canonical batching mechanism, at each iteration,
`a Transformer layer takes a 3-dimensional input tensor of
`shape [B,L,H] generated by concatenating multiple [L,H] in-
`put tensors of requests in a batch, where B is the batch size,
`L is the number of tokens processed together, and H is the
`hidden size of the model. For example, in Figure 3, “iter 1”
`(initiation phase) takes an input tensor of shape [2,2,H] and
`“iter 2” (increment phase) takes a tensor of shape [2,1,H].
`However, when the scheduler decides to run an iteration on
`batch (x1,x2,x3,x4) in Figure 4, the inputs for

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket