`Transformer-Based Generative Models
`Gyeong-In Yu and Joo Seong Jeong, Seoul National University;
`Geon-Woo Kim, FriendliAI and Seoul National University; Soojeong Kim, FriendliAI;
`Byung-Gon Chun, FriendliAI and Seoul National University
`https://www.usenix.org/conference/osdi22/presentation/yu
`
`This paper is included in the Proceedings of the
`16th USENIX Symposium on Operating Systems
`Design and Implementation.
`July 11–13, 2022 • Carlsbad, CA, USA
`978-1-939133-28-1
`
`Open access to the Proceedings of the
`16th USENIX Symposium on Operating
`Systems Design and Implementation
`is sponsored by
`
`Petitioner, EX1008
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`
`
`ORCA: A Distributed Serving System for
`Transformer-Based Generative Models
`
`Gyeong-In Yu
`Seoul National University
`
`Joo Seong Jeong
`Seoul National University
`
`Soojeong Kim
`FriendliAI
`
`Abstract
`Large-scale Transformer-based models trained for generation
`tasks (e.g., GPT-3) have recently attracted huge interest, em-
`phasizing the need for system support for serving models in
`this family. Since these models generate a next token in an au-
`toregressive manner, one has to run the model multiple times
`to process an inference request where each iteration of the
`model generates a single output token for the request. How-
`ever, existing systems for inference serving do not perform
`well on this type of workload that has a multi-iteration char-
`acteristic, due to their inflexible scheduling mechanism that
`cannot change the current batch of requests being processed;
`requests that have finished earlier than other requests in a
`batch cannot return to the client, while newly arrived requests
`have to wait until the current batch completely finishes.
`In this paper, we propose iteration-level scheduling, a new
`scheduling mechanism that schedules execution at the gran-
`ularity of iteration (instead of request) where the scheduler
`invokes the execution engine to run only a single iteration of
`the model on the batch. In addition, to apply batching and
`iteration-level scheduling to a Transformer model at the same
`time, we suggest selective batching, which applies batching
`only to a selected set of operations. Based on these two tech-
`niques, we have implemented a distributed serving system
`called ORCA, with additional designs for scalability to models
`with hundreds of billions of parameters. Our evaluation on a
`GPT-3 175B model shows that ORCA can significantly out-
`perform NVIDIA FasterTransformer in terms of both latency
`and throughput: 36.9× throughput improvement at the same
`level of latency.
`
`1 Introduction
`
`Language generation tasks are becoming increasingly
`paramount to many types of applications, such as chatbot [9,
`52], summarization [41,45,54], code generation [13], and cap-
`tion generation [65,66]. Moreover, recent works published by
`
`∗Corresponding author.
`
`Geon-Woo Kim
`FriendliAI
`Seoul National University
`Byung-Gon Chun∗
`FriendliAI
`Seoul National University
`
`AI21 Labs [37], DeepMind [26,48], Google [15,21,63], Meta
`Platforms [10,67], Microsoft [50], Microsoft & NVIDIA [59],
`and OpenAI [12] have reported that every language process-
`ing task, including translation [11, 17], classification [20, 53],
`question-answering [32, 33, 40] and more, can be cast as a
`language generation problem and have shown great improve-
`ments along this direction. The rise of generative models is
`not limited to the language domain; the AI community has
`also given growing interest to generation problems in other do-
`mains such as image, video, speech, or a mixture of multiple
`domains [19,38,51,62]. At the heart of generative models lies
`the Transformer architecture [60] and its variants [15, 47–49].
`By relying on the attention mechanism [60], Transformer
`models can learn better representations where each element
`of the sequence may have a direct connection with every other
`element, which was not possible in recurrent models [25].
`To use generative models in real-world applications, we
`often delegate the inference procedure to a separate service
`responsible for ML inference serving. The growing demands
`for this service, which should provide inference results for
`client requests at low latency and high throughput, have fa-
`cilitated the development of inference serving systems such
`as Triton Inference Server [7] and TensorFlow Serving [42].
`These systems can use a separately-developed DNN execution
`engine to perform the actual tensor operations. For example,
`we can deploy a service for language generation tasks by
`using a combination of Triton and FasterTransformer [4], an
`execution engine optimized for the inference of Transformer-
`based models. In this case, Triton is mainly responsible for
`grouping multiple client requests into a batch, while Faster-
`Transformer receives the batch from Triton and conducts the
`inference procedure in the batched manner.
`Unfortunately, we notice that the existing inference sys-
`tems, including both the serving system layer and the execu-
`tion engine layer, have limitations in handling requests for
`Transformer-based generative models. Since these models are
`trained to generate a next token in an autoregressive manner,
`one should run the model as many times as the number of to-
`kens to generate, while for other models like ResNet [24] and
`
`USENIX Association
`
`16th USENIX Symposium on Operating Systems Design and Implementation 521
`
`
`
`BERT [18] a request can be processed by running the model
`once. That is, in order to process a request to the generative
`model, we have to run multiple iterations of the model; each
`iteration generates a single output token, which is used as
`an input in the following iteration. Such multi-iteration char-
`acteristic calls into question the current design of inference
`systems, where the serving system schedules the execution
`of the engine at the granularity of request. Under this design,
`when the serving system dispatches a batch of requests to
`the engine, the engine returns inference results for the entire
`batch at once after processing all requests within the batch.
`As different client requests may require different numbers of
`iterations for processing, requests that have finished earlier
`than others in the batch cannot return to the client, resulting
`in an increased latency. Requests arrived after dispatching the
`batch also should wait for processing the batch, which can
`significantly increase the requests’ queueing time.
`In this paper, we propose to schedule the execution of the
`engine at the granularity of iteration instead of request. In
`particular, the serving system invokes the engine to run only a
`single iteration of the model on the batch. As a result, a newly
`arrived request can be considered for processing after waiting
`for only a single iteration of the model. The serving system
`checks whether a request has finished processing after every
`return from the engine – hence the finished requests can also
`be returned to the clients immediately.
`Nevertheless, a noticeable challenge arises when we at-
`tempt to apply batching and the iteration-level scheduling at
`the same time. Unlike the canonical request-level scheduling,
`the proposed scheduling can issue a batch of requests where
`each request has so far processed a different number of tokens.
`In such a case, the requests to the Transformer model cannot
`be processed in the batched manner because the attention
`mechanism calls for non-batchable tensor operations whose
`input tensors have variable shapes depending on the number
`of processed tokens.
`To address this challenge, we suggest to apply batching
`only to a selected set of operations, which we call selective
`batching. By taking different characteristics of operations into
`account, selective batching splits the batch and processes each
`request individually for the Attention1 operation while apply-
`ing batching to other operations of the Transformer model.
`We observe that the decision not to batch the executions of
`the Attention operation has only a small impact on efficiency.
`Since the Attention operation is not associated with any model
`parameters, applying batching to Attention has no benefit of
`reducing the amount of GPU memory reads by reusing the
`loaded parameters across multiple requests.
`Based on these techniques, we design and implement
`ORCA, a distributed serving system for Transformer-based
`generative models. In order to handle large-scale models,
`
`1In some literature the Attention operation has an extended definition that
`includes linear layers (QKV Linear and Attn Out Linear; Figure 1b). On the
`other hand, we use a narrow definition as described in Figure 1b.
`
`ORCA adopts parallelization strategies including intra-layer
`and inter-layer model parallelism, which were originally de-
`veloped by training systems [55, 58] for Transformer models.
`We also devise a new scheduling algorithm for the proposed
`iteration-level scheduling, with additional considerations for
`memory management and pipelined execution across work-
`ers.
`We evaluate ORCA using OpenAI GPT-3 [12] models with
`various configurations, scaling up to 341B of parameters. The
`results show that ORCA significantly outperforms FasterTrans-
`former [4], showing 36.9× throughput improvement at the
`same level of latency. While we use a language model as
`a driving example throughout the paper and conduct experi-
`ments only on language models, generative models in other
`domains can benefit from our approach as long as the mod-
`els are based on the Transformer architecture and use the
`autoregressive generation procedure [19, 38, 51, 62].
`
`2 Background
`
`We provide background on the inference procedure of
`GPT [12, 47], a representative example of Transformer-based
`generative models that we use throughout this paper, and ML
`inference serving systems.
`
`Inference procedure of GPT. GPT is an autoregressive
`language model based on one of architectural variants of
`Transformer [60]. It takes text as input and produces new text
`as output. In particular, the model receives a sequence of input
`tokens and then completes the sequence by generating subse-
`quent output tokens. Figure 1a illustrates a simplified compu-
`tation graph that represents this procedure with a three-layer
`GPT model, where nodes and edges indicate Transformer
`layers and dependencies between the layers, respectively. The
`Transformer layers are executed in the order denoted by the
`numbers on the nodes, and the nodes that use the same set
`of model parameters (i.e., nodes representing the same layer)
`are filled with the same color.
`The generated output token is fed back into the model to
`generate the next output token, imposing a sequential, one-
`by-one inference procedure. This autoregressive procedure of
`generating a single token is done by running all the layers of
`the model with the input, which is either a sequence of input
`tokens that came from the client or a previously generated out-
`put token. We define the run of all layers as an iteration of the
`model. In the example shown in Figure 1a, the inference pro-
`cedure comprises three iterations. The first iteration (“iter 1”)
`takes all the input tokens (“I think this”) at once and generates
`the next token (“is”). This iteration composes an initiation
`phase, a procedure responsible for processing the input tokens
`and generating the first output token. The next two iterations
`(“iter 2” and “iter 3”), which compose an increment phase,
`take the output token of the preceding iteration and generate
`
`522 16th USENIX Symposium on Operating Systems Design and Implementation
`
`USENIX Association
`
`
`
`(a) A computation graph representing
`an inference procedure using a GPT
`model. The graph does not depict lay-
`ers other than Transformer layers (e.g.,
`embedding) for simplicity.
`
`(b) A Transformer layer used in GPT.
`
`(c) Internal state usage of Transformer. h, k, v, and c refer
`to layer input/output, Attention key, Attention value, and
`LSTM internal memory, respectively. l denotes layer
`index and t denotes token index.
`
`Figure 1: Illustrations for GPT’s inference procedure, Transformer layer, and internal state usage.
`
`the next token. In this case, “iter 3” is the last iteration because
`it produces “<EOS>”, a special end-of-sequence token that
`terminates output generation. Note that while the increment
`phase comprises multiple iterations because each iteration
`is only able to process a single token, the initiation phase is
`typically implemented as a single iteration by processing all
`the input tokens in parallel.
`The original Transformer [60] employs two stacks of Trans-
`former layers, while GPT’s architecture consists of a single
`layer stack, namely decoder. Figure 1b shows a Transformer
`layer used in GPT. Among the operations that compose the
`Transformer layer, Attention is the essence that distinguishes
`Transformer from other architectures. At a high level, the At-
`tention operation computes a weighted average of the tokens
`of interest so that each token in the sequence is aware of the
`other. It takes three inputs, query, key, and value, computes dot
`products of the query (for the current token) with all keys (for
`the tokens of interest), applies Softmax on the dot products
`to get weights, and conducts weighted average of all values
`associated with the weights.
`Since the Attention requires keys and values of all pre-
`ceding tokens,2 we consider the keys and values as internal
`states that should be maintained across multiple iterations. A
`naïve, state-less inference procedure would take all tokens in
`the sequence (including both the client-provided input tokens
`and the output tokens generated so far) to recompute all the
`keys and values at every iteration. To avoid such recomputa-
`tion, fairseq [43] suggests incremental decoding, which saves
`the keys and values for reuse in successive iterations. Other
`systems for Transformer such as FasterTransformer [4] and
`Megatron-LM [3] also do the same.
`
`2Language models like GPT use causal masking, which means all pre-
`ceding tokens are of interest and participate in the Attention operation.
`
`Figure 1c illustrates the state usage pattern of Transformer,
`along with LSTM [25] that also maintains internal states. The
`main difference is that the size of the states (k for Attention
`key and v for value) in Transformer increases with iteration,
`whereas the size of the states (c for LSTM internal memory
`and h for LSTM layer’s input/output) in LSTM remains con-
`stant. When processing the token at index t, the Attention
`operation takes all previous Attention keys kl,1:t−1 and values
`vl,1:t−1 along with the current key kl,t and value vl,t.3 There-
`fore, the Attention operation should perform computation on
`tensors of different shapes depending on the number of tokens
`already processed.
`Prior to the Attention operation, there are the layer normal-
`ization operation (LayerNorm) and the QKV Linear (linear
`and split operations to get the query, key and value). Opera-
`tions performed after Attention are, in order, a linear operation
`(Attn Out Linear), an add operation for residual connection
`(Add), layer normalization operation (LayerNorm), the multi-
`layer perceptron (MLP) operations, and the other residual
`connection operation (Add).
`
`ML inference serving systems. Growing demands for ML-
`driven applications have made ML inference serving service
`a critical workload in modern datacenters. Users (either the
`end-user or internal microservices of the application) submit
`requests to an inference service, and the service gives replies
`on the requests based on a pre-defined ML model using its
`provisioned resource, typically equipped with specialized ac-
`celerators such as GPUs and TPUs. In particular, the service
`runs a DNN model with input data to generate output for the
`
`3kl,1:t−1 represents Attention keys of the l-th layer for tokens at indices
`1 to t − 1 while kl,t is for the Attention key of the l-th layer for the token at
`index t. Same for vl,1:t−1 and vl,t.
`
`USENIX Association
`
`16th USENIX Symposium on Operating Systems Design and Implementation 523
`
`is
`
`great <EOS>
`
`iter 1
`
`iter 2
`
`iter 3
`
`3
`
`2
`
`1
`
`6
`
`5
`
`4
`
`9
`
`8
`
`7
`
`I
`
`think this
`
`is
`
`great
`
`Add
`
`Attn Out Linear
`
`Attention
`
`Query
`
`Key
`
`Value
`
`QKV Linear
`
`MLP
`
`Output
`
`Add
`
`Linear
`
`GeLU
`
`Linear
`
`LayerNorm
`
`LayerNorm
`
`Input
`
`hl,thl,t
`
`
`
`hl,t+1hl,t+1
`
`kl,1:t−1
`kl,1:t−1
`vl,1:t−1
`vl,1:t−1
`
`Transformer
`layer
`
`kl,1:t
`kl,1:t
`vl,1:t
`vl,1:t
`
`Transformer
`layer
`
`kl,1:t+1
`kl,1:t+1
`vl,1:t+1
`vl,1:t+1
`
`cl,t−1
`cl,t−1
`hl,t−1
`hl,t−1
`
`hl−1,thl−1,t
`
`hl,thl,t
`
`LSTM
`layer
`
`
`
`hl−1,thl−1,t
`
`hl−1,t+1hl−1,t+1
`
`
`hl,t+1hl,t+1
`
`cl,t
`cl,t
`hl,t
`hl,t
`
`LSTM
`layer
`
`cl,t+1
`cl,t+1
`hl,t+1
`hl,t+1
`
`
`
`hl−1,t+1hl−1,t+1
`
`
`
`Figure 2: Overall workflow of serving a generative language
`model with existing systems.
`
`request. Just like other services operating on datacenters, a
`well-managed inference service should provide low latency
`and high throughput within a reasonable amount of cost.
`To meet such constraints, service operators often use ML
`inference serving systems such as Triton Inference Server [7]
`and TensorFlow Serving [42]. These systems can be seen as
`an abstraction sitting atop underlying model execution en-
`gines such as TensorRT [6], TVM [14], TensorFlow [8], and
`many others [44, 46], being agnostic to various kinds of ML
`models, execution engines, and computing hardware. While
`delegating the role of driving the main mathematical opera-
`tions to the engines, serving systems are in charge of exposing
`endpoints that receive inference requests, scheduling execu-
`tions of the engine, and sending responses to the requests.
`Accordingly, these systems focus on aspects such as batch-
`ing the executions [7, 16, 35, 42, 56], selecting an appropriate
`model from multiple model variants [16,27,30,57], deploying
`multiple models (each for different inference services) on the
`same device [7, 29, 35, 56], and so on.
`Among the features and optimizations provided by serv-
`ing systems, batching is a key to achieve high accelerator
`utilization when using accelerators like GPUs. When we run
`the execution engine with batching enabled, the input tensors
`from multiple requests coalesce into a single, large input ten-
`sor before being fed to the first operation of the model. Since
`the accelerators prefer large input tensors over small ones to
`better exploit the vast amount of parallel computation units,
`the engine’s throughput is highly dependent on the batch size,
`i.e., the number of inference requests the engine processes
`together. Reusing the model parameters loaded from off-chip
`memory is another merit in batched execution, especially
`when the model involves memory-intensive operations.
`Figure 2 shows an overall workflow of serving a generative
`language model with existing serving systems and execution
`engines. The main component of the serving system (e.g., Tri-
`ton [7]) is the scheduler, which is responsible for creating
`a batch of requests by retrieving requests from a queue and
`scheduling the execution engine (e.g., FasterTransformer [4])
`to process the batch. The execution engine processes the
`received batch by running multiple iterations of the model
`being served and returns the generated text back to the
`serving system. In Figure 2, the serving system schedules the
`engine to process two requests (x1: “I think”, x2: “I love”) in
`
`Figure 3: An illustration for a case where the requests have the
`same input length but some requests finish earlier than others.
`Shaded tokens represent input tokens. “-” denotes inputs and
`outputs of extra computation imposed by the scheduling.
`
`a batch and the engine generates “this is great” and “you” for
`requests x1 and x2, respectively.
`
`3 Challenges and Proposed Solutions
`
`In this section, we describe challenges in serving Transformer-
`based generative models and propose two techniques:
`iteration-level scheduling and selective batching.
`
`C1: Early-finished and late-joining requests. One major
`limitation of existing systems is that the serving system and
`the execution engine interact with each other only when (1)
`the serving system schedules the next batch on an idle engine;
`or (2) the engine finishes processing the current batch. In
`other words, these systems are designed to schedule execu-
`tions at request granularity; the engine maintains a batch of
`requests fixed until all requests in the batch finish. This can be
`problematic in the serving of generative models, since each
`request in a batch may require different number of iterations,
`resulting in certain requests finishing earlier than the others.
`In the example shown in Figure 3, although request x2 finishes
`earlier than request x1, the engine performs computation for
`both “active” and “inactive” requests throughout all iterations.
`Such extra computation for inactive requests (x2 at iter 3 and
`4) limits the efficiency of batched execution.
`What makes it even worse is that this behavior prevents an
`early return of the finished request to the client, imposing a
`substantial amount of extra latency. This is because the engine
`only returns the execution results to the serving system when
`it finishes processing all requests in the batch. Similarly, when
`a new request arrives in the middle of the current batch’s
`execution, the aforementioned scheduling mechanism makes
`the newly arrived request wait until all requests in the current
`batch have finished. We argue that the current request-level
`scheduling mechanism cannot efficiently handle workloads
`with multi-iteration characteristic. Note that this problem of
`early-finished and late-joining requests does not occur in the
`training of language models; the training procedure finishes
`processing the whole batch in a single iteration by using the
`teacher forcing technique [64].
`
`524 16th USENIX Symposium on Operating Systems Design and Implementation
`
`USENIX Association
`
`#
`
`Engine
`
`Execution
`
`"
`
`x1x1: I think
`x2x2: I love
`
`$
`
`x1x1: this is great
`x2x2: you
`
`Serving System
`
`Scheduler
`
`!
`
`Request Queue
`
`Endpoint
`
`request
`
`response
`
`this
`you
`
`is
`
`<EOS>
`
`great
`-
`
`<EOS>
`
`-
`
`iter 1
`
`iter 2
`
`iter 3
`
`iter 4
`
`x1x1
`x2x2
`
`I
`I
`
`think
`love
`
`this
`you
`
`is
`-
`
`great
`-
`
`
`
`decide which requests to run next and invokes the engine
`to run four selected requests: (x1,x2,x3,x4). The scheduler
`provides the engine with input tokens of the requests sched-
`uled for the first time. In this case, x3 and x4 have not run
`any iterations yet, so the scheduler hands over (x31,x32) for
`x3 and (x41,x42,x43) for x4. The engine runs an iteration
`of the model on the four requests and returns generated
`output tokens (x15,x23,x33,x44), one for each scheduled re-
`quest. Once a request has finished processing, the request pool
`removes the finished request and notifies the endpoint to send
`a response. Unlike the method shown in Figure 2 that should
`run multiple iterations on a scheduled batch until finish of
`all requests within the batch, ORCA’s scheduler can change
`which requests are going to be processed at every iteration.
`We describe the detailed algorithm about how to select the
`requests at every iteration in Section 4.2.
`
`C2: Batching an arbitrary set of requests. When we try
`to use the iteration-level scheduling in practice, one major
`challenge that we are going to face is batching. To achieve
`high efficiency, the execution engine should be able to process
`any selected set of requests in the batched manner. Without
`batching, one would have to process each selected request
`one by one, losing out on the massively parallel computation
`capabilities of GPUs.
`Unfortunately, there is no guarantee that even for a pair of
`requests (xi,x j), for the next iteration, their executions can be
`merged and replaced with a batched version. There are three
`cases for a pair of requests where the next iteration cannot
`be batched together: (1) both requests are in the initiation
`phase and each has different number of input tokens (e.g.,
`x3 and x4 in Figure 4); (2) both are in the increment phase
`and each is processing a token at different index from each
`other (x1 and x2); or (3) each request is in the different phase:
`initiation or increment (x1 and x3). Recall that in order to
`batch the execution of multiple requests, the execution of each
`request must consist of identical operations, each consuming
`identically-shaped input tensors. In the first case, the two
`requests cannot be processed in a batch because the “length”
`dimension of their input tensors, which is the number of input
`tokens, are not equal. The requests in the second case have
`difference in the tensor shape of Attention keys and values
`because each processes token at different index, as shown in
`Figure 1c. For the third case, we cannot batch the iterations of
`different phases because they take different number of tokens
`as input; an iteration of the initiation phase processes all input
`tokens in parallel for efficiency, while in the increment phase
`each iteration takes a single token as its input (we assume the
`use of fairseq-style incremental decoding [43]).
`Batching is only applicable when the two selected requests
`are in the same phase, with the same number of input tokens
`(in case of the initiation phase) or with the same token index
`(in case of the increment phase). This restriction significantly
`reduces the likelihood of batching in real-world workloads,
`
`Figure 4: System overview of ORCA. Interactions between
`components represented as dotted lines indicate that the inter-
`action takes place at every iteration of the execution engine.
`xi j is the j-th token of the i-th request. Shaded tokens repre-
`sent input tokens received from the clients, while unshaded
`tokens are generated by ORCA. For example, request x1 ini-
`tially arrived with two input tokens (x11,x12) and have run
`two iterations so far, where the first and second iterations gen-
`erated x13 and x14, respectively. On the other hand, request
`x3 only contains input tokens (x31,x32) because it has not run
`any iterations yet.
`
`S1: Iteration-level scheduling. To address the above limi-
`tations, we propose to schedule executions at the granularity
`of iteration. At high level, the scheduler repeats the follow-
`ing procedure: (1) selects requests to run next; (2) invokes
`the engine to execute one iteration for the selected requests;
`and (3) receives execution results for the scheduled iteration.
`Since the scheduler receives a return on every iteration, it can
`detect the completion of a request and immediately return its
`generated tokens to the client. For a newly arrived request, the
`request gets a chance to start processing (i.e., the scheduler
`may select the new request to run next) after execution of
`the currently scheduled iteration, significantly reducing the
`queueing delay. With iteration-level scheduling, the sched-
`uler has a full control on how many and which requests are
`processed in each iteration.
`Figure 4 depicts the system architecture and the overall
`workflow of ORCA using the iteration-level scheduling. ORCA
`exposes an endpoint (e.g., HTTPS or gRPC) where inference
`requests arrive at the system and responses to the requests
`are sent out. The endpoint puts newly arrived requests in the
`request pool, a component that manages all requests in the
`system during their lifetime. The pool is monitored by the
`scheduler, which is responsible for: selecting a set of requests
`from the pool, scheduling the execution engine to run an it-
`eration of the model on the set, receiving execution results
`(i.e., output tokens) from the engine, and updating the pool
`by appending each output token to the corresponding request.
`The engine is an abstraction for executing the actual tensor
`operations, which can be parallelized across multiple GPUs
`spread across multiple machines. In the example shown in
`Figure 4, the scheduler interacts with the request pool to
`
`USENIX Association
`
`16th USENIX Symposium on Operating Systems Design and Implementation 525
`
`#
`
`Engine
`
`Execution
`
`Orca System
`
`Scheduler
`
`"
`
`
`
`x1, x2, x3, x4x1, x2, x3, x4
`
`$
`
`
`
`x15, x23, x33, x44x15, x23, x33, x44
`
`!
`
`Endpoint
`
`request
`
`response
`
`Request Pool
`x1x1 x11x11 x12x12 x13x13 x14x14
`x3x3 x31x31 x32x32
`x4x4 x41x41 x42x42 x43x43
`x2x2 x21x21 x22x22
`
`
`
`· · ·· · ·
`
`
`
`For instance, the aforementioned input tensors for x3 and x4
`can compose a 2-dimensional tensor of shape [∑L,H] = [5,H]
`without an explicit batch dimension. This tensor can be fed
`into all non-Attention operations including Linear, Layer-
`Norm, Add, and GeLU operations because they do not need to
`distinguish tensor elements of different requests. On the other
`hand, the Attention operation requires a notion of requests
`(i.e., requires the batch dimension) to compute attention only
`between the tokens of the same request, typically done by
`applying cuBLAS routines for batch matrix multiplication.
`Selective batching is aware of the different characteristics
`of each operation; it splits the batch and processes each re-
`quest individually for the Attention operation while applying
`token-wise (instead of request-wise) batching to other oper-
`ations without the notion of requests. Figure 5 presents the
`selective batching mechanism processing a batch of requests
`(x1,x2,x3,x4) described in Figure 4. This batch has 7 input
`tokens to process, so we make the input tensor have a shape
`of [7,H] and apply the non-Attention operations. Before the
`Attention operation, we insert a Split operation and run the
`Attention operation separately on the split tensor for each
`request. The outputs of Attention operations are merged back
`into a tensor of shape [7,H] by a Merge operation, bringing
`back the batching functionality to the rest of operations.
`To make the requests in the increment phase can use the
`Attention keys and values for the tokens processed in previous
`iterations, ORCA maintains the generated keys and values in
`the Attention K/V manager. The manager maintains these
`keys and values separately for each request until the scheduler
`explicitly asks to remove certain request’s keys and values,
`i.e., when the request has finished processing. The Attention
`operation for request in the increment phase (x1 and x2) takes
`keys and values of previous tokens (x11,x12,x13 for x1; x21 for
`x2) from the manager, along with the current token’s query,
`key, and value from the Split operation to compute attention
`between the current token and the previous ones.
`
`4 ORCA Design
`
`Based on the above techniques, we design and implement
`ORCA: a distributed serving system for Transformer-based
`generative models. We have already discussed the system
`components and the overall execution model of ORCA while
`describing Figure 4. In this section, we answer the remaining
`issues about how to build an efficient system that can scale to
`large-scale models with hundreds of billions of parameters.
`We also describe the scheduling algorithm for iteration-level
`scheduling, i.e., how to select a batch of requests from the
`request pool at every iteration.
`
`4.1 Distributed Architecture
`Recent works [12, 31] have shown that scaling language mod-
`els can dramatically improve the quality of models. Hence,
`
`Figure 5: An illustration of ORCA execution engine running
`a Transformer layer on a batch of requests with selective
`batching. We only depict the QKV Linear, Attention, and
`Attention Out Linear operations for simplicity.
`
`because the scheduler should make a wish for the presence
`of two requests eligible for batching at the same time. The
`likelihood further decreases exponentially as the batch size
`increases, making it impractical to use a large batch size that
`can pull out better throughput without compromising latency.
`
`S2: Selective batching. We propose selective batching, a
`technique for batched execution that allows high flexibility in
`composing requests as a batch. Instead of processing a batch
`of requests by “batchifying” all tensor operations composing
`the model, this technique selectively apply batching only to a
`handful of operations.
`The main problem regarding batching described above is
`that the three aforementioned cases4 correspond to irregu-
`larly shaped input (or state) tensors, which cannot be coa-
`lesced into a single large tensor and fed into a batch opera-
`tion. In the canonical batching mechanism, at each iteration,
`a Transformer layer takes a 3-dimensional input tensor of
`shape [B,L,H] generated by concatenating multiple [L,H] in-
`put tensors of requests in a batch, where B is the batch size,
`L is the number of tokens processed together, and H is the
`hidden size of the model. For example, in Figure 3, “iter 1”
`(initiation phase) takes an input tensor of shape [2,2,H] and
`“iter 2” (increment phase) takes a tensor of shape [2,1,H].
`However, when the scheduler decides to run an iteration on
`batch (x1,x2,x3,x4) in Figure 4, the inputs for