`Yu et al .
`
`US 11,442,775 B1
`( 10 ) Patent No .:
`Sep. 13 , 2022
`( 45 ) Date of Patent :
`
`US011442775B1
`
`( 54 ) DYNAMIC BATCHING FOR INFERENCE
`SYSTEM FOR TRANSFORMER - BASED
`GENERATION TASKS
`
`( 71 ) Applicant : FriendliAI Inc. , Seoul ( KR )
`( 72 ) Inventors : Gyeongin Yu , Seoul ( KR ) ; Geon - Woo
`Kim , Seoul ( KR ) ; Joo Seong Jeong ,
`Seoul ( KR ) ; Soojeong Kim , Seoul
`( KR ) ; Byung - Gon Chun , Seoul ( KR )
`( 73 ) Assignee : FriendliAI Inc. , Seoul ( KR )
`Subject to any disclaimer , the term of this
`( * ) Notice :
`patent is extended or adjusted under 35
`U.S.C. 154 ( b ) by 0 days .
`
`( 21 ) Appl . No .: 17 / 542,193
`( 22 ) Filed :
`
`Dec. 3 , 2021
`
`( 2006.01 )
`( 2006.01 )
`( 2006.01 )
`( 2019.01 )
`( 2006.01 )
`( 2006.01 )
`( 2006.01 )
`
`( 51 ) Int . Ci .
`G06F 9/46
`G06F 9/48
`GOON 5/04
`GOON 20/00
`GOOF 9/50
`GOON 3/04
`GOON 3/08
`( 52 ) U.S. CI .
`G06F 9/4881 ( 2013.01 ) ; G06F 9/5016
`CPC
`( 2013.01 ) ; G06N 5/04 ( 2013.01 ) ; G06N 20/00
`( 2019.01 ) ; GOON 3/0454 ( 2013.01 ) ; GOON
`3/08 ( 2013.01 )
`
`( 58 ) Field of Classification Search
`CPC ..... GO6F 9/4881 ; GOOF 9/5016 ; GO6N 20/00 ;
`GOON 5/04 ; G06N 3/0454 ; GOON 3/08
`USPC
`718/1
`See application file for complete search history .
`
`( 56 )
`
`References Cited
`U.S. PATENT DOCUMENTS
`10,846,096 B1 * 11/2020 Chung
`2020/0226453 A1 *
`7/2020 Luk
`2020/0311341 A1 * 10/2020 Chaturvedi
`2021/0034335 A1
`2/2021 Svyalkovskly et al .
`2021/0192314 A1 *
`6/2021 Aarts
`GO6F 8/433
`2021/0263779 A1 *
`8/2021 Haghighat
`G06F 11/3409
`2021/0279576 A1 *
`9/2021 Shazeer
`GOON 3/08
`2021/0357210 Al
`11/2021 Clement et al .
`2021/0406673 A1 * 12/2021 Pardeshi
`( Continued )
`
`GOON 20/00
`GOON 3/08
`GOON 20/00
`
`GOON 3/08
`
`OTHER PUBLICATIONS
`" PREMA : A Predictive Multi - task Scheduling Algo
`Choi et al . ,
`rithm For Preemptible Neural Processing Units ” , 2020 , IEEE Inter
`national Symposium on High Performance Computer Architecture
`( HPCA ) , pp . 220-233 . ( Year : 2020 ) . *
`( Continued )
`Primary Examiner Kenneth Tang
`( 74 ) Attorney , Agent , or Firm — Fenwick & West LLP
`( 57 )
`ABSTRACT
`An inference system applies a machine - learning transformer
`model to a batch of requests with variable input length or
`variable target length or variable internal state length by
`selectively batching a subset of operations in the transformer
`model but processing req
`ests in the batch individually for
`a subset of operations in the transformer model . In one
`embodiment , the operation to be processed individually is an
`attention operation of an encoder or a decoder of the
`transformer model . By selective batching , the inference
`system can allow batching operations to be performed for a
`batch of requests with variable input or target length or
`internal state length to utilize the parallel computation
`capabilities of hardware accelerators while preventing
`unnecessary computations that occur for workarounds that
`restrain the data of a batch of requests
`a same length .
`18 Claims , 12 Drawing Sheets
`
`NS :
`
`Batch
`
`Z
`
`in
`
`No
`Batch
`
`Q Kcache , b
`
`MINS
`
`Veachio ,
`
`Batch
`
`QKV
`ZZZ
`
`91
`
`300
`
`LM Head
`370
`
`Decoder DN
`
`Decoder D1
`
`Add
`335
`.... 1
`Attention Linear
`330
`
`Self - Attention
`325
`
`Split
`320
`
`342
`
`Add
`360
`
`MLP
`355
`
`? ? ? ? GeLU
`350
`
`MLP
`245
`
`QKV Operation
`315
`
`Layer Normalization
`340
`
`Layer Normalization
`310
`
`X +
`
`X3
`
`Petitioner, EX1001
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`
`
`US 11,442,775 B1
`Page 2
`
`( 56 )
`
`References Cited
`U.S. PATENT DOCUMENTS
`2022/0066747 A1
`2022/0067513 Al
`
`3/2022 Drain et al .
`3/2022 Stevens et al .
`
`OTHER PUBLICATIONS
`Dai et al , “ Transformer - XL : Attentive Language Models Beyond a
`Fixed - Length Context ” , Jun . 2 , 2019 , Carnegie Mellon , University ,
`Google Brain , pp . 1-20 ( Year : 2019 ) . *
`Fang , J. et al . , “ Turbo Transformers : An Efficient GPU Serving
`System For Transformer Models , ” arXiv : 2010.05680v4 , Feb. 20 ,
`2021 , pp . 1-14 .
`Gao , P. et al . , “ Low Latency RNN Inference with Cellular Batch
`ing , ” EuroSys ’18 , Apr. 2018 , pp . 1-15 .
`Github , “ microsoft / DeepSpeed , ” Jan. 19 , 2021 , pp . 1-9 ,
`[ Online ]
`[ Retrieved on Jan. 31 , 2022 ] Retrieved from the Internet < URL :
`https://github.com/microsoft/DeepSpeed >
`Github , “ NVIDIA / Faster Transformer , ” Apr. 2 , 2021 , pp . 1-28 , [ Online ]
`[ Retrieved on Jan. 31 , 2022 ] Retrieved from the Internet < URL :
`https://github.com/NVIDIA/FasterTransformer » .
`Github , “ NVIDIA Megatron - LM , ” Aug. 11 , 2021 , pp . 1-18 , [ Online ]
`[ Retrieved on Jan. 31 , 2022 ] Retrieved from the Internet < URL :
`https://github.com/NVIDIA/Megatron-LM > .
`Li , G. et al . , “ Easy and Efficient Transformer : Scalable Inference
`Solution for Large NLP Model , ” arXiv : 2104.12470v4 , Nov. 23 ,
`2021 , pp . 1-9 .
`NVIDIA , “ NVIDIA TensorRT , ” Jan. 27 , 2021 , pp . 1-11 , [ Online ]
`[ Retrieved on Jan. 31 , 2022 ] Retrieved from the Wayback Machine
`
`> >
`
`< URL http://web.archive.org/web/20210127111124/https://developer .
`nvidia.com/tensorrt » .
`NVIDIA , “ NVIDIA Triton Inference Server , ” Jan. 25 , 2021 , pp . 1-6 ,
`[ Online ] [ Retrieved on Jan. 31 , 2022 ] Retrieved from the Wayback
`Machine < URL http://web.archive.org/web/20210125141031/https://
`developer.nvidia.com/nvidia-triton-inference-server > .
`Olston , C. et al . , “ TensorFlow - Serving : Flexible , High - Performance
`ML Serving , " arXiv : 1712.06139v2 , Dec. 27 , 2017 , pp . 1-8 .
`Shazeer , N. et al . , “ Mesh - TensorFlow : Deep Learning for Super
`computers , ” arXiv : 1811.02084v1 , Nov. 5 , 2018 , pp . 1-16 .
`Shoeybi , M. et al . , “ Megatron - LM : Training Multi - Billion Param
`eter Language Models Using Model Parallelism , ” arXiv : 1909 .
`08053v4 , Mar. 13 , 2020 , pp . 1-15 .
`Wang , X. et al . , “ LightSeq : A High Performance Inference Library
`for Transformers , ” arXiv : 2010.13887v4 , Apr. 22 , 2021 , pp . 1-8 .
`Doshi , Ketan , “ Transformers Explained Visually ( Part 1 ) : Overview
`of Functionality ” , Dec. 13 , 2020 , < towardsdatascience.com > ( Year :
`2020 ) , 16 pages .
`Doshi , Ketan , “ Transformers Explained Visually ( Part 2 ) : How it
`works , step - by - step ” , Jan. 2 , 2021 , < towardsdatascience.com > ( Year :
`2021 ) , 23 pages .
`Doshi , Ketan , “ Transformers Explained Visually ( Part 3 ) : Multi
`head Attention deep dive ” , Jan. 16 , 2021 , < towardsdatascience .
`com > ( Year : 2021 ) , 20 pages .
`Doshi , Ketan , “ Transformers Explained Visually ( Part 4 ) : Not Just
`How , but Why They Work So Well ” , Jun . 2 , 2021 , < towardsdatascience .
`com > ( Year : 2021 ) , 17 pages .
`Vaswani et al . , " Attention Is All You Need ” , Dec. 6 , 2017 , arXiv ,
`< https://arxiv.org/abs/1706.03762 > ( Year : 2017 ) , pp . 1-15 .
`* cited by examiner
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 1 of 12
`
`US 11,442,775 B1
`
`Inference System 130
`
`Network 120
`
`
`
`Client Device 110B
`
`
`
`Client Device 110A
`
`100
`
`FIG . 1
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 2 of 12
`
`US 11,442,775 B1
`
`X
`X2
`X3
`
`?1
`
`?1
`
`200
`
`LM Head
`270
`
`Decoder DN
`
`Decoder D1
`
`Add
`235
`
`Attention Linear
`230
`
`Self - Attention
`225
`
`Split
`220
`
`N
`
`Z
`
`DDDDD
`
`Q
`
`Kcache
`
`Vcache
`
`K
`
`V
`
`242
`
`Add
`260
`
`MLP
`255
`
`GELU
`250
`
`MLP
`245
`
`QKV Operation
`215
`
`Layer Normalization
`240
`
`Layer Normalization
`210
`
`X1
`X2
`X3
`
`FIG . 2A
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 3 of 12
`
`US 11,442,775 B1
`
`X
`X2
`X3
`
`Û2
`
`LM Head
`270
`
`Decoder DN
`
`Decoder D1
`
`Add
`235
`
`N E H N CHI
`Z
`
`Q Kcache
`
`Vcache
`
`QKV
`
`Attention Linear
`230
`
`Self - Attention
`225
`
`Split
`220
`
`242
`
`Add
`260
`
`MLP
`255
`
`GELU
`250
`
`MLP
`245
`
`QKV Operation
`215
`
`Layer Normalization
`240
`
`Layer Normalization
`210
`
`?1
`
`FIG . 2B
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 4 of 12
`
`US 11,442,775 B1
`
`X1
`X2
`X2
`
`Z NANDO
`
`Zio
`
`Qb
`
`Kcache.b Vcache ,
`
`QKV
`
`Batch
`
`No
`Batch
`
`Batch
`
`?1
`
`300
`
`LM Head
`370
`
`Decoder DN
`
`Decoder D1
`
`Add
`335
`
`Attention Linear
`330
`
`Self - Attention
`325
`
`Split
`320
`
`342
`
`Add
`360
`
`MLP
`355
`
`GeLU
`350
`?
`MLP
`345
`
`QKV Operation
`315
`
`Layer Normalization
`340
`
`Layer Normalization
`310
`
`X
`X2
`X3
`FIG . 3A
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 5 of 12
`
`US 11,442,775 B1
`
`X1
`X2
`X3
`
`Û2
`
`LM Head
`370
`
`Decoder DN
`
`N NI
`
`Z '
`
`Q Kcache
`2
`
`V cache
`
`QKV
`
`Batch
`
`No
`Batch
`
`Batch
`
`Decoder D1
`
`Add
`335
`
`Attention Output
`330
`
`NA
`
`Self - Attention
`325
`
`Split
`320
`
`342
`
`Add
`360
`
`MLP
`355
`
`GeLU
`350
`
`MLP
`345
`
`QKV Operation
`315
`
`Layer Normalization
`340
`
`Layer Normalization
`310
`
`?1
`
`FIG . 3B
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 6 of 12
`
`US 11,442,775 B1
`
`
`
`Inference System 130
`
`
`
`Serving System 435
`
`
`
`Training Module 430
`
`Execution Engine
`Module 425
`Data Management Module 420
`
`Training Corpus 460
`
`FIG . 4
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 7 of 12
`
`US 11,442,775 B1
`
`Serving System
`435
`
`R2
`
`Request Processor
`580
`
`Completion
`
`Scheduler
`585
`
`Incoming
`R2
`
`Execution Engine
`590A
`
`R1
`
`Execution Engine
`590B
`
`KV Cache
`
`KV Cache
`
`R3
`R4
`R5
`
`FIG . 5A
`
`Serving System
`1435
`
`Request Processor
`580
`
`Completion
`
`Scheduler
`585
`
`Incoming
`
`Execution Engine
`590A
`
`R1 22
`R2
`
`Execution Engine
`590B
`
`KV Cache
`
`KV Cache
`
`R3
`R4
`R5
`
`FIG . 5B
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 8 of 12
`
`US 11,442,775 B1
`
`Serving System
`435
`
`R7 Request Processor
`580
`
`Completion
`R2
`R4
`
`Scheduler
`585
`
`Incoming
`R7
`
`R2 R4
`Response
`
`Execution Engine
`590A
`
`Execution Engine
`590B
`
`KV Cache
`
`KV Cache
`
`R1 W
`R2
`
`Serving System
`1435
`
`R3
`R4
`R5
`
`FIG . 5C
`
`Request Processor
`580
`
`Completion
`
`Scheduler
`585
`
`Incoming
`
`
`
`*** " "
`
`"
`
`"
`
`"
`
`"
`
`"
`
`"
`
`"
`
`Execution Engine
`590A
`
`R1 MINA
`R7 ||
`
`Execution Engine
`590B
`
`KV Cache
`
`KV Cache
`
`R3
`R5
`
`FIG . 5D
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 9 of 12
`
`US 11,442,775 B1
`
`Receive a batch of requests including one or
`more input token sequences
`602
`
`Access a machine learned transformer model
`including at least a set of decoders
`604
`
`Generate one or more queries , keys , and values
`for the requests by applying a QKV weight tensor
`to input representations by a batch operation
`606
`
`Split a first query , a first key , a first value for the
`first request and a second query , a second key , a
`second value for the second request
`608
`
`FIG . 6A
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 10 of 12
`
`US 11,442,775 B1
`
`Generate first attention output for the first request
`by combining the first query , first key , first value
`610
`
`Separately generating second attention output for
`the second request by combining the second
`query , second key , second value
`612
`
`Concatenate at least the first attention output and
`the second attention output into a concatenated
`tensor
`614
`
`Generate output representations by applying a
`weight tensor to the concatenated tensor by a
`batch operation
`616
`
`Setting the one or more output tokens as the one
`or more inputs to the set of decoders for the next
`iteration
`618
`
`Provide output tokens generated for at least one
`request to a client device as a response
`620
`
`FIG . 6B
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 11 of 12
`
`US 11,442,775 B1
`
`Receive , by a serving system , one or more
`requests for execution
`710
`
`Schedule a batch of requests for execution on an
`execution engine
`712
`
`Generate a first set of output tokens by iteratively
`applying the transformer model to a first set of
`inputs for the batch
`714
`
`Receive , by the serving system , a new request
`including a sequence of input tokens
`716
`
`Schedule a second batch of requests including
`the new request responsive to determining that
`the execution engine has memory available
`718
`
`Generate a second set of output tokens by
`iteratively applying the transformer model to a
`second set of inputs for the second batch
`720
`
`FIG . 7
`
`
`
`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 12 of 12
`
`US 11,442,775 B1
`
`800
`
`1
`
`Display Device
`811
`
`1
`Input Mechanisms
`813
`
`Processor
`801
`
`Main Memory
`803
`
`ROM
`805
`
`Storage Device
`807
`
`Communication
`Interface
`809
`
`Hardware
`Accelerators
`810
`
`FIG . 8
`
`
`
`US 11,442,775 B1
`
`a
`
`1
`DYNAMIC BATCHING FOR INFERENCE
`SYSTEM FOR TRANSFORMER - BASED
`GENERATION TASKS
`
`2
`unnecessary computations that occur for workarounds that
`restrain the data of a batch of requests to a same length .
`Specifically , in one embodiment , the inference system
`receives a batch of requests including one or more input
`5 token sequences . A length of a first input token sequence for
`BACKGROUND
`a first request in the batch may be different from a length of
`a second input token sequence for a second request . The
`This invention relates generally to machine - learning
`inference system accesses a transformer model including at
`transformer neural network models , and more particularly to
`least a set of decoders coupled to one another . For one or
`selective batching for transformer models .
`Transformer neural network models are machine - learning 10 more iterations , the inference system repeatedly performs
`the steps of generating one or more output tokens for the
`models used for a variety of applications , for example ,
`requests by applying the set of decoders to one or more
`natural language processing ( NLP ) , image processing , or
`inputs for the requests .
`audio processing applications that include sequential data .
`For at least one decoder in the set , the inference system
`For example , a transformer model may receive a sequence 15 generates one or more queries , one or more keys , and one or
`of input tokens that represent a query and generate a
`more values for the requests by applying a QKV weight
`sequence of output tokens that represent a response to the
`tensor to one or more input representations . In one instance ,
`query . As another example , the transformer model may
`the queries , keys , and values are generated by a batch
`receive a sequence of input tokens that represent a paragraph
`operation . The inference system splits at least a first query
`in German and generate a sequence of output tokens that 20 for the first request from the one or more queries , a first key
`represents a translation of the paragraph in English . As yet
`for the first request from the one or more keys , and a first
`another example , the transformer model may receive a
`value for the first request from the one or more values . The
`sequence of input tokens that represent a paragraph of text
`inference system also splits at least a second query for the
`and generate a sequence of output tokens that represent a
`second request from the one or more queries , a second key
`25 for the second request from the one or more keys , and a
`summarized version of the text .
`Typically , users of client devices submit requests to an
`second value for the second request from the one or more
`inference system . The inference system executes a machine
`values .
`learning transformer model to inputs ( e.g. , a sequence of
`The inference system generates a first attention output for
`input tokens ) of requests to generate outputs ( e.g. , a
`the first request by at least combining the first query , the first
`sequence of output tokens ) for the requests . The inference 30 key , and the first value for the first request . The inference
`system may return the outputs to client devices of the
`system also separately generates a second attention output
`requests as a response . In one instance , the inference system
`for the second request by at least combining the second
`executes the requests on specialized hardware accelerators
`query , the second key , and the second value for the second
`such as graphics processing units ( GPU's ) or tensor pro-
`request . The inference system concatenates at least the first
`cessing units ( TPU's ) to improve latency and throughput , 35 attention output and the second attention output into a
`especially when the number of parameters of the transformer
`concatenated tensor and generates one or more output rep
`resentations by applying a weight tensor to the concatenated
`model is significantly large .
`In one instance , the inference system processes requests in
`tensor . In one instance , the one or more output representa
`batches to achieve high processor utilization on the accel-
`tions are generated by a batch operation . The inference
`erators . Specifically , the inference system may process mul- 40 system sets the one or more output tokens as the one or more
`tiple requests in a batch together to exploit the amount of
`inputs to the set of decoders for the next iteration and
`parallel computation units in the hardware accelerators . In
`provides output tokens generated for at least one request to
`many situations , the inputs for requests in a batch are
`a client device as a response to the at least one request .
`variable in length . For example , the number of input tokens
`In one embodiment , the inference system performs itera
`for each request in a batch may be variable in length . 45 tion - level dynamic batching for a transformer model that
`However , methods of batching for transformer models often
`allows the inference system to dynamically modify a batch
`require that the length of data for multiple requests in a batch
`of requests being executed on an execution engine . Specifi
`be the same to be processed . Thus , it may not be feasible to
`cally , in existing batching methods for transformer models ,
`process a batch of requests with variable lengths or work-
`it is difficult to modify a batch of requests once it has started
`arounds addressing this problem may result in using more 50 to process on an execution engine . This is because certain
`resources compared to processing each request individually .
`methods of batching require the length of the inputs or the
`length of the internal states to be the same across all requests
`in the batch . Therefore , unless new incoming requests have
`SUMMARY
`the same length of inputs as the batch of requests being
`An inference system applies a machine learning trans- 55 executed on the execution engine , it may be difficult for the
`inference system to modify the batch to , for example , add or
`former model to a batch of requests with variable input
`length or variable target length or variable internal state
`remove new requests to the batch .
`length by selectively batching a subset of operations in the
`By performing selective batching , the inference system
`transformer model but processing requests in the batch
`can monitor and modify a batch being processed on the
`individually for a subset of operations in the transformer 60 execution engine on an iteration - level and update the batch
`model . In one embodiment , the operation to be processed
`between iterations as requests get completed and new
`individually is an attention operation of an encoder or a
`requests are received . Specifically , at one or more iterations ,
`decoder of the transformer model . By selective batching , the
`the inference system can modify the batch being executed on
`inference system can allow batching operations to be per-
`the execution engine by adding new incoming requests to
`formed for a batch of requests with variable input or target 65 the batch or removing completed requests from the batch .
`or internal state length to utilize the parallel computation
`This is because selective batching allows requests with
`capabilities of hardware accelerators while preventing
`variable lengths to be processed without restraining the one
`
`
`
`US 11,442,775 B1
`
`10
`
`30
`
`a
`
`35
`
`40
`
`3
`4
`with an embodiment . The system environment 100 shown
`or more inputs or internal states to the transformer model to
`by FIG . 1 includes one or more client devices 110A , 110B ,
`same lengths . This allows the inference system to remove
`requests in the batch that are completed earlier than others
`a network 120 , and an inference system 130. In alternative
`so that the response can be provided to the user faster and
`configurations , different or additional components may be
`allows the inference system to add new requests to a batch 5 included in the system environment 100 .
`of requests if the execution engine is being under - utilized .
`The inference system 130 receives requests from client
`In one embodiment , a serving system of the inference
`devices 110A , 110B to perform tasks using machine - learn
`system receives one or more requests for execution . The
`ing models . In one embodiment , the machine learning mod
`serving system may include a request processor and a
`scheduler each coupled to one or more execution engines for
`els are transformer neural network models . The tasks may
`executing a machine - learning transformer model including
`include , but are not limited to , natural language processing
`at least a set of decoders . The scheduler schedules a batch of
`( NLP ) , image processing , audio processing applications .
`requests including the one or more requests for execution on
`Specifically , the transformer model may be appropriate for
`an execution engine . The execution engine generates a first
`processing sequential data that can be tokenized into a
`set of output tokens by iteratively applying the transformer
`model to a first set of inputs for the batch of requests . In one 15 sequence of input tokens for the request and a sequence of
`output tokens for the desired response . The inference system
`instance , applying the transformer model includes applying
`130 receives a request including input data ( e.g. , text data ,
`at least one batch operation to one or more input tensors
`image or video data , audio data ) and encodes the input data
`associated with the batch of requests .
`to a set of input tokens . The inference system 130 repeatedly
`The serving system may receive a new request from a
`client device that includes a sequence of input tokens . The 20 applies the machine - learning transformer model for one or
`scheduler schedules a second batch of requests including the
`more iterations to generate a set of output tokens . The
`one or more requests and the new request for execution on
`inference system 130 decodes the set of output tokens to
`the execution engine responsive to determining that the
`output data and returns the output data as the response to the
`execution engine has memory available to execute the
`request . While for applications such as NLP applications , a
`second batch of requests . The execution engine generates a 25 sequence of input tokens or output tokens are arranged along
`second set of output tokens by iteratively applying the
`one dimension ( 1 - D ) to represent , for example , a sequence
`transformer model to a second set of inputs for the second
`of words , it is appreciated that in other embodiments , a
`batch of requests . The second set of inputs may include the
`sequence of input tokens or output tokens may be a multi
`sequence of input tokens for the new request .
`dimensional sequence . For example , for two - dimensional
`image data , the sequence of tokens may be a two dimen
`BRIEF DESCRIPTION OF THE DRAWINGS
`sional ( 2 - D ) sequence arranged along both a first direction
`a
`FIG . 1 is a high - level block diagram of a system envi-
`( e.g. , X - axis ) and a second direction ( e.g. , Y - axis ) , where
`ronment for an inference system , in accordance with an
`each token corresponds to a block of one or more pixels in
`embodiment .
`the image .
`FIGS . 2A - 2B illustrate a method of batching using a
`In particular , NLP tasks involve using artificial intelli
`machine learning transformer model , in accordance with an
`gence and machine learning techniques to analyze language
`embodiment .
`and may include a variety of tasks including translation ,
`FIGS . 3A - 3B illustrate a method of selective batching
`sentiment analysis , text summarization , auto - correction , and
`using a machine - learning transformer model , in accordance
`the like . When processing NLP tasks , the inference system
`with an embodiment .
`130 receives a request including input text of a sequence of
`FIG . 4 is a block diagram of an inference system , in
`words ( e.g. , query ) and encodes the input text to a sequence
`accordance with an embodiment .
`FIGS . 5A - 5D illustrate a method of dynamic batching for
`of input tokens that each represent a respective word in a
`a
`processing requests using a machine learning transformer
`latent space . The inference system 130 repeatedly applies a
`45 transformer model for one or more iterations to generate a
`model , in accordance with an embodiment .
`FIGS . 6A - 6B is a flowchart illustrating a method of
`sequence of output tokens ( e.g. , response to query ) . The
`selective batching using the transformer model , in accor-
`output tokens are converted to output text as a response to
`the request .
`dance with an embodiment .
`FIG . 7 is a flowchart illustrating a method of dynamic
`For example , a transformer model may receive a sequence
`batching for processing requests using the transformer 50 of input tokens that represent a query and generate a
`sequence of output tokens that represent a response to the
`model , in accordance with an embodiment .
`FIG . 8 is a diagram illustrating a computer system upon
`query . As another example , the transformer model may
`which embodiments described herein may be implemented
`receive a sequence of input tokens that represent a paragraph
`within the inference system , in accordance with an embodi-
`in French and generate a sequence of output tokens that
`55 represents a translation of the paragraph or sentence in
`ment .
`The figures depict various embodiments of the present
`English . As yet another example , the transformer model may
`receive a sequence of input tokens that represent a paragraph
`invention for purposes of illustration only . One skilled in the
`art will readily recognize from the following discussion that
`of text and generate a sequence of output tokens that
`alternative embodiments of the structures and methods illus-
`represents a summarized version of the text .
`trated herein may be employed without departing from the 60
`In one embodiment , the inference system 130 includes
`one or more execution engines that are built on specialized
`principles of the invention described herein .
`hardware accelerators such as graphics processing units
`( GPU's ) or tensor processing units ( TPU's ) . The requests
`DETAILED DESCRIPTION
`are executed on the execution engines . Specifically , execu
`65 tion of machine - learning neural network models , such as
`transformer models , involve a significant number of opera
`tions , such as tensor multiplication between input data and
`
`Overview
`FIG . 1 is a high - level block diagram of a system envi-
`ronment 100 for an inference system 130 , in accordance
`
`
`
`5
`6
`of conditions that are specified by , for example , the inference
`high - dimensional weight tensors that can be computation-
`system 130 or users of the client devices 110 .
`ally intensive . The hardware accelerators of the execution
`FIG . 2A illustrates an encoding phase for the transformer
`engines may be optimized to perform these operations
`model 200 , in which the set of input token sequences are
`efficiently by parallel processing , leading to significant
`improvement in latency or throughput when the number of 5 processed to generate one or more output tokens . In the
`example shown in FIG . 2A , the inference system 130
`parameters in the transformer model are large .
`processing requests for a chatbot receives a first request as
`The hardware of the inference system 130 may include
`the question " what is your name ?, " a second request as the
`one or more central processing unit ( CPU ) cores , CPU
`question " what is the time ?, ” and a third request as the
`memory ( e.g. , DRAM ) , data storage , one or more execution
`engines ( e.g. , GPU devices ) . Each execution engine may 10 question “ how do I pay ? ” The inference system 130 encodes
`each of the requests as a respective set of input token
`include a set of cores ( e.g. , GPU cores ) coupled to local
`sequences . The first request is encoded to an input token
`memory ( e.g. , GPU memory ) , and may be composed of one
`sequence X1 , the second request is encoded to an input token
`or more hardware accelerators . In addition , the inference
`sequence X2 , and the third request is encoded to an input
`system 130 may be composed of multiple hardware com 15 token sequence X3 , each request being illustrated with a
`ponents and components for configuring a network to con
`different fill pattern in the figures . Each input token sequence
`nect the various components across the multiple hardware
`in FIG . 2A is a one - dimensional sequence in which a
`components together such that the components can coordi
`sequence of tokens are arranged along a single dimension
`nate with each other to process requests . For example , one
`( e.g. , X - direction ) . However , as described above with
`execution engine may communicate with multiple hardware 20 respect to FIG . 1 , it is appreciated that in other embodi
`accelerators on multiple machines . An execution engine may
`ments , a sequence of tokens may be arranged as a multi
`process data that is stored on its local memory . Specifically ,
`dimensional sequence .
`during training or inference of the transformer model , data
`As shown in FIG . 2A , since each request includes four
`required for inference or training is read from an input file
`words , each input token sequence includes four tokens each
`in the data storage by the CPU or across the network 120 25 token representing a respective word . For example , input
`from , for example , a client device 110 , moved to local
`token sequence X , for the first request is represented by four
`memory of an execution engine , and processed by the
`squares that represent words “ what , ” “ is , ” “ your , ” “ name . ”
`execution engine . The results of the processing are retrieved
`Specifically , while each word is mapped to a single square ,
`in practice , the inference system 130 represents a token for
`by the CPU .
`In one embodiment , the inference system 130 processes 30 a word as an embedding that represents the word in a
`requests by batches to achieve higher processor utilization
`multi - dimensional latent space . Thus , while each input token
`on the hardware accelerators . Specifically , the inference
`sequence is visually illustrated as a two - dimensional 1x4
`system 130 processes multiple requests in a batch together
`tensor in FIG . 2A , in practice , each input token sequence
`to exploit the amount of parallel computation units in the
`may be represented as a three - dimensional tensor 1x4xH
`execution engines . In such an embodiment , the inference 35 where H is the dimension of an embedding ( e.g. , direction
`system 130 receives multiple requests each associated with
`going in or out of the page ) . Moreover , while each token
`an input token sequence . The inference system 130 itera-
`( input token or output token ) is mapped to one word for the
`tively applies the transformer model to the batch of requests
`remainder of the specification , this is merely an example ,
`to generate the output tokens for the requests together . In one
`and it is appreciated that in other embodiments , each token
`instance , batching for a transformer model is made possible 40 may be mapped to different text units , combination of text
`by grouping requests that have the same length of input
`units , and the like . For example , in other embodiments , each
`token sequences together or at each iteration , treating
`token may be mapped to a text unit of multiple words ,
`requests in the batch as if they all had the same input token
`paragraphs , sentences , n - grams or may be mapped to a
`sequence lengths as the request with the shortest length .
`punctuation mark ( e.g. , “ ?, " « ! , " « . " ) in addition to text units .
`Transformer Model with Batching
`In one embodiment , the transformer model 200 includes
`FIGS . 2A - 2B illustrate a method of batching using a
`a set of N decoders D1 , D2 , ... , DN . A decoder is coupled
`machine - learning transformer model 200 , in accordance
`to receive a set of input representations and generate a set of
`with an embodiment . In particular , the transformer model
`output representations . For example , the first decoder D1 is
`200 is associated with a set of parameters