throbber
( 12 ) United States Patent
`Yu et al .
`
`US 11,442,775 B1
`( 10 ) Patent No .:
`Sep. 13 , 2022
`( 45 ) Date of Patent :
`
`US011442775B1
`
`( 54 ) DYNAMIC BATCHING FOR INFERENCE
`SYSTEM FOR TRANSFORMER - BASED
`GENERATION TASKS
`
`( 71 ) Applicant : FriendliAI Inc. , Seoul ( KR )
`( 72 ) Inventors : Gyeongin Yu , Seoul ( KR ) ; Geon - Woo
`Kim , Seoul ( KR ) ; Joo Seong Jeong ,
`Seoul ( KR ) ; Soojeong Kim , Seoul
`( KR ) ; Byung - Gon Chun , Seoul ( KR )
`( 73 ) Assignee : FriendliAI Inc. , Seoul ( KR )
`Subject to any disclaimer , the term of this
`( * ) Notice :
`patent is extended or adjusted under 35
`U.S.C. 154 ( b ) by 0 days .
`
`( 21 ) Appl . No .: 17 / 542,193
`( 22 ) Filed :
`
`Dec. 3 , 2021
`
`( 2006.01 )
`( 2006.01 )
`( 2006.01 )
`( 2019.01 )
`( 2006.01 )
`( 2006.01 )
`( 2006.01 )
`
`( 51 ) Int . Ci .
`G06F 9/46
`G06F 9/48
`GOON 5/04
`GOON 20/00
`GOOF 9/50
`GOON 3/04
`GOON 3/08
`( 52 ) U.S. CI .
`G06F 9/4881 ( 2013.01 ) ; G06F 9/5016
`CPC
`( 2013.01 ) ; G06N 5/04 ( 2013.01 ) ; G06N 20/00
`( 2019.01 ) ; GOON 3/0454 ( 2013.01 ) ; GOON
`3/08 ( 2013.01 )
`
`( 58 ) Field of Classification Search
`CPC ..... GO6F 9/4881 ; GOOF 9/5016 ; GO6N 20/00 ;
`GOON 5/04 ; G06N 3/0454 ; GOON 3/08
`USPC
`718/1
`See application file for complete search history .
`
`( 56 )
`
`References Cited
`U.S. PATENT DOCUMENTS
`10,846,096 B1 * 11/2020 Chung
`2020/0226453 A1 *
`7/2020 Luk
`2020/0311341 A1 * 10/2020 Chaturvedi
`2021/0034335 A1
`2/2021 Svyalkovskly et al .
`2021/0192314 A1 *
`6/2021 Aarts
`GO6F 8/433
`2021/0263779 A1 *
`8/2021 Haghighat
`G06F 11/3409
`2021/0279576 A1 *
`9/2021 Shazeer
`GOON 3/08
`2021/0357210 Al
`11/2021 Clement et al .
`2021/0406673 A1 * 12/2021 Pardeshi
`( Continued )
`
`GOON 20/00
`GOON 3/08
`GOON 20/00
`
`GOON 3/08
`
`OTHER PUBLICATIONS
`" PREMA : A Predictive Multi - task Scheduling Algo
`Choi et al . ,
`rithm For Preemptible Neural Processing Units ” , 2020 , IEEE Inter
`national Symposium on High Performance Computer Architecture
`( HPCA ) , pp . 220-233 . ( Year : 2020 ) . *
`( Continued )
`Primary Examiner Kenneth Tang
`( 74 ) Attorney , Agent , or Firm — Fenwick & West LLP
`( 57 )
`ABSTRACT
`An inference system applies a machine - learning transformer
`model to a batch of requests with variable input length or
`variable target length or variable internal state length by
`selectively batching a subset of operations in the transformer
`model but processing req
`ests in the batch individually for
`a subset of operations in the transformer model . In one
`embodiment , the operation to be processed individually is an
`attention operation of an encoder or a decoder of the
`transformer model . By selective batching , the inference
`system can allow batching operations to be performed for a
`batch of requests with variable input or target length or
`internal state length to utilize the parallel computation
`capabilities of hardware accelerators while preventing
`unnecessary computations that occur for workarounds that
`restrain the data of a batch of requests
`a same length .
`18 Claims , 12 Drawing Sheets
`
`NS :
`
`Batch
`
`Z
`
`in
`
`No
`Batch
`
`Q Kcache , b
`
`MINS
`
`Veachio ,
`
`Batch
`
`QKV
`ZZZ
`
`91
`
`300
`
`LM Head
`370
`
`Decoder DN
`
`Decoder D1
`
`Add
`335
`.... 1
`Attention Linear
`330
`
`Self - Attention
`325
`
`Split
`320
`
`342
`
`Add
`360
`
`MLP
`355
`
`? ? ? ? GeLU
`350
`
`MLP
`245
`
`QKV Operation
`315
`
`Layer Normalization
`340
`
`Layer Normalization
`310
`
`X +
`
`X3
`
`Petitioner, EX1001
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`

`

`US 11,442,775 B1
`Page 2
`
`( 56 )
`
`References Cited
`U.S. PATENT DOCUMENTS
`2022/0066747 A1
`2022/0067513 Al
`
`3/2022 Drain et al .
`3/2022 Stevens et al .
`
`OTHER PUBLICATIONS
`Dai et al , “ Transformer - XL : Attentive Language Models Beyond a
`Fixed - Length Context ” , Jun . 2 , 2019 , Carnegie Mellon , University ,
`Google Brain , pp . 1-20 ( Year : 2019 ) . *
`Fang , J. et al . , “ Turbo Transformers : An Efficient GPU Serving
`System For Transformer Models , ” arXiv : 2010.05680v4 , Feb. 20 ,
`2021 , pp . 1-14 .
`Gao , P. et al . , “ Low Latency RNN Inference with Cellular Batch
`ing , ” EuroSys ’18 , Apr. 2018 , pp . 1-15 .
`Github , “ microsoft / DeepSpeed , ” Jan. 19 , 2021 , pp . 1-9 ,
`[ Online ]
`[ Retrieved on Jan. 31 , 2022 ] Retrieved from the Internet < URL :
`https://github.com/microsoft/DeepSpeed >
`Github , “ NVIDIA / Faster Transformer , ” Apr. 2 , 2021 , pp . 1-28 , [ Online ]
`[ Retrieved on Jan. 31 , 2022 ] Retrieved from the Internet < URL :
`https://github.com/NVIDIA/FasterTransformer » .
`Github , “ NVIDIA Megatron - LM , ” Aug. 11 , 2021 , pp . 1-18 , [ Online ]
`[ Retrieved on Jan. 31 , 2022 ] Retrieved from the Internet < URL :
`https://github.com/NVIDIA/Megatron-LM > .
`Li , G. et al . , “ Easy and Efficient Transformer : Scalable Inference
`Solution for Large NLP Model , ” arXiv : 2104.12470v4 , Nov. 23 ,
`2021 , pp . 1-9 .
`NVIDIA , “ NVIDIA TensorRT , ” Jan. 27 , 2021 , pp . 1-11 , [ Online ]
`[ Retrieved on Jan. 31 , 2022 ] Retrieved from the Wayback Machine
`
`> >
`
`< URL http://web.archive.org/web/20210127111124/https://developer .
`nvidia.com/tensorrt » .
`NVIDIA , “ NVIDIA Triton Inference Server , ” Jan. 25 , 2021 , pp . 1-6 ,
`[ Online ] [ Retrieved on Jan. 31 , 2022 ] Retrieved from the Wayback
`Machine < URL http://web.archive.org/web/20210125141031/https://
`developer.nvidia.com/nvidia-triton-inference-server > .
`Olston , C. et al . , “ TensorFlow - Serving : Flexible , High - Performance
`ML Serving , " arXiv : 1712.06139v2 , Dec. 27 , 2017 , pp . 1-8 .
`Shazeer , N. et al . , “ Mesh - TensorFlow : Deep Learning for Super
`computers , ” arXiv : 1811.02084v1 , Nov. 5 , 2018 , pp . 1-16 .
`Shoeybi , M. et al . , “ Megatron - LM : Training Multi - Billion Param
`eter Language Models Using Model Parallelism , ” arXiv : 1909 .
`08053v4 , Mar. 13 , 2020 , pp . 1-15 .
`Wang , X. et al . , “ LightSeq : A High Performance Inference Library
`for Transformers , ” arXiv : 2010.13887v4 , Apr. 22 , 2021 , pp . 1-8 .
`Doshi , Ketan , “ Transformers Explained Visually ( Part 1 ) : Overview
`of Functionality ” , Dec. 13 , 2020 , < towardsdatascience.com > ( Year :
`2020 ) , 16 pages .
`Doshi , Ketan , “ Transformers Explained Visually ( Part 2 ) : How it
`works , step - by - step ” , Jan. 2 , 2021 , < towardsdatascience.com > ( Year :
`2021 ) , 23 pages .
`Doshi , Ketan , “ Transformers Explained Visually ( Part 3 ) : Multi
`head Attention deep dive ” , Jan. 16 , 2021 , < towardsdatascience .
`com > ( Year : 2021 ) , 20 pages .
`Doshi , Ketan , “ Transformers Explained Visually ( Part 4 ) : Not Just
`How , but Why They Work So Well ” , Jun . 2 , 2021 , < towardsdatascience .
`com > ( Year : 2021 ) , 17 pages .
`Vaswani et al . , " Attention Is All You Need ” , Dec. 6 , 2017 , arXiv ,
`< https://arxiv.org/abs/1706.03762 > ( Year : 2017 ) , pp . 1-15 .
`* cited by examiner
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 1 of 12
`
`US 11,442,775 B1
`
`Inference System 130
`
`Network 120
`
`
`
`Client Device 110B
`
`
`
`Client Device 110A
`
`100
`
`FIG . 1
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 2 of 12
`
`US 11,442,775 B1
`
`X
`X2
`X3
`
`?1
`
`?1
`
`200
`
`LM Head
`270
`
`Decoder DN
`
`Decoder D1
`
`Add
`235
`
`Attention Linear
`230
`
`Self - Attention
`225
`
`Split
`220
`
`N
`
`Z
`
`DDDDD
`
`Q
`
`Kcache
`
`Vcache
`
`K
`
`V
`
`242
`
`Add
`260
`
`MLP
`255
`
`GELU
`250
`
`MLP
`245
`
`QKV Operation
`215
`
`Layer Normalization
`240
`
`Layer Normalization
`210
`
`X1
`X2
`X3
`
`FIG . 2A
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 3 of 12
`
`US 11,442,775 B1
`
`X
`X2
`X3
`
`Û2
`
`LM Head
`270
`
`Decoder DN
`
`Decoder D1
`
`Add
`235
`
`N E H N CHI
`Z
`
`Q Kcache
`
`Vcache
`
`QKV
`
`Attention Linear
`230
`
`Self - Attention
`225
`
`Split
`220
`
`242
`
`Add
`260
`
`MLP
`255
`
`GELU
`250
`
`MLP
`245
`
`QKV Operation
`215
`
`Layer Normalization
`240
`
`Layer Normalization
`210
`
`?1
`
`FIG . 2B
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 4 of 12
`
`US 11,442,775 B1
`
`X1
`X2
`X2
`
`Z NANDO
`
`Zio
`
`Qb
`
`Kcache.b Vcache ,
`
`QKV
`
`Batch
`
`No
`Batch
`
`Batch
`
`?1
`
`300
`
`LM Head
`370
`
`Decoder DN
`
`Decoder D1
`
`Add
`335
`
`Attention Linear
`330
`
`Self - Attention
`325
`
`Split
`320
`
`342
`
`Add
`360
`
`MLP
`355
`
`GeLU
`350
`?
`MLP
`345
`
`QKV Operation
`315
`
`Layer Normalization
`340
`
`Layer Normalization
`310
`
`X
`X2
`X3
`FIG . 3A
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 5 of 12
`
`US 11,442,775 B1
`
`X1
`X2
`X3
`
`Û2
`
`LM Head
`370
`
`Decoder DN
`
`N NI
`
`Z '
`
`Q Kcache
`2
`
`V cache
`
`QKV
`
`Batch
`
`No
`Batch
`
`Batch
`
`Decoder D1
`
`Add
`335
`
`Attention Output
`330
`
`NA
`
`Self - Attention
`325
`
`Split
`320
`
`342
`
`Add
`360
`
`MLP
`355
`
`GeLU
`350
`
`MLP
`345
`
`QKV Operation
`315
`
`Layer Normalization
`340
`
`Layer Normalization
`310
`
`?1
`
`FIG . 3B
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 6 of 12
`
`US 11,442,775 B1
`
`
`
`Inference System 130
`
`
`
`Serving System 435
`
`
`
`Training Module 430
`
`Execution Engine
`Module 425
`Data Management Module 420
`
`Training Corpus 460
`
`FIG . 4
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 7 of 12
`
`US 11,442,775 B1
`
`Serving System
`435
`
`R2
`
`Request Processor
`580
`
`Completion
`
`Scheduler
`585
`
`Incoming
`R2
`
`Execution Engine
`590A
`
`R1
`
`Execution Engine
`590B
`
`KV Cache
`
`KV Cache
`
`R3
`R4
`R5
`
`FIG . 5A
`
`Serving System
`1435
`
`Request Processor
`580
`
`Completion
`
`Scheduler
`585
`
`Incoming
`
`Execution Engine
`590A
`
`R1 22
`R2
`
`Execution Engine
`590B
`
`KV Cache
`
`KV Cache
`
`R3
`R4
`R5
`
`FIG . 5B
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 8 of 12
`
`US 11,442,775 B1
`
`Serving System
`435
`
`R7 Request Processor
`580
`
`Completion
`R2
`R4
`
`Scheduler
`585
`
`Incoming
`R7
`
`R2 R4
`Response
`
`Execution Engine
`590A
`
`Execution Engine
`590B
`
`KV Cache
`
`KV Cache
`
`R1 W
`R2
`
`Serving System
`1435
`
`R3
`R4
`R5
`
`FIG . 5C
`
`Request Processor
`580
`
`Completion
`
`Scheduler
`585
`
`Incoming
`
`
`
`*** " "
`
`"
`
`"
`
`"
`
`"
`
`"
`
`"
`
`"
`
`Execution Engine
`590A
`
`R1 MINA
`R7 ||
`
`Execution Engine
`590B
`
`KV Cache
`
`KV Cache
`
`R3
`R5
`
`FIG . 5D
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 9 of 12
`
`US 11,442,775 B1
`
`Receive a batch of requests including one or
`more input token sequences
`602
`
`Access a machine learned transformer model
`including at least a set of decoders
`604
`
`Generate one or more queries , keys , and values
`for the requests by applying a QKV weight tensor
`to input representations by a batch operation
`606
`
`Split a first query , a first key , a first value for the
`first request and a second query , a second key , a
`second value for the second request
`608
`
`FIG . 6A
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 10 of 12
`
`US 11,442,775 B1
`
`Generate first attention output for the first request
`by combining the first query , first key , first value
`610
`
`Separately generating second attention output for
`the second request by combining the second
`query , second key , second value
`612
`
`Concatenate at least the first attention output and
`the second attention output into a concatenated
`tensor
`614
`
`Generate output representations by applying a
`weight tensor to the concatenated tensor by a
`batch operation
`616
`
`Setting the one or more output tokens as the one
`or more inputs to the set of decoders for the next
`iteration
`618
`
`Provide output tokens generated for at least one
`request to a client device as a response
`620
`
`FIG . 6B
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 11 of 12
`
`US 11,442,775 B1
`
`Receive , by a serving system , one or more
`requests for execution
`710
`
`Schedule a batch of requests for execution on an
`execution engine
`712
`
`Generate a first set of output tokens by iteratively
`applying the transformer model to a first set of
`inputs for the batch
`714
`
`Receive , by the serving system , a new request
`including a sequence of input tokens
`716
`
`Schedule a second batch of requests including
`the new request responsive to determining that
`the execution engine has memory available
`718
`
`Generate a second set of output tokens by
`iteratively applying the transformer model to a
`second set of inputs for the second batch
`720
`
`FIG . 7
`
`

`

`U.S. Patent
`
`Sep. 13 , 2022
`
`Sheet 12 of 12
`
`US 11,442,775 B1
`
`800
`
`1
`
`Display Device
`811
`
`1
`Input Mechanisms
`813
`
`Processor
`801
`
`Main Memory
`803
`
`ROM
`805
`
`Storage Device
`807
`
`Communication
`Interface
`809
`
`Hardware
`Accelerators
`810
`
`FIG . 8
`
`

`

`US 11,442,775 B1
`
`a
`
`1
`DYNAMIC BATCHING FOR INFERENCE
`SYSTEM FOR TRANSFORMER - BASED
`GENERATION TASKS
`
`2
`unnecessary computations that occur for workarounds that
`restrain the data of a batch of requests to a same length .
`Specifically , in one embodiment , the inference system
`receives a batch of requests including one or more input
`5 token sequences . A length of a first input token sequence for
`BACKGROUND
`a first request in the batch may be different from a length of
`a second input token sequence for a second request . The
`This invention relates generally to machine - learning
`inference system accesses a transformer model including at
`transformer neural network models , and more particularly to
`least a set of decoders coupled to one another . For one or
`selective batching for transformer models .
`Transformer neural network models are machine - learning 10 more iterations , the inference system repeatedly performs
`the steps of generating one or more output tokens for the
`models used for a variety of applications , for example ,
`requests by applying the set of decoders to one or more
`natural language processing ( NLP ) , image processing , or
`inputs for the requests .
`audio processing applications that include sequential data .
`For at least one decoder in the set , the inference system
`For example , a transformer model may receive a sequence 15 generates one or more queries , one or more keys , and one or
`of input tokens that represent a query and generate a
`more values for the requests by applying a QKV weight
`sequence of output tokens that represent a response to the
`tensor to one or more input representations . In one instance ,
`query . As another example , the transformer model may
`the queries , keys , and values are generated by a batch
`receive a sequence of input tokens that represent a paragraph
`operation . The inference system splits at least a first query
`in German and generate a sequence of output tokens that 20 for the first request from the one or more queries , a first key
`represents a translation of the paragraph in English . As yet
`for the first request from the one or more keys , and a first
`another example , the transformer model may receive a
`value for the first request from the one or more values . The
`sequence of input tokens that represent a paragraph of text
`inference system also splits at least a second query for the
`and generate a sequence of output tokens that represent a
`second request from the one or more queries , a second key
`25 for the second request from the one or more keys , and a
`summarized version of the text .
`Typically , users of client devices submit requests to an
`second value for the second request from the one or more
`inference system . The inference system executes a machine
`values .
`learning transformer model to inputs ( e.g. , a sequence of
`The inference system generates a first attention output for
`input tokens ) of requests to generate outputs ( e.g. , a
`the first request by at least combining the first query , the first
`sequence of output tokens ) for the requests . The inference 30 key , and the first value for the first request . The inference
`system may return the outputs to client devices of the
`system also separately generates a second attention output
`requests as a response . In one instance , the inference system
`for the second request by at least combining the second
`executes the requests on specialized hardware accelerators
`query , the second key , and the second value for the second
`such as graphics processing units ( GPU's ) or tensor pro-
`request . The inference system concatenates at least the first
`cessing units ( TPU's ) to improve latency and throughput , 35 attention output and the second attention output into a
`especially when the number of parameters of the transformer
`concatenated tensor and generates one or more output rep
`resentations by applying a weight tensor to the concatenated
`model is significantly large .
`In one instance , the inference system processes requests in
`tensor . In one instance , the one or more output representa
`batches to achieve high processor utilization on the accel-
`tions are generated by a batch operation . The inference
`erators . Specifically , the inference system may process mul- 40 system sets the one or more output tokens as the one or more
`tiple requests in a batch together to exploit the amount of
`inputs to the set of decoders for the next iteration and
`parallel computation units in the hardware accelerators . In
`provides output tokens generated for at least one request to
`many situations , the inputs for requests in a batch are
`a client device as a response to the at least one request .
`variable in length . For example , the number of input tokens
`In one embodiment , the inference system performs itera
`for each request in a batch may be variable in length . 45 tion - level dynamic batching for a transformer model that
`However , methods of batching for transformer models often
`allows the inference system to dynamically modify a batch
`require that the length of data for multiple requests in a batch
`of requests being executed on an execution engine . Specifi
`be the same to be processed . Thus , it may not be feasible to
`cally , in existing batching methods for transformer models ,
`process a batch of requests with variable lengths or work-
`it is difficult to modify a batch of requests once it has started
`arounds addressing this problem may result in using more 50 to process on an execution engine . This is because certain
`resources compared to processing each request individually .
`methods of batching require the length of the inputs or the
`length of the internal states to be the same across all requests
`in the batch . Therefore , unless new incoming requests have
`SUMMARY
`the same length of inputs as the batch of requests being
`An inference system applies a machine learning trans- 55 executed on the execution engine , it may be difficult for the
`inference system to modify the batch to , for example , add or
`former model to a batch of requests with variable input
`length or variable target length or variable internal state
`remove new requests to the batch .
`length by selectively batching a subset of operations in the
`By performing selective batching , the inference system
`transformer model but processing requests in the batch
`can monitor and modify a batch being processed on the
`individually for a subset of operations in the transformer 60 execution engine on an iteration - level and update the batch
`model . In one embodiment , the operation to be processed
`between iterations as requests get completed and new
`individually is an attention operation of an encoder or a
`requests are received . Specifically , at one or more iterations ,
`decoder of the transformer model . By selective batching , the
`the inference system can modify the batch being executed on
`inference system can allow batching operations to be per-
`the execution engine by adding new incoming requests to
`formed for a batch of requests with variable input or target 65 the batch or removing completed requests from the batch .
`or internal state length to utilize the parallel computation
`This is because selective batching allows requests with
`capabilities of hardware accelerators while preventing
`variable lengths to be processed without restraining the one
`
`

`

`US 11,442,775 B1
`
`10
`
`30
`
`a
`
`35
`
`40
`
`3
`4
`with an embodiment . The system environment 100 shown
`or more inputs or internal states to the transformer model to
`by FIG . 1 includes one or more client devices 110A , 110B ,
`same lengths . This allows the inference system to remove
`requests in the batch that are completed earlier than others
`a network 120 , and an inference system 130. In alternative
`so that the response can be provided to the user faster and
`configurations , different or additional components may be
`allows the inference system to add new requests to a batch 5 included in the system environment 100 .
`of requests if the execution engine is being under - utilized .
`The inference system 130 receives requests from client
`In one embodiment , a serving system of the inference
`devices 110A , 110B to perform tasks using machine - learn
`system receives one or more requests for execution . The
`ing models . In one embodiment , the machine learning mod
`serving system may include a request processor and a
`scheduler each coupled to one or more execution engines for
`els are transformer neural network models . The tasks may
`executing a machine - learning transformer model including
`include , but are not limited to , natural language processing
`at least a set of decoders . The scheduler schedules a batch of
`( NLP ) , image processing , audio processing applications .
`requests including the one or more requests for execution on
`Specifically , the transformer model may be appropriate for
`an execution engine . The execution engine generates a first
`processing sequential data that can be tokenized into a
`set of output tokens by iteratively applying the transformer
`model to a first set of inputs for the batch of requests . In one 15 sequence of input tokens for the request and a sequence of
`output tokens for the desired response . The inference system
`instance , applying the transformer model includes applying
`130 receives a request including input data ( e.g. , text data ,
`at least one batch operation to one or more input tensors
`image or video data , audio data ) and encodes the input data
`associated with the batch of requests .
`to a set of input tokens . The inference system 130 repeatedly
`The serving system may receive a new request from a
`client device that includes a sequence of input tokens . The 20 applies the machine - learning transformer model for one or
`scheduler schedules a second batch of requests including the
`more iterations to generate a set of output tokens . The
`one or more requests and the new request for execution on
`inference system 130 decodes the set of output tokens to
`the execution engine responsive to determining that the
`output data and returns the output data as the response to the
`execution engine has memory available to execute the
`request . While for applications such as NLP applications , a
`second batch of requests . The execution engine generates a 25 sequence of input tokens or output tokens are arranged along
`second set of output tokens by iteratively applying the
`one dimension ( 1 - D ) to represent , for example , a sequence
`transformer model to a second set of inputs for the second
`of words , it is appreciated that in other embodiments , a
`batch of requests . The second set of inputs may include the
`sequence of input tokens or output tokens may be a multi
`sequence of input tokens for the new request .
`dimensional sequence . For example , for two - dimensional
`image data , the sequence of tokens may be a two dimen
`BRIEF DESCRIPTION OF THE DRAWINGS
`sional ( 2 - D ) sequence arranged along both a first direction
`a
`FIG . 1 is a high - level block diagram of a system envi-
`( e.g. , X - axis ) and a second direction ( e.g. , Y - axis ) , where
`ronment for an inference system , in accordance with an
`each token corresponds to a block of one or more pixels in
`embodiment .
`the image .
`FIGS . 2A - 2B illustrate a method of batching using a
`In particular , NLP tasks involve using artificial intelli
`machine learning transformer model , in accordance with an
`gence and machine learning techniques to analyze language
`embodiment .
`and may include a variety of tasks including translation ,
`FIGS . 3A - 3B illustrate a method of selective batching
`sentiment analysis , text summarization , auto - correction , and
`using a machine - learning transformer model , in accordance
`the like . When processing NLP tasks , the inference system
`with an embodiment .
`130 receives a request including input text of a sequence of
`FIG . 4 is a block diagram of an inference system , in
`words ( e.g. , query ) and encodes the input text to a sequence
`accordance with an embodiment .
`FIGS . 5A - 5D illustrate a method of dynamic batching for
`of input tokens that each represent a respective word in a
`a
`processing requests using a machine learning transformer
`latent space . The inference system 130 repeatedly applies a
`45 transformer model for one or more iterations to generate a
`model , in accordance with an embodiment .
`FIGS . 6A - 6B is a flowchart illustrating a method of
`sequence of output tokens ( e.g. , response to query ) . The
`selective batching using the transformer model , in accor-
`output tokens are converted to output text as a response to
`the request .
`dance with an embodiment .
`FIG . 7 is a flowchart illustrating a method of dynamic
`For example , a transformer model may receive a sequence
`batching for processing requests using the transformer 50 of input tokens that represent a query and generate a
`sequence of output tokens that represent a response to the
`model , in accordance with an embodiment .
`FIG . 8 is a diagram illustrating a computer system upon
`query . As another example , the transformer model may
`which embodiments described herein may be implemented
`receive a sequence of input tokens that represent a paragraph
`within the inference system , in accordance with an embodi-
`in French and generate a sequence of output tokens that
`55 represents a translation of the paragraph or sentence in
`ment .
`The figures depict various embodiments of the present
`English . As yet another example , the transformer model may
`receive a sequence of input tokens that represent a paragraph
`invention for purposes of illustration only . One skilled in the
`art will readily recognize from the following discussion that
`of text and generate a sequence of output tokens that
`alternative embodiments of the structures and methods illus-
`represents a summarized version of the text .
`trated herein may be employed without departing from the 60
`In one embodiment , the inference system 130 includes
`one or more execution engines that are built on specialized
`principles of the invention described herein .
`hardware accelerators such as graphics processing units
`( GPU's ) or tensor processing units ( TPU's ) . The requests
`DETAILED DESCRIPTION
`are executed on the execution engines . Specifically , execu
`65 tion of machine - learning neural network models , such as
`transformer models , involve a significant number of opera
`tions , such as tensor multiplication between input data and
`
`Overview
`FIG . 1 is a high - level block diagram of a system envi-
`ronment 100 for an inference system 130 , in accordance
`
`

`

`5
`6
`of conditions that are specified by , for example , the inference
`high - dimensional weight tensors that can be computation-
`system 130 or users of the client devices 110 .
`ally intensive . The hardware accelerators of the execution
`FIG . 2A illustrates an encoding phase for the transformer
`engines may be optimized to perform these operations
`model 200 , in which the set of input token sequences are
`efficiently by parallel processing , leading to significant
`improvement in latency or throughput when the number of 5 processed to generate one or more output tokens . In the
`example shown in FIG . 2A , the inference system 130
`parameters in the transformer model are large .
`processing requests for a chatbot receives a first request as
`The hardware of the inference system 130 may include
`the question " what is your name ?, " a second request as the
`one or more central processing unit ( CPU ) cores , CPU
`question " what is the time ?, ” and a third request as the
`memory ( e.g. , DRAM ) , data storage , one or more execution
`engines ( e.g. , GPU devices ) . Each execution engine may 10 question “ how do I pay ? ” The inference system 130 encodes
`each of the requests as a respective set of input token
`include a set of cores ( e.g. , GPU cores ) coupled to local
`sequences . The first request is encoded to an input token
`memory ( e.g. , GPU memory ) , and may be composed of one
`sequence X1 , the second request is encoded to an input token
`or more hardware accelerators . In addition , the inference
`sequence X2 , and the third request is encoded to an input
`system 130 may be composed of multiple hardware com 15 token sequence X3 , each request being illustrated with a
`ponents and components for configuring a network to con
`different fill pattern in the figures . Each input token sequence
`nect the various components across the multiple hardware
`in FIG . 2A is a one - dimensional sequence in which a
`components together such that the components can coordi
`sequence of tokens are arranged along a single dimension
`nate with each other to process requests . For example , one
`( e.g. , X - direction ) . However , as described above with
`execution engine may communicate with multiple hardware 20 respect to FIG . 1 , it is appreciated that in other embodi
`accelerators on multiple machines . An execution engine may
`ments , a sequence of tokens may be arranged as a multi
`process data that is stored on its local memory . Specifically ,
`dimensional sequence .
`during training or inference of the transformer model , data
`As shown in FIG . 2A , since each request includes four
`required for inference or training is read from an input file
`words , each input token sequence includes four tokens each
`in the data storage by the CPU or across the network 120 25 token representing a respective word . For example , input
`from , for example , a client device 110 , moved to local
`token sequence X , for the first request is represented by four
`memory of an execution engine , and processed by the
`squares that represent words “ what , ” “ is , ” “ your , ” “ name . ”
`execution engine . The results of the processing are retrieved
`Specifically , while each word is mapped to a single square ,
`in practice , the inference system 130 represents a token for
`by the CPU .
`In one embodiment , the inference system 130 processes 30 a word as an embedding that represents the word in a
`requests by batches to achieve higher processor utilization
`multi - dimensional latent space . Thus , while each input token
`on the hardware accelerators . Specifically , the inference
`sequence is visually illustrated as a two - dimensional 1x4
`system 130 processes multiple requests in a batch together
`tensor in FIG . 2A , in practice , each input token sequence
`to exploit the amount of parallel computation units in the
`may be represented as a three - dimensional tensor 1x4xH
`execution engines . In such an embodiment , the inference 35 where H is the dimension of an embedding ( e.g. , direction
`system 130 receives multiple requests each associated with
`going in or out of the page ) . Moreover , while each token
`an input token sequence . The inference system 130 itera-
`( input token or output token ) is mapped to one word for the
`tively applies the transformer model to the batch of requests
`remainder of the specification , this is merely an example ,
`to generate the output tokens for the requests together . In one
`and it is appreciated that in other embodiments , each token
`instance , batching for a transformer model is made possible 40 may be mapped to different text units , combination of text
`by grouping requests that have the same length of input
`units , and the like . For example , in other embodiments , each
`token sequences together or at each iteration , treating
`token may be mapped to a text unit of multiple words ,
`requests in the batch as if they all had the same input token
`paragraphs , sentences , n - grams or may be mapped to a
`sequence lengths as the request with the shortest length .
`punctuation mark ( e.g. , “ ?, " « ! , " « . " ) in addition to text units .
`Transformer Model with Batching
`In one embodiment , the transformer model 200 includes
`FIGS . 2A - 2B illustrate a method of batching using a
`a set of N decoders D1 , D2 , ... , DN . A decoder is coupled
`machine - learning transformer model 200 , in accordance
`to receive a set of input representations and generate a set of
`with an embodiment . In particular , the transformer model
`output representations . For example , the first decoder D1 is
`200 is associated with a set of parameters

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket