`Claim Appendix
`The tables below provide side-by-side comparisons between corresponding claims of the ’775 patent.
`
`Differences for the compared claim language on the right side of each chart are indicated with cross-through and
`
`underline.
`
`I.
`
`#
`
`’775 PATENT | INDEPENDENT CLAIMS 1 AND 10
`775 PAT | CLAIM 1
`#
`1. A method of dynamically executing
`batches of requests on one or more
`execution engines running a machine-
`learning transformer model,
`comprising:
`
`775_1[PRE]
`
`775_10[PRE]
`
`775_1[A]
`
`775_1[B]
`
`receiving, by a serving system, one or
`more requests for execution, the
`serving system including a scheduler
`and one or more execution engines
`each coupled to access a machine-
`learning transformer model including
`at least a set of decoders;
`scheduling, by the scheduler, a batch
`of requests including the one or more
`requests for execution on an execution
`engine;
`
`775_10[A]
`
`775_10[B]
`
`1
`
`775 PAT | CLAIM 10
`A method ofnon-transitory computer-
`readable storage medium storing
`computer program instructions
`executable to perform operations
`for dynamically executing batches of
`requests on one or more execution
`engines running a machine-learning
`transformer model, the
`operations comprising:
`receiving, by a serving system, one or
`more requests for execution, the
`serving system including a scheduler
`and one or more execution engines
`each coupled to access a machine-
`learning transformer model including
`at least a set of decoders;
`scheduling, by the scheduler, a batch
`of requests including the one or more
`requests for execution on an
`execution engine;
`
`Petitioner, EX1022
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`
`
`#
`
`775_1[C]
`
`775_1[D]
`
`775_1[E]
`
`775 PAT | CLAIM 1
`generating, by the execution engine, a
`first set of output tokens by applying
`the transformer model to a first set of
`inputs for the batch of requests,
`wherein applying the transformer
`model comprises applying at least one
`batch operation to one or more input
`tensors associated with the batch of
`requests;
`receiving, by a request processor, a
`new request from a client device, the
`new request including a sequence of
`input tokens;
`scheduling, by the scheduler, a second
`batch of requests additionally
`including the new request for
`execution on the execution engine, the
`second batch of requests scheduled
`responsive to determining that the
`execution engine has memory
`available to execute the second batch
`of requests, wherein in a second set of
`inputs for the second batch of
`requests, a length of the sequence of
`input tokens for the new request is
`different from a length of an input for
`
`#
`
`775_10[C]
`
`775_10[D]
`
`775_10[E]
`
`2
`
`’775 Patent
`Claim Appendix
`775 PAT | CLAIM 10
`generating, by the execution engine, a
`first set of output tokens by applying
`the transformer model to a first set of
`inputs for the batch of requests,
`wherein applying the transformer
`model comprises applying at least one
`batch operation to one or more input
`tensors associated with the batch of
`requests;
`receiving, by a request processor, a
`new request from a client device, the
`new request including a sequence of
`input tokens;
`scheduling, by the scheduler, a
`second batch of requests additionally
`including the new request for
`execution on the execution engine,
`the second batch of requests
`scheduled responsive to determining
`that the execution engine has memory
`available to execute the second batch
`of requests, wherein in a second set of
`inputs for the second batch of
`requests, a length of the sequence of
`input tokens for the new request is
`different from a length of an input for
`
`
`
`#
`
`775_1[F]
`
`II.
`
`775_10[F]
`
`#
`
`#
`
`775 PAT | CLAIM 1
`at least one request other than the new
`request; and
`generating, by the execution engine, a
`second set of output tokens by
`applying the transformer model to the
`second set of inputs for the second
`batch.
`’775 PATENT | DEPENDENT CLAIMS 2 AND 11
`775 PAT | CLAIM 2
`#
`2. The method of claim 1, further
`comprising: responsive to determining
`that a request in the first batch of
`requests has been completed,
`providing output tokens generated for
`the completed request to a client
`device as a response to the request,
`and
`
`775_2[A]
`
`775_11[A]
`
`775_2[B]
`
`wherein the second batch of requests
`includes at least one of the remaining
`requests from the one or more
`requests and the new request.
`
`775_11[B]
`
`
`
`
`
`
`3
`
`’775 Patent
`Claim Appendix
`775 PAT | CLAIM 10
`at least one request other than the new
`request; and
`generating, by the execution engine, a
`second set of output tokens by
`applying the transformer model to the
`second set of inputs for the second
`batch.
`
`775 PAT | CLAIM 11
`11. The method of claim 1,non-
`transitory computer-readable storage
`medium of claim 10, the
`operations further comprising:
`responsive to determining that a
`request in the first batch of requests
`has been completed, providing output
`tokens generated for the completed
`request to a client device as a
`response to the request, and
`wherein the second batch of requests
`includes at least one of the remaining
`requests from the one or more
`requests and the new request.
`
`
`
`III.
`
`’775 PATENT | DEPENDENT CLAIMS 3 AND 12
`775 PAT | CLAIM 3
`#
`3. The method of claim 2, wherein the
`request is associated with a cache
`memory in the execution engine
`dedicated for storing an internal state
`for the request, and responsive to
`determining that the request has been
`completed, freeing the dedicated
`cache memory for the request in the
`execution engine.
`
`775_3
`
`#
`
`775_12
`
`
`IV.
`
`’775 PATENT | DEPENDENT CLAIMS 4 AND 13
`775 PAT | CLAIM 4
`#
`4. The method of claim 1, wherein the
`input for the at least one request is an
`output token from the first set of
`output tokens for the at least one
`request, and wherein a length of the
`sequence of input tokens for the new
`request is different from a length of
`the output token for the at least one
`request.
`
`775_4
`
`#
`
`775_13
`
`4
`
`’775 Patent
`Claim Appendix
`
`775 PAT | CLAIM 12
`12. The method of claim 1,non-
`transitory computer-readable storage
`medium of claim 10, wherein the
`request is associated with a cache
`memory in the execution engine
`dedicated for storing an internal state
`for the request, and responsive to
`determining that the request has been
`completed, freeing the dedicated
`cache memory for the request in the
`execution engine.
`
`775 PAT | CLAIM 13
`13. The method non-transitory
`computer-readable storage medium of
`claim 10, wherein the input for the at
`least one request is an at least
`one output token from the first set of
`output tokens for the at least one
`request, and wherein a length of the
`sequence of input tokens for the new
`request is different from a length of
`the at least one output token for the at
`least one request.
`
`
`
`’775 Patent
`Claim Appendix
`
`775 PAT | CLAIM 14
`14. The method of claim 4,non-
`transitory computer-readable storage
`medium of claim 13, wherein the
`execution engine includes a cache
`memory for maintaining a key cache
`tensor for storing keys and a value
`cache tensor for storing values for the
`at least one request, and
`wherein after scheduling the second
`batch of requests, allocating, by the
`execution engine, a new cache
`memory dedicated to maintaining a
`key cache tensor and a value cache
`tensor for the new request.
`
`V.
`
`775_5[A]
`
`’775 PATENT | DEPENDENT CLAIMS 5 AND 14
`775 PAT | CLAIM 5
`#
`5. The method of claim 4, wherein the
`execution engine includes a cache
`memory for maintaining a key cache
`tensor for storing keys and a value
`cache tensor for storing values for the
`at least one request, and
`
`#
`
`775_14[A]
`
`775_5[B]
`
`wherein after scheduling the second
`batch of requests, allocating, by the
`execution engine, a new cache
`memory dedicated to maintaining a
`key cache tensor and a value cache
`tensor for the new request.
`
`
`
`
`
`
`775_14[B]
`
`5
`
`
`
`VI.
`
`’775 PATENT | DEPENDENT CLAIMS 6 AND 15
`775 PAT | CLAIM 6
`#
`6. The method of claim 5, wherein
`after generating the second set of
`output tokens, a length of the key
`cache tensor for the at least one
`request is different from a length of a
`key cache tensor for the new request,
`and a length of the value cache tensor
`for the at least one request is different
`from a length of a value cache tensor
`for the new request.
`
`775_6
`
`#
`
`775_15
`
`VII. ’775 PATENT | DEPENDENT CLAIMS 7 AND 16
`775 PAT | CLAIM 7
`#
`7. The method of claim 1, after
`receiving the new request from the
`client device, determining, by the
`scheduler, that there is insufficient
`memory available to execute the
`second batch of requests on a second
`execution engine different from the
`execution engine, and responsive to
`the determination for the second
`execution engine, determining
`
`775_7
`
`#
`
`775_16
`
`6
`
`’775 Patent
`Claim Appendix
`
`775 PAT | CLAIM 15
`15. The method of claim 4,non-
`transitory computer-readable storage
`medium of claim 14, wherein after
`generating the second set of output
`tokens, a length of the key cache
`tensor for the at least one request is
`different from a length of a key cache
`tensor for the new request, and a
`length of the value cache tensor for
`the at least one request is different
`from a length of a value cache tensor
`for the new request.
`
`775 PAT | CLAIM 16
`16. The method of claim 1,non-
`transitory computer-readable storage
`medium of claim 10, after receiving
`the new request from the client
`device, determining, by the scheduler,
`that there is insufficient memory
`available to execute the second batch
`of requests on a second execution
`engine different from the execution
`engine, and responsive to the
`
`
`
`#
`
`775 PAT | CLAIM 7
`whether the execution engine has the
`memory available to execute the
`second batch of requests.
`
`VIII. ’775 PATENT | DEPENDENT CLAIMS 8 AND 17
`775 PAT | CLAIM 8
`#
`8. The method of claim 1, wherein the
`execution engine is configured as a
`graphics processing unit (GPU) or a
`tensor processing unit (TPU).
`
`775_8
`
`#
`
`#
`
`775_17
`
`IX.
`
`’775 PATENT | DEPENDENT CLAIMS 9 AND 18
`775 PAT | CLAIM 9
`#
`9. The method of claim 1, wherein
`each token in the sequence of input
`tokens represents a text unit.
`
`775_9
`
`#
`
`775_18
`
`’775 Patent
`Claim Appendix
`775 PAT | CLAIM 16
`determination for the second
`execution engine, determining
`whether the execution engine has the
`memory available to execute the
`second batch of requests.
`
`775 PAT | CLAIM 17
`17. The method of claim 1,non-
`transitory computer-readable storage
`medium of claim 10, wherein the
`execution engine is configured as a
`graphics processing unit (GPU) or a
`tensor processing unit (TPU).
`
`775 PAT | CLAIM 18
`18. The method of claim 1,non-
`transitory computer-readable storage
`medium of claim 10, wherein each
`token in the sequence of input tokens
`represents a text unit.
`
`7
`
`