`________________________________
`
`BEFORE THE PATENT TRIAL AND APPEAL BOARD
`________________________________
`HUGGING FACE, INC.,
`PETITIONER,
`v.
`FRIENDLIAI INC.,
`PATENT OWNER.
`________________________________
`
`Case IPR2024-01234
`Patent 11,442,775
`________________________________
`
`PETITION FOR INTER PARTES REVIEW OF USPN 11,442,775
`UNDER 35 U.S.C. §§ 311 ET SEQ. AND
`37 C.F.R. § 42.100 ET SEQ.
`
`
`
`
`
` I.
`
`TABLE OF CONTENTS
`
`Introduction ...................................................................................................... 1
`
`II.
`
`Summary of the Argument .............................................................................. 1
`
`III. MANDATORY NOTICES UNDER 37 C.F.R. § 42.8(a)(1) .......................... 4
`
`A.
`
`B.
`
`C.
`
`D.
`
`E.
`
`F.
`
`Real Party In Interest Under 37 C.F.R. § 42.8(b)(1) ............................. 4
`
`Related Matters Under 37 C.F.R. § 42.8(b)(2) ..................................... 4
`
`Lead and Back-Up Counsel Under 37 C.F.R. § 42.8(b)(3) .................. 4
`
`Service Information Under 37 C.F.R. § 42.8(b)(4) ............................... 5
`
`Payment of Fees Under 37 C.F.R. § 42.15............................................ 5
`
`Certification of Word Count Under 37 C.F.R. § 42.24(d) .................... 6
`
`IV. GROUNDS FOR STANDING UNDER 37 C.F.R. § 42.104(a) ..................... 6
`
`V.
`
`IDENTIFICATION OF GROUNDS FOR WHICH REVIEW IS
`REQUESTED UNDER 37 C.F.R. § 42.104(b)(1) .......................................... 6
`
`VI. HOW THE CHALLENGED CLAIMS ARE TO BE CONSTRUED
`UNDER 37 C.F.R. § 42.104(b)(3) .................................................................. 7
`
`VII. OVERVIEW OF THE ’775 PATENT ............................................................ 8
`
`A.
`
`Summary of the ’775 Specification’s Description of Claimed
`Matter .................................................................................................... 8
`
`B.
`
`Summary of the ’775 Patent Prosecution History ............................... 11
`
`VIII. STATE OF TECHNOLOGY ........................................................................ 13
`
`A. Machine Learning and RNNs .............................................................. 13
`
`
`
`
`i
`
`
`
`
`
`B.
`
`C.
`
`Transformers ....................................................................................... 17
`
`Batching in Machine Learning ............................................................ 19
`
`IX. OVERVIEW OF THE PRIOR ART ............................................................. 21
`
`A.
`
`B.
`
`Summary of Gao ................................................................................. 21
`
`Summary of Katharopoulos ................................................................ 25
`
`X. HOW THE CLAIMS ARE UNPATENTABLE UNDER 37 C.F.R.
`§ 42.104(b)(4) ................................................................................................ 27
`
`A.
`
`B.
`
`Level of Skill in the Art ....................................................................... 27
`
`Ground 1 – Gao in view of Katharopoulos renders obvious
`claims 1-18 under 35 U.S.C. § 102 ..................................................... 27
`
`XI. DISCRETIONARY DENIAL IS NOT SUPPORTED ................................. 78
`
`A.
`
`B.
`
`35 U.S.C. § 325(d)............................................................................... 78
`
`35 U.S.C. § 314(a) ............................................................................... 79
`
`XII. CONCLUSION .............................................................................................. 80
`
`
`
`
`
`
`ii
`
`
`
`
`
`
`
`
`LIST OF EXHIBITS
`DESCRIPTION
`
`EX
`1001 U.S. Patent No. 11,442,775
`
`1002 Declaration of Ghaith Hammouri
`
`1005
`
`1004
`
`1003 C.V. of Ghaith Hammouri
`Pin Gao, et. al., Low Latency RNN Inference with Cellular Batching,
`Thirteenth EuroSys Conference 2018, April 23–26, 2018, Porto, Portugal,
`published by the Association for Computing Machinery (2018) (“Gao”)
`Katharopoulos, et. al., Transformers are RNNs: Fast Autoregressive
`Transformers with Linear Attention by Katharopoulos et al., arXiv :
`2006.16236v3 , Aug. 31 , 2020, presented at Proceedings of the 37th
`International Conference on Machine Learning, Online, PMLR 119, 2020
`(“Katharopoulos”)
`1006 Original Complaint, FriendliAI Inc. v. Hugging Face, Inc., Case No. 23-
`816-MN, D. Del. (filed July 28, 2023)
`1007 First Amended Complaint, FriendliAI Inc. v. Hugging Face, Inc., Case No.
`23-816-MN, D. Del. (filed January 8, 2024)
`1008 Gyeong-In Yu, et. al., Orca: A Distributed Serving System for Transformer-
`Based Generative Models, Proceedings of the 16th USENIX Symposium on
`Operating Systems Design and Implementation, July 11–13, 2022,
`Carlsbad, CA, USA (“Orca paper”)
`1009 A. Vaswani, et. al., Attention is All you Need, Advances in Neural
`Information Processing Systems, 2017 (“Vaswani”)
`
`1010 Prosecution history of application 17/542,193
`
`1011 Plaintiff’s Disclosure of Proposed Claim Constructions, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on May 28,
`2024)
`
`1012 Defendant’s Disclosure of Proposed Claim Construction, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on May 28,
`2024)
`
`iii
`
`
`
`
`
`
`
`
`DESCRIPTION
`EX
`1013 Plaintiff’s Final Infringement Contentions of ’775 Patent, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on July 1, 2024)
`
`1014
`
`Ilya Sutskever, et al., Sequence to Sequence Learning with Neural
`Networks, arXiv : 1409.3215v3, Dec. 14, 2014, Proceedings of the 28th
`Conference on Neural Information Processing Systems 2014, December 8–
`13, 2014, Palais des congrès de Montréal, Canada, (“Sutskever”)
`
`1015 Dzmitry Bahdanau, et al., Neural Machine Translation by Jointly Learning
`to Align and Translate, arXiv : 1409.0473v7, May 19, 2016, Proceedings of
`the International Conference on Learning Representations 2015, May 7–9,
`2015, San Diego, California (“Bahdanau”)
`
`1016 Romain Paulus, et al., A Deep Reinforced Model for Abstractive
`Summarization, arXiv : 1705.04304v3, November 13, 2017, Published
`online by Salesforce Research, Palo Alto, California (“Paulus”)
`
`1017 Peter J. Liu, et al., Generating Wikipedia by Summarizing Long Sequences,
`arXiv : 1801.10198v1, January 30, 2018, Proceedings of the International
`Conference on Learning Representations 2018, April 30–May 3, 2018,
`Vancouver, Canada (“Liu”)
`
`1018 Alec Radford, et al., Improving Language Understanding by Generative
`Pre-Training, https://api.semanticscholar.org/CorpusID:49313245, June 11,
`2018, Published online (“Radford”)
`
`1019 Yao-Hung Hubert Tsai, et al., Transformer Dissection: A Unified
`Understanding of Transformer’s Attention via the Lens of Kernel, arXiv :
`1908.11775v4, November 11, 2019, Proceedings of the 2019 Conference on
`Empirical Methods in Natural Language Processing and the 9th
`International Joint Conference on Natural Language Processing (EMNLP-
`IJCNLP) (pp. 4335–4344), November 3–7, 2019, Hong Kong, China
`(“Tsai”)
`
`1020 Ankit Singh Rawat, et al., Sampled Softmax with Random Fourier Features,
`arXiv : 1907.10747v2, December 31, 2019, Proceedings of the Conference
`on Neural Information Processing Systems 2019, December 8–14, 2019,
`Vancouver, Canada (“Rawat”)
`
`iv
`
`
`
`
`
`
`
`
`DESCRIPTION
`EX
`1021 Guy Blanc, et al., Adaptive Sampled Softmax with Kernel Based Sampling,
`arXiv : 1712.00527v2, August 1, 2018, Proceedings of the 35th
`International Conference on Machine Learning, PMLR Vol. 80 pp. 590–
`599, July 10–15, 2018, Stockholm, Sweden (“Blanc”)
`
`1022 Claim appendix for claims 1-18 of the ’775 Patent
`
`v
`
`
`
`
`
`I.
`
`Introduction
`Hugging Face, Inc. (“Petitioner”), hereby respectfully requests Inter Partes
`
`Review pursuant to 35 U.S.C. §§ 311 et seq. and 37 C.F.R. §§ 42.100 et seq., of
`
`claims 1-18 (the “Challenged Claims”) of U.S. Patent No. 11,442,775 (“the ’775
`
`Patent”) issued on Sept. 13, 2022 to Gyeong-In Yu et. al. See Ex-1001.
`
`As explained in detail below, the Challenged Claims are rendered obvious in
`
`view of the prior art cited herein. Therefore, the Board should institute an inter partes
`
`review of claims 1–18 of the ’775 Patent.
`
`II.
`
`Summary of the Argument
`The ’775 Patent generally relates to an alleged improvement for machine
`
`learning transformer neural network models. Ex-1001, 1:7–9. In its complaint in the
`
`underlying litigation, Patent Owner states the ’775 Patent claims are directed to
`
`“methods used to process batches of transformer-based requests, which involves
`
`scheduling batches at an iteration-level.” Ex-1007, ¶¶ 23-27. According to Patent
`
`Owner, “iteration-level scheduling…allows for a finished request to be sent to a
`
`client, and for new requests to be sent to the execution engine, before all requests in
`
`a batch are completed.” Id., ¶ 14.
`
`However, after filing the ’775 Patent, the inventors acknowledged in their
`
`published “Orca” paper that “iteration-level scheduling” was a previously known
`
`method to optimize a recurrent neural network (RNN). Ex-1008, 532–533. Indeed,
`
`
`
`
`1
`
`
`
`
`
`when discussing the Gao reference (Ex-1004) cited in this IPR Petition, the inventors
`
`of the challenged ’775 Patent stated:
`
`We would like to highlight BatchMaker [Gao] as one of the most
`relevant previous works. BatchMaker is a serving system for RNNs that
`performs scheduling and batching at the granularity of RNN
`cells,…BatchMaker allows a newly arrived request for RNN to join (or
`a finished request to leave) the current executing batch without waiting
`for the batch to completely finish.
`Ex-1008, 532-533 (emphasis added).1 This description of Gao is the same as Patent
`
`Owner’s description of the ’775 claims. Id., ¶ 14 (’775 claims “allows for a finished
`
`request to be sent to a client, and for new requests to be sent to the execution engine,
`
`before all requests in a batch are completed.”).
`
`Notably, Patent Owner alleges in its Complaint against Petitioner that the
`
`Orca paper embodies the ’775 Patent. Ex-1007, ¶ 36 (Patent Owner stating that “The
`
`claimed advancements were described in a paper, titled “Orca:…”) and ¶ 38
`
`(“PeriFlow (aka Orca) practices the ’775 patent”). Thus, the admissions about Gao
`
`in the Orca paper is strong evidence that Gao teaches scheduling batches at an
`
`iteration-level, the feature that is allegedly claimed in the ’775 Patent.
`
`The only difference in the claims of the ’775 Patent and Gao is that the claims
`
`of the ’775 Patent recite a “transformer model” while Gao discloses a RNN model.
`
`Yet, the application of Gao’s scheduling technique for an RNN model to a
`
`
`1 Emphasis always added by Petitioner unless otherwise noted.
`
`
`
`
`2
`
`
`
`
`
`transformer model was not inventive, but rather a predictable and obvious
`
`implementation of a known improvement for one machine learning model applied
`
`to a different machine learning model. KSR International Co. v. Teleflex Inc., 550
`
`U.S. 398, 401 (2007).
`
`Indeed, by the time of the ’775 Patent, it was known that certain transformer
`
`models act as an RNN. For example, Katharopoulos is titled and teaches
`
`“Transformers are RNNs:…” thereby explicitly linking these two machine learning
`
`models. See Ex-1005, 5 (“any transformer layer with causal masking can be written
`
`as a model that, given an input, modifies an internal state and then predicts an output,
`
`namely a Recurrent Neural Network (RNN).”). As discussed in this IPR petition, the
`
`combination of Gao’s scheduling and batching with Katharopoulos’s transformer
`
`model was obvious and teaches all the claims of the ’775 Patent.
`
`Lastly, the Orca paper attempts to distinguish Gao by arguing that
`
`“BatchMaker [Gao] cannot make batches of cells for Transformer models” because
`
`of certain performance characteristics. See Ex-1008, 533. Yet, the Orca Paper
`
`provides nothing to support these conclusions and further, these conclusions are
`
`irrelevant to the broadly claimed transformer model in the ’775 Patent. The Orca
`
`Paper also does not address applying Gao to Katharopoulos’s transformer model
`
`which uses linear attention and overcomes the Orca paper’s criticism of applying
`
`Gao to a transformer model. See e.g., Ex-1005, 1 (“In this paper, we introduce the
`
`
`
`
`3
`
`
`
`
`
`linear transformer model that significantly reduces the memory footprint and scales
`
`linearly with respect to the context length.”).
`
`III. MANDATORY NOTICES UNDER 37 C.F.R. § 42.8(a)(1)
`Petitioner satisfies each requirement for Inter Partes Review of the ’775
`
`patent pursuant to 37 C.F.R. § 42.8(a)(1).
`
`A. Real Party In Interest Under 37 C.F.R. § 42.8(b)(1)
`
`The Petitioner and real party in interest is Hugging Face, Inc. Petitioner is
`
`incorporated in Delaware with a principal business address of 20 Jay St, Suite 620,
`
`Brooklyn, NY 11201.
`
`B. Related Matters Under 37 C.F.R. § 42.8(b)(2)
`
`The ’775 Patent is presently asserted by the Patent Owner against Petitioner
`
`in FriendliAI Inc. v. Hugging Face, Inc., Case No. 1:23-cv-00816-MN, D. Del. (filed
`
`on July 28, 2023). See Exs. 1006 and 1007.
`
`To the best of Petitioners’ knowledge, the ’775 Patent has not been at issue in
`
`any other litigation or PTAB proceeding before the instant petition was filed.
`
`C. Lead and Back-Up Counsel Under 37 C.F.R. § 42.8(b)(3)
`
`Petitioner is represented by the following counsel:
`
`
`
`
`
`
`4
`
`
`
`
`
`Lead Counsel
`
`
`
`James P. Murphy
`Reg. No. 55,474
`Polsinelli PC
`1000 Louisiana Street
`Suite 6400
`Houston, Texas 77002
`Tel: (713) 374-1631
`jpmurphy@polsinelli.com
`
`
`
`
`
`Backup Counsel
`
`Adam P. Daniels,
`Reg. No. 66,681
`Polsinelli LLP
`2049 Century Park E.
`Suite 2900
`Los Angeles, CA 90067
`Tel: (310) 556-6754
`adaniels@polsinelli.com
`
`
`Pursuant to 37 C.F.R. § 42.10(b), Powers of Attorney have been filed with
`
`this Petition.
`
`D.
`
`Service Information Under 37 C.F.R. § 42.8(b)(4)
`
`Physical mailing service information for lead and back-up counsel is as
`
`follows:
`
`James Murphy
`Polsinelli PC
`1000 Louisiana Street
`Suite 6400
`Houston, Texas 77002
`Petitioner also consents to service by e-mail at the above e-mail addresses provided
`
`above for lead and backup counsel.
`
`E.
`
`Payment of Fees Under 37 C.F.R. § 42.15
`
`All required fees have been paid with the filing of this Petition. Petitioner
`
`further authorizes the U.S. Patent & Trademark Office to charge Deposit Account
`
`
`
`
`5
`
`
`
`
`
`No. 50-1662 for any fees, including the fee set forth in 37 C.F.R. § 42.15(a) for this
`
`Petition.
`
`F. Certification of Word Count Under 37 C.F.R. § 42.24(d)
`
`Petitioner certifies that the word count in this Petition, including all footnotes,
`
`is 13,982 words as counted by the word-processing program (Microsoft Word for
`
`Office 365) used to generate this Petition, where such word count excludes the table
`
`of contents, mandatory notices, certificate of service, list of exhibits, and this
`
`certificate of word count. This Petition is in compliance with the 14,000 word limit
`
`set forth in 37 C.F.R. § 42.24(a)(1)(i).
`
`IV. GROUNDS FOR STANDING UNDER 37 C.F.R. § 42.104(a)
`Petitioner certifies that the ’775 patent is available for inter partes review.
`
`Petitioner is not barred or estopped from requesting an inter partes review of the
`
`’775 patent claims on the grounds identified in this Petition. 37 C.F.R. § 42.104(a).
`
`V.
`
`IDENTIFICATION OF GROUNDS FOR WHICH REVIEW IS
`REQUESTED UNDER 37 C.F.R. § 42.104(b)(1)
`Petitioner asserts that claims 1-18 (the “Challenged Claims”) of the ’775
`
`patent are unpatentable based on the following ground:
`
`Ground 1: Claims 1-18 are rendered obvious under 35 U.S.C. § 103 by Gao
`
`in view of Katharopoulos. See Ex-1002, ¶¶12-13.
`
`
`
`
`6
`
`
`
`
`
`VI. HOW THE CHALLENGED CLAIMS ARE TO BE CONSTRUED
`UNDER 37 C.F.R. § 42.104(b)(3)
`In an IPR, claim terms are to be construed in accordance with the standard set
`
`forth in Phillips. Phillips v. AWH Corp., 415 F.3d 1303, 1312 (Fed. Cir. 2005) (en
`
`banc). Further, claim terms need only be construed “to the extent necessary to
`
`resolve the controversy.” Vivid Techs., Inc. v. Am Sci. & Eng’g, Inc., 200 F.3d 795,
`
`803 (Fed. Cir. 1999). In the underlying litigation, the Patent Owner has asserted that
`
`all claim terms are to be given their plain and ordinary meaning without providing
`
`any further constructions. Ex-1011. Petitioner has proposed constructions for certain
`
`claim terms. Ex-1012, 2–3. Here, Petitioner does not believe that any term requires
`
`express construction to resolve the invalidity grounds presented in this Petition since
`
`the prior art renders claims 1–18 under any reasonable interpretation of the claims.
`
`Claim construction negotiations are ongoing in the underlying litigation, and
`
`substantive briefing has not been filed. At present, Petitioner has proposed that
`
`certain limitations of the claims of the ’775 Patent are indefinite because Petitioner’s
`
`infringement positions require a claim scope that is unsupported and cannot be
`
`determined with reasonable certainty by a POSITA when reading the claims in light
`
`of the specification and prosecution history. Ex-1012, 4. To be clear, the uncertainty
`
`of the scope is an issue of infringement due to the unreasonable breadth in which
`
`Patent Owner is interpreting the claims to allege infringement. See Ex-1013. Yet,
`
`the arguments presented in this Petition do not rely on Patent Owner’s unreasonable
`
`
`
`
`7
`
`
`
`
`
`interpretations of the boundaries of the claim scope (rather the prior art reads on the
`
`claims under a narrower scope than Patent Owner is alleging). Thus, the Board does
`
`not need determine if the outer boundaries of the claims as asserted by Patent Owner
`
`is indefinite in order to determine that the prior art renders obvious claims 1-18
`
`unpatentable. See Ex-1002, ¶25.
`
`VII. OVERVIEW OF THE ’775 PATENT
`Summary of the ’775 Specification’s Description of Claimed
`A.
`Matter
`
`The ’775 Patent relates to an inference system that applies a machine-learning
`
`transformer model to batches of input requests with variable input lengths. ’775
`
`Patent, Abstract. The ’775 Patent, in Figures 5A–5D, illustrates the claimed method
`
`for dynamic batching and processing of requests using a machine-learning
`
`transformer model. ’775 Patent, Figures 5A–5D; 22:22–24:38; see Ex-1002, ¶33.
`
`
`
`
`8
`
`
`
`
`
`
`
`
`iServingSystem=
`:435
`
` Request Processor
`
`Completion
`
`Incoming
`
`:
`
`|589 Execution Engine
`
`R2
`
`290A
`
`R14
`
`Execution Engine
`22068
`
`KV Cache
`Eo} jRsEt
`
`FIG. 5A
`
`
`
`9
`
`
`
`
`
`
`
`As new request R2 arrives at request processor 590, it is forwarded to
`
`scheduler 585 which monitors the cache memory for execution engines 590A and
`
`590B to determine if memory is available for processing request R2. ’775 Patent,
`
`22:58–23:29. Moving to Figure 5B, as first output token is generated for requests
`
`R1, R3, R4, and R5, execution engine 590A is now scheduled to execute updated
`
`batch R1 and R2 at a second iteration. ’775 Patent, 23:30–50; see Ex-1002, ¶¶34-38.
`
`In Figure 5C, a second output token is generated for request R1, R3, R4, and
`
`R5 and a first output token generated for request R2 with an end token that moves
`
`the outputs for request R2 to the completion queue of the request processor 580.
`
`’775 Patent, 23:51–58; see Ex-1002, ¶39.
`
`
`
`
`10
`
`
`
`
`
`
`
`Accordingly, by having dynamic batches for each iteration, “completed
`
`requests can be provided to the client device 110 as soon as processing is complete,
`
`and the scheduler 585 can schedule new requests. ’775 Patent, 24:10–16; see Ex-
`
`1002, ¶¶40-41.
`
`B.
`
`Summary of the ’775 Patent Prosecution History
`
`On December 7, 2021, applicants filed Application No. 17/542,193 (“the ’193
`
`Application”) which issued as the ’775 Patent. See Ex-1010. During prosecution of
`
`the ’193 Application, the Examiner issued a single office action rejecting certain
`
`pending claims, including all independent claims, as being rendered obvious by U.S.
`
`
`
`
`11
`
`
`
`
`
`Patent Publication 2021/0192314 to Aarts et. al. (“Aarts”) in view of U.S. Patent
`
`No. 10,846,096 to Chung et. al. (“Chung”). Ex-1010, 120–124.
`
`In response to the office action, the applicant amended both pending
`
`independent claims to further recite “wherein in a second set of inputs for the second
`
`batch of requests, a length of the sequence of input tokens for the new request is
`
`different from a length of an input for at least one request other than the new
`
`request.” Id., 181 and 184. Applicant argued that this amendment in conjunction
`
`with the scheduler limitation reciting “a second batch of requests additionally
`
`including the new request” distinguishes the prior art as emphasized in applicant’s
`
`response below:
`
`
`
`Id., 189 (emphasis in original).
`
`Applicant argues the independent claims as amended requires “the second
`
`batch of requests is modified to include a new request in addition to the one or more
`
`requests of the [first] batch of requests.” Id., 189 (bracket in original); see also id.
`
`191 (Applicant states “Claim 10 recites similar features as claim 1” and
`
`
`
`
`12
`
`
`
`
`
`distinguishable over the prior art for the same reasons as claim 1). According to the
`
`applicant, this distinguishes the claims from the prior art because:
`
`In existing batching methods for transformer models, it is difficult to
`modify a batch of requests once it has started to process on an execution
`engine, since the length of the inputs or internal states are the same
`across the requests in the batch.
`Id., 189.
`
`Applicants then argue that neither Aarts nor Chung teaches batching in the
`
`context of a “machine-learned transformer model” and also that neither teaches
`
`“subsequently scheduling a second batch of requests that is modified to include a
`
`new request in addition to the first batch of requests, in which a length of the
`
`sequence of input tokens for a new request is different from a length of an input for
`
`at least one request other than the new request in the batch.” Id., 190–191.
`
`The Examiner then allowed all claims without any further comments. Id., 199.
`
`VIII. STATE OF TECHNOLOGY2
`A. Machine Learning and RNNs
`
`Machine Learning in general is a field of study focused on using statistical
`
`techniques and algorithms in order to generalize data observations and predict the
`
`
`2 Cited references not named in a ground of rejection are cited for the purpose of
`showing the state of the art and the background knowledge of a POSITA. Randall
`Mfg. v. Rea, 733 F.3d 1355, 1362-63 (Fed. Cir. 2013).
`
`
`
`
`13
`
`
`
`
`
`behavior over unseen data. One of the oldest techniques employed in machine
`
`learning is that of a Neural Network (NN) model which was developed and
`
`introduced in the 1950’s and 1960’s. Early NNs were built using a single building
`
`cell, namely the perceptron. The perceptron is a simple circuit taking in a number
`
`of binary inputs each corresponding to some weight (a real number) which
`
`modulates the input, and finally produces a single decision output of True or False
`
`(1 or 0) based on some activation value (also a real number). The idea behind the
`
`perceptron was to capture the functionality that takes place in a human neuron. See
`
`Ex-1002, ¶43.
`
`In order to achieve higher levels of generalization (perceived intelligence), a
`
`NN utilizes many preceptors arranged in the form of a layer where multiple inputs
`
`are mapped into multiple outputs. Moreover, while a simple NN is made of a single
`
`layer, a more complex NN can be made of many interconnected layers (sometimes
`
`referred to as Multi-Layer Perceptron (MLP)). Typically, each layer is made of a
`
`number of preceptors whose output is connected to the next layer and so forth. The
`
`first layer is connected to the input while the last layer produces the output of the
`
`circuit leading to hidden layers in the middle. Once the network is created, the goal
`
`becomes to figure out the correct weights used in each perceptron in order to
`
`generate the correct output. This is the process of learning. See Ex-1002, ¶44.
`
`
`
`
`14
`
`
`
`
`
`Learning in general is carried out over a set of desired inputs and outputs
`
`(training samples). The goal is to feed the input representation into the NN model
`
`and to find a way to modify the weights of the NN until the correct output is
`
`produced. This process is carried out for all the training samples over many iterations
`
`until the NN model is finally capable of correctly producing the correct output for
`
`the training input. See Ex-1002, ¶45.
`
`The types of algorithms used to train a NN is outside of our scope here.
`
`However, it suffices to say that these training algorithms depend on the architecture
`
`and components of the underlying NN and typically require considerable
`
`computational resources to complete. A successful NN architecture would lend itself
`
`to training in a way that obtains higher levels of accuracy when tested to predict the
`
`output of the training samples. More importantly, the NN is expected to obtain high
`
`levels of accuracy when exposed to new data that the NN never saw during the
`
`training phase. See Ex-1002, ¶46.
`
`Over the years, many different types of NNs were introduced. Most known
`
`today are that of Recurrent Neural Networks (RNNs) and Convolutional Neural
`
`Networks (CNNs) both of which are considered to be Deep Neural Networks
`
`(DNN). Here we are interested in RNNs as they relate to the subject at hand. See Ex-
`
`1002, ¶47.
`
`
`
`
`15
`
`
`
`
`
`RNNs were first introduced in the 1980s and only became practical to use
`
`during the late 1990 with the introduction of Long Short-Term Memory (LSTM).
`
`Since then, RNNs have evolved to encompass many types and variations of their
`
`underlying cell and neural networks. In general, RNNs were designed to process
`
`sequential data. That is, data that changes over time. Whereas NNs expect an input
`
`of fixed size in order to produce an output, an RNN is designed to be able to process
`
`data in a sequence of varying length such as text or speech. See Ex-1002, ¶48.
`
`In its simplest form, an RNN can be viewed as a functional cell that takes in
`
`an external input along with a value of an internal hidden state, and correspondingly
`
`updates the values of the internally stored hidden state. This RNN function depends
`
`on a number of parameters that are learned during training. When processing an
`
`input sequence, say a list of words, the hidden state is initialized to some value before
`
`the RNN moves to processes the first word in the sequence as input.
`
`Correspondingly, the RNN updates the value of the internal hidden state in a way
`
`that depends on the first word processed. Once the RNN moves to the next word in
`
`the list, it repeats the same computation only this time the hidden state used in the
`
`function has been updated in a way that depends on the first word. Continuing in this
`
`fashion, the RNN can process any number of input words until there are no more
`
`words to process. At such point, the internal state of the RNN holds a value that has
`
`been continuously updated corresponding to every word in the input. As a result, the
`
`
`
`
`16
`
`
`
`
`
`internal state value can now be thought of as a representation of the input words and
`
`as such can be used to carry out a classification or prediction task relating to the
`
`processed input. See Ex-1002, ¶49.
`
`Due to the success of RNNs in many applications (e.g. Natural Language
`
`Processing (NLP), speech recognition, image classification) a large number of RNN
`
`variations can be found in the literature. These various types of RNNs differ in their
`
`functionality, components, number of inputs or outputs, and so on. Examples of used
`
`RNN models include, LSTM, Gated Recurrent Units (GRU), Sequence to Sequence
`
`(Seq2Seq), and further include many other types of architectures and building
`
`blocks. Regardless of this variation, all RNNs share the same underlying property of
`
`retaining a hidden state that is updated according to the changing input in order to
`
`affect the final output of the circuit. See Ex-1002, ¶50.
`
`B.
`
`Transformers
`
`In their groundbreaking work, Vaswani introduced Transformers, a new Deep
`
`Neural Network (DNN) model that completely relies on the attention mechanism to
`
`draw global dependencies between input and output. See Ex-1009. Simply put, the
`
`attention mechanism takes in an input encoded as a vector and maps it to an output
`
`that is also a vector. See Ex-1002, ¶51.
`
`By first mapping a sequence of encoded inputs to a corresponding set of
`
`Queries, Keys, and Values, (QKV) the output of the attention mechanism is formed
`
`
`
`
`17
`
`
`
`
`
`as a weighted sum of input Values, where the weight assigned to each Value is
`
`computed as a compatibility function between a Query and a Key. See Ex-1002, ¶52.
`
`Although the Attention mechanism already existed in the literature (Ex-1015),
`
`the main contribution of Vaswani was to rely only on the attention mechanism to
`
`extract any interdependencies between the elements making up the input sequence.
`
`See Ex-1002, ¶53.
`
`Accordingly, one can easily see that all the building blocks making up a
`
`transformer model (except for the attention block) are element-wise operations that
`
`do not observe interdependency between input elements. Typically, transformers
`
`operate in an auto-regressive fashion where the predicted next element in a sequence
`
`of input elements is concatenated to the input and fed again to the system until a
`
`final special output element (e.g. <eos>) is generated. This mode of operation is very
`
`similar to that used in Seq2Seq RNN-models. Ex-1014. See Ex-1002, ¶54.
`
`The original Transformer proposed by Vaswani was made up of two main
`
`component-stacks, an encoder-stack followed by a decoder-stack. While both
`
`components have slightly different building blocks and connections, their attention
`
`mechanism differs in one main way. See Ex-1002, ¶55.
`
`In an encoder, attention is computed between all input elements regardless of
`
`their position within the input sequence. On the other hand, the decoder uses masked
`
`attention (or causal attention), which only allows the attention to be computed
`
`
`
`
`18
`
`
`
`
`
`between an input element and previous elements up to itself within the same input
`
`sequence. This difference ensures that the decoder, mainly tasked with predicting
`
`the next element in the input sequence, does not break the chain of causality. That
`
`is, an element can only be influenced by itself and previous elements but not future
`
`elements. See Ex-1002, ¶56.
`
`Many different Transformer architectures have been proposed, but perhaps
`
`one of the most popular was the Decoder-Only-Transformer (Ex-1017), which is
`
`most known to have been used in GPT (Generative-Pretrained Transformer). See
`
`Ex-1018. See Ex-1002, ¶57.
`
`In the DOT architecture, the Transformer is made of a number of decoders
`
`stacked in multiple consecutive layers, with their final output used to predict the next
`
`element of an input sequence. This architecture was preferred for having a simpler
`
`architecture along with efficient-implementation features, such as the masked
`
`attention, which allows the reuse of the Key and Value elements from previous
`
`iterations over the inputted and generated elements without requiring dependency
`
`on future element keys and values. See Ex-1002, ¶58.
`
`C. Batching in Machine Learning
`
`Batching (sometimes called batch processing) is a known computer method
`
`which refers to the idea of combining multiple inputs to be processed in parallel as
`
`a batch. In the context of ML, batching has been widely used for many years in order
`
`
`
`
`19
`
`
`
`
`
`to expedite the time required to process inputs. In the training phase of ML, batching
`
`is intimately related to the training algorithms used (e.g. minibatch-based Stochastic
`
`Gradient Descent (SGD)). Without going into the details, in this phase the training
`
`dataset is typically broken into smaller batches in order reduce the amount of
`
`memory used (relative to what is required for the entire dataset) while expediting the
`
`processing time (relative to what is required for the entire dataset) by computing the
`
`training function in parallel. See Ex-1002, ¶59.
`
`In the inference phase of ML, the model is treated as a function that takes in
`
`an input and produces an output in real-time using the underlying trained network.
`
`To expedite this process, batching is used to process multiple inputs at the same time,
`
`thus maximizing utility of the computational resources such as memory and
`
`processors and increasin