throbber
UNITED STATES PATENT AND TRADEMARK OFFICE
`________________________________
`
`BEFORE THE PATENT TRIAL AND APPEAL BOARD
`________________________________
`HUGGING FACE, INC.,
`PETITIONER,
`v.
`FRIENDLIAI INC.,
`PATENT OWNER.
`________________________________
`
`Case IPR2024-01234
`Patent 11,442,775
`________________________________
`
`PETITION FOR INTER PARTES REVIEW OF USPN 11,442,775
`UNDER 35 U.S.C. §§ 311 ET SEQ. AND
`37 C.F.R. § 42.100 ET SEQ.
`
`

`

`
`
` I.
`
`TABLE OF CONTENTS
`
`Introduction ...................................................................................................... 1
`
`II.
`
`Summary of the Argument .............................................................................. 1
`
`III. MANDATORY NOTICES UNDER 37 C.F.R. § 42.8(a)(1) .......................... 4
`
`A.
`
`B.
`
`C.
`
`D.
`
`E.
`
`F.
`
`Real Party In Interest Under 37 C.F.R. § 42.8(b)(1) ............................. 4
`
`Related Matters Under 37 C.F.R. § 42.8(b)(2) ..................................... 4
`
`Lead and Back-Up Counsel Under 37 C.F.R. § 42.8(b)(3) .................. 4
`
`Service Information Under 37 C.F.R. § 42.8(b)(4) ............................... 5
`
`Payment of Fees Under 37 C.F.R. § 42.15............................................ 5
`
`Certification of Word Count Under 37 C.F.R. § 42.24(d) .................... 6
`
`IV. GROUNDS FOR STANDING UNDER 37 C.F.R. § 42.104(a) ..................... 6
`
`V.
`
`IDENTIFICATION OF GROUNDS FOR WHICH REVIEW IS
`REQUESTED UNDER 37 C.F.R. § 42.104(b)(1) .......................................... 6
`
`VI. HOW THE CHALLENGED CLAIMS ARE TO BE CONSTRUED
`UNDER 37 C.F.R. § 42.104(b)(3) .................................................................. 7
`
`VII. OVERVIEW OF THE ’775 PATENT ............................................................ 8
`
`A.
`
`Summary of the ’775 Specification’s Description of Claimed
`Matter .................................................................................................... 8
`
`B.
`
`Summary of the ’775 Patent Prosecution History ............................... 11
`
`VIII. STATE OF TECHNOLOGY ........................................................................ 13
`
`A. Machine Learning and RNNs .............................................................. 13
`
`
`
`
`i
`
`

`

`
`
`B.
`
`C.
`
`Transformers ....................................................................................... 17
`
`Batching in Machine Learning ............................................................ 19
`
`IX. OVERVIEW OF THE PRIOR ART ............................................................. 21
`
`A.
`
`B.
`
`Summary of Gao ................................................................................. 21
`
`Summary of Katharopoulos ................................................................ 25
`
`X. HOW THE CLAIMS ARE UNPATENTABLE UNDER 37 C.F.R.
`§ 42.104(b)(4) ................................................................................................ 27
`
`A.
`
`B.
`
`Level of Skill in the Art ....................................................................... 27
`
`Ground 1 – Gao in view of Katharopoulos renders obvious
`claims 1-18 under 35 U.S.C. § 102 ..................................................... 27
`
`XI. DISCRETIONARY DENIAL IS NOT SUPPORTED ................................. 78
`
`A.
`
`B.
`
`35 U.S.C. § 325(d)............................................................................... 78
`
`35 U.S.C. § 314(a) ............................................................................... 79
`
`XII. CONCLUSION .............................................................................................. 80
`
`
`
`
`
`
`ii
`
`

`

`
`
`
`
`
`LIST OF EXHIBITS
`DESCRIPTION
`
`EX
`1001 U.S. Patent No. 11,442,775
`
`1002 Declaration of Ghaith Hammouri
`
`1005
`
`1004
`
`1003 C.V. of Ghaith Hammouri
`Pin Gao, et. al., Low Latency RNN Inference with Cellular Batching,
`Thirteenth EuroSys Conference 2018, April 23–26, 2018, Porto, Portugal,
`published by the Association for Computing Machinery (2018) (“Gao”)
`Katharopoulos, et. al., Transformers are RNNs: Fast Autoregressive
`Transformers with Linear Attention by Katharopoulos et al., arXiv :
`2006.16236v3 , Aug. 31 , 2020, presented at Proceedings of the 37th
`International Conference on Machine Learning, Online, PMLR 119, 2020
`(“Katharopoulos”)
`1006 Original Complaint, FriendliAI Inc. v. Hugging Face, Inc., Case No. 23-
`816-MN, D. Del. (filed July 28, 2023)
`1007 First Amended Complaint, FriendliAI Inc. v. Hugging Face, Inc., Case No.
`23-816-MN, D. Del. (filed January 8, 2024)
`1008 Gyeong-In Yu, et. al., Orca: A Distributed Serving System for Transformer-
`Based Generative Models, Proceedings of the 16th USENIX Symposium on
`Operating Systems Design and Implementation, July 11–13, 2022,
`Carlsbad, CA, USA (“Orca paper”)
`1009 A. Vaswani, et. al., Attention is All you Need, Advances in Neural
`Information Processing Systems, 2017 (“Vaswani”)
`
`1010 Prosecution history of application 17/542,193
`
`1011 Plaintiff’s Disclosure of Proposed Claim Constructions, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on May 28,
`2024)
`
`1012 Defendant’s Disclosure of Proposed Claim Construction, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on May 28,
`2024)
`
`iii
`
`

`

`
`
`
`
`
`DESCRIPTION
`EX
`1013 Plaintiff’s Final Infringement Contentions of ’775 Patent, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on July 1, 2024)
`
`1014
`
`Ilya Sutskever, et al., Sequence to Sequence Learning with Neural
`Networks, arXiv : 1409.3215v3, Dec. 14, 2014, Proceedings of the 28th
`Conference on Neural Information Processing Systems 2014, December 8–
`13, 2014, Palais des congrès de Montréal, Canada, (“Sutskever”)
`
`1015 Dzmitry Bahdanau, et al., Neural Machine Translation by Jointly Learning
`to Align and Translate, arXiv : 1409.0473v7, May 19, 2016, Proceedings of
`the International Conference on Learning Representations 2015, May 7–9,
`2015, San Diego, California (“Bahdanau”)
`
`1016 Romain Paulus, et al., A Deep Reinforced Model for Abstractive
`Summarization, arXiv : 1705.04304v3, November 13, 2017, Published
`online by Salesforce Research, Palo Alto, California (“Paulus”)
`
`1017 Peter J. Liu, et al., Generating Wikipedia by Summarizing Long Sequences,
`arXiv : 1801.10198v1, January 30, 2018, Proceedings of the International
`Conference on Learning Representations 2018, April 30–May 3, 2018,
`Vancouver, Canada (“Liu”)
`
`1018 Alec Radford, et al., Improving Language Understanding by Generative
`Pre-Training, https://api.semanticscholar.org/CorpusID:49313245, June 11,
`2018, Published online (“Radford”)
`
`1019 Yao-Hung Hubert Tsai, et al., Transformer Dissection: A Unified
`Understanding of Transformer’s Attention via the Lens of Kernel, arXiv :
`1908.11775v4, November 11, 2019, Proceedings of the 2019 Conference on
`Empirical Methods in Natural Language Processing and the 9th
`International Joint Conference on Natural Language Processing (EMNLP-
`IJCNLP) (pp. 4335–4344), November 3–7, 2019, Hong Kong, China
`(“Tsai”)
`
`1020 Ankit Singh Rawat, et al., Sampled Softmax with Random Fourier Features,
`arXiv : 1907.10747v2, December 31, 2019, Proceedings of the Conference
`on Neural Information Processing Systems 2019, December 8–14, 2019,
`Vancouver, Canada (“Rawat”)
`
`iv
`
`

`

`
`
`
`
`
`DESCRIPTION
`EX
`1021 Guy Blanc, et al., Adaptive Sampled Softmax with Kernel Based Sampling,
`arXiv : 1712.00527v2, August 1, 2018, Proceedings of the 35th
`International Conference on Machine Learning, PMLR Vol. 80 pp. 590–
`599, July 10–15, 2018, Stockholm, Sweden (“Blanc”)
`
`1022 Claim appendix for claims 1-18 of the ’775 Patent
`
`v
`
`

`

`
`
`I.
`
`Introduction
`Hugging Face, Inc. (“Petitioner”), hereby respectfully requests Inter Partes
`
`Review pursuant to 35 U.S.C. §§ 311 et seq. and 37 C.F.R. §§ 42.100 et seq., of
`
`claims 1-18 (the “Challenged Claims”) of U.S. Patent No. 11,442,775 (“the ’775
`
`Patent”) issued on Sept. 13, 2022 to Gyeong-In Yu et. al. See Ex-1001.
`
`As explained in detail below, the Challenged Claims are rendered obvious in
`
`view of the prior art cited herein. Therefore, the Board should institute an inter partes
`
`review of claims 1–18 of the ’775 Patent.
`
`II.
`
`Summary of the Argument
`The ’775 Patent generally relates to an alleged improvement for machine
`
`learning transformer neural network models. Ex-1001, 1:7–9. In its complaint in the
`
`underlying litigation, Patent Owner states the ’775 Patent claims are directed to
`
`“methods used to process batches of transformer-based requests, which involves
`
`scheduling batches at an iteration-level.” Ex-1007, ¶¶ 23-27. According to Patent
`
`Owner, “iteration-level scheduling…allows for a finished request to be sent to a
`
`client, and for new requests to be sent to the execution engine, before all requests in
`
`a batch are completed.” Id., ¶ 14.
`
`However, after filing the ’775 Patent, the inventors acknowledged in their
`
`published “Orca” paper that “iteration-level scheduling” was a previously known
`
`method to optimize a recurrent neural network (RNN). Ex-1008, 532–533. Indeed,
`
`
`
`
`1
`
`

`

`
`
`when discussing the Gao reference (Ex-1004) cited in this IPR Petition, the inventors
`
`of the challenged ’775 Patent stated:
`
`We would like to highlight BatchMaker [Gao] as one of the most
`relevant previous works. BatchMaker is a serving system for RNNs that
`performs scheduling and batching at the granularity of RNN
`cells,…BatchMaker allows a newly arrived request for RNN to join (or
`a finished request to leave) the current executing batch without waiting
`for the batch to completely finish.
`Ex-1008, 532-533 (emphasis added).1 This description of Gao is the same as Patent
`
`Owner’s description of the ’775 claims. Id., ¶ 14 (’775 claims “allows for a finished
`
`request to be sent to a client, and for new requests to be sent to the execution engine,
`
`before all requests in a batch are completed.”).
`
`Notably, Patent Owner alleges in its Complaint against Petitioner that the
`
`Orca paper embodies the ’775 Patent. Ex-1007, ¶ 36 (Patent Owner stating that “The
`
`claimed advancements were described in a paper, titled “Orca:…”) and ¶ 38
`
`(“PeriFlow (aka Orca) practices the ’775 patent”). Thus, the admissions about Gao
`
`in the Orca paper is strong evidence that Gao teaches scheduling batches at an
`
`iteration-level, the feature that is allegedly claimed in the ’775 Patent.
`
`The only difference in the claims of the ’775 Patent and Gao is that the claims
`
`of the ’775 Patent recite a “transformer model” while Gao discloses a RNN model.
`
`Yet, the application of Gao’s scheduling technique for an RNN model to a
`
`
`1 Emphasis always added by Petitioner unless otherwise noted.
`
`
`
`
`2
`
`

`

`
`
`transformer model was not inventive, but rather a predictable and obvious
`
`implementation of a known improvement for one machine learning model applied
`
`to a different machine learning model. KSR International Co. v. Teleflex Inc., 550
`
`U.S. 398, 401 (2007).
`
`Indeed, by the time of the ’775 Patent, it was known that certain transformer
`
`models act as an RNN. For example, Katharopoulos is titled and teaches
`
`“Transformers are RNNs:…” thereby explicitly linking these two machine learning
`
`models. See Ex-1005, 5 (“any transformer layer with causal masking can be written
`
`as a model that, given an input, modifies an internal state and then predicts an output,
`
`namely a Recurrent Neural Network (RNN).”). As discussed in this IPR petition, the
`
`combination of Gao’s scheduling and batching with Katharopoulos’s transformer
`
`model was obvious and teaches all the claims of the ’775 Patent.
`
`Lastly, the Orca paper attempts to distinguish Gao by arguing that
`
`“BatchMaker [Gao] cannot make batches of cells for Transformer models” because
`
`of certain performance characteristics. See Ex-1008, 533. Yet, the Orca Paper
`
`provides nothing to support these conclusions and further, these conclusions are
`
`irrelevant to the broadly claimed transformer model in the ’775 Patent. The Orca
`
`Paper also does not address applying Gao to Katharopoulos’s transformer model
`
`which uses linear attention and overcomes the Orca paper’s criticism of applying
`
`Gao to a transformer model. See e.g., Ex-1005, 1 (“In this paper, we introduce the
`
`
`
`
`3
`
`

`

`
`
`linear transformer model that significantly reduces the memory footprint and scales
`
`linearly with respect to the context length.”).
`
`III. MANDATORY NOTICES UNDER 37 C.F.R. § 42.8(a)(1)
`Petitioner satisfies each requirement for Inter Partes Review of the ’775
`
`patent pursuant to 37 C.F.R. § 42.8(a)(1).
`
`A. Real Party In Interest Under 37 C.F.R. § 42.8(b)(1)
`
`The Petitioner and real party in interest is Hugging Face, Inc. Petitioner is
`
`incorporated in Delaware with a principal business address of 20 Jay St, Suite 620,
`
`Brooklyn, NY 11201.
`
`B. Related Matters Under 37 C.F.R. § 42.8(b)(2)
`
`The ’775 Patent is presently asserted by the Patent Owner against Petitioner
`
`in FriendliAI Inc. v. Hugging Face, Inc., Case No. 1:23-cv-00816-MN, D. Del. (filed
`
`on July 28, 2023). See Exs. 1006 and 1007.
`
`To the best of Petitioners’ knowledge, the ’775 Patent has not been at issue in
`
`any other litigation or PTAB proceeding before the instant petition was filed.
`
`C. Lead and Back-Up Counsel Under 37 C.F.R. § 42.8(b)(3)
`
`Petitioner is represented by the following counsel:
`
`
`
`
`
`
`4
`
`

`

`
`
`Lead Counsel
`
`
`
`James P. Murphy
`Reg. No. 55,474
`Polsinelli PC
`1000 Louisiana Street
`Suite 6400
`Houston, Texas 77002
`Tel: (713) 374-1631
`jpmurphy@polsinelli.com
`
`
`
`
`
`Backup Counsel
`
`Adam P. Daniels,
`Reg. No. 66,681
`Polsinelli LLP
`2049 Century Park E.
`Suite 2900
`Los Angeles, CA 90067
`Tel: (310) 556-6754
`adaniels@polsinelli.com
`
`
`Pursuant to 37 C.F.R. § 42.10(b), Powers of Attorney have been filed with
`
`this Petition.
`
`D.
`
`Service Information Under 37 C.F.R. § 42.8(b)(4)
`
`Physical mailing service information for lead and back-up counsel is as
`
`follows:
`
`James Murphy
`Polsinelli PC
`1000 Louisiana Street
`Suite 6400
`Houston, Texas 77002
`Petitioner also consents to service by e-mail at the above e-mail addresses provided
`
`above for lead and backup counsel.
`
`E.
`
`Payment of Fees Under 37 C.F.R. § 42.15
`
`All required fees have been paid with the filing of this Petition. Petitioner
`
`further authorizes the U.S. Patent & Trademark Office to charge Deposit Account
`
`
`
`
`5
`
`

`

`
`
`No. 50-1662 for any fees, including the fee set forth in 37 C.F.R. § 42.15(a) for this
`
`Petition.
`
`F. Certification of Word Count Under 37 C.F.R. § 42.24(d)
`
`Petitioner certifies that the word count in this Petition, including all footnotes,
`
`is 13,982 words as counted by the word-processing program (Microsoft Word for
`
`Office 365) used to generate this Petition, where such word count excludes the table
`
`of contents, mandatory notices, certificate of service, list of exhibits, and this
`
`certificate of word count. This Petition is in compliance with the 14,000 word limit
`
`set forth in 37 C.F.R. § 42.24(a)(1)(i).
`
`IV. GROUNDS FOR STANDING UNDER 37 C.F.R. § 42.104(a)
`Petitioner certifies that the ’775 patent is available for inter partes review.
`
`Petitioner is not barred or estopped from requesting an inter partes review of the
`
`’775 patent claims on the grounds identified in this Petition. 37 C.F.R. § 42.104(a).
`
`V.
`
`IDENTIFICATION OF GROUNDS FOR WHICH REVIEW IS
`REQUESTED UNDER 37 C.F.R. § 42.104(b)(1)
`Petitioner asserts that claims 1-18 (the “Challenged Claims”) of the ’775
`
`patent are unpatentable based on the following ground:
`
`Ground 1: Claims 1-18 are rendered obvious under 35 U.S.C. § 103 by Gao
`
`in view of Katharopoulos. See Ex-1002, ¶¶12-13.
`
`
`
`
`6
`
`

`

`
`
`VI. HOW THE CHALLENGED CLAIMS ARE TO BE CONSTRUED
`UNDER 37 C.F.R. § 42.104(b)(3)
`In an IPR, claim terms are to be construed in accordance with the standard set
`
`forth in Phillips. Phillips v. AWH Corp., 415 F.3d 1303, 1312 (Fed. Cir. 2005) (en
`
`banc). Further, claim terms need only be construed “to the extent necessary to
`
`resolve the controversy.” Vivid Techs., Inc. v. Am Sci. & Eng’g, Inc., 200 F.3d 795,
`
`803 (Fed. Cir. 1999). In the underlying litigation, the Patent Owner has asserted that
`
`all claim terms are to be given their plain and ordinary meaning without providing
`
`any further constructions. Ex-1011. Petitioner has proposed constructions for certain
`
`claim terms. Ex-1012, 2–3. Here, Petitioner does not believe that any term requires
`
`express construction to resolve the invalidity grounds presented in this Petition since
`
`the prior art renders claims 1–18 under any reasonable interpretation of the claims.
`
`Claim construction negotiations are ongoing in the underlying litigation, and
`
`substantive briefing has not been filed. At present, Petitioner has proposed that
`
`certain limitations of the claims of the ’775 Patent are indefinite because Petitioner’s
`
`infringement positions require a claim scope that is unsupported and cannot be
`
`determined with reasonable certainty by a POSITA when reading the claims in light
`
`of the specification and prosecution history. Ex-1012, 4. To be clear, the uncertainty
`
`of the scope is an issue of infringement due to the unreasonable breadth in which
`
`Patent Owner is interpreting the claims to allege infringement. See Ex-1013. Yet,
`
`the arguments presented in this Petition do not rely on Patent Owner’s unreasonable
`
`
`
`
`7
`
`

`

`
`
`interpretations of the boundaries of the claim scope (rather the prior art reads on the
`
`claims under a narrower scope than Patent Owner is alleging). Thus, the Board does
`
`not need determine if the outer boundaries of the claims as asserted by Patent Owner
`
`is indefinite in order to determine that the prior art renders obvious claims 1-18
`
`unpatentable. See Ex-1002, ¶25.
`
`VII. OVERVIEW OF THE ’775 PATENT
`Summary of the ’775 Specification’s Description of Claimed
`A.
`Matter
`
`The ’775 Patent relates to an inference system that applies a machine-learning
`
`transformer model to batches of input requests with variable input lengths. ’775
`
`Patent, Abstract. The ’775 Patent, in Figures 5A–5D, illustrates the claimed method
`
`for dynamic batching and processing of requests using a machine-learning
`
`transformer model. ’775 Patent, Figures 5A–5D; 22:22–24:38; see Ex-1002, ¶33.
`
`
`
`
`8
`
`

`

`
`
`
`
`
`iServingSystem=
`:435
`
` Request Processor
`
`Completion
`
`Incoming
`
`:
`
`|589 Execution Engine
`
`R2
`
`290A
`
`R14
`
`Execution Engine
`22068
`
`KV Cache
`Eo} jRsEt
`
`FIG. 5A
`
`
`
`9
`
`

`

`
`
`
`
`As new request R2 arrives at request processor 590, it is forwarded to
`
`scheduler 585 which monitors the cache memory for execution engines 590A and
`
`590B to determine if memory is available for processing request R2. ’775 Patent,
`
`22:58–23:29. Moving to Figure 5B, as first output token is generated for requests
`
`R1, R3, R4, and R5, execution engine 590A is now scheduled to execute updated
`
`batch R1 and R2 at a second iteration. ’775 Patent, 23:30–50; see Ex-1002, ¶¶34-38.
`
`In Figure 5C, a second output token is generated for request R1, R3, R4, and
`
`R5 and a first output token generated for request R2 with an end token that moves
`
`the outputs for request R2 to the completion queue of the request processor 580.
`
`’775 Patent, 23:51–58; see Ex-1002, ¶39.
`
`
`
`
`10
`
`

`

`
`
`
`
`Accordingly, by having dynamic batches for each iteration, “completed
`
`requests can be provided to the client device 110 as soon as processing is complete,
`
`and the scheduler 585 can schedule new requests. ’775 Patent, 24:10–16; see Ex-
`
`1002, ¶¶40-41.
`
`B.
`
`Summary of the ’775 Patent Prosecution History
`
`On December 7, 2021, applicants filed Application No. 17/542,193 (“the ’193
`
`Application”) which issued as the ’775 Patent. See Ex-1010. During prosecution of
`
`the ’193 Application, the Examiner issued a single office action rejecting certain
`
`pending claims, including all independent claims, as being rendered obvious by U.S.
`
`
`
`
`11
`
`

`

`
`
`Patent Publication 2021/0192314 to Aarts et. al. (“Aarts”) in view of U.S. Patent
`
`No. 10,846,096 to Chung et. al. (“Chung”). Ex-1010, 120–124.
`
`In response to the office action, the applicant amended both pending
`
`independent claims to further recite “wherein in a second set of inputs for the second
`
`batch of requests, a length of the sequence of input tokens for the new request is
`
`different from a length of an input for at least one request other than the new
`
`request.” Id., 181 and 184. Applicant argued that this amendment in conjunction
`
`with the scheduler limitation reciting “a second batch of requests additionally
`
`including the new request” distinguishes the prior art as emphasized in applicant’s
`
`response below:
`
`
`
`Id., 189 (emphasis in original).
`
`Applicant argues the independent claims as amended requires “the second
`
`batch of requests is modified to include a new request in addition to the one or more
`
`requests of the [first] batch of requests.” Id., 189 (bracket in original); see also id.
`
`191 (Applicant states “Claim 10 recites similar features as claim 1” and
`
`
`
`
`12
`
`

`

`
`
`distinguishable over the prior art for the same reasons as claim 1). According to the
`
`applicant, this distinguishes the claims from the prior art because:
`
`In existing batching methods for transformer models, it is difficult to
`modify a batch of requests once it has started to process on an execution
`engine, since the length of the inputs or internal states are the same
`across the requests in the batch.
`Id., 189.
`
`Applicants then argue that neither Aarts nor Chung teaches batching in the
`
`context of a “machine-learned transformer model” and also that neither teaches
`
`“subsequently scheduling a second batch of requests that is modified to include a
`
`new request in addition to the first batch of requests, in which a length of the
`
`sequence of input tokens for a new request is different from a length of an input for
`
`at least one request other than the new request in the batch.” Id., 190–191.
`
`The Examiner then allowed all claims without any further comments. Id., 199.
`
`VIII. STATE OF TECHNOLOGY2
`A. Machine Learning and RNNs
`
`Machine Learning in general is a field of study focused on using statistical
`
`techniques and algorithms in order to generalize data observations and predict the
`
`
`2 Cited references not named in a ground of rejection are cited for the purpose of
`showing the state of the art and the background knowledge of a POSITA. Randall
`Mfg. v. Rea, 733 F.3d 1355, 1362-63 (Fed. Cir. 2013).
`
`
`
`
`13
`
`

`

`
`
`behavior over unseen data. One of the oldest techniques employed in machine
`
`learning is that of a Neural Network (NN) model which was developed and
`
`introduced in the 1950’s and 1960’s. Early NNs were built using a single building
`
`cell, namely the perceptron. The perceptron is a simple circuit taking in a number
`
`of binary inputs each corresponding to some weight (a real number) which
`
`modulates the input, and finally produces a single decision output of True or False
`
`(1 or 0) based on some activation value (also a real number). The idea behind the
`
`perceptron was to capture the functionality that takes place in a human neuron. See
`
`Ex-1002, ¶43.
`
`In order to achieve higher levels of generalization (perceived intelligence), a
`
`NN utilizes many preceptors arranged in the form of a layer where multiple inputs
`
`are mapped into multiple outputs. Moreover, while a simple NN is made of a single
`
`layer, a more complex NN can be made of many interconnected layers (sometimes
`
`referred to as Multi-Layer Perceptron (MLP)). Typically, each layer is made of a
`
`number of preceptors whose output is connected to the next layer and so forth. The
`
`first layer is connected to the input while the last layer produces the output of the
`
`circuit leading to hidden layers in the middle. Once the network is created, the goal
`
`becomes to figure out the correct weights used in each perceptron in order to
`
`generate the correct output. This is the process of learning. See Ex-1002, ¶44.
`
`
`
`
`14
`
`

`

`
`
`Learning in general is carried out over a set of desired inputs and outputs
`
`(training samples). The goal is to feed the input representation into the NN model
`
`and to find a way to modify the weights of the NN until the correct output is
`
`produced. This process is carried out for all the training samples over many iterations
`
`until the NN model is finally capable of correctly producing the correct output for
`
`the training input. See Ex-1002, ¶45.
`
`The types of algorithms used to train a NN is outside of our scope here.
`
`However, it suffices to say that these training algorithms depend on the architecture
`
`and components of the underlying NN and typically require considerable
`
`computational resources to complete. A successful NN architecture would lend itself
`
`to training in a way that obtains higher levels of accuracy when tested to predict the
`
`output of the training samples. More importantly, the NN is expected to obtain high
`
`levels of accuracy when exposed to new data that the NN never saw during the
`
`training phase. See Ex-1002, ¶46.
`
`Over the years, many different types of NNs were introduced. Most known
`
`today are that of Recurrent Neural Networks (RNNs) and Convolutional Neural
`
`Networks (CNNs) both of which are considered to be Deep Neural Networks
`
`(DNN). Here we are interested in RNNs as they relate to the subject at hand. See Ex-
`
`1002, ¶47.
`
`
`
`
`15
`
`

`

`
`
`RNNs were first introduced in the 1980s and only became practical to use
`
`during the late 1990 with the introduction of Long Short-Term Memory (LSTM).
`
`Since then, RNNs have evolved to encompass many types and variations of their
`
`underlying cell and neural networks. In general, RNNs were designed to process
`
`sequential data. That is, data that changes over time. Whereas NNs expect an input
`
`of fixed size in order to produce an output, an RNN is designed to be able to process
`
`data in a sequence of varying length such as text or speech. See Ex-1002, ¶48.
`
`In its simplest form, an RNN can be viewed as a functional cell that takes in
`
`an external input along with a value of an internal hidden state, and correspondingly
`
`updates the values of the internally stored hidden state. This RNN function depends
`
`on a number of parameters that are learned during training. When processing an
`
`input sequence, say a list of words, the hidden state is initialized to some value before
`
`the RNN moves to processes the first word in the sequence as input.
`
`Correspondingly, the RNN updates the value of the internal hidden state in a way
`
`that depends on the first word processed. Once the RNN moves to the next word in
`
`the list, it repeats the same computation only this time the hidden state used in the
`
`function has been updated in a way that depends on the first word. Continuing in this
`
`fashion, the RNN can process any number of input words until there are no more
`
`words to process. At such point, the internal state of the RNN holds a value that has
`
`been continuously updated corresponding to every word in the input. As a result, the
`
`
`
`
`16
`
`

`

`
`
`internal state value can now be thought of as a representation of the input words and
`
`as such can be used to carry out a classification or prediction task relating to the
`
`processed input. See Ex-1002, ¶49.
`
`Due to the success of RNNs in many applications (e.g. Natural Language
`
`Processing (NLP), speech recognition, image classification) a large number of RNN
`
`variations can be found in the literature. These various types of RNNs differ in their
`
`functionality, components, number of inputs or outputs, and so on. Examples of used
`
`RNN models include, LSTM, Gated Recurrent Units (GRU), Sequence to Sequence
`
`(Seq2Seq), and further include many other types of architectures and building
`
`blocks. Regardless of this variation, all RNNs share the same underlying property of
`
`retaining a hidden state that is updated according to the changing input in order to
`
`affect the final output of the circuit. See Ex-1002, ¶50.
`
`B.
`
`Transformers
`
`In their groundbreaking work, Vaswani introduced Transformers, a new Deep
`
`Neural Network (DNN) model that completely relies on the attention mechanism to
`
`draw global dependencies between input and output. See Ex-1009. Simply put, the
`
`attention mechanism takes in an input encoded as a vector and maps it to an output
`
`that is also a vector. See Ex-1002, ¶51.
`
`By first mapping a sequence of encoded inputs to a corresponding set of
`
`Queries, Keys, and Values, (QKV) the output of the attention mechanism is formed
`
`
`
`
`17
`
`

`

`
`
`as a weighted sum of input Values, where the weight assigned to each Value is
`
`computed as a compatibility function between a Query and a Key. See Ex-1002, ¶52.
`
`Although the Attention mechanism already existed in the literature (Ex-1015),
`
`the main contribution of Vaswani was to rely only on the attention mechanism to
`
`extract any interdependencies between the elements making up the input sequence.
`
`See Ex-1002, ¶53.
`
`Accordingly, one can easily see that all the building blocks making up a
`
`transformer model (except for the attention block) are element-wise operations that
`
`do not observe interdependency between input elements. Typically, transformers
`
`operate in an auto-regressive fashion where the predicted next element in a sequence
`
`of input elements is concatenated to the input and fed again to the system until a
`
`final special output element (e.g. <eos>) is generated. This mode of operation is very
`
`similar to that used in Seq2Seq RNN-models. Ex-1014. See Ex-1002, ¶54.
`
`The original Transformer proposed by Vaswani was made up of two main
`
`component-stacks, an encoder-stack followed by a decoder-stack. While both
`
`components have slightly different building blocks and connections, their attention
`
`mechanism differs in one main way. See Ex-1002, ¶55.
`
`In an encoder, attention is computed between all input elements regardless of
`
`their position within the input sequence. On the other hand, the decoder uses masked
`
`attention (or causal attention), which only allows the attention to be computed
`
`
`
`
`18
`
`

`

`
`
`between an input element and previous elements up to itself within the same input
`
`sequence. This difference ensures that the decoder, mainly tasked with predicting
`
`the next element in the input sequence, does not break the chain of causality. That
`
`is, an element can only be influenced by itself and previous elements but not future
`
`elements. See Ex-1002, ¶56.
`
`Many different Transformer architectures have been proposed, but perhaps
`
`one of the most popular was the Decoder-Only-Transformer (Ex-1017), which is
`
`most known to have been used in GPT (Generative-Pretrained Transformer). See
`
`Ex-1018. See Ex-1002, ¶57.
`
`In the DOT architecture, the Transformer is made of a number of decoders
`
`stacked in multiple consecutive layers, with their final output used to predict the next
`
`element of an input sequence. This architecture was preferred for having a simpler
`
`architecture along with efficient-implementation features, such as the masked
`
`attention, which allows the reuse of the Key and Value elements from previous
`
`iterations over the inputted and generated elements without requiring dependency
`
`on future element keys and values. See Ex-1002, ¶58.
`
`C. Batching in Machine Learning
`
`Batching (sometimes called batch processing) is a known computer method
`
`which refers to the idea of combining multiple inputs to be processed in parallel as
`
`a batch. In the context of ML, batching has been widely used for many years in order
`
`
`
`
`19
`
`

`

`
`
`to expedite the time required to process inputs. In the training phase of ML, batching
`
`is intimately related to the training algorithms used (e.g. minibatch-based Stochastic
`
`Gradient Descent (SGD)). Without going into the details, in this phase the training
`
`dataset is typically broken into smaller batches in order reduce the amount of
`
`memory used (relative to what is required for the entire dataset) while expediting the
`
`processing time (relative to what is required for the entire dataset) by computing the
`
`training function in parallel. See Ex-1002, ¶59.
`
`In the inference phase of ML, the model is treated as a function that takes in
`
`an input and produces an output in real-time using the underlying trained network.
`
`To expedite this process, batching is used to process multiple inputs at the same time,
`
`thus maximizing utility of the computational resources such as memory and
`
`processors and increasin

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket