throbber

`
`96569857.2
`
`UNITED STATES PATENT AND TRADEMARK OFFICE
`
`__________________
`
`BEFORE THE PATENT TRIAL AND APPEAL BOARD
`
`___________________
`
`HUGGING FACE, INC.,
`
`Petitioner,
`
`v.
`
`FRIENDLIAI INC.,
`
`Patent Owner.
`
`__________________
`
`IPR 2024-01234
`
`U.S. Patent No. 11,442,775
`
`_________________
`
`DECLARATION OF GHAITH HAMMOURI, Ph.D.
`
`Petitioner, EX1002
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`

`

`
`
`TABLE OF CONTENTS
`
`Page No.
`
`INTRODUCTION ........................................................................................... 3
`I.
`A. Qualifications ...................................................................................... 3
`II. MATERIALS REVIEWED ............................................................................ 6
`III.
`SUMMARY OF MY OPINIONS ................................................................... 8
`IV. LEGAL PRINCIPLES ..................................................................................... 8
`A. Understanding of Patent Law ............................................................ 8
`B.
`Claim Construction ...........................................................................10
`C.
`Level of Ordinary Skill in the Art ..................................................12
`THE ‘775 PATENT .......................................................................................13
`V.
`A.
`Priority of ‘775 Patent ......................................................................13
`B.
`Summary of the ‘775 Patent ............................................................14
`VI. OVERVIEW OF THE PRIOR ART ........................................................18
`A.
`Summary of Gao ...............................................................................26
`B.
`Summary of Katharopoulos ............................................................32
`VII. REASONS THE CHALLENGED CLAIMS OF THE ‘775 PATENT
`ARE UNPATENTABLE ...............................................................................34
`A. GROUND 1: Claims 1-18 are rendered obvious under 35
`U.S.C. § 103 by Gao in view of Katharopoulos ...........................34
`VIII. OATH ..........................................................................................................110
`
`
`
`96569857.2
`
`2
`
`

`

`
`
`I, Dr. Ghaith Hammouri, declare as follows:
`INTRODUCTION
`I.
`Hugging Face, Inc., (collectively “Hugging Face”) has retained my
`1.
`
`services in connection with the above captioned Inter Partes Review (IPR) of U.S.
`
`Patent No. 11,442,775 (‘775 Patent). I have been asked to study and provide my
`
`opinions as an independent expert witness regarding technology described in the
`
`‘775 Patent. I am being compensated at my usual and customary rate for my time.
`
`Such compensation, however, does not influence my opinion nor does the outcome
`
`of this proceeding impact my compensation.
`
`A. Qualifications
`2. My qualifications and professional experience are described in my
`
`curriculum vitae. I have been informed that a copy of my curriculum vitae will be
`
`submitted with my declaration. The following is a summary of my relevant
`
`qualifications and professional experience.
`
`3.
`
`I received my PhD in Electrical and Computer Engineer from
`
`Worcester Polytechnique Institute (WPI) in 2009. Prior to that I received my M.S.
`
`in Electrical Engineering from the University of Hartford in 2004. I also received
`
`my B.S. in Electrical Engineering with a second major in Physics and a minor in
`
`Math from the University of Hartford in 2003.
`
`4. My PhD work and thesis were focused on cryptography and the use of
`
`learning models for extracting unique and secure identifiers from physical hardware.
`3
`
`96569857.2
`
`

`

`
`
`After my PhD, I worked as a post-Doctoral researcher at WPI where I continued
`
`pursuing my research. I invented a technique for fingerprinting CDs and other forms
`
`of optical media using statistical learning. During this time, I mentored junior
`
`graduate students who joined the (CRIS) lab and acted as their intern advisor where
`
`I coordinated with the head of the lab. My post-Doc and research were both funded
`
`through a National Science Foundation (NSF) academic grant.
`
`5.
`
`In 2010, I co-founded Intryca (renamed Claveo) in order to
`
`commercialize my research. I served as the VP of Technology where I lead the
`
`research and development efforts of the company. My research focused on
`
`developing statistical learning methods for extracting hardware fingerprints from
`
`smart phones. Based on my research, I was awarded a Small Business Innovation
`
`Research (SBIR) grant from the (NSF). Further, this research and corresponding
`
`development resulted in 3 US patents.
`
`6.
`
`In 2012, I became the CEO (and later a co-owner) of Simtix, a physical
`
`security company with customers across the Middle East. The company’s products
`
`mainly revolved around automating physical security using RFID and Automatic
`
`Number Plate Recondition (ANPR) technology. At my role, I spent a significant
`
`portion of my time overseeing the technology while supervising the technical team
`
`and introducing them to new technologies.
`
`96569857.2
`
`4
`
`

`

`
`
`7.
`
`In 2016, I left my full-time role at Simtix to completely focus my
`
`attention on Machine Learning. I co-founded Xr.AI, a New York based startup
`
`focused on utilizing Machine Learning and Natural Language Processing (NLP) in
`
`order to automate the analysis of legal contracts. I served as the Chief Scientific
`
`Officer of the company where my work focused on creating a new technology for
`
`clause-level vector embeddings using Recurrent Neural Networks (RNNs). For this
`
`work, I was awarded a Small Business Innovation Research (SBIR) grant from the
`
`National Science Foundation (NSF).
`
`8.
`
`In 2020, I left corporate work to spend more time on my long-term
`
`research project of exploring deep connections between learning theory and
`
`cryptography. Over the past four years, I have served in a consulting capacity as the
`
`CEO of Simtix where I lead the development of Secure-Brain technology, a vision
`
`AI engine built on Multi-Modal Transformer technology to be used in security and
`
`defense applications. In this capacity, I mentored the development team and
`
`introduced them to modern machine learning concepts and tools.
`
`9.
`
`I currently also serve as a consultant for several private companies
`
`advising them on the use and applications of AI technology including Large
`
`Language Models (LLMs) and Multi-Modal Transformer models.
`
`10. Since 2023, I have also served as an affiliate research scientist at WPI
`
`pursuing research in areas relating to machine learning and security. Specifically, I
`
`96569857.2
`
`5
`
`

`

`
`
`have been actively researching adversarial vulnerabilities in LLMs which are in
`
`general based on the machine learning Transformer model.
`
`II. MATERIALS REVIEWED
`In forming my opinions provided in this declaration, I have reviewed
`11.
`
`and considered at least the following. I have also considered the other background
`
`references I have cited in this declaration as well as my education, knowledge, and
`
`experience working in machine learning, software development, and consumer
`
`products.
`
`U.S. Patent No. 11,442,775 (“the ’775 patent”)
`
`Declaration of Ghaith Hammouri
`
`C.V. of Ghaith Hammouri
`Pin Gao, et. al., Low Latency RNN Inference with Cellular Batching, Thirteenth
`EuroSys Conference 2018, April 23–26, 2018, Porto, Portugal, published by the
`Association for Computing Machinery (2018) (“Gao”)
`Katharopoulos, et. al., Transformers are RNNs: Fast Autoregressive
`Transformers with Linear Attention by Katharopoulos et al., arXiv :
`2006.16236v3 , Aug. 31 , 2020, presented at Proceedings of the 37th International
`Conference on Machine Learning, Online, PMLR 119, 2020 (“Katharopoulos”).
`Original Complaint, FriendliAI Inc. v. Hugging Face, Inc., Case No. 23-816-MN,
`D. Del. (filed July 28, 2023)
`First Amended Complaint, FriendliAI Inc. v. Hugging Face, Inc., Case No. 23-
`816-MN, D. Del. (filed January 8, 2024)
`Gyeong-In Yu, et. al., Orca: A Distributed Serving System for Transformer-Based
`Generative Models, Proceedings of the 16th USENIX Symposium on Operating
`Systems Design and Implementation, July 11–13, 2022, Carlsbad, CA, USA
`(“Orca paper”)
`
`96569857.2
`
`6
`
`

`

`
`
`A. Vaswani, et. al., Attention is All you Need, Advances in Neural Information
`Processing Systems, 2017 (“Vaswani”)
`Prosecution history of application 17/542,193.
`
`Plaintiff’s Disclosure of Proposed Claim Constructions, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on May 28, 2024)
`Defendant’s Disclosure of Proposed Claim Construction, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on May 28, 2024)
`Plaintiff’s Final Infringement Contentions of ’775 Patent, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on July 1, 2024)
`Ilya Sutskever, et al., Sequence to Sequence Learning with Neural Networks,
`arXiv : 1409.3215v3, Dec. 14, 2014, Proceedings of the 28th Conference on
`Neural Information Processing Systems 2014, December 8–13, 2014, Palais des
`congrès de Montréal, Canada, (“Sutskever”)
`Dzmitry Bahdanau, et al., Neural Machine Translation by Jointly Learning to
`Align and Translate, arXiv : 1409.0473v7, May 19, 2016, Proceedings of the
`International Conference on Learning Representations 2015, May 7–9, 2015, San
`Diego, California (“Bahdanau”)
`Romain Paulus, et al., A Deep Reinforced Model for Abstractive Summarization,
`arXiv : 1705.04304v3, November 13, 2017, Published online by Salesforce
`Research, Palo Alto, California (“Paulus”)
`Peter J. Liu, et al., Generating Wikipedia by Summarizing Long Sequences, arXiv
`: 1801.10198v1, January 30, 2018, Proceedings of the International Conference
`on Learning Representations 2018, April 30–May 3, 2018, Vancouver, Canada
`(“Liu”)
`Alec Radford, et al., Improving Language Understanding by Generative Pre-
`Training, https://api.semanticscholar.org/CorpusID:49313245, June 11, 2018,
`Published online (“Radford”)
`Yao-Hung Hubert Tsai, et al., Transformer Dissection: A Unified Understanding
`of Transformer’s Attention via the Lens of Kernel, arXiv : 1908.11775v4,
`November 11, 2019, Proceedings of the 2019 Conference on Empirical Methods
`in Natural Language Processing and the 9th International Joint Conference on
`Natural Language Processing (EMNLP-IJCNLP) (pp. 4335–4344), November 3–
`7, 2019, Hong Kong, China (“Tsai”)
`Ankit Singh Rawat, et al., Sampled Softmax with Random Fourier Features,
`arXiv : 1907.10747v2, December 31, 2019, Proceedings of the Conference on
`Neural Information Processing Systems 2019, December 8–14, 2019, Vancouver,
`Canada (“Rawat”)
`
`96569857.2
`
`7
`
`

`

`
`
`Guy Blanc, et al., Adaptive Sampled Softmax with Kernel Based Sampling, arXiv
`: 1712.00527v2, August 1, 2018, Proceedings of the 35th International
`Conference on Machine Learning, PMLR Vol. 80 pp. 590–599, July 10–15, 2018,
`Stockholm, Sweden (“Blanc”)
`Claim appendix for claims 1-18 of the ’775 Patent
`
`
`III. SUMMARY OF MY OPINIONS
`I understand the following table lists the ground of rejection I have
`12.
`
`considered in this declaration:
`
`Prior Art
`Gao in view of Katharopoulos
`
`Basis
`Obviousness
`
`Claims
`1-18
`
`13. After a review of the ‘775 Patent and the prior art, it is my opinion that
`
`
`
`the Challenged Claims are invalid under the proposed ground. My opinions, and the
`
`bases therefore, are detailed throughout this Declaration.
`
`IV. LEGAL PRINCIPLES
`A. Understanding of Patent Law
`I am not an attorney and will not be offering legal conclusions.
`14.
`
`However, I have been informed of several principles concerning legal issues relevant
`
`to my analysis of the Challenges to the claims of the ‘775 Patent, and I relied on
`
`these principles to arrive at my conclusions.
`
`15.
`
`I understand a claim is anticipated under 35 U.S.C. § 102 if all
`
`limitations are found in a single prior art reference, arranged as in the claim. The
`
`96569857.2
`
`8
`
`

`

`
`
`identical invention must be shown in complete detail as is contained in the patent
`
`claim.
`
`16.
`
`I understand a prior art reference can disclose an element not expressly
`
`identified in a reference if the element is “inherently present” in the reference. To be
`
`“inherent,” I understand the missing element must necessarily be present in the
`
`reference. An element is not “inherent” if the missing element is only probably
`
`present or if there is merely a possibility it is present.
`
`17.
`
`I understand a claim is invalid as obvious under 35 U.S.C. § 103 if the
`
`differences between the subject matter sought to be patented and the prior art are
`
`such that the subject matter of the claim as a whole would have been obvious at the
`
`time of the patent’s filing date to a Person of Ordinary Skill in The Art (POSITA).
`
`It is my understanding that the following factors are used to determine whether or
`
`not the claimed subject matter would have been obvious: (i) the scope and content
`
`of the prior art; (ii) the differences between the prior art and the claimed invention;
`
`(iii) the level of ordinary skill in the field of the invention; and (iv) any relevant
`
`objective considerations of non-obviousness.
`
`18.
`
`I understand a party asserting obviousness based on a combination of
`
`prior art references must demonstrate that one of ordinary skill in the art would have
`
`been motivated to combine the teachings of those references to achieve the claimed
`
`invention with a reasonable expectation of success. It is my understanding that it is
`
`96569857.2
`
`9
`
`

`

`
`
`not sufficient to show that one of ordinary skill in the art could combine elements of
`
`multiple references. Instead, there must be a rational reason that would have
`
`prompted a POSITA to combine the elements in the way the claimed invention does;
`
`and the reason should be explained or articulated.
`
`19.
`
`I understand a combination of references would not have been obvious
`
`if the alleged modification(s) to be made to the reference(s) are inconsistent with the
`
`stated goals of the reference(s). I understand a combination of references would not
`
`have been obvious if the modification of the reference(s) to derive what is claimed
`
`would render the reference(s) unsatisfactory or inoperable for their intended
`
`purpose. I further understand the party asserting obviousness must explain why a
`
`POSITA would have selected components for combination in the manner claimed.
`
`20.
`
`It is my further understanding that an invention would not have
`
`necessarily been obvious simply because all the elements of the invention may have
`
`been known separately in the prior art; there must be a reason to combine the
`
`separately known elements. Obviousness cannot be based on a hindsight
`
`combination of components selectively picked from the art using the claims as guide.
`
`B. Claim Construction
`I understand that claim construction in an IPR proceeding is a legal
`21.
`
`question for the Patent Trial and Appeal Board (PTAB or Board) to decide. In
`
`96569857.2
`
`10
`
`

`

`
`
`general, I understand that claim terms are to be given their ordinary and customary
`
`meaning to a POSITA in the context of the patent at the time the patent was filed.
`
`22.
`
`I also understand that in construing claim terms, the Board asks what
`
`the claim terms would mean to a person of ordinary skill in the relevant art in view
`
`of the plain claim language and the disclosures of the patent and prosecution history.
`
`I understand that the Board may also consider other external evidence, such as
`
`dictionaries, however the disclosures in the patent and prosecution history carry
`
`more weight than external evidence.
`
`23. As such, any claim term not construed should be given its ordinary and
`
`customary meaning as would be understood by one of ordinary skill in the art.
`
`24.
`
`I understand that the best source for determining the meaning of a claim
`
`is intrinsic evidence—the claims themselves, the written description, and the
`
`prosecution history. I also understand that extrinsic evidence, which consists of all
`
`evidence external to the patent and prosecution history, may be considered to
`
`determine the meaning of a claim term.
`
`25.
`
`In view of the principles described above and the materials I have
`
`reviewed, I do not believe any limitations in the claims addressed herein require a
`
`specific construction to support the opinions I provide in this declaration.
`
`96569857.2
`
`11
`
`

`

`
`
`C. Level of Ordinary Skill in the Art
`I understand that certain issues in an IPR, such as claim construction
`26.
`
`and whether a claim is invalid as obvious, are assessed from the view of a
`
`hypothetical person of ordinary skill in the relevant art at the time of the invention.
`
`I understand there are multiple factors relevant to determining the level of ordinary
`
`skill in the art, including: (1) the level of education and experience of persons
`
`working in the field at the time of the invention; (2) the sophistication of the
`
`technology; (3) the types of problems encountered in the field; and (4) the prior art
`
`solutions to those problems.
`
`27.
`
`In order to determine the characteristics of a hypothetical POSITA at
`
`the time of the claimed invention, I have considered a variety of factors. I have
`
`considered the prior art (referred to the in the “Materials Considered” section of this
`
`declaration) and the various approaches to address the batching of machine-learning
`
`tasks disclosed in those prior art documents, the types of problems encountered in
`
`the art and the solutions to those problems, the alleged problems encountered by the
`
`inventor as described in the ‘775 patent, the sophistication of the technology
`
`involved, and the educational background and experience of those actively working
`
`in the relevant field at the time of the invention.
`
`28. Additionally, I considered
`
`the
`
`technology available
`
`in 2021,
`
`immediately before the filing of the patent application in December of 2021, and the
`
`96569857.2
`
`12
`
`

`

`
`
`professionals with whom I worked during that time, including their levels of
`
`education, sophistication, and activities in professional associations. I am informed
`
`that such considerations are in accordance with factors identified in case law and
`
`typically considered to determine the level of skill in the art.
`
`29. The field of “art” for the ‘775 patent is machine-learning.
`
`30.
`
`In view of the above and based on my experience and knowledge, I
`
`believe a hypothetical Person of Ordinary Skill in The Art (POSITA) with regard to
`
`the ‘775 patent would have either: (1) a bachelor’s degree in electrical engineering,
`
`computer engineering, or computer science, with two to three years of work
`
`experience in machine learning; or (2) a master’s degree in electrical engineering,
`
`computer engineering, or computer science, with one year of work experience in
`
`machine learning.
`
`31. Although I describe the POSITA as of December 2021, it is my further
`
`opinion that the fundamental qualifications, attributes, and skills of the person of
`
`ordinary skill in the art would have been the same for many years prior to December
`
`of 2021 and presently remain the same.
`
`V. THE ‘775 PATENT
`A. Priority of ‘775 Patent
`32. The ‘775 patent was filed on December 3, 2021. See ‘775 Patent.
`
`96569857.2
`
`13
`
`

`

`
`
`Summary of the ‘775 Patent
`B.
`33. The ‘775 Patent relates to an inference system that applies a machine-
`
`learning transformer model to batches of input requests with variable input lengths.
`
`‘775 Patent at Abstract. This dynamic batching allows the utilization of hardware
`
`accelerators’ parallel computation capabilities while avoiding unnecessary
`
`computations from forcing requests into uniform lengths. ‘775 Patent at Abstract.
`
`34. The ‘775 Patent, in Figures 5A-5D, illustrates the claimed method for
`
`dynamic batching and processing of requests using a machine-learning transformer
`
`model. ‘775 Patent at Figures 5A-5D; 22:22-24:38.
`
`35. As shown in Figures 5A-5D, the system includes a serving system 435,
`
`request processor 580, and scheduler 585 coupled to multiple execution engines
`
`590A and 590B. ‘775 Patent at 22:22-57.
`
`36. Figure 5A illustrates that a single request R1 (with a single input token)
`
`is scheduled to execute in execution engine 590A while execution engine 590B is
`
`scheduled to execute a batch of requests, R3 (having two input tokens), R4 (having
`
`three input tokens) and R5 (having two input tokens). ‘775 Patent at 22:47-51.
`
`96569857.2
`
`14
`
`

`

`
`
`
`37. As new request R2 arrives at request processor 590, it is forwarded to
`
`scheduler 585 which monitors the cache memory for execution engines 590A and
`
`590B to determine if memory is available for processing request R2. ‘775 Patent at
`
`22:58-23:29.
`
`38. Moving to Figure 5B, as first output token is generated for requests R1,
`
`R3, R4, and R5, execution engine 590A is now scheduled to execute updated batch
`
`R1 and R2 at a second iteration. ‘775 Patent at 23:30-50. Execution engine 590A is
`
`capable of performing both the encoding phase and decoding phase for the same
`
`batch of requests in conjunction with using a machine-learning transformer model
`
`96569857.2
`
`15
`
`

`

`
`
`300. ‘775 Patent at 23:39-44; see also ‘775 Patent at 11:26-17:32 (describing
`
`processing methodology for a machine-learning transformer model).
`
`
`In Figure 5C, a second output token is generated for request R1, R3,
`
`39.
`
`R4, and R5 and a first output token generated for request R2 with an end token that
`
`moves the outputs for request R2 to the completion queue of the request processor
`
`580. ‘775 Patent at 23:51-58. The execution engine 590A then frees the cache
`
`memory allocated to request R2. ‘775 Patent at 23:59-60. Similarly, the second
`
`output token for request R4 is sent to the completion queue with its end token and
`
`execution engine 590B frees its cache memory allocated to request R4. ‘775 Patent
`
`at 23:60-67.
`
`96569857.2
`
`16
`
`

`

`
`
`
`40. As another new request, R7 arrives, it is forwarded to the scheduler 585
`
`where it is stored in the incoming queue. ‘775 Patent at 24:1-4. Since requests R2
`
`and R4 are complete and cache is available in execution engine 590A, the scheduler
`
`585 updates the batch for execution engine 590A to R1, R7 and execution engine
`
`590B to R3, R5. ‘775 Patent at 24:4-11. Accordingly, by having dynamic batches
`
`for each iteration, “completed requests can be provided to the client device 110 as
`
`soon as processing is complete, and the scheduler 585 can schedule new requests.
`
`‘775 Patent at 24:10-16.
`
`41. Moving to Figure 5D, a third output token is generated for R1, R3, and
`
`R5 and execution 590A is ready to execute an updated batch of requests R1, R7
`
`96569857.2
`
`17
`
`

`

`
`
`(with two tokens), and execution engine 590B is ready to execute an updated batch
`
`of requests R3 and R5. ‘775 Patent at 24:13-27.
`
`
`
`VI. OVERVIEW OF THE PRIOR ART
`42. Before providing a detailed analysis of how the prior art discloses or
`
`teaches the limitations of the challenged claims, I provide a brief summary of state
`
`of the art and the individual prior art references.
`
`Machine Learning and RNNs
`43. Machine Learning in general is a field of study focused on using
`
`statistical techniques and algorithms in order to generalize data observations and
`
`predict the behavior over unseen data. One of the oldest techniques employed in
`
`96569857.2
`
`18
`
`

`

`
`
`machine learning is that of a Neural Network (NN) model which was developed and
`
`introduced in the 1950’s and 1960’s. Early NNs were built using a single building
`
`cell, namely the perceptron. The perceptron is a simple circuit taking in a number
`
`of binary inputs each corresponding to some weight (a real number) which
`
`modulates the input, and finally produces a single decision output of True or False
`
`(1 or 0) based on some activation value (also a real number). The idea behind the
`
`perceptron was to capture the functionality that takes place in a human neuron.
`
`44.
`
`In order to achieve higher levels of generalization (perceived
`
`intelligence), a NN utilizes many preceptors arranged in the form of a layer where
`
`multiple inputs are mapped into multiple outputs. Moreover, while a simple NN is
`
`made of a single layer, a more complex NN can be made of many interconnected
`
`layers (sometimes referred to as Multi-Layer Perceptron (MLP)). Typically, each
`
`layer is made of a number of preceptors whose output is connected to the next layer
`
`and so forth. The first layer is connected to the input while the last layer produces
`
`the output of the circuit leading to hidden layers in the middle. Once the network is
`
`created, the goal becomes to figure out the correct weights used in each perceptron
`
`in order to generate the correct output. This is the process of learning.
`
`45. Learning in general is carried out over a set of desired inputs and
`
`outputs (training samples). The goal is to feed the input representation into the NN
`
`model and to find a way to modify the weights of the NN until the correct output is
`
`96569857.2
`
`19
`
`

`

`
`
`produced. This process is carried out for all the training samples over many iterations
`
`until the NN model is finally capable of correctly producing the correct output for
`
`the training input.
`
`46. The types of algorithms used to train a NN is outside of our scope here.
`
`However, it suffices to say that these training algorithms depend on the architecture
`
`and components of the underlying NN and typically require considerable
`
`computational resources to complete. A successful NN architecture would lend itself
`
`to training in a way that obtains higher levels of accuracy when tested to predict the
`
`output of the training samples. More importantly, the NN is expected to obtain high
`
`levels of accuracy when exposed to new data that the NN never saw during the
`
`training phase.
`
`47. Over the years, many different types of NNs were introduced. Most
`
`known today are that of Recurrent Neural Networks (RNNs) and Convolutional
`
`Neural Networks (CNNs) both of which are considered to be Deep Neural Networks
`
`(DNN). Here we are interested in RNNs as they relate to the subject at hand.
`
`48. RNNs were first introduced in the 1980s and only became practical to
`
`use during the late 1990 with the introduction of Long Short-Term Memory (LSTM).
`
`Since then, RNNs have evolved to encompass many types and variations of their
`
`underlying cell and neural networks. In general, RNNs were designed to process
`
`sequential data. That is, data that changes over time. Whereas NNs expect an input
`
`96569857.2
`
`20
`
`

`

`
`
`of fixed size in order to produce an output, an RNN is designed to be able to process
`
`data in a sequence of varying length such as text or speech.
`
`49.
`
`In its simplest form, an RNN can be viewed as a functional cell that
`
`takes in an external input along with a value of an internal hidden state, and
`
`correspondingly updates the values of the internally stored hidden state. This RNN
`
`function depends on a number of parameters that are learned during training. When
`
`processing an input sequence, say a list of words, the hidden state is initialized to
`
`some value before the RNN moves to processes the first word in the sequence as
`
`input. Correspondingly, the RNN updates the value of the internal hidden state in a
`
`way that depends on the first word processed. Once the RNN moves to the next word
`
`in the list, it repeats the same computation only this time the hidden state used in the
`
`function has been updated in a way that depends on the first word. Continuing in this
`
`fashion, the RNN can process any number of input words until there are no more
`
`words to process. At such point, the internal state of the RNN holds a value that has
`
`been continuously updated corresponding to every word in the input. As a result, the
`
`internal state value can now be thought of as a representation of the input words and
`
`as such can be used to carry out a classification or prediction task relating to the
`
`processed input.
`
`50. Due to the success of RNNs in many applications (e.g. Natural
`
`Language Processing (NLP), speech recognition, image classification) a large
`
`96569857.2
`
`21
`
`

`

`
`
`number of RNN variations can be found in the literature. These various types of
`
`RNNs differ in their functionality, components, number of inputs or outputs, and so
`
`on. Examples of used RNN models include, LSTM, Gated Recurrent Units (GRU),
`
`Sequence to Sequence (Seq2Seq), and further include many other types of
`
`architectures and building blocks. Regardless of this variation, all RNNs share the
`
`same underlying property of retaining a hidden state that is updated according to the
`
`changing input in order to affect the final output of the circuit.
`
`Transformers
`In their groundbreaking work, Vaswani introduced Transformers, a
`
`51.
`
`new Deep Neural Network (DNN) model that completely relies on the attention
`
`mechanism to draw global dependencies between input and output. Simply put, the
`
`attention mechanism takes in an input encoded as a vector and maps it to an output
`
`that is also a vector.
`
`52. By first mapping a sequence of encoded inputs to a corresponding set
`
`of Queries, Keys, and Values, (QKV) the output of the attention mechanism is
`
`formed as a weighted sum of input Values, where the weight assigned to each Value
`
`is computed as a compatibility function between a Query and a Key.
`
`53. Although the Attention mechanism already existed in the literature (e.g.
`
`Bahdanau et al.), the main contribution of Vaswani was to rely only on the attention
`
`96569857.2
`
`22
`
`

`

`
`
`mechanism to extract any interdependencies between the elements making up the
`
`input sequence.
`
`54. Accordingly, one can easily see that all the building blocks making up
`
`a transformer model (except for the attention block) are element-wise operations that
`
`do not observe interdependency between input elements. Typically, transformers
`
`operate in an auto-regressive fashion where the predicted next element in a sequence
`
`of input elements is concatenated to the input and fed again to the system until a
`
`final special output element (e.g. <eos>) is generated. This mode of operation is very
`
`similar to that used in Seq2Seq RNN-models. (Sutskever et al.)
`
`55. The original Transformer proposed by Vaswani was made up of two
`
`main component-stacks, an encoder-stack followed by a decoder-stack. While both
`
`components have slightly different building blocks and connections, their attention
`
`mechanism differs in one main way.
`
`56.
`
`In an encoder, attention is computed between all input elements
`
`regardless of their position within the input sequence. On the other hand, the decoder
`
`uses masked attention (or causal attention), which only allows the attention to be
`
`computed between an input element and previous elements up to itself within the
`
`same input sequence. This difference ensures that the decoder, mainly tasked with
`
`predicting the next element in the input sequence, does not break the chain of
`
`96569857.2
`
`23
`
`

`

`
`
`causality. That is, an element can only be influenced by itself and previous elements
`
`but not future elements.
`
`57. Many different Transformer architectures have been proposed, but
`
`perhaps one of the most popular was the Decoder-Only-Transformer (Liu et al.),
`
`which is most known to have been used in GPT (Generative-Pretrained
`
`Transformer). See Radford et al.
`
`58.
`
`In the DOT architecture, the Transformer is made of a number of
`
`decoders stacked in multiple consecutive layers, with their final output used to
`
`predict the next element of an input sequence. This architecture was preferred for
`
`having a simpler architecture along with efficient-implementation features, such as
`
`the masked attention, which allows the reuse of the Key and Value elements from
`
`previous iterations over the inputted and generated elements without requiring
`
`dependency on future element keys and values.
`
`Batching in Machine Learning
`59. Batching (sometimes called batch processing) is a known computer
`
`method which refers to the idea of combining multiple inputs to be processed in
`
`parallel as a batch. In the context of ML, batching has been widely used for many
`
`years in order to expedite the time required to process inputs. In the training phase
`
`of ML, batching is intimately related to the training algorithms used (e.g. minibatch-
`
`based Stochastic Gradient Descent (SGD)). W

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket