`
`96569857.2
`
`UNITED STATES PATENT AND TRADEMARK OFFICE
`
`__________________
`
`BEFORE THE PATENT TRIAL AND APPEAL BOARD
`
`___________________
`
`HUGGING FACE, INC.,
`
`Petitioner,
`
`v.
`
`FRIENDLIAI INC.,
`
`Patent Owner.
`
`__________________
`
`IPR 2024-01234
`
`U.S. Patent No. 11,442,775
`
`_________________
`
`DECLARATION OF GHAITH HAMMOURI, Ph.D.
`
`Petitioner, EX1002
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`
`
`
`
`TABLE OF CONTENTS
`
`Page No.
`
`INTRODUCTION ........................................................................................... 3
`I.
`A. Qualifications ...................................................................................... 3
`II. MATERIALS REVIEWED ............................................................................ 6
`III.
`SUMMARY OF MY OPINIONS ................................................................... 8
`IV. LEGAL PRINCIPLES ..................................................................................... 8
`A. Understanding of Patent Law ............................................................ 8
`B.
`Claim Construction ...........................................................................10
`C.
`Level of Ordinary Skill in the Art ..................................................12
`THE ‘775 PATENT .......................................................................................13
`V.
`A.
`Priority of ‘775 Patent ......................................................................13
`B.
`Summary of the ‘775 Patent ............................................................14
`VI. OVERVIEW OF THE PRIOR ART ........................................................18
`A.
`Summary of Gao ...............................................................................26
`B.
`Summary of Katharopoulos ............................................................32
`VII. REASONS THE CHALLENGED CLAIMS OF THE ‘775 PATENT
`ARE UNPATENTABLE ...............................................................................34
`A. GROUND 1: Claims 1-18 are rendered obvious under 35
`U.S.C. § 103 by Gao in view of Katharopoulos ...........................34
`VIII. OATH ..........................................................................................................110
`
`
`
`96569857.2
`
`2
`
`
`
`
`
`I, Dr. Ghaith Hammouri, declare as follows:
`INTRODUCTION
`I.
`Hugging Face, Inc., (collectively “Hugging Face”) has retained my
`1.
`
`services in connection with the above captioned Inter Partes Review (IPR) of U.S.
`
`Patent No. 11,442,775 (‘775 Patent). I have been asked to study and provide my
`
`opinions as an independent expert witness regarding technology described in the
`
`‘775 Patent. I am being compensated at my usual and customary rate for my time.
`
`Such compensation, however, does not influence my opinion nor does the outcome
`
`of this proceeding impact my compensation.
`
`A. Qualifications
`2. My qualifications and professional experience are described in my
`
`curriculum vitae. I have been informed that a copy of my curriculum vitae will be
`
`submitted with my declaration. The following is a summary of my relevant
`
`qualifications and professional experience.
`
`3.
`
`I received my PhD in Electrical and Computer Engineer from
`
`Worcester Polytechnique Institute (WPI) in 2009. Prior to that I received my M.S.
`
`in Electrical Engineering from the University of Hartford in 2004. I also received
`
`my B.S. in Electrical Engineering with a second major in Physics and a minor in
`
`Math from the University of Hartford in 2003.
`
`4. My PhD work and thesis were focused on cryptography and the use of
`
`learning models for extracting unique and secure identifiers from physical hardware.
`3
`
`96569857.2
`
`
`
`
`
`After my PhD, I worked as a post-Doctoral researcher at WPI where I continued
`
`pursuing my research. I invented a technique for fingerprinting CDs and other forms
`
`of optical media using statistical learning. During this time, I mentored junior
`
`graduate students who joined the (CRIS) lab and acted as their intern advisor where
`
`I coordinated with the head of the lab. My post-Doc and research were both funded
`
`through a National Science Foundation (NSF) academic grant.
`
`5.
`
`In 2010, I co-founded Intryca (renamed Claveo) in order to
`
`commercialize my research. I served as the VP of Technology where I lead the
`
`research and development efforts of the company. My research focused on
`
`developing statistical learning methods for extracting hardware fingerprints from
`
`smart phones. Based on my research, I was awarded a Small Business Innovation
`
`Research (SBIR) grant from the (NSF). Further, this research and corresponding
`
`development resulted in 3 US patents.
`
`6.
`
`In 2012, I became the CEO (and later a co-owner) of Simtix, a physical
`
`security company with customers across the Middle East. The company’s products
`
`mainly revolved around automating physical security using RFID and Automatic
`
`Number Plate Recondition (ANPR) technology. At my role, I spent a significant
`
`portion of my time overseeing the technology while supervising the technical team
`
`and introducing them to new technologies.
`
`96569857.2
`
`4
`
`
`
`
`
`7.
`
`In 2016, I left my full-time role at Simtix to completely focus my
`
`attention on Machine Learning. I co-founded Xr.AI, a New York based startup
`
`focused on utilizing Machine Learning and Natural Language Processing (NLP) in
`
`order to automate the analysis of legal contracts. I served as the Chief Scientific
`
`Officer of the company where my work focused on creating a new technology for
`
`clause-level vector embeddings using Recurrent Neural Networks (RNNs). For this
`
`work, I was awarded a Small Business Innovation Research (SBIR) grant from the
`
`National Science Foundation (NSF).
`
`8.
`
`In 2020, I left corporate work to spend more time on my long-term
`
`research project of exploring deep connections between learning theory and
`
`cryptography. Over the past four years, I have served in a consulting capacity as the
`
`CEO of Simtix where I lead the development of Secure-Brain technology, a vision
`
`AI engine built on Multi-Modal Transformer technology to be used in security and
`
`defense applications. In this capacity, I mentored the development team and
`
`introduced them to modern machine learning concepts and tools.
`
`9.
`
`I currently also serve as a consultant for several private companies
`
`advising them on the use and applications of AI technology including Large
`
`Language Models (LLMs) and Multi-Modal Transformer models.
`
`10. Since 2023, I have also served as an affiliate research scientist at WPI
`
`pursuing research in areas relating to machine learning and security. Specifically, I
`
`96569857.2
`
`5
`
`
`
`
`
`have been actively researching adversarial vulnerabilities in LLMs which are in
`
`general based on the machine learning Transformer model.
`
`II. MATERIALS REVIEWED
`In forming my opinions provided in this declaration, I have reviewed
`11.
`
`and considered at least the following. I have also considered the other background
`
`references I have cited in this declaration as well as my education, knowledge, and
`
`experience working in machine learning, software development, and consumer
`
`products.
`
`U.S. Patent No. 11,442,775 (“the ’775 patent”)
`
`Declaration of Ghaith Hammouri
`
`C.V. of Ghaith Hammouri
`Pin Gao, et. al., Low Latency RNN Inference with Cellular Batching, Thirteenth
`EuroSys Conference 2018, April 23–26, 2018, Porto, Portugal, published by the
`Association for Computing Machinery (2018) (“Gao”)
`Katharopoulos, et. al., Transformers are RNNs: Fast Autoregressive
`Transformers with Linear Attention by Katharopoulos et al., arXiv :
`2006.16236v3 , Aug. 31 , 2020, presented at Proceedings of the 37th International
`Conference on Machine Learning, Online, PMLR 119, 2020 (“Katharopoulos”).
`Original Complaint, FriendliAI Inc. v. Hugging Face, Inc., Case No. 23-816-MN,
`D. Del. (filed July 28, 2023)
`First Amended Complaint, FriendliAI Inc. v. Hugging Face, Inc., Case No. 23-
`816-MN, D. Del. (filed January 8, 2024)
`Gyeong-In Yu, et. al., Orca: A Distributed Serving System for Transformer-Based
`Generative Models, Proceedings of the 16th USENIX Symposium on Operating
`Systems Design and Implementation, July 11–13, 2022, Carlsbad, CA, USA
`(“Orca paper”)
`
`96569857.2
`
`6
`
`
`
`
`
`A. Vaswani, et. al., Attention is All you Need, Advances in Neural Information
`Processing Systems, 2017 (“Vaswani”)
`Prosecution history of application 17/542,193.
`
`Plaintiff’s Disclosure of Proposed Claim Constructions, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on May 28, 2024)
`Defendant’s Disclosure of Proposed Claim Construction, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on May 28, 2024)
`Plaintiff’s Final Infringement Contentions of ’775 Patent, FriendliAI Inc. v.
`Hugging Face, Inc., Case No. 23-816-MN, D. Del. (served on July 1, 2024)
`Ilya Sutskever, et al., Sequence to Sequence Learning with Neural Networks,
`arXiv : 1409.3215v3, Dec. 14, 2014, Proceedings of the 28th Conference on
`Neural Information Processing Systems 2014, December 8–13, 2014, Palais des
`congrès de Montréal, Canada, (“Sutskever”)
`Dzmitry Bahdanau, et al., Neural Machine Translation by Jointly Learning to
`Align and Translate, arXiv : 1409.0473v7, May 19, 2016, Proceedings of the
`International Conference on Learning Representations 2015, May 7–9, 2015, San
`Diego, California (“Bahdanau”)
`Romain Paulus, et al., A Deep Reinforced Model for Abstractive Summarization,
`arXiv : 1705.04304v3, November 13, 2017, Published online by Salesforce
`Research, Palo Alto, California (“Paulus”)
`Peter J. Liu, et al., Generating Wikipedia by Summarizing Long Sequences, arXiv
`: 1801.10198v1, January 30, 2018, Proceedings of the International Conference
`on Learning Representations 2018, April 30–May 3, 2018, Vancouver, Canada
`(“Liu”)
`Alec Radford, et al., Improving Language Understanding by Generative Pre-
`Training, https://api.semanticscholar.org/CorpusID:49313245, June 11, 2018,
`Published online (“Radford”)
`Yao-Hung Hubert Tsai, et al., Transformer Dissection: A Unified Understanding
`of Transformer’s Attention via the Lens of Kernel, arXiv : 1908.11775v4,
`November 11, 2019, Proceedings of the 2019 Conference on Empirical Methods
`in Natural Language Processing and the 9th International Joint Conference on
`Natural Language Processing (EMNLP-IJCNLP) (pp. 4335–4344), November 3–
`7, 2019, Hong Kong, China (“Tsai”)
`Ankit Singh Rawat, et al., Sampled Softmax with Random Fourier Features,
`arXiv : 1907.10747v2, December 31, 2019, Proceedings of the Conference on
`Neural Information Processing Systems 2019, December 8–14, 2019, Vancouver,
`Canada (“Rawat”)
`
`96569857.2
`
`7
`
`
`
`
`
`Guy Blanc, et al., Adaptive Sampled Softmax with Kernel Based Sampling, arXiv
`: 1712.00527v2, August 1, 2018, Proceedings of the 35th International
`Conference on Machine Learning, PMLR Vol. 80 pp. 590–599, July 10–15, 2018,
`Stockholm, Sweden (“Blanc”)
`Claim appendix for claims 1-18 of the ’775 Patent
`
`
`III. SUMMARY OF MY OPINIONS
`I understand the following table lists the ground of rejection I have
`12.
`
`considered in this declaration:
`
`Prior Art
`Gao in view of Katharopoulos
`
`Basis
`Obviousness
`
`Claims
`1-18
`
`13. After a review of the ‘775 Patent and the prior art, it is my opinion that
`
`
`
`the Challenged Claims are invalid under the proposed ground. My opinions, and the
`
`bases therefore, are detailed throughout this Declaration.
`
`IV. LEGAL PRINCIPLES
`A. Understanding of Patent Law
`I am not an attorney and will not be offering legal conclusions.
`14.
`
`However, I have been informed of several principles concerning legal issues relevant
`
`to my analysis of the Challenges to the claims of the ‘775 Patent, and I relied on
`
`these principles to arrive at my conclusions.
`
`15.
`
`I understand a claim is anticipated under 35 U.S.C. § 102 if all
`
`limitations are found in a single prior art reference, arranged as in the claim. The
`
`96569857.2
`
`8
`
`
`
`
`
`identical invention must be shown in complete detail as is contained in the patent
`
`claim.
`
`16.
`
`I understand a prior art reference can disclose an element not expressly
`
`identified in a reference if the element is “inherently present” in the reference. To be
`
`“inherent,” I understand the missing element must necessarily be present in the
`
`reference. An element is not “inherent” if the missing element is only probably
`
`present or if there is merely a possibility it is present.
`
`17.
`
`I understand a claim is invalid as obvious under 35 U.S.C. § 103 if the
`
`differences between the subject matter sought to be patented and the prior art are
`
`such that the subject matter of the claim as a whole would have been obvious at the
`
`time of the patent’s filing date to a Person of Ordinary Skill in The Art (POSITA).
`
`It is my understanding that the following factors are used to determine whether or
`
`not the claimed subject matter would have been obvious: (i) the scope and content
`
`of the prior art; (ii) the differences between the prior art and the claimed invention;
`
`(iii) the level of ordinary skill in the field of the invention; and (iv) any relevant
`
`objective considerations of non-obviousness.
`
`18.
`
`I understand a party asserting obviousness based on a combination of
`
`prior art references must demonstrate that one of ordinary skill in the art would have
`
`been motivated to combine the teachings of those references to achieve the claimed
`
`invention with a reasonable expectation of success. It is my understanding that it is
`
`96569857.2
`
`9
`
`
`
`
`
`not sufficient to show that one of ordinary skill in the art could combine elements of
`
`multiple references. Instead, there must be a rational reason that would have
`
`prompted a POSITA to combine the elements in the way the claimed invention does;
`
`and the reason should be explained or articulated.
`
`19.
`
`I understand a combination of references would not have been obvious
`
`if the alleged modification(s) to be made to the reference(s) are inconsistent with the
`
`stated goals of the reference(s). I understand a combination of references would not
`
`have been obvious if the modification of the reference(s) to derive what is claimed
`
`would render the reference(s) unsatisfactory or inoperable for their intended
`
`purpose. I further understand the party asserting obviousness must explain why a
`
`POSITA would have selected components for combination in the manner claimed.
`
`20.
`
`It is my further understanding that an invention would not have
`
`necessarily been obvious simply because all the elements of the invention may have
`
`been known separately in the prior art; there must be a reason to combine the
`
`separately known elements. Obviousness cannot be based on a hindsight
`
`combination of components selectively picked from the art using the claims as guide.
`
`B. Claim Construction
`I understand that claim construction in an IPR proceeding is a legal
`21.
`
`question for the Patent Trial and Appeal Board (PTAB or Board) to decide. In
`
`96569857.2
`
`10
`
`
`
`
`
`general, I understand that claim terms are to be given their ordinary and customary
`
`meaning to a POSITA in the context of the patent at the time the patent was filed.
`
`22.
`
`I also understand that in construing claim terms, the Board asks what
`
`the claim terms would mean to a person of ordinary skill in the relevant art in view
`
`of the plain claim language and the disclosures of the patent and prosecution history.
`
`I understand that the Board may also consider other external evidence, such as
`
`dictionaries, however the disclosures in the patent and prosecution history carry
`
`more weight than external evidence.
`
`23. As such, any claim term not construed should be given its ordinary and
`
`customary meaning as would be understood by one of ordinary skill in the art.
`
`24.
`
`I understand that the best source for determining the meaning of a claim
`
`is intrinsic evidence—the claims themselves, the written description, and the
`
`prosecution history. I also understand that extrinsic evidence, which consists of all
`
`evidence external to the patent and prosecution history, may be considered to
`
`determine the meaning of a claim term.
`
`25.
`
`In view of the principles described above and the materials I have
`
`reviewed, I do not believe any limitations in the claims addressed herein require a
`
`specific construction to support the opinions I provide in this declaration.
`
`96569857.2
`
`11
`
`
`
`
`
`C. Level of Ordinary Skill in the Art
`I understand that certain issues in an IPR, such as claim construction
`26.
`
`and whether a claim is invalid as obvious, are assessed from the view of a
`
`hypothetical person of ordinary skill in the relevant art at the time of the invention.
`
`I understand there are multiple factors relevant to determining the level of ordinary
`
`skill in the art, including: (1) the level of education and experience of persons
`
`working in the field at the time of the invention; (2) the sophistication of the
`
`technology; (3) the types of problems encountered in the field; and (4) the prior art
`
`solutions to those problems.
`
`27.
`
`In order to determine the characteristics of a hypothetical POSITA at
`
`the time of the claimed invention, I have considered a variety of factors. I have
`
`considered the prior art (referred to the in the “Materials Considered” section of this
`
`declaration) and the various approaches to address the batching of machine-learning
`
`tasks disclosed in those prior art documents, the types of problems encountered in
`
`the art and the solutions to those problems, the alleged problems encountered by the
`
`inventor as described in the ‘775 patent, the sophistication of the technology
`
`involved, and the educational background and experience of those actively working
`
`in the relevant field at the time of the invention.
`
`28. Additionally, I considered
`
`the
`
`technology available
`
`in 2021,
`
`immediately before the filing of the patent application in December of 2021, and the
`
`96569857.2
`
`12
`
`
`
`
`
`professionals with whom I worked during that time, including their levels of
`
`education, sophistication, and activities in professional associations. I am informed
`
`that such considerations are in accordance with factors identified in case law and
`
`typically considered to determine the level of skill in the art.
`
`29. The field of “art” for the ‘775 patent is machine-learning.
`
`30.
`
`In view of the above and based on my experience and knowledge, I
`
`believe a hypothetical Person of Ordinary Skill in The Art (POSITA) with regard to
`
`the ‘775 patent would have either: (1) a bachelor’s degree in electrical engineering,
`
`computer engineering, or computer science, with two to three years of work
`
`experience in machine learning; or (2) a master’s degree in electrical engineering,
`
`computer engineering, or computer science, with one year of work experience in
`
`machine learning.
`
`31. Although I describe the POSITA as of December 2021, it is my further
`
`opinion that the fundamental qualifications, attributes, and skills of the person of
`
`ordinary skill in the art would have been the same for many years prior to December
`
`of 2021 and presently remain the same.
`
`V. THE ‘775 PATENT
`A. Priority of ‘775 Patent
`32. The ‘775 patent was filed on December 3, 2021. See ‘775 Patent.
`
`96569857.2
`
`13
`
`
`
`
`
`Summary of the ‘775 Patent
`B.
`33. The ‘775 Patent relates to an inference system that applies a machine-
`
`learning transformer model to batches of input requests with variable input lengths.
`
`‘775 Patent at Abstract. This dynamic batching allows the utilization of hardware
`
`accelerators’ parallel computation capabilities while avoiding unnecessary
`
`computations from forcing requests into uniform lengths. ‘775 Patent at Abstract.
`
`34. The ‘775 Patent, in Figures 5A-5D, illustrates the claimed method for
`
`dynamic batching and processing of requests using a machine-learning transformer
`
`model. ‘775 Patent at Figures 5A-5D; 22:22-24:38.
`
`35. As shown in Figures 5A-5D, the system includes a serving system 435,
`
`request processor 580, and scheduler 585 coupled to multiple execution engines
`
`590A and 590B. ‘775 Patent at 22:22-57.
`
`36. Figure 5A illustrates that a single request R1 (with a single input token)
`
`is scheduled to execute in execution engine 590A while execution engine 590B is
`
`scheduled to execute a batch of requests, R3 (having two input tokens), R4 (having
`
`three input tokens) and R5 (having two input tokens). ‘775 Patent at 22:47-51.
`
`96569857.2
`
`14
`
`
`
`
`
`
`37. As new request R2 arrives at request processor 590, it is forwarded to
`
`scheduler 585 which monitors the cache memory for execution engines 590A and
`
`590B to determine if memory is available for processing request R2. ‘775 Patent at
`
`22:58-23:29.
`
`38. Moving to Figure 5B, as first output token is generated for requests R1,
`
`R3, R4, and R5, execution engine 590A is now scheduled to execute updated batch
`
`R1 and R2 at a second iteration. ‘775 Patent at 23:30-50. Execution engine 590A is
`
`capable of performing both the encoding phase and decoding phase for the same
`
`batch of requests in conjunction with using a machine-learning transformer model
`
`96569857.2
`
`15
`
`
`
`
`
`300. ‘775 Patent at 23:39-44; see also ‘775 Patent at 11:26-17:32 (describing
`
`processing methodology for a machine-learning transformer model).
`
`
`In Figure 5C, a second output token is generated for request R1, R3,
`
`39.
`
`R4, and R5 and a first output token generated for request R2 with an end token that
`
`moves the outputs for request R2 to the completion queue of the request processor
`
`580. ‘775 Patent at 23:51-58. The execution engine 590A then frees the cache
`
`memory allocated to request R2. ‘775 Patent at 23:59-60. Similarly, the second
`
`output token for request R4 is sent to the completion queue with its end token and
`
`execution engine 590B frees its cache memory allocated to request R4. ‘775 Patent
`
`at 23:60-67.
`
`96569857.2
`
`16
`
`
`
`
`
`
`40. As another new request, R7 arrives, it is forwarded to the scheduler 585
`
`where it is stored in the incoming queue. ‘775 Patent at 24:1-4. Since requests R2
`
`and R4 are complete and cache is available in execution engine 590A, the scheduler
`
`585 updates the batch for execution engine 590A to R1, R7 and execution engine
`
`590B to R3, R5. ‘775 Patent at 24:4-11. Accordingly, by having dynamic batches
`
`for each iteration, “completed requests can be provided to the client device 110 as
`
`soon as processing is complete, and the scheduler 585 can schedule new requests.
`
`‘775 Patent at 24:10-16.
`
`41. Moving to Figure 5D, a third output token is generated for R1, R3, and
`
`R5 and execution 590A is ready to execute an updated batch of requests R1, R7
`
`96569857.2
`
`17
`
`
`
`
`
`(with two tokens), and execution engine 590B is ready to execute an updated batch
`
`of requests R3 and R5. ‘775 Patent at 24:13-27.
`
`
`
`VI. OVERVIEW OF THE PRIOR ART
`42. Before providing a detailed analysis of how the prior art discloses or
`
`teaches the limitations of the challenged claims, I provide a brief summary of state
`
`of the art and the individual prior art references.
`
`Machine Learning and RNNs
`43. Machine Learning in general is a field of study focused on using
`
`statistical techniques and algorithms in order to generalize data observations and
`
`predict the behavior over unseen data. One of the oldest techniques employed in
`
`96569857.2
`
`18
`
`
`
`
`
`machine learning is that of a Neural Network (NN) model which was developed and
`
`introduced in the 1950’s and 1960’s. Early NNs were built using a single building
`
`cell, namely the perceptron. The perceptron is a simple circuit taking in a number
`
`of binary inputs each corresponding to some weight (a real number) which
`
`modulates the input, and finally produces a single decision output of True or False
`
`(1 or 0) based on some activation value (also a real number). The idea behind the
`
`perceptron was to capture the functionality that takes place in a human neuron.
`
`44.
`
`In order to achieve higher levels of generalization (perceived
`
`intelligence), a NN utilizes many preceptors arranged in the form of a layer where
`
`multiple inputs are mapped into multiple outputs. Moreover, while a simple NN is
`
`made of a single layer, a more complex NN can be made of many interconnected
`
`layers (sometimes referred to as Multi-Layer Perceptron (MLP)). Typically, each
`
`layer is made of a number of preceptors whose output is connected to the next layer
`
`and so forth. The first layer is connected to the input while the last layer produces
`
`the output of the circuit leading to hidden layers in the middle. Once the network is
`
`created, the goal becomes to figure out the correct weights used in each perceptron
`
`in order to generate the correct output. This is the process of learning.
`
`45. Learning in general is carried out over a set of desired inputs and
`
`outputs (training samples). The goal is to feed the input representation into the NN
`
`model and to find a way to modify the weights of the NN until the correct output is
`
`96569857.2
`
`19
`
`
`
`
`
`produced. This process is carried out for all the training samples over many iterations
`
`until the NN model is finally capable of correctly producing the correct output for
`
`the training input.
`
`46. The types of algorithms used to train a NN is outside of our scope here.
`
`However, it suffices to say that these training algorithms depend on the architecture
`
`and components of the underlying NN and typically require considerable
`
`computational resources to complete. A successful NN architecture would lend itself
`
`to training in a way that obtains higher levels of accuracy when tested to predict the
`
`output of the training samples. More importantly, the NN is expected to obtain high
`
`levels of accuracy when exposed to new data that the NN never saw during the
`
`training phase.
`
`47. Over the years, many different types of NNs were introduced. Most
`
`known today are that of Recurrent Neural Networks (RNNs) and Convolutional
`
`Neural Networks (CNNs) both of which are considered to be Deep Neural Networks
`
`(DNN). Here we are interested in RNNs as they relate to the subject at hand.
`
`48. RNNs were first introduced in the 1980s and only became practical to
`
`use during the late 1990 with the introduction of Long Short-Term Memory (LSTM).
`
`Since then, RNNs have evolved to encompass many types and variations of their
`
`underlying cell and neural networks. In general, RNNs were designed to process
`
`sequential data. That is, data that changes over time. Whereas NNs expect an input
`
`96569857.2
`
`20
`
`
`
`
`
`of fixed size in order to produce an output, an RNN is designed to be able to process
`
`data in a sequence of varying length such as text or speech.
`
`49.
`
`In its simplest form, an RNN can be viewed as a functional cell that
`
`takes in an external input along with a value of an internal hidden state, and
`
`correspondingly updates the values of the internally stored hidden state. This RNN
`
`function depends on a number of parameters that are learned during training. When
`
`processing an input sequence, say a list of words, the hidden state is initialized to
`
`some value before the RNN moves to processes the first word in the sequence as
`
`input. Correspondingly, the RNN updates the value of the internal hidden state in a
`
`way that depends on the first word processed. Once the RNN moves to the next word
`
`in the list, it repeats the same computation only this time the hidden state used in the
`
`function has been updated in a way that depends on the first word. Continuing in this
`
`fashion, the RNN can process any number of input words until there are no more
`
`words to process. At such point, the internal state of the RNN holds a value that has
`
`been continuously updated corresponding to every word in the input. As a result, the
`
`internal state value can now be thought of as a representation of the input words and
`
`as such can be used to carry out a classification or prediction task relating to the
`
`processed input.
`
`50. Due to the success of RNNs in many applications (e.g. Natural
`
`Language Processing (NLP), speech recognition, image classification) a large
`
`96569857.2
`
`21
`
`
`
`
`
`number of RNN variations can be found in the literature. These various types of
`
`RNNs differ in their functionality, components, number of inputs or outputs, and so
`
`on. Examples of used RNN models include, LSTM, Gated Recurrent Units (GRU),
`
`Sequence to Sequence (Seq2Seq), and further include many other types of
`
`architectures and building blocks. Regardless of this variation, all RNNs share the
`
`same underlying property of retaining a hidden state that is updated according to the
`
`changing input in order to affect the final output of the circuit.
`
`Transformers
`In their groundbreaking work, Vaswani introduced Transformers, a
`
`51.
`
`new Deep Neural Network (DNN) model that completely relies on the attention
`
`mechanism to draw global dependencies between input and output. Simply put, the
`
`attention mechanism takes in an input encoded as a vector and maps it to an output
`
`that is also a vector.
`
`52. By first mapping a sequence of encoded inputs to a corresponding set
`
`of Queries, Keys, and Values, (QKV) the output of the attention mechanism is
`
`formed as a weighted sum of input Values, where the weight assigned to each Value
`
`is computed as a compatibility function between a Query and a Key.
`
`53. Although the Attention mechanism already existed in the literature (e.g.
`
`Bahdanau et al.), the main contribution of Vaswani was to rely only on the attention
`
`96569857.2
`
`22
`
`
`
`
`
`mechanism to extract any interdependencies between the elements making up the
`
`input sequence.
`
`54. Accordingly, one can easily see that all the building blocks making up
`
`a transformer model (except for the attention block) are element-wise operations that
`
`do not observe interdependency between input elements. Typically, transformers
`
`operate in an auto-regressive fashion where the predicted next element in a sequence
`
`of input elements is concatenated to the input and fed again to the system until a
`
`final special output element (e.g. <eos>) is generated. This mode of operation is very
`
`similar to that used in Seq2Seq RNN-models. (Sutskever et al.)
`
`55. The original Transformer proposed by Vaswani was made up of two
`
`main component-stacks, an encoder-stack followed by a decoder-stack. While both
`
`components have slightly different building blocks and connections, their attention
`
`mechanism differs in one main way.
`
`56.
`
`In an encoder, attention is computed between all input elements
`
`regardless of their position within the input sequence. On the other hand, the decoder
`
`uses masked attention (or causal attention), which only allows the attention to be
`
`computed between an input element and previous elements up to itself within the
`
`same input sequence. This difference ensures that the decoder, mainly tasked with
`
`predicting the next element in the input sequence, does not break the chain of
`
`96569857.2
`
`23
`
`
`
`
`
`causality. That is, an element can only be influenced by itself and previous elements
`
`but not future elements.
`
`57. Many different Transformer architectures have been proposed, but
`
`perhaps one of the most popular was the Decoder-Only-Transformer (Liu et al.),
`
`which is most known to have been used in GPT (Generative-Pretrained
`
`Transformer). See Radford et al.
`
`58.
`
`In the DOT architecture, the Transformer is made of a number of
`
`decoders stacked in multiple consecutive layers, with their final output used to
`
`predict the next element of an input sequence. This architecture was preferred for
`
`having a simpler architecture along with efficient-implementation features, such as
`
`the masked attention, which allows the reuse of the Key and Value elements from
`
`previous iterations over the inputted and generated elements without requiring
`
`dependency on future element keys and values.
`
`Batching in Machine Learning
`59. Batching (sometimes called batch processing) is a known computer
`
`method which refers to the idea of combining multiple inputs to be processed in
`
`parallel as a batch. In the context of ML, batching has been widely used for many
`
`years in order to expedite the time required to process inputs. In the training phase
`
`of ML, batching is intimately related to the training algorithms used (e.g. minibatch-
`
`based Stochastic Gradient Descent (SGD)). W