`
`
`
`
`
`
`
`UNITED STATES PATENT AND TRADEMARK OFFICE
`____________
`
`BEFORE THE PATENT TRIAL AND APPEAL BOARD
` ____________
`
`APPLE INC.,
`
`Petitioner
`
`v.
`
`PARUS HOLDINGS, INC.,
`
`Patent Owner
`____________
`
`IPR2020-00686
`
`Patent No. 7,076,431
`
`AND
`
`IPR2020-00687
`
`Patent No. 9,451,084
`
` ____________
`
`
`
`SUPPLEMENTAL DECLARATION OF DR. LOREN TERVEEN
`
`
`
`
`
`IPR2020-00686 and IPR2020-00687
`
`EX1040 Page 1
`
`
`
`
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`
`I, Dr. Loren Terveen, hereby declare the following:
`
`I.
`
`INTRODUCTION
`1.
`
`I have been asked to respond to certain issues raised by Patent Owner
`
`in Patent Owner’s Response dated December 23, 2020 (“POR”). All of my opinions
`
`expressed in my original declaration (Ex. 1003) remain the same. I have reviewed
`
`the relevant portions of the POR (Paper 15) and the relevant portions of Mr.
`
`Occhiogrosso’s declaration (Ex. 2025) and deposition transcript (Ex. 1039) in
`
`connection with preparing this supplemental declaration. References to opinions of
`
`the ’431 Patent below are intended as equally applicable to the ’084 Patent.
`
`II. OPINIONS
`A. A Two-Step Speech Recognition Process Is Described in Both the
`’431 and ’084 Patents and Ladd
`
`2.
`
`As I discussed in my original declaration (Ex. 1003) at ¶¶ 81-83, Ladd
`
`teaches a system for retrieving information by uttering speech commands into a
`
`voice enabled device and for providing information retrieved from an information
`
`source, such as “web pages” or “web sites.” Specifically, Ladd’s system is an IVR
`
`(Interactive Voice Response) system that may answer a question, such as “what is
`
`the weather” from a web site in response to a spoken user request. Ex. 1003, ¶ 78,
`
`81-82 citing Ladd, 2:19-64, 3:7-53, 9:1-21. In an IVR system, including specifically
`
`Ladd’s, the computing system must determine the content of at least some of the
`
`speech uttered by the user in order to identify desired information for retrieval from
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 2
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`an appropriate information source. For example, when a user inquires about the
`
`current weather in Chicago, the system must determine the key words “weather” and
`
`“Chicago” were spoken and by comparison to the grammar, determine the command
`
`corresponding to the spoken words, i.e., that the user is commanding to retrieve
`
`Chicago’s weather. Ladd, 2:48-54, 4:64-5:11, 8:23-25, 10:3-11, 11:50-64, 38:4-16.
`
`This is in contrast to Mr. Occhiogrosso’s description of mere transcribing of free
`
`speech that may occur in some systems, where spoken utterances are transformed
`
`from audio messages into text and stored in memory, but no content is determined
`
`for any transcribed words. Ex. 1039, Occhiogrosso Dep. Tr., 39:10-40:22.
`
`3.
`
`In order for an IVR system to act upon user speech, it must perform two
`
`distinct steps. In the first step, the speech recognition device simply transforms the
`
`sound wave into text. Ex. 1039, 33:11-16, 49:5-19. At this juncture, the speech
`
`recognition device has not yet determined any content of what was said, i.e., what
`
`instruction is being commanded; it has merely generated a textual data message. Id.
`
`For example, a speech recognition device that has performed only this first step may
`
`generate the character string “weather” after the word “weather” was spoken, but the
`
`device does not yet know what to do in response to the character string “weather.”
`
`There are a number of methods by which a system may perform this first step of
`
`converting the spoken words into text, but Ladd is not specific on how it requires
`
`step one to occur. I note that Mr. Occhiogrosso also agrees there are various speech
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 3
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`recognition algorithms to recognize the user’s speech and convert into text. Ex.
`
`1039, 54:6-16.
`
`4.
`
`It is not until the second step of content recognition of the spoken
`
`speech that a speech recognition device determines the content of the spoken words
`
`(e.g., determining that the user uttered “weather” and is therefore instructing the IVR
`
`system to retrieve and respond with the current weather). Mr. Occhiogrosso agreed
`
`with this during his deposition in differentiating between the first step of converting
`
`speech into text and the second step of using a recognition grammar to “address[]
`
`what words are.” Id. at 50:17–51:8. Speech recognition devices that do not determine
`
`the content of transcribed words cannot act in response to the spoken words. Id. at
`
`40:13-22. (Mr. Occhiogrosso opining that when the user is “simply speaking and
`
`there is no higher order context of a recognition grammar that meters or governs the
`
`speech, then the speech recognition engine will dutifully translate what the user is
`
`speaking into text” and that “free speech” or “free text” is “effectively a dictation
`
`application with no imposed recognition grammar”). Systems such as the ’431 Patent
`
`and Ladd must perform both steps to act upon a spoken command to retrieve desired
`
`information, namely the steps of (1) converting speech utterances into text words,
`
`and (2) comparing the textual words to grammar to determine the content of the
`
`spoken command. As I explain further below, the statement in the ’431 Patent that
`
`it “recognizes spoken words without using predefined voice patterns” is
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 4
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`characterizing a method of performing the first step of speech recognition
`
`(transforming speech to text). In contrast, Ladd’s description of determining a
`
`“speech pattern” is characterizing a method of performing the second step of speech
`
`recognition (determining the content of the text). ’431 Patent, 4:38-43; Ladd, 9:27-
`
`44. I further note this second step is recited in the claims of the ’431 Patent at
`
`Limitations 1(f)-1(h), which recite the recognition grammar, that the speech
`
`command comprises an information request selectable by the user, and selecting the
`
`recognition grammar upon receiving the speech command.
`
`5.
`
`The ’431 Patent confirms the two-step process. Specifically, the speech
`
`recognition engine 300 “converts voice commands received from the user’s voice
`
`enabled device 112…into data messages.” ’431 Patent, 6:4-8. “The media server
`
`106 uses the speech recognition engine 300 to interpret the speech commands
`
`received from the user. Based upon these commands, the media server 106 retrieves
`
`the appropriate web site record 200 from the database 100.” Id. at 16:3-7. Therefore,
`
`the ’431 Patent describes a system where the speech commands are converted into
`
`data messages, i.e., text, and then the converted speech commands are interpreted to
`
`determine what web site record to retrieve.
`
`6.
`
`Ladd also confirms its system performs a two-step speech recognition
`
`process, stating: “The STT unit 256 of the VRU server 234 receives speech inputs
`
`or communications from the user and converts the speech inputs to textual
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 5
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`information (i.e., a text message). The textual information can be sent or routed to
`
`the
`
`communication
`
`devices 201, 202, 203 and 204,
`
`the
`
`content
`
`providers 208 and 209, the markup language servers, the voice browser, and the
`
`application server 242.” Ladd, 9:11-54, 10:3-20, 38:4-16. Ladd teaches the VRU
`
`server 234, which includes the ASR unit 254. As I discussed in ¶¶ 90-91 and 110 of
`
`my original Declaration, the ASR unit 254 is a speaker independent speech
`
`recognition device, as recited in the claims of the ’431 and ’084 Patents. I further
`
`note that Ladd teaches a VRU client 232 that is connected to the VRU server 234.
`
`Ladd, 8:3-5. The VRU client is part of the communication node 212, e.g., a mobile
`
`phone. See Ex. 1003, ¶ 91, citing Ladd, 7:28-33, Fig. 3; “The VRU client 232
`
`processes speech communications…from the user.” Ladd, 8:5-7. Ladd further
`
`teaches the VRU client 232 “routes the speech communications to the VRU server
`
`234.” Ladd, 8:7-9. Ladd teaches “It will be recognized that the VRU client 232 can
`
`be integrated with the VRU server.” Ladd, 8:10-11. The VRU client 232 includes
`
`voice communications boards that include a voice recognition unit having a
`
`vocabulary “for detecting a speech pattern (i.e., a key word or phrase).” Ladd, 8:19-
`
`28. Ladd further explains:
`
`The VRU server 234 receives speech communications from the user via
`the VRU client 232. The VRU server 234 processes the speech
`communications and compares the speech communications against a
`vocabulary or grammar stored in the database server unit 244 or a
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 6
`
`
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`memory device. The VRU server 234 provides output signals,
`representing the result of the speech processing, to the LAN 240. The
`LAN 240 routes the output signal to the call control unit 236, the
`application server 242, and/or
`the voice browser 250. The
`communication node 212 then performs a specific function associated
`with the output signals.
`
`Ladd, 8:55-67. Ladd then goes on to discuss the VRU server 234 including various
`
`components, including the ASR unit 254. Ladd, 9:1-3. Ladd specifically discloses
`
`the ASR unit determines whether a speech pattern matches any stored grammar or
`
`vocabulary. Ladd, 9:32-36. As I also discuss in ¶ 8 below, Ladd teaches using
`
`various grammars to “interpret the user’s response,” substantially similar to the ’431
`
`Patent. Ladd, 19:24-26.
`
`7.
`
`Reading the above-discussed disclosures collectively, it is my
`
`understanding that the user’s speech communications are received by the VRU client
`
`232 and transmitted to the VRU server 234, which then processes the user’s speech
`
`communications. Specifically, the speech communications are compared against a
`
`vocabulary or grammar. Based on the comparison, the communication node
`
`performs specific functions. Therefore, in my opinion, because the VRU client 232
`
`may be integrated in the VRU server 234, and the VRU server 234 includes the ASR
`
`unit 254 (which I discussed at ¶¶ 91-93, 110 of my previous declaration), the
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 7
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`comparison of the user’s speech communications to the vocabulary or grammar is
`
`performed by the ASR unit.
`
`8.
`
`Ladd also teaches a GRAMMAR input “used to specify an input
`
`grammar when interpreting the user’s responses.” Ladd, 20:48-58. Prior to
`
`interpreting the user’s responses using the grammar input, Ladd teaches collecting
`
`input from the user and “convert[ing] the input to text using the speech to text unit”
`
`and then sending the text to the markup language server (which proceeds to perform
`
`the functions instructed by the user’s spoken words). Ladd, 20:5-10; see also id. at
`
`20:20-21 (“The FORM input makes use of the speech to text unit to convert user
`
`input to text.”), 20:23-27 (discussing that if the user said “John Smith,” then the text
`
`string “john smith” would be sent to the server). Thus, user’s responses are
`
`interpreted using input grammars for various categories described throughout Ladd,
`
`including, for example, a DATE input grammar for interpreting dates (Ladd, 19:22-
`
`26), and a MONEY input grammar for interpreting a user’s response related to the
`
`input of money (Ladd, 21:61-64). In each example, the user’s speech is transformed
`
`to text, and then the system to determines how to interpret the input (i.e., selecting
`
`a recognition grammar). Ladd, 19:22-26, 21:61-64. Therefore, Ladd teaches, in my
`
`opinion, performing a first step of converting the speech input into text. After the
`
`speech input is converted into text, the Ladd system interprets the user commands
`
`by identifying key words. These key words are, per Ladd, speech patterns.
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 8
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`I also note Ladd describes using a commercially-available product from
`
`9.
`
`a company called Nuance to transform speech into text. This is the same
`
`commercially-available product from Nuance that the ’431 and ’084 Patents
`
`describe, further indicating to me that Ladd performs teaches a speaker independent
`
`speech recognition device substantially similar to the speaker independent speech
`
`recognition device described and claimed in the ’431 Patent. Ex. 1001, 6:4-24; Ladd,
`
`8:23-28.
`
`B.
`Ladd Equates a “Grammar” with a “Vocabulary”
`10. Ladd repeatedly equates a grammar and vocabulary, using the terms
`
`interchangeably. Ladd, 4:22-25, 6:25-29, 9:32-35, 10:12-14. I also note Mr.
`
`Occhiogrosso agreed on the equivalence relationship between a grammar and
`
`vocabulary. Ex. 1039, 17:21-18:3, 19:8-12. Thus, a PHOSITA would have
`
`recognized that Ladd’s meaning of “grammar” generally equates with its meaning
`
`of “vocabulary.” Notably, Mr. Occhiogrosso also stated that grammars don’t “have
`
`anything to do” with [predefined] voice patterns, expressly stating that they are not
`
`correlated. Id. at 31:9-11, 32:24–33:1.
`
`C. Ladd Defines a Speech/Voice Pattern as a Key Word or Key Phrase
`11. As I discussed in my previous declaration, Ladd expressly identifies its
`
`system as providing “speaker independent automatic speech recognition of speech
`
`inputs,” and processing the speech inputs “to determine whether a word or speech
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 9
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`pattern matches any of the [stored] grammars or vocabulary.” Ex. 1003, ¶ 90 citing
`
`Ladd, 9:28-44, 8:19-28. Specifically, Ladd states that its system “may include a
`
`voice recognition system engine having a vocabulary for detecting a speech pattern
`
`(i.e. a key word or phrase).” Ladd, 8:23-25. Thus, when the user utters a spoken
`
`command, the two steps of speech recognition discussed above are performed. First,
`
`the sound of the spoken command is transformed into text. Ladd refers to words
`
`spoken by a user as “speech communications” or “speech inputs.” Ladd, 8:58-61,
`
`9:28-30. The textual version of the words is then compared to a “vocabulary” or
`
`“grammar” to identify key words or key phrases invoking functions in the system.
`
`Ladd, 9:28-38, 10:3-20. Specifically, Ladd states that “the ASR unit 254 sends an
`
`output signal to implement the specific function associated with the recognized voice
`
`pattern.” Ladd, 9:35-38. The speech/voice pattern, i.e., the key word or phrase
`
`corresponding to the spoken command, is only recognized in the second step of the
`
`process, after the speech inputs have been transformed into text.
`
`12.
`
`In my opinion, Ladd provides an express definition of a speech or voice
`
`pattern as a key word or phrase. As discussed in my previous declaration, Ladd states
`
`“…a speech pattern (i.e. a key word or phrase).” Ladd, 8:23-25; Ex. 1003, ¶¶ 106-
`
`107, 111-112. Here the “i.e.” means “in other words” or “that is,” which I understand
`
`to mean the key word and key phrase are being presented as other words for or a
`
`definition of a “speech pattern.”
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 10
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`It is also my opinion a PHOSITA would have understood Ladd’s
`
`13.
`
`teaching of “…a speech pattern (i.e. a key word or phrase)” (Ladd, 8:23-25) means
`
`the “key” is modifying both “word” and “phrase,” meaning Ladd is searching for a
`
`key word or key phrase.
`
`14. My opinion that Ladd provides an express definition of speech/voice
`
`pattern as a key word or phrase is consistent with multiple other teachings in Ladd,
`
`including in context of the overall sentence at 8:23-25: “The voice communication
`
`boards may include a voice recognition engine having a vocabulary for detecting a
`
`speech pattern (i.e., a key word or phrase).” Here, Ladd is stating the voice
`
`recognition engine (1) has a vocabulary, (2) the vocabulary is used to detect a speech
`
`pattern, and (3) the speech pattern is a key word or phrase. Shortly after this teaching
`
`in Ladd, Ladd further explains that the speech inputs are compared against a
`
`“vocabulary” or “grammar” to detect key words or phrases. Ladd, 8:58-61, 10:3-20.
`
`Matching of the speech patterns to the grammar or vocabulary is also discussed at
`
`9:28-44.
`
`15. The understanding that Ladd’s speech/voice patterns are key words or
`
`phrases is further confirmed in the discussion at 9:28-44 (which I discussed in my
`
`original declaration, Ex. 1003, ¶¶ 105-108):
`
`The ASR unit 254 of the VRU server 234 provides speaker independent
`automatic speech recognition of speech inputs or communications from
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 11
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`the user. It is contemplated that the ASR unit 254 can include speaker
`dependent speech recognition. The ASR unit 254 processes the speech
`inputs from the user to determine whether a word or a speech pattern
`matches any of the grammars or vocabulary stored in the database
`server unit 244 or downloaded from the voice browser. When the ASR
`unit 254 identifies a selected speech pattern of the speech inputs, the
`ASR unit 254 sends an output signal to implement the specific function
`associated with the recognized voice pattern. The ASR unit 254 is
`preferably a speaker independent speech recognition software package,
`Model No. RecServer, available from Nuance Communications. It is
`contemplated that the ASR unit 254 can be any suitable speech
`recognition unit to detect voice communications from a user.
`
`Here, Ladd is explaining the automatic speech recognition unit (ASR unit) processes
`
`the speech inputs to determine whether a speech pattern matches any stored grammar
`
`or vocabulary. If there is a match, i.e., the user spoke a speech pattern (that is, a key
`
`word or phrase) matching a grammar/vocabulary, then the ASR unit outputs a signal
`
`associated with the word selected by the user, i.e., the key word or phrase spoken by
`
`the user. Ladd provides several examples of the user speaking a key word or phrase
`
`to thereby make a selection. For instance, Ladd describes a process by which its IVR
`
`system may conduct a dialog with a user concerning selecting a desired soda. Ladd,
`
`17:1-27, 23:40-44, Claim 8. The user’s spoken words are matched against a set of
`
`key words for sodas including Coke, Pepsi, 7Up, and root beer. Ladd, 17:1-35.
`
`Depending on the key word detected, a next step in the browser is selected by the
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 12
`
`
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`system. Ladd, 23:40-44, Claim 8. Thus, the key word or phrase matching the
`
`grammar or vocabulary (i.e., step 2) causes the ASR unit to “send an output signal
`
`to implement the specific function associated with the recognized voice pattern,”
`
`determining the system’s response. Ladd, 9:36-39. Consequently, by speaking a key
`
`word or phrase, e.g., by saying “Coke,” the user selects a speech pattern (the key
`
`word “Coke), and the ASR identifies the selected speech pattern (the key word
`
`“Coke”) by matching the speech inputs to the grammar/vocabulary. The “selected
`
`speech pattern” discussed at Ladd, 9:36-38 is the speech pattern, i.e., the key word
`
`or phrase, selected to be spoken by the user.
`
`16. Reading at least these collective disclosures (discussed in the above two
`
`paragraphs) together, a PHOSITA would reasonably understand the user’s speech
`
`inputs that are converted to text are then compared to a vocabulary/grammar to detect
`
`a speech pattern, where the speech pattern is a key word or phrase.
`
`17.
`
`I also note Ladd uses the phrase “speech pattern” and “voice pattern”
`
`elsewhere. Ladd, 4:15-18, 6:50-57. In my opinion, each of these discussions is
`
`consistent with my opinion that Ladd uses the phrase speech/voice pattern to mean
`
`a key word or phrase. In the discussion at 4:15-18, Ladd is discussing speaker-
`
`dependent speech recognition where the speaker is identified by “detecting a unique
`
`speech pattern”: “The system may also identify the user by detecting a unique speech
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 13
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`pattern from the user (i.e., speaker verification) or a PIN entered using voice
`
`commands or DTMF tones.” Ladd includes a similar teaching at 6:50-57:
`
`the electronic network 206 from a
`the user accesses
`When
`communication device not registered with the system (i.e., a payphone,
`a phone of a non-subscriber, etc.), the node answers the call and
`prompts the user to enter his or her name and/or a personal
`identification number (PIN) using speech commands or DTMF tones.
`The node can also utilize speaker verification to identify a particular
`speech pattern of the user.
`
`Ladd, 6:50-57. In my opinion, these disclosures (4:15-18, 6:50-57) discussing a
`
`speech pattern are referring to the user uttering a unique key word or phrase to
`
`identify the user to the device. For example, the sentence at 4:15-18 is discussing
`
`that the system may identify the user and provides examples for identification as the
`
`unique speech pattern from the user or a PIN. A PIN is commonly understood as a
`
`unique identifier. Similarly, a unique key word or phrase, i.e., the disclosed “unique
`
`speech pattern,” would also have been understood to identify the user, akin to the
`
`user saying a password. Therefore, in my opinion, Ladd’s disclosure at 4:15-18 is
`
`describing a circumstance where the user can identify himself or herself to the
`
`system by saying a unique key word or phrase or by saying a PIN.
`
`18.
`
`I note Ladd, 4:15-18 is an example of speaker-dependent speech
`
`recognition, but only insofar as the user is speaking a unique key word or phrase,
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 14
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`i.e., the unique speech pattern, to identify the user to the device. That is, there is
`
`nothing in Ladd that indicates the user’s unique voice attributes, akin to voice
`
`printing, is being identified in Ladd. The user’s spoken words are recognized and
`
`converted into text, but it is the content recognition of determining the user spoke a
`
`unique speech pattern that actually identifies the user to the device. This method of
`
`user identification makes sense within the context of Ladd, which is intended for use
`
`by users from any network-enabled device. Ladd, 2:40-47.
`
`19. Similarly, the discussion at 6:50-57 is stating the user can identify
`
`himself or herself to the system by entering a PIN. The following sentence is “[t]he
`
`node can also use speaker verification” to identify a “particular speech pattern” of
`
`the user. Similar to the disclosure at 4:15-18, I understand this section of Ladd to be
`
`explaining that that user can verify his/her identity to the system based on a particular
`
`speech pattern, i.e., a particular key word or phrase the user speaks to the system.
`
`20. Ladd also uses the term speech/voice pattern at 6:29-34, which states:
`
`The node 212 can provide various dialog voice personalities (i.e., a
`female voice, a male voice, etc.) and can implement various grammars
`(i.e., vocabulary) to detect and respond to the audio inputs from the
`user. In addition, the communication node can automatically select
`various speech recognition models (i.e., an English model, a Spanish
`model, an English accent model, etc.) based upon a user profile, the
`user’s communication device, and/or the user’s speech patterns. The
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 15
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`communication node 212 can also allow the user to select a particular
`speech recognition model.
`
`Ladd, 6:29-34. Here, Ladd is discussing speech recognition models that recognize
`
`the English language (the “English model”), the Spanish language (the “Spanish
`
`model”) or users who speak with an English accent (“the English accent model”).
`
`Ladd discloses the speech recognition model could be selected based on the user
`
`profile. In my opinion, selection of the speech recognition model based on the user
`
`profile may occur in instances where the user previously indicated their native is
`
`English or Spanish or where they indicated they were born in or lived in a country
`
`that speaks with an English accent (e.g., the UK). Ladd also discloses the speech
`
`recognition model could be selected based on the user’s communication device,
`
`which indicates to me a geographic region in which the user is located. Finally, Ladd
`
`discloses the speech recognition model could be based on the user’s speech patterns.
`
`Because we know from Ladd, 8:23-25 that the speech patterns or key words or
`
`phrases, then I understand this disclosure at Ladd, 6:29-34 to be stating that that
`
`based on key words or phrases the user speaks, a certain speech recognition model
`
`is selected. Although not detailed in Ladd, an example that would reasonably be
`
`contemplated in view of Ladd’s disclosure is the user speaking a particular word is
`
`“pesos” instead of “dollars” in response to a prompt of “How much would you like
`
`to deposit?” See Ladd, 22:4-19 (describing a MONEY input where the user is
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 16
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`depositing money via the IVR). In this instance, if the user says “pesos,” which is
`
`recognized as the currency of Mexico, then the system would identify the speech
`
`pattern of “peso” as a key word indicating the Spanish model with a Spanish speech
`
`recognition engine should be employed with the user.
`
`21. My opinion is further confirmed by Ladd’s disclosure at 4:20-36, which
`
`describes selecting a grammar and a personality based on various factors, including
`
`the accent of the caller. Selection of a grammar based on the accent of the caller
`
`indicates to me the Ladd IVR system is advanced enough to recognize an accent and
`
`select a particular grammar based on the accent. Thus, one example in Ladd is an
`
`English accent. British-English speakers often use different terms to identify a thing
`
`that American-English speakers, such as “pram” in Britain versus “stroller” in
`
`America. Therefore, should the user be recognized as having an English accent, I
`
`understand Ladd’s disclosure at 4:20-36 as selecting a grammar based on the English
`
`accent. For example, the selected grammar may then recognize “pram” as the
`
`equivalent of a stroller if spoken by the user.
`
`22.
`
`In sum, each of the disclosures in Ladd that reference a speech/voice
`
`pattern inform me that Ladd consistently uses the term to describe a key word or
`
`phrase, where recognition of the spoken word as a key word or phrase is determined
`
`by matching the voice pattern against the grammar or vocabulary. Additionally, in
`
`each instance Ladd uses the phrase “speech pattern” or voice pattern,” a PHOSITA
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 17
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`would have understood Ladd’s use of a “speech pattern” or “voice pattern” is not
`
`the same as the “predefined voice pattern” that the ’431 and ’084 Patents purportedly
`
`exclude. Ex. 1001, 4:30-43. Instead, Ladd’s “speech patterns” are key words or
`
`phrases determined using grammar (or vocabulary), which is not only allowed by
`
`the ’431 and ’084 Patents but is explicitly performed in at least one example
`
`described by the patents. See, e.g., Ex. 1001, 6:44-56. This conclusion regarding the
`
`distinction between the “predefined voice pattern” excluded by the ’431 and ’084
`
`Patents and the “speech pattern” of Ladd would have been recognized by a
`
`PHOSITA by any of the reasons I discuss here, each supported by evidence in the
`
`’431 and ’084 Patents, Ladd, and opinions from myself and/or Mr. Occhiogrosso.
`
`D. Ladd Does Not Include Any Disclosure Indicating the Disclosed
`Speech/Voice Patterns Are Spectral Energy as a Function of Time
`23. Parus’s construction of a “speaker-independent speech recognition
`
`device” requires a “speech recognition device that recognizes spoken words without
`
`using predefined voice patterns.” Paper 15, 21-24. Mr. Occhiogrosso explained in
`
`his deposition that it is Parus’s position that the ’431 Patent’s meaning of these
`
`“predefined voice patterns” being excluded are “a word or utterance, and its spectral
`
`energy—typically—spectral energy as a function of time.” Ex. 1039, Dep. Tr.,
`
`25:12-17, 25:22–26:13, 30:10-16.
`
`24.
`
`In my opinion, no description of spectral energy (as a function of time
`
`or otherwise) appears in Ladd explicitly nor would have been understood by a
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 18
`
`
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`PHOSITA to have been performed by Ladd implicitly or inherently. None of the
`
`citations I discussed above where Ladd uses the phrase “speech pattern” or “voice
`
`pattern” indicate to me that Ladd is using the phrase to mean spectral energy over
`
`time. Moreover, I went through each use of the phrase in Ladd above and provided
`
`an explanation for why the phrase is consistently used in Ladd to mean a key word
`
`or phrase. Additionally, there is no disclosure in Ladd that would, in my opinion,
`
`teach or suggest to a PHOSITA that Ladd even converts speech to text using the
`
`spectral energy of speech input as a function over time. Ladd does not detail how the
`
`speech is converted into text and instead just states that the user’s speech inputs are
`
`converted into text. See, e.g., Ladd, 9:45-54. This is understandable within the
`
`context of Ladd, where the discussion focuses on Ladd’s advanced IVR system and
`
`not mere speech recognition of converting speech into text, which was well-known
`
`at the time of Ladd and discussed in the Background of my original Declaration.
`
`25. Returning to the previous discussion of the two steps of speech and
`
`command recognition, the difference between the ’431 and ’084 Patents’
`
`“predefined voice patterns” and Ladd’s “speech patterns” is, in my opinion, evident:
`
`spectral energy as a function of time is one method by which a system might
`
`transform audible sound into text (and is thus one method of performing the first
`
`step of speech recognition), while Ladd’s “speech pattern” merely refers to detection
`
`
`
`IPR2020-00686 and IPR2020-00687
`
` EX1040 Page 19
`
`
`
`Supplemental Declaration of Dr. Loren Terveen
`U.S. Patent No. 9,451,084
`of key words to determine content (a method of performing the second step of
`
`recognizing a command).
`
`E.
`Sequential Access of Websites
`26. Mr. Occhiogrosso opines that “sequential access of websites is very
`
`different from sequential access of a database.” Ex. 2025, 94-95. This is incorrect,
`
`as I’ve discussed in my previous declaration. Ex. 1003, ¶¶ 103-104, 122. Both a
`
`website and a database electronically store information for access via a network
`
`using network addresses. See, e.g., Ex. 1004, 11:50-63; Ex. 1006, 9:33-44. Ladd
`
`expressly states that one of its “content sources” may include “a database, scripts,
`
`and/or markup language documents or pages,” illustrating to a PHOSITA that the
`
`databases and web pages are treated similarly in Ladd as sources of content, and that
`
`methods applied to databases would have been appropriate for application to website
`
`searches as well. Ex. 1004, 11:50-63. The frequency with which a content source,
`
`be it a website or database, is updated is irrelevant to the process taught by Ladd in
`
`view of Kurosawa and Goedken, in which the algorithm simply continues searching
`
`until information to be retrieved is found. If the information is not found on any
`
`particular website, or if that website is unavailable, the algorithm will continue until
`
`the information to be retrieved is found.
`
`