throbber
DESIGN OF A COMMAND INTERFACE WITH A
`
`DYNAMIC GRAMMAR SPEECH RECOGNITION
`
`ENGINE
`
`S(cid:2) Kruger S(cid:2) Chennoukh J(cid:2) Flanagan L(cid:2) Rothkrantz
`Delft University of TechnologyITS
`Zuidplantsoen   BZ Delft The Netherlands
`email
`L(cid:3)J(cid:3)M(cid:3)Rothkrantz(cs(cid:3)tudelft(cid:3)nl
`
`ABSTRACT
`
`This paper presents the design of an open grammar
`speech recognition engine(cid:2) The engine is built to op
`erate in realtime and kept simple(cid:2) The objective of this
`paper is to show how task dependent knowledge can
`improve speech recognition accuracy by cancelling out
`parts of the command set depending on the state of the
`interface(cid:2)
`The system is a phonebased recognizer and therefore
`the vocabulary and grammar can easily be extended
`without retraining the system(cid:2)
`The hypothesis of this paper is that speech recognition
`accuracy will improve if higher level knowledge about
`the state of a task is used by the recognition engine(cid:2)
`
` Introduction
`
`Recently the rst commercial automatic speech recog
`nition systems ASR became available(cid:2) The speech
`recognition technology is succesfully being applied in
`telephone inquiry systems(cid:2) The keyboard of comput
`ers can be replaced by the human voice to deliver short
`commands(cid:2)
`In all these applications the grammar and the dictio
`nary are limited(cid:2) To design such a system from scratch
`takes a lot of time and eort in training acoustic models
`to a specic application(cid:2) In another approach a gen
`eral ASR system that theoratically can recognize any
`vocabulary is restricted to a strictly limited number of
`words and a grammar(cid:2) The general system contains pre
`trained models for speech units which can be combined
`to form words that are specic for a certein task(cid:2)
`To realize a realtime ASR system a reduction of re
`sources is necessary(cid:2) The advantage of a pretrained gen
`eral ASR system is that less eort is needed to apply it
`to a specic task(cid:2) However to realize a high recognition
`rate additional training for the specic task is neces
`sary(cid:2)
`Usually a task can be split into a number of subtasks
`and accordingly a subset of all possible commands(cid:2) De
`pending on the state of the system not all tasks can
`be performed and not all commands are available at all
`times(cid:2) For each subtask a separate ASR system could
`
`be built focussed on its commands but this is a waste
`of resources(cid:2) It is better to have one ASR system for
`a complete task and have it activate parts of the com
`mand sets depending on the state of the system(cid:2)
`In this paper we use the general approach to realize
`an ASR application(cid:2) We train the acoustic models for
`general purpose then perform additional training for a
`specic application(cid:2) Then the impact of use of knowl
`edge about the state of the system is examined(cid:2)
`
` Open Vocabulary
`
`The idea to use one speech recognition system for sev
`eral applications is not new(cid:2) Ideally a system that rec
`ognized a very large vocabulary can be used for many
`tasks(cid:2) The concept to keep these large systems in pro
`portion was to chop speech up into speech units train
`the speech units and during recognition patch the units
`back together to form sensible language(cid:2)
`Many speech units have been proposed from compound
`syllables syllables phonemes to subphone units(cid:2) Al
`though there is still a lot of research into which unit
`best matches actual speech the phone is most widely
`used for its size and the associated number of dierent
`ones intuitive use and for historical reasons used in
`dictionaries for ages(cid:2)
`For the large vocabulary speech recognition system a
`dictionary is created with say  of the most used
`words in natural language(cid:2) Usually this is sucient to
`support most tasks(cid:2)
`The next step towards application specic speech recog
`nition while retaining the allpurpose engine is the open
`vocabulary method(cid:2) Any task doesnt need all these
` words but only a subset(cid:2) But while a dictionary
`is created fairly easily the acoustic models take a lot of
`eort to create(cid:2) All the speech units are trained for gen
`eral use and a task specic vocabulary is constructed
`for each application(cid:2)
`
` Dynamic grammar speech recognition
`
`Dialog systems already use a dialog structure to inter
`pret the output of a speechrecognizer in terms of the
`expected input from the user(cid:2) Most dialog systems take
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:45:23 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1017, Page 1
`
`

`

`Subtask 1
`
`C-set 1
`
`Subtask 3
`
`Subtask 2
`
`C-set 2
`
`C-set 4
`
`C-set 3
`
`C-set 5
`
`SIMPLE_COMMAND
`
`CREATE
`MOVE
`DELETE
`HIGHLIGHT
`POP UP
`POP DOWN
`
`Figure  Word category
`
`SIMPLE
`COMMAND
`
`ANY
`OBJECT
`
`Figure Commandsets corresponding with subtasks
`
`input from a speech recognizer that recognizes natural
`language and then lter out the relevant words that the
`recognizer produces(cid:5)
`To do this the dialog system uses a slotlling algo
`rithm(cid:5) Depending on the state of the dialog it expects
`the user to utter words from a small set that would be
`probable at the time(cid:5) For example the system may ex
`pect the user to utter a number and a product(cid:5) It then
`creates a number and a product slot and try to ll these
`slots with the input from the speech recognizer(cid:5)
`This knowledge about the state of the dialog can be used
`to the advantage of the speech recognizer by telling
`it what to expect in the current situation(cid:5) Not only
`accuracy but also speed may improve because of the
`smaller set of options to consider(cid:5)
`In gure  the commands of three subtasks are denoted
`symbolically as Venndiagrams(cid:5) From these diagrams 
`command sets are found(cid:5) A speech recognition system
`can enable dierent sets when dierent subtasks are ac
`tive(cid:5) When subtask is active Command set and 
`must be enabled(cid:5)
`A natural language speech recognizer has a language
`model based on statistical properties of natural lan
`guage such a the occurence probabilities of wordpairs
`or triples(cid:5) To incorporate knowledge about the state of
`the dialog into such a language model requires to cal
`culate conditional occurence probabilities(cid:5) Calculating
`the general statistics about natural language is a huge
`task indeed and adding this knowledge would require
`even more eort(cid:5)
`This paper concentrates on speech recognition in com
`mand interface environments(cid:5) Dialogs in such an en
`vironment may be more constrained in syntax(cid:5) This
`allows to use a speech recognizer with a grammar lan
`guage model rather than a natural language recognizer(cid:5)
`The language model is not based on statistical prop
`erties of language but is predened for a task(cid:5) The
`grammar species which are valid commands and what
`is their syntax and each command has equal occurence
`probability(cid:5)
`Based on the state of the dialog the command set can
`
`Start
`
`DIRECTED
`COMMAND
`
`MOVEABLE
`OBJECT
`
`LOCATION
`
`DIRECTION
`
`End
`
`SINGLE
`COMMAND
`
`Figure Grammar graph
`
`vary(cid:5) This knowledge can easily be integrated in the
`speech recognizer by enabling or disabling parts of the
`grammar(cid:5) This is called a dynamic grammar speech
`recognition DGSR(cid:5) If the dialog system and the speech
`recognizer are integrated closely they can share knowl
`edge of higher level to optimize accuracy(cid:5)
`
` Grammar
`
`The grammar is structured in such a way that the
`speech recognition engine can easily nd an optimal
`path through it when presented with input(cid:5) A grammar
`compiler builds a graph from a grammar denition in
`which the basic components are instances of the acoustic
`models(cid:5)
`
`A grammar is built up from word categories or word
`classes see gure (cid:5) One class category contains an
`arbitrary number of words or concatenated words and
`works like an orport(cid:5) The grammar compiler constructs
`a parallel path for each word in the category(cid:5) A category
`can be viewed as a node because it has a number of
`arcs fanning in and a number of arcs fanning out but
`the inside of the node is shielded(cid:5)
`
`A graph is constructed by dening the directed arcs be
`tween word categories gure (cid:5) The grammar com
`piler replaces each word by its phone sequence and each
`phone by an instance of its acoustic model plus some
`variables for kepping track of the recognition process(cid:5)
`
`The recognition process is really a scheme of updating
`scores of state instances in the graph and propagating
`these values from the start to the end node(cid:5) At the end
`of a sentence the optimal word sequence is traced back
`through the graph to present the resulting sentence(cid:5)
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:45:23 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1017, Page 2
`
`

`

` Implementation
`
`To test the goals of this paper a speech recognition en
`gine is used that can accept any grammar and recognize
`only sentences that are valid according to that grammar(cid:4)
`The engine is an openvocabulary phone based speech
`recognizer based on the Hidden Markov Model method(cid:4)
`The grammar is represented by a directed graph that
`connects word categories(cid:4) Each word category may con
`tain any number of words or word sequences(cid:4) The word
`category works as an orconstruct(cid:4) A word category is
`instantiated in a node in the grammar graph(cid:4) The arcs
`of the graph are specied by identifying the predecessors
`of each node(cid:4) Every graph contains at least a start and
`end node(cid:4)
`Each path in the graph represents the syntax of a com
`mand where each node may be substituted by one of its
`words(cid:4) The speech recognition engine compiles a net
`work from this graph by substituting each word by its
`phonetic transcription and each phone by an instance of
`the acoustic model(cid:4)
`A Viterbi search nds the best path in this network
`given a stream of features extracted from the input ut
`terance(cid:4) A evaluation algorithm determines the number
`of word errors(cid:4) The speech recognizer can only pro
`duce sentences that are valid according to the grammar(cid:4)
`Therefore an error is at least a sentence that is in the
`grammar and substitution errors are the most common(cid:4)
`For a deletion or an insertion error to occur the wrong
`sentence must be in the grammar(cid:4)
`It is easy to merge two grammars into one new one(cid:4) All
`nodes of both grammars are copied into one le ex
`cept the start and end node are copied once and the
`end nodes predecessor list is a concatenation of the two
`original ones(cid:4) It is also easy to disable part of a gram
`mar(cid:4)
`If one arc in a path from start to end node is
`removed the path becomes invalid and the recognizer
`will no longer recognize the corresponding syntax(cid:4) It is
`of course wise to cut the path in the beginning of the
`network to reduce computational load(cid:4)
`
` Experiment
`
`The objective of the experiment is to nd how the recog
`nition accuracy improves if a grammar is reduced to sub
`tasks(cid:4) To this goal a grammar is designed that consists
`of subgrammars belonging to subtasks(cid:4)
`The global task is the interface of a computer(cid:4) The
`subtasks are Window Management Finder accessing
`Extensions and activating Control panels(cid:4)
`At some point in time perhaps not all tasks are active(cid:4)
`This means that the system need not expect the user to
`input any commands that belong to other tasks(cid:4) Then
`that subgrammar may be disabled(cid:4)
`The testset from one subtask is rst tested with the
`recognizer with a full grammar(cid:4) It is interesting to see
`how the performance increases when subgrammars are
`disabled(cid:4)
`
`The speech recognizer is provided with simple context
`independent acoustic models monophones(cid:4) The phones
`are represented by state continuous density HMMs
`with a mixture of  Gaussians to represent the pdf of
`each state(cid:4) The acoustic models are rst trained on the
`TIMIT database(cid:4)
`Initial models are estimated with a
`segmentation of the timealigned TIMIT data(cid:4) The em
`bedded BaumWelch algorithm is used to reestimate the
`models(cid:4) The TIMIT database consists of clean speech
`of very good quality and low noise levels(cid:4) To make the
`speech recognizer perform well in the testing environ
`ment the acoustic models are retrained with speech
`recorded in the testing environment(cid:4)
`Because the system has contextindependent acoustic
`models not so much training data is needed to esti
`mate the models(cid:4) To make the system perform well for
`this experiment the models are retreined on speech data
`from the grammar of the global task(cid:4) This makes the
`contextindependent models tend to model the phones
`in the context in which they appear in the task(cid:4)
`For the training and test data  sentences were gen
`erated from the grammar(cid:4) This set was divided in
`subsets of   and  sentences(cid:4) Of  speakers one
`of these subsets was recorded(cid:4) speakers are in the test
`set and  speakers are in the training set(cid:4)
`Of these  sentences  are allocated to the task
`FINDER are allocated to CONTROL PANELS and
`  to EXTENSIONS(cid:4) For the purpose of the experiment
`four grammars are dened(cid:4) One for each task and one
`general grammar containing all three tasks(cid:4)
`In the experiment rst the complete test set is run
`through the general grammar(cid:4) Then the test sentences
`belonging to each task are run separately to see if any
`task performs better than another one(cid:4) This is not inter
`esting in itself but for comparing the results(cid:4) Next the
`test sentences belonging to each task are run through
`the speech recognizer given only the taskspecic gram
`mar(cid:4) The results of this experiment are combined to
`see how the dynamic grammar works on average for the
`complete test set(cid:4)
`To determine the accuracy of the system and compare
`the dierent results a metric is used(cid:4) For the purpose of
`this paper the word error rate is computed as follows
`
`word error rate
`
`word errors
`total number of words
`
` 
`
`word errors substitutions  deletions  insertions
`
`Here a substitution deletion or insertion all score
`penalty(cid:4) A match can be interpreted in dierent ways
`for example a substitution is also a deletion plus an
`insertion(cid:4) The metric algorithm computes the optimum
`minimum errors error rate(cid:4)
`
` Results
`
`The results of the experiment are summarized in table (cid:4)
`In the left column the grammar is written(cid:4) General is
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:45:23 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1017, Page 3
`
`

`

`  D(cid:4) Geller et al(cid:4) Speech Recognition on Spheric
`an IC for Command  Control Applications
`EuroSpeech  (cid:4)
`
`Isues in Measuring the
` J(cid:4)L(cid:4) Flanagan I(cid:4) Marsic
`Benits of Multimodal Interfaces Proc(cid:4) ICASSP
`  (cid:4)
`
` Q(cid:4) Lin et al(cid:4) Robust Distant Talking Speech Recog
`nition Proc(cid:4) ICASSP  (cid:4)
`
` L(cid:4)J(cid:4)M(cid:4) Rothkrantz et al(cid:4) The Simulation of a Hu
`man Operator by an ASP System Proc(cid:4) ASS 
` (cid:4)
`
` V(cid:4) Zue Conversational Interfaces Advances and
`Challanges Proc(cid:4) EuroSpeech  (cid:4)
`
`TIMIT Retrained
`
`Test
`Grammar
`FIND (cid:4) (cid:4) 
`General
`FIND (cid:4)  (cid:4)
`Finder
`CPAN  (cid:4)  (cid:4)
`General
`CPAN (cid:4) (cid:4)
`CPanels
`EXTS
` (cid:4) (cid:4)
`General
`Extensions EXTS
` (cid:4)  (cid:4)
`General
`ALL
`(cid:4) (cid:4) 
`Dynamic
`ALL
`(cid:4)  (cid:4)
`
`Table Word error rates
`
`the combined grammar of the three subtasks Finder
`CPanels and Extensions(cid:4) The Dynamic grammar is the
`combined grammar of which only the grammar of one
`subtask is enabled when this subtask is active(cid:4) The
`next column contains the name of the test set(cid:4) FIND is
`the set with commands for the FINDER CPAN for the
`CONTROL PANELS and EXTS for the Extensions(cid:4)
`In the last two columns the word error rates are printed
`for two dierent sets of acoustic models(cid:4) The rst is
`only trained on the TIMIT database(cid:4) The second is
`reestimated with the training set recorded in the same
`environment as the test set(cid:4)
`Each test set is tested both on the general grammar and
`on the grammar specic for the subtask(cid:4) As can be seen
`in the table each time the speech recognizer does a bit
`better when it is given a smaller grammar(cid:4)
`The retrained models are signicantly better(cid:4) For the
`complete testset the TIMIT trained models shows a
` (cid:4)grammar(cid:4) For the retrained models error decrease
`is (cid:4) models that are better at the acoustic match have
`less to gain from the dynamic grammar technique(cid:4)
`
` Conclusion
`
`The results show that a dynamic grammar can improve
`the recognition rate signicantly although it works bet
`ter for less perfect acoustic models(cid:4) Still the eort is
`worth the while especially since robust systems tend
`to have more variable input and therefore less accurate
`acoustic models(cid:4)
`An issue that has not been investigated in this paper
`but seems rather evident is that a dynamic grammar
`by disabling parts of the grammar reduces the amount
`of time needed to recognize speech(cid:4) In a realtime appli
`cation this may free up resources to improve accuracy
`or another qualitymeasure otherwise(cid:4)
`
`References
`
`  J(cid:4) Allan Natural Language Understanding Ben
`jaminCummings (cid:4)
`
` L(cid:4) Ban P(cid:4) Tatai Automatic Speech Segmentation
`for an Open Vocabulary Recognition System Proc(cid:4)
`ECSAP  (cid:4)
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:45:23 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1017, Page 4
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket