`
`DYNAMIC GRAMMAR SPEECH RECOGNITION
`
`ENGINE
`
`S(cid:2) Kruger S(cid:2) Chennoukh J(cid:2) Flanagan L(cid:2) Rothkrantz
`Delft University of TechnologyITS
`Zuidplantsoen BZ Delft The Netherlands
`L(cid:3)J(cid:3)M(cid:3)Rothkrantz(cs(cid:3)tudelft(cid:3)nl
`
`ABSTRACT
`
`This paper presents the design of an open grammar
`speech recognition engine(cid:2) The engine is built to op
`erate in realtime and kept simple(cid:2) The objective of this
`paper is to show how task dependent knowledge can
`improve speech recognition accuracy by cancelling out
`parts of the command set depending on the state of the
`interface(cid:2)
`The system is a phonebased recognizer and therefore
`the vocabulary and grammar can easily be extended
`without retraining the system(cid:2)
`The hypothesis of this paper is that speech recognition
`accuracy will improve if higher level knowledge about
`the state of a task is used by the recognition engine(cid:2)
`
` Introduction
`
`Recently the rst commercial automatic speech recog
`nition systems ASR became available(cid:2) The speech
`recognition technology is succesfully being applied in
`telephone inquiry systems(cid:2) The keyboard of comput
`ers can be replaced by the human voice to deliver short
`commands(cid:2)
`In all these applications the grammar and the dictio
`nary are limited(cid:2) To design such a system from scratch
`takes a lot of time and e ort in training acoustic models
`to a speci c application(cid:2) In another approach a gen
`eral ASR system that theoratically can recognize any
`vocabulary is restricted to a strictly limited number of
`words and a grammar(cid:2) The general system contains pre
`trained models for speech units which can be combined
`to form words that are speci c for a certein task(cid:2)
`To realize a realtime ASR system a reduction of re
`sources is necessary(cid:2) The advantage of a pretrained gen
`eral ASR system is that less e ort is needed to apply it
`to a speci c task(cid:2) However to realize a high recognition
`rate additional training for the speci c task is neces
`sary(cid:2)
`Usually a task can be split into a number of subtasks
`and accordingly a subset of all possible commands(cid:2) De
`pending on the state of the system not all tasks can
`be performed and not all commands are available at all
`times(cid:2) For each subtask a separate ASR system could
`
`be built focussed on its commands but this is a waste
`of resources(cid:2) It is better to have one ASR system for
`a complete task and have it activate parts of the com
`mand sets depending on the state of the system(cid:2)
`In this paper we use the general approach to realize
`an ASR application(cid:2) We train the acoustic models for
`general purpose then perform additional training for a
`speci c application(cid:2) Then the impact of use of knowl
`edge about the state of the system is examined(cid:2)
`
` Open Vocabulary
`
`The idea to use one speech recognition system for sev
`eral applications is not new(cid:2) Ideally a system that rec
`ognized a very large vocabulary can be used for many
`tasks(cid:2) The concept to keep these large systems in pro
`portion was to chop speech up into speech units train
`the speech units and during recognition patch the units
`back together to form sensible language(cid:2)
`Many speech units have been proposed from compound
`syllables syllables phonemes to subphone units(cid:2) Al
`though there is still a lot of research into which unit
`best matches actual speech the phone is most widely
`used for its size and the associated number of di erent
`ones intuitive use and for historical reasons used in
`dictionaries for ages(cid:2)
`For the large vocabulary speech recognition system a
`dictionary is created with say of the most used
`words in natural language(cid:2) Usually this is su cient to
`support most tasks(cid:2)
`The next step towards application speci c speech recog
`nition while retaining the allpurpose engine is the open
`vocabulary method(cid:2) Any task doesnt need all these
` words but only a subset(cid:2) But while a dictionary
`is created fairly easily the acoustic models take a lot of
`e ort to create(cid:2) All the speech units are trained for gen
`eral use and a task speci c vocabulary is constructed
`for each application(cid:2)
`
` Dynamic grammar speech recognition
`
`Dialog systems already use a dialog structure to inter
`pret the output of a speechrecognizer in terms of the
`expected input from the user(cid:2) Most dialog systems take
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:45:23 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1017, Page 1
`
`
`
`Subtask 1
`
`C-set 1
`
`Subtask 3
`
`Subtask 2
`
`C-set 2
`
`C-set 4
`
`C-set 3
`
`C-set 5
`
`SIMPLE_COMMAND
`
`CREATE
`MOVE
`DELETE
`HIGHLIGHT
`POP UP
`POP DOWN
`
`Figure Word category
`
`SIMPLE
`COMMAND
`
`ANY
`OBJECT
`
`Figure Commandsets corresponding with subtasks
`
`input from a speech recognizer that recognizes natural
`language and then lter out the relevant words that the
`recognizer produces(cid:5)
`To do this the dialog system uses a slot lling algo
`rithm(cid:5) Depending on the state of the dialog it expects
`the user to utter words from a small set that would be
`probable at the time(cid:5) For example the system may ex
`pect the user to utter a number and a product(cid:5) It then
`creates a number and a product slot and try to ll these
`slots with the input from the speech recognizer(cid:5)
`This knowledge about the state of the dialog can be used
`to the advantage of the speech recognizer by telling
`it what to expect in the current situation(cid:5) Not only
`accuracy but also speed may improve because of the
`smaller set of options to consider(cid:5)
`In gure the commands of three subtasks are denoted
`symbolically as Venndiagrams(cid:5) From these diagrams
`command sets are found(cid:5) A speech recognition system
`can enable di erent sets when di erent subtasks are ac
`tive(cid:5) When subtask is active Command set and
`must be enabled(cid:5)
`A natural language speech recognizer has a language
`model based on statistical properties of natural lan
`guage such a the occurence probabilities of wordpairs
`or triples(cid:5) To incorporate knowledge about the state of
`the dialog into such a language model requires to cal
`culate conditional occurence probabilities(cid:5) Calculating
`the general statistics about natural language is a huge
`task indeed and adding this knowledge would require
`even more e ort(cid:5)
`This paper concentrates on speech recognition in com
`mand interface environments(cid:5) Dialogs in such an en
`vironment may be more constrained in syntax(cid:5) This
`allows to use a speech recognizer with a grammar lan
`guage model rather than a natural language recognizer(cid:5)
`The language model is not based on statistical prop
`erties of language but is prede ned for a task(cid:5) The
`grammar speci es which are valid commands and what
`is their syntax and each command has equal occurence
`probability(cid:5)
`Based on the state of the dialog the command set can
`
`Start
`
`DIRECTED
`COMMAND
`
`MOVEABLE
`OBJECT
`
`LOCATION
`
`DIRECTION
`
`End
`
`SINGLE
`COMMAND
`
`Figure Grammar graph
`
`vary(cid:5) This knowledge can easily be integrated in the
`speech recognizer by enabling or disabling parts of the
`grammar(cid:5) This is called a dynamic grammar speech
`recognition DGSR(cid:5) If the dialog system and the speech
`recognizer are integrated closely they can share knowl
`edge of higher level to optimize accuracy(cid:5)
`
` Grammar
`
`The grammar is structured in such a way that the
`speech recognition engine can easily nd an optimal
`path through it when presented with input(cid:5) A grammar
`compiler builds a graph from a grammar de nition in
`which the basic components are instances of the acoustic
`models(cid:5)
`
`A grammar is built up from word categories or word
`classes see gure (cid:5) One class category contains an
`arbitrary number of words or concatenated words and
`works like an orport(cid:5) The grammar compiler constructs
`a parallel path for each word in the category(cid:5) A category
`can be viewed as a node because it has a number of
`arcs fanning in and a number of arcs fanning out but
`the inside of the node is shielded(cid:5)
`
`A graph is constructed by de ning the directed arcs be
`tween word categories gure (cid:5) The grammar com
`piler replaces each word by its phone sequence and each
`phone by an instance of its acoustic model plus some
`variables for kepping track of the recognition process(cid:5)
`
`The recognition process is really a scheme of updating
`scores of state instances in the graph and propagating
`these values from the start to the end node(cid:5) At the end
`of a sentence the optimal word sequence is traced back
`through the graph to present the resulting sentence(cid:5)
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:45:23 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1017, Page 2
`
`
`
` Implementation
`
`To test the goals of this paper a speech recognition en
`gine is used that can accept any grammar and recognize
`only sentences that are valid according to that grammar(cid:4)
`The engine is an openvocabulary phone based speech
`recognizer based on the Hidden Markov Model method(cid:4)
`The grammar is represented by a directed graph that
`connects word categories(cid:4) Each word category may con
`tain any number of words or word sequences(cid:4) The word
`category works as an orconstruct(cid:4) A word category is
`instantiated in a node in the grammar graph(cid:4) The arcs
`of the graph are speci ed by identifying the predecessors
`of each node(cid:4) Every graph contains at least a start and
`end node(cid:4)
`Each path in the graph represents the syntax of a com
`mand where each node may be substituted by one of its
`words(cid:4) The speech recognition engine compiles a net
`work from this graph by substituting each word by its
`phonetic transcription and each phone by an instance of
`the acoustic model(cid:4)
`A Viterbi search nds the best path in this network
`given a stream of features extracted from the input ut
`terance(cid:4) A evaluation algorithm determines the number
`of word errors(cid:4) The speech recognizer can only pro
`duce sentences that are valid according to the grammar(cid:4)
`Therefore an error is at least a sentence that is in the
`grammar and substitution errors are the most common(cid:4)
`For a deletion or an insertion error to occur the wrong
`sentence must be in the grammar(cid:4)
`It is easy to merge two grammars into one new one(cid:4) All
`nodes of both grammars are copied into one le ex
`cept the start and end node are copied once and the
`end nodes predecessor list is a concatenation of the two
`original ones(cid:4) It is also easy to disable part of a gram
`mar(cid:4)
`If one arc in a path from start to end node is
`removed the path becomes invalid and the recognizer
`will no longer recognize the corresponding syntax(cid:4) It is
`of course wise to cut the path in the beginning of the
`network to reduce computational load(cid:4)
`
` Experiment
`
`The objective of the experiment is to nd how the recog
`nition accuracy improves if a grammar is reduced to sub
`tasks(cid:4) To this goal a grammar is designed that consists
`of subgrammars belonging to subtasks(cid:4)
`The global task is the interface of a computer(cid:4) The
`subtasks are Window Management Finder accessing
`Extensions and activating Control panels(cid:4)
`At some point in time perhaps not all tasks are active(cid:4)
`This means that the system need not expect the user to
`input any commands that belong to other tasks(cid:4) Then
`that subgrammar may be disabled(cid:4)
`The testset from one subtask is rst tested with the
`recognizer with a full grammar(cid:4) It is interesting to see
`how the performance increases when subgrammars are
`disabled(cid:4)
`
`The speech recognizer is provided with simple context
`independent acoustic models monophones(cid:4) The phones
`are represented by state continuous density HMMs
`with a mixture of Gaussians to represent the pdf of
`each state(cid:4) The acoustic models are rst trained on the
`TIMIT database(cid:4)
`Initial models are estimated with a
`segmentation of the timealigned TIMIT data(cid:4) The em
`bedded BaumWelch algorithm is used to reestimate the
`models(cid:4) The TIMIT database consists of clean speech
`of very good quality and low noise levels(cid:4) To make the
`speech recognizer perform well in the testing environ
`ment the acoustic models are retrained with speech
`recorded in the testing environment(cid:4)
`Because the system has contextindependent acoustic
`models not so much training data is needed to esti
`mate the models(cid:4) To make the system perform well for
`this experiment the models are retreined on speech data
`from the grammar of the global task(cid:4) This makes the
`contextindependent models tend to model the phones
`in the context in which they appear in the task(cid:4)
`For the training and test data sentences were gen
`erated from the grammar(cid:4) This set was divided in
`subsets of and sentences(cid:4) Of speakers one
`of these subsets was recorded(cid:4) speakers are in the test
`set and speakers are in the training set(cid:4)
`Of these sentences are allocated to the task
`FINDER are allocated to CONTROL PANELS and
` to EXTENSIONS(cid:4) For the purpose of the experiment
`four grammars are de ned(cid:4) One for each task and one
`general grammar containing all three tasks(cid:4)
`In the experiment rst the complete test set is run
`through the general grammar(cid:4) Then the test sentences
`belonging to each task are run separately to see if any
`task performs better than another one(cid:4) This is not inter
`esting in itself but for comparing the results(cid:4) Next the
`test sentences belonging to each task are run through
`the speech recognizer given only the taskspeci c gram
`mar(cid:4) The results of this experiment are combined to
`see how the dynamic grammar works on average for the
`complete test set(cid:4)
`To determine the accuracy of the system and compare
`the di erent results a metric is used(cid:4) For the purpose of
`this paper the word error rate is computed as follows
`
`word error rate
`
`word errors
`total number of words
`
`
`
`word errors substitutions deletions insertions
`
`Here a substitution deletion or insertion all score
`penalty(cid:4) A match can be interpreted in di erent ways
`for example a substitution is also a deletion plus an
`insertion(cid:4) The metric algorithm computes the optimum
`minimum errors error rate(cid:4)
`
` Results
`
`The results of the experiment are summarized in table (cid:4)
`In the left column the grammar is written(cid:4) General is
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:45:23 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1017, Page 3
`
`
`
` D(cid:4) Geller et al(cid:4) Speech Recognition on Spheric
`an IC for Command Control Applications
`EuroSpeech (cid:4)
`
`Isues in Measuring the
` J(cid:4)L(cid:4) Flanagan I(cid:4) Marsic
`Beni ts of Multimodal Interfaces Proc(cid:4) ICASSP
` (cid:4)
`
` Q(cid:4) Lin et al(cid:4) Robust Distant Talking Speech Recog
`nition Proc(cid:4) ICASSP (cid:4)
`
` L(cid:4)J(cid:4)M(cid:4) Rothkrantz et al(cid:4) The Simulation of a Hu
`man Operator by an ASP System Proc(cid:4) ASS
` (cid:4)
`
` V(cid:4) Zue Conversational Interfaces Advances and
`Challanges Proc(cid:4) EuroSpeech (cid:4)
`
`TIMIT Retrained
`
`Test
`Grammar
`FIND (cid:4) (cid:4)
`General
`FIND (cid:4) (cid:4)
`Finder
`CPAN (cid:4) (cid:4)
`General
`CPAN (cid:4) (cid:4)
`CPanels
`EXTS
` (cid:4) (cid:4)
`General
`Extensions EXTS
` (cid:4) (cid:4)
`General
`ALL
`(cid:4) (cid:4)
`Dynamic
`ALL
`(cid:4) (cid:4)
`
`Table Word error rates
`
`the combined grammar of the three subtasks Finder
`CPanels and Extensions(cid:4) The Dynamic grammar is the
`combined grammar of which only the grammar of one
`subtask is enabled when this subtask is active(cid:4) The
`next column contains the name of the test set(cid:4) FIND is
`the set with commands for the FINDER CPAN for the
`CONTROL PANELS and EXTS for the Extensions(cid:4)
`In the last two columns the word error rates are printed
`for two di erent sets of acoustic models(cid:4) The rst is
`only trained on the TIMIT database(cid:4) The second is
`reestimated with the training set recorded in the same
`environment as the test set(cid:4)
`Each test set is tested both on the general grammar and
`on the grammar speci c for the subtask(cid:4) As can be seen
`in the table each time the speech recognizer does a bit
`better when it is given a smaller grammar(cid:4)
`The retrained models are signi cantly better(cid:4) For the
`complete testset the TIMIT trained models shows a
` (cid:4)grammar(cid:4) For the retrained models error decrease
`is (cid:4) models that are better at the acoustic match have
`less to gain from the dynamic grammar technique(cid:4)
`
` Conclusion
`
`The results show that a dynamic grammar can improve
`the recognition rate signi cantly although it works bet
`ter for less perfect acoustic models(cid:4) Still the e ort is
`worth the while especially since robust systems tend
`to have more variable input and therefore less accurate
`acoustic models(cid:4)
`An issue that has not been investigated in this paper
`but seems rather evident is that a dynamic grammar
`by disabling parts of the grammar reduces the amount
`of time needed to recognize speech(cid:4) In a realtime appli
`cation this may free up resources to improve accuracy
`or another qualitymeasure otherwise(cid:4)
`
`References
`
` J(cid:4) Allan Natural Language Understanding Ben
`jaminCummings (cid:4)
`
` L(cid:4) Ban P(cid:4) Tatai Automatic Speech Segmentation
`for an Open Vocabulary Recognition System Proc(cid:4)
`ECSAP (cid:4)
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:45:23 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1017, Page 4
`
`