`HUMAN-MACHINE DIALOG SYSTEMS
`
`Richard C. Rose’ and Hong Kook Kim2
`
`‘AT&T Labs - Research, Florham Park, NJ 07932, rose@research.att.com
`’Kwangju Institute of Science and Technology, Kwangju, Korea hongkook@kjist.ac.kr
`
`ABSTRACT
`This paper investigates techniques designed to allow
`the users of human-machine dialog systems to interrupt
`or barge-in over machine generated speech messages.
`An experimental study was performed on utterances col-
`lected from a telephone based dialog system to analyze
`the effect of barge-in performance on users’ speech. One
`result of this study was that excessive barge-in latencies
`resulted in disfluencies appearing in over half of users’
`utterances. A hybrid procedure for barge-in detection is
`proposed and evaluated on the utterances collected from
`the same domain. The procedure combines a feature-
`based voice activity detection PAD) algorithm with a
`model-based approach for verifying hypothesized speech
`segments. The procedure is shown in the paper to obtain
`better detection performance than procedures that rely on
`the speech recognition decoder to detect speech. It is also
`found to have latencies that are comparable to those ob-
`tained by low delay featurebased speech detection algo-
`rithms.
`
`1. INTRODUCTION
`
`The ability for a user to “barge-in’’ over prompts allows
`for turn-taking protocols that differ substantially from
`systems that require that the user listen and speak in alter-
`nation [l]. It enables longer turns to be taken by the ma-
`chine, arbitrary interruptions by the user, and can enhance
`the over-all naturalness of the human-machine interac-
`tion. Barge-in detection strategies detect onset of speech
`for the purpose of disabling machine generated prompts
`with a minimum latency from the start of the user’s ut-
`terance. Existing strategies differ in their approach to
`addressing the trade-off between latency and errors that
`result from false detection or missed detection of speech
`events.
`A hybrid barge-in detection technique is presented
`in this paper. The technique combines the low latency
`of feature-based barge-in detection with the more ro-
`bust detection characteristics previously obtained only by
`“speech-based” or “decoder-based” techniques that rely
`on decisions made in the automatic speech recognition
`(ASR) decoder [7,4]. To motivate this hybrid approach,
`a study of the impact of barge-in on user utterances was
`performed.
`
`It is important to note that there are many additional
`issues involved in creating a natural turn-taking protocol.
`One issue is the proper choice of prompts to convey the
`proper level of confirmation, uncertainty, or agreement to
`the user as part of a coherent discourse strategy [3]. An-
`other issue is the design of prompts that encourage natu-
`ral turntaking [I]. Finally, it is also important to design
`recovery strategies that can be invoked to maintain a con-
`sistent discourse history when barge-in detection fails [8].
`These additional issues are not directly addressed in this
`paper. However, it is safe to say that a faster, more reli-
`able barge-in detection mechanism will result in less im-
`pact on speech utterances and will reduce the dependence
`on confirmation and error recovery strategies.
`A barge-in detector must be able to detect user speech
`input with minimum latency but also be able to reliably
`reject non-relevant input. There are several sources of
`non-relevant input. First, there is background speech and
`background noise that arise from the wide variety of pos-
`sible acoustic environments. Second, there is the effect of
`echo cancelers located either in the voice platform or in
`the public switched telephone network. These can intro-
`duce echoed versions of system prompts as well as non-
`linear effects such as “digital zeroes” being inserted dur-
`ing periods of inactivity. These non-linear effects make
`it very difficult for speech detection algorithms to update
`adaptive estimates of background levels.
`A third source of non-relevant input is human gener-
`ated non-speech noise. It is clear that events like breath
`noise, coughing, and lip smacks should not trigger barge-
`in detection. However, a final issue concerning non-
`relevant input involves the definition of what types of
`speech utterances should be considered relevant. In this
`work, the goal is for the system to consider as barge-in
`events only those utterances where the user intended to
`interrupt the prompt. This means that the system would
`be expected to ignore instances like word fragments and
`filled pauses but would be expected to respond to all other
`utterances.
`The paper is organized in two major parts. The first
`part is an informal study of barge-in behavior for human-
`machine dialog systems. After reviewing some of the
`existing techniques used for barge-in detection in Sec-
`tion 2, performance measures that can be used to charac-
`terize barge-in out-comes over a range of user behaviors
`
`0-7803-7980-2/03/$20.00 0 2003 IEEE
`
`198
`
`ASRU 2002
`
`
`
`Exhibit 1017
`Page 01 of 06
`
`
`
`and system responses are discussed in Section 3. Follow-
`ing that, Section 3 also describes the results of a study
`of barge-in performance for a single segment of a. user-
`machine dialog. The second part of the paper proposes
`and evaluates a simple hybrid procedure for barge-in de-
`tection. The procedure, described in Section 4, combines
`a feature-based voice activity detection (VAD) algorithm
`with a model based approach for verifying speech seg-
`ments that were hypothesized by the VAD.
`
`2. EXISTING BARGE-IN DETECTION
`METHODS
`
`Most of the existing methods for barge-in detection that
`are in common use fit into two general categories [ 11. The
`first set of approaches are based largely on signal features
`such as energy, periodicity, and voicing of speech. They
`rely on adaptive background estimation algorithms along
`with some form of state machine to ensure that low en-
`ergy speech events are classified as speech [S, 21.
`-A second class of approaches relies on the ASR de-
`coder to detect when a barge-in event has occurred [7,4].
`One approach involves tagging specific arcs in the lan-
`guage model as barge-in arcs. When these arcs are
`reached during decoding, a confidence score is computed
`and compared to a threshold [4]. These arcs might, for
`example, be placed at locations following any sequence
`of the first five non-silence phones in the ASR network or
`after any sequence of the first one or more words in the
`network.
`This implies the capability of rejecting all speech
`events that are “out-of-grammar”, but it also implies that
`the barge-in latency is subject to the delays that exist in
`the ASR network. These decoder based methods will
`typically provide better rejection characteristics for non-
`speech events than energy-based approaches at the ex-
`pense of longer latencies.
`3. EXPERIMENTAL STUDY
`
`In order to develop barge-in detection strategies, it is im-
`portant to have a rigorous means for evaluating barge-in
`detection performance. This section describes an experi-
`mental study to evaluate the performance of the existing
`decoder-based barge-in approach described in Section 2.
`Performance is evaluated using the performance mea-
`sures described in Section 3.1 using human end-pointed
`utterances taken from the task domain described in Sec-
`tion 3.2. The results of the study are presented in Sec-
`tion 3.3.
`
`3.1. Measuring Barge-in Detection Performance
`The performance of the barge-in procedures that are con-
`sidered here are measured in terms of two criteria. The
`first is their ability to anive at a speechhon-speech deci-
`sion with minimum latency with respect to the start of the
`customer’s speech. This is important so the system can
`
`respond by turning off system generated speech messages
`in a way that seems natural to the user and will result
`in a minimum of disfluent speech events or other erratic
`user behavior. This requirement is documented in many
`user interface references in the form of a 300 msec max-
`imum latency requirement for any barge-in implementa-
`tion [I]. Excessive latencies can result in utterances con-
`taining pauses, repetitions, and interrupted speech. The
`effect of barge-in latency on the tendency of users’ utter-
`ances to be ill-formed as a result of these latencies will
`be discussed in more detail in Section 3.3.
`The second criterion is the barge-in procedure’s abil-
`ity to respond only to speech events and to ignore non-
`speech events. This requirement can be extended to in-
`clude the ability to reject “out-of-vocabulary” or “out-of-
`grammar” events as well. However, rejecting utterances
`based on a higher level notion of whether the utterance is
`within the task domain can conflict with the requirement
`of very fast response time. System developers would like
`to reduce latencies while maintaining the system’s abil-
`ity to reliably detect speech events and to reliably reject
`non-speech events.
`
`3.2. Target Task Domain
`It is well known that the tendency for users to interrupt
`system prompts and the tendency for their utterances to
`be affected by the act of intempting the prompt is highly
`dependent on the state of the dialog. When responding
`to directed machine initiated queries, there are many fac-
`tors that might influence user behavior. These include the
`characteristics of the prompt, the level of user experience
`with the task, and the type of expected response.
`The utterances used in this study are taken from a
`general task domain that includes a variety of system
`queries including requests for general area of interest,
`telephone and account numbers, names, and various other
`information. A set of utterances were chosen for this
`study containing user responses to a request for a thirteen
`digit account number. This class of utterances was inter-
`esting for two reasons. First, the recorded prompt used
`to elicit responses consisted of two segments, a request
`for information followed by further explanation, and was
`eleven seconds long. It is not surprising that this resulted
`in a relatively large number of barge-in attempts. Sec-
`ond, it was easy for human annotators to determine when
`disfluent speech events were induced in digit string utter-
`ances as a result of system behavior.
`A total of 9437 of these utterances were obtained,
`where approximately 25% of these utterances corre-
`sponded to users attempting to intermpt the prompt. A
`2000 utterance subset was chosen from this set to pro-
`vide utterances where either barge-in occurred or where
`background noise or human generated non-speech noise
`was very likely to generate a barge-in event. Each ut-
`terance in this subset was labeled with word level text
`transcriptions, time end-points, and annotations indicat-
`ing the presence of disfluencies (pauses, repetitions, and
`
`199
`
`
`
`Exhibit 1017
`Page 02 of 06
`
`
`
`Table 1: Outcome table describing barge-in detection behavior for a dataset taken from the task domain described in Section 3.2.
`User Behavior
`No User Barge-in (29.5%)
`User Barge-in (70.5%)
`Disfluent Utt. 1 Well-Formed Utt. 1 Disfluent Utt. 1 Well-Formed Utt.
`54.5%
`45.5%
`5.4%
`94.6%
`False BI Event Detections In All Utterances
`5.8%
`-
`
`System
`Decision
`-
`
`BI
`Event
`
`No BI
`Event
`.
`
`Correct BI Event Detection
`79.8%
`Missed BI Event
`21.2%
`
`interrupted speech). It was found that 1590, or 79.5%, of
`these files contain speech and 1122, or 70.5%, of these
`speech utterances correspond to attempts by the user to
`barge-in. This is far higher a percentage of barge-in at-
`tempts than would normally be expected and this percent-
`age is a result of these utterances having been deliberately
`selected to analyze barge-in behavior.
`
`3.3. Performance for Target Task Domain
`
`The overall effect of barge-in on system behavior and
`user behavior can be described using the outcome table in
`Table 1. User behavior is described in terms of whether
`or uot the user attempted to interrupt the system prompt
`and whether or not the resulting user utterance was well-
`formed or contained disfluencies. The system behavior is
`described in the table simply in terms of whether or not
`a barge-in event occurred. While Table 1 describes the
`overall level of disfluencies for utterances where barge-
`in did and did not occur, the plots in Figure 1 more
`directly characterize the effect of barge-in
`latencies on
`users’ speech.
`The entries in Table 1 are derived from the 2000 ut-
`terance subset described in Section 3.2. These utterances
`were chosen to have a much higher incidence of barge-in
`than actually occurred in the full 9437 utterance dataset.
`There are several observations that can be made from Ta-
`ble l . First, The top row of Table l indicates that 70.5%
`of the 2000 responses to the request for account num-
`ber were spoken over the prompt. The second observa-
`tion is that the barge-in detection characteristics for this
`very difficult subset of utterances are reasonably good.
`It is clear from Table 1 that false barge-in events, result-
`ing in the prompt being incorrectly turned off, are gen-
`erated in 5.8% of all files. Missed barge-in events, re-
`sulting in the prompt being left active when user speech
`is present, occur in 21.3% of the files where the user at-
`tempted to barge-in over the prompt. The third observa-
`tion is that there are a large number of disfluencies gen-
`erated when users barge-in compared to when users wait
`until the prompt has ended. Table 1 shows that 54.5%
`of barge-in utterances were shown to contain some dis-
`fluent speech while only 5.4% of utterances spoken after
`
`200
`
`-
`
`the end of the prompt contained disfluencies.
`The plots in Figure 1 illustrate how various aspects
`of the dialog system design can contribute to the level of
`disfluencies in users’ utterances. The plot in Figure l a
`attempts to describe the effect of barge-in latency on the
`tendency of users’ utterances to be ill-formed. Figure l a
`displays the percentage of utterances that contain these
`disfluencies plotted with respect to average barge-in la-
`tency. The plot in Figure Ib describes the effect of the
`prompt design on disfluencies in users’ utterances. It dis-
`plays the percentage of disfluent utterances plotted with
`respect to the instant in time when barge-in occurred. The
`vertical lines in Figure 1 b indicate the times when the two
`segments of a two part prompt ended.
`There are several observations that can be made from
`the plots in Figure 1. First, it is clear from Figure l a that
`the percentage of utterances containing disfluencies in-
`creases with the latency between the time at which speech
`begins and time at which the prompt is shut-off. Second,
`Figure l a shows that the overall percentage of utteranc,es
`containing disfluent speech is very high for utterances
`where barge-in
`latency is significantly greater than 300
`msec. This suggests that it is very important to maintain
`very low latencies in any system for detecting the start
`of speech. The last observation, illustrated by the plot in
`Figure Ib, shows that the design of the prompt can also
`have a significant effect on users’ utterances. The high
`rate of disfluencies occurring in these utterances suggests
`that many users started speaking immediately after the
`first portion of the prompt (Please enter your account
`number) and became confused when the second segment
`of the prompt began (The account number is located. . .).
`
`4. HYBRID FEATUREMODEL BASED
`DETECTION
`
`A simple hybrid approach for barge-in detection is pro-
`posed in this section. The approach is a two-step pro-
`cedure where barge-in events are hypothesized using a
`feature-based barge-in algorithm and verified over fixed
`length segments using a model based procedure. It is sim-
`ilar in motivation to the decoder-based barge-in detectors
`discussed in Section 2 in that a model based score is com-
`
`
`
`Exhibit 1017
`Page 03 of 06
`
`
`
`~
`
`0 '
`
`
`
`I
`
`i
`
`J
`
`1.2
`
`I?
`
`nally developed to be used with the adaptive multirate
`(AMR) speech coder in the ETSI 3GPP [2]. The VAD
`produces a sequence of speech activity flags for each 10
`msec frame. A post-processing step produces a sequence
`of fixed length segments which will either be verified as
`barge-in speech or rejected by the model-based verifica-
`tion stage.
`
`a) Disfluency Dependence on Bargein Latency
`
`0
`
`0
`
`0
`
`6 1
`0 2
`0 2
`
`0.3
`
`0 8 r
`
`0.1
`
`ai
`
`0.8
`0.7
`0.5
`bargecn latency (Sec)
`b) Disfluency Dependence on Bargein Time
`
`0.9
`
`I
`
`bargein went time (sec)
`Figure 1: Analysis of how disfluencies in user utterances are
`affected by a) latency from the start of speech to the prompt
`being disabled and h) The absolute time from the start of
`the prompt when the user attempts to barge-in.
`
`puted to verify a barge-in event. However, this hybrid ap-
`proach can c o n k a barge-in event within a guaranteed
`latency with respect to the time of the original hypothe-
`sized segment. The approach is described in Section 4.1
`and the barge-in performance evaluation of the approach
`is presented in Section 4.2.
`
`4.1. Hybrid Approach
`Section 2 described decoder-based barge-in approaches
`as being able to provide better detection characteristics
`than energy-based approaches but suffering from longer
`latencies. It was also suggested in Section 1 that the vari-
`ety of background conditions along with the wide range
`of linear and non-linear distortions make it extremely
`difficult for the faster adaptive energy-based algorithms
`to work reliably under all conditions. The compromise
`that was reached here was to use a twc-step approach.
`A featurebased algorithm is used to hypothesize short
`speech segments, on the order of several hundred msec in
`length, but is parameterized so that it over-generates seg-
`ments to the point where only a negligible number of ac-
`tual start-of-speech events are missed. In the second step,
`a model based procedure for verifying that these short,
`fixed length segments correspond to a barge-in event is
`used in place of the decoder-based approach. This sec-
`ond step verification process is much simpler than the
`decoder-based approach and has a guaranteed maximum
`latency equal to the length of the hypothesized speech
`segments. It is also shown in Section 4.2 to have detection
`characteristics that are as reliable as the decoder-based
`approach.
`The hybrid procedure is outlined by the block dia-
`gram in Figure 2. The first step is based on a feature-
`based voice activity detection (VAD) algorithm which
`in our case was borrowed from a VAD that was origi-
`
`Hybrid FeatureIModel Based Barge-in Detection
`
`input Audio
`
`Feature-Based
`
`Frame Labels
`00011101110011111110011000111011100010000
`
`Processor
`
`Hypothesized Speech Segments
`
`Figure 2: Block diagram of hybrid procedure for barge-in
`detection.
`
`The verification step shown in Figure 2 relies on a
`likelihood ratio test to verify the hypothesis that speech
`is present in the fixed length hypothesized speech seg-
`ment. The likelihood ratio test is based on Gaussian
`mixture representations of speech events and non-speech
`events. The parameters of these models are estimated
`during hidden Markov model (HMM) training of context
`dependent phonetic units from speech utterances repre-
`senting a large number of telephony-based task domains.
`These domains include large vocabulary customer ser-
`vice, proper names from directoty service applications,
`connected digits from operator service domains, and sev-
`eral others. The Gaussian mixture model (GMM) rep-
`resenting speech, A, contains 64 mixture densities and
`was actually trained as a single state HMM from all non-
`silence frames in the training data. The GMM represent-
`ing non-speech events, Ab, contains 24 densities and was
`trained as a single state HMM from a subset of the back-
`ground frames in the training data. The HMMs are de-
`fined over sixty component observation frames. These
`are computed by concatenating 11 successive 22 compo-
`nent cepstnun vectors, performing a dimensionality re-
`
`201
`
`
`
`Exhibit 1017
`Page 04 of 06
`
`
`
`duction to a 60 component vector using a transfoma-
`tion estimated from heteroscedastic discriminant analy-
`sis (HDA), and diagonalized using a maximum likeli-
`hood linear transformation (MLLT) [5, 61. The observa-
`tion frames are updated every ten msec. A log likelihood
`
`ratio, L = C l , log w, is computed over an N
`
`frame speech segment, q,. . . , ZN, where N is typically
`30 frames. A decision to classify the hypothesized seg-
`ment as speech is made by comparing L to a decision
`threshold.
`
`4.2. Performance
`
`The performance of the procedure outlined in Figure 2 is
`described in terms of its ability to minimize the number of
`utterances with false barge-in events and missed barge-in
`events over the same 2000 utterance test Set described in
`Section 3.2. The Performance is described
`the Plot in
`Figure 3. The plot displays the Utterance based probabil-
`
`Bargein Utterance Defection Characteristics
`
`0 3 ,
`
`I
`
`\
`
`~
`
`
`
`”-
`
`o
`
`Hybrid FelureNodel 81
`
`o
`
`VAD (loo msec PPI
`
`I
`
`I
`
`L
`
`0 1
`Pmb False Speech Detect
`Figure 3: Operating curve describing hybrid barge-in per-
`formance plotted with operating points of dec*er-b=ed
`barge-in technique and the voice activity detector.
`
`0..
`
`ity of false detection along the horizontal axis and proba-
`bility of missed detection along the vertical axis. In order
`for there to be a correct barge-in detection using the hy-
`brid procedure, all hypothesized segments corresponding
`to non-speech in an utterance must be rejected by the ver-
`ification step and a hypothesized segment overlapping the
`start of speech must be accepted. The curve in the figure
`represents the operating characteristic as the threshold on
`the likelihood ratio scores as varied from zero to infinity.
`The point labeled “decoder-based BI” in Figure 3 shows
`the detection characteristics of the decoder-based barge-
`in procedure taken from the outcome table in Table 1.
`The operating point shown corresponds to an empirically
`chosen threshold setting that was used in the system for
`collecting the utterances described in Section 3.2.
`There are two points labeled as “VAD” correspond-
`ing to the detection characteristics of the first stage of the
`hybrid procedure when the post-processor is adjusted to
`hypothesize segments of length 250 msec and 100 msec
`
`respectively. The first point corresponds to a trade-off
`between false alarms and missed detections that might
`be considered reasonable if the VAD were to be used in
`isolation fo barge-in detection. The second point cor-
`responds to the operating point that was actually used
`in the hybrid system where the number of hypothesized
`events was over-generated to the point where the number
`of missed barge-in events is close to zero.
`There are two observations that can be made from this
`plot. It is clear from the curve in Figure 3 that the model-
`based verification step performs well enough to improve
`the overall detection characteristics. The second obser-
`vation is that the performance of the hybrid procedure is
`actually better in this case than the decoder-based uroce-
`dure. major issue associated with the decoderLbased
`procedure is the excessive latency which was character-
`ized in Section 3.3. ne latencies measured for the 2000
`utterance subset described in Section 3.2 were shown to
`range from 300 msec to well over one sec. Hence. the
`good performance obtained by the hybrid method is espe-
`cially important when one considers its fixed, short laten-
`cies. This result, along with the simplicity of the hybrid
`method recommends its use in even in the most resource
`limited ASR implementations.
`
`5. CONCLUSIONS
`
`
`
`There were two major contributions in this paper relating
`to procedures for allowing users of human-machine dia-
`log systems to interrupt system prompts. The first was
`the res& of an experimental study that demonstrated the
`effect that excessive barge-in latencies can have on the
`presence of disfluencies in users’ utterances. This study
`was done in the context of a decoder-based barge-in Dro-
`cedure that relies on confidence scores computed in the
`ASR decoder to detect when user barge-in has occurred.
`It was found for a particular dialog state that barge-in de-
`lays that were approximately 0.5 sec or longer resulted in
`over fifty percent of users’ utterances having some dis-
`fluency. The second contribution was a hybrid barge-
`in procedure that uses a model based likelihood ratio
`test for verifying whether hypothesized segments contain
`speech. The procedure operates with a maximum of a
`300 msec latency and was shown to have detection per-
`formance that was significantly better than the decoder-
`based barge-in detection procedure.
`
`I .
`
`6. ACKNOWLEDGMENTS
`
`The authors would l i e to express their appreciation to
`Susan Boyce for providing the utterance annotations nec-
`essary for analyzing the effects of barge-in on users’
`speech. The authors would also l i e to thank Harry Blan-
`chard, S. Parthasarthy, and Vincent Goffin for their many
`helpful suggestions and advice on barge-in issues.
`
`202
`
`
`
`Exhibit 1017
`Page 05 of 06
`
`
`
`7. REFERENCES
`
`[l] Bruce Balentine and David P. Morgan. How to build
`speech recogniton applications - A style guide for
`telephony dialogs. Enterprise Integration Group, San
`Ramon, CA, 1999.
`
`[2] ETSI TS 126 094 (2001-03). Universal Mobile
`Telecommunications System (UMTS); Mandatory
`speech codec speech processing functions AMR
`speech codec; Voice Activity Detector (VAD) (3FPP
`TS 26.094 version 4.00 Release 4).
`
`A. Johnstone, U. Berry, and T. Nguyen. There
`was a long pause: influencing turn-taking behavior
`in human-human and human-computer spoken dia-
`logues. International Journal of Human-Computer
`Studies, 41.:38341 I , 1994.
`
`M. Rahim, R. Pieraccini, W. Eckert, E. Levin, G. Di
`Fabbrizio, C. Kamm, and S. Narayanan. A spoken di-
`alog system for conferencelworkshop services. Proc.
`Int. Conf on Spoken Language Processing, October
`2000.
`G . Saon, M. Padmanabhan an R. Gopinath, and
`S. Chen. Maximum likelihood discriminant feature
`spaces. Proceedings of the International Conference
`on Acoustics, Speech, and Signal Processing, May
`2000.
`[6] M. Saraclar, M. Riley, E. Bocchieri, and V. Goffin.
`Towards automatic closed captioning: low latency
`real time broadcast news transcription. Proc. Int.
`Con$ on Spoken Language Processing, September
`2002.
`
`[7] A. R. Setlur and R. A. Sukkar. Recognition-based
`word counting for reliable barge-in and early end-
`point detection in continuous speech recognition.
`Proc. Int. Conf on Spoken Language Processing,
`pages 2135-2138, Nov. 1998.
`
`[8] N. Strom and S. Seneff. Intelligent barge-in conver-
`sational systems. Proc. In?. Conf on Spoken Lan-
`guage Processing, October 2000.
`
`203
`
`
`
`Exhibit 1017
`Page 06 of 06
`
`