`and
`NeuralNetworkModels
`Tina-LouiseBurrows
`ACambridgeUniversityEngineeringDepartment
`TrumpingtonStreet
`CambridgeCB PZ
`England
`Thisdissertationissubmittedforconsiderationforthedegree
`ofDoctorofPhilosophyattheUniversityofCambridge
`
`Ex. 1030 / Page 1 of 202
`Apple v. Saint Lawrence
`
`
`
`Summary
`Thisdissertationinvestigatessomeaspectsofspeechprocessingusinglinearmodelsand
`singlehiddenlayerneuralnetworks.Thestudyisdividedintotwopartswhichfocuson
`speechmodellingandspeechclassi(cid:12)cationrespectively.
`The(cid:12)rstpartofthedissertationexamineslinearandnonlinearvocaltractmodels
`forsynthesisinghighqualityspeechwithadjustablepitch.Asource-(cid:12)lterframeworkfor
`analysisandsynthesisisused,inwhichthesourceisarepresentationoftheglottalvolume
`velocitywaveform.Twofamiliesoflinearmodelareconsidered,ARX(autoregressive
`withexternalinput)andOE(outputerror).Theirperformanceinestimatingvocaltract
`transferfunctionsiscomparedonsyntheticspeechdata,andthedi(cid:11)erenceisexplained
`intermsoftheparameterestimationprocedure,thefrequencydistributionofbiasin
`theestimateandtheassumptionsaboutthespectrumofthenoiseinthevocaltract
`system.ThenoisespectrumforARXmodelsisshowntobeperceptuallysigni(cid:12)cantfor
`speechsynthesisapplicationsbecauseitexploitsauditorymasking.Methodsforimproving
`poorqualitysynthesesfromOEmodelsareproposed.Nonlinearvocaltractmodels,
`implementedasfeed-forwardorrecurrentneuralnetworks,areinvestigated.Methodsfor
`initialisingnetworksfromlinearmodelsaredeveloped.Amodi(cid:12)edrecurrentarchitecture
`isintroducedwhichpermitsinitialisationfromARXmodels.Theuseofregularization,
`forimposingcontinuitybetweenmodelsofadjacentspeechsegments,andlearningrate
`adaptation,forimprovingback-propagationtraining,arediscussed.Forsynthesisingreal
`speechutterances,anaudiotapedemonstratesthatARXmodelsproducethehighest
`qualitysyntheticspeechandthatthequalityismaintainedwhenpitchmodi(cid:12)cationsare
`applied.Thesecondpartofthedissertationstudiestheoperationofrecurrentneuralnetworks
`inclassifyingpatternsofcorrelatedfeaturevectors.Suchpatternsaretypicalofspeech
`classi(cid:12)cationtasks.Theoperationofahiddennodewitharecurrentconnectionisex-
`plainedintermsofadecisionboundarywhichchangespositioninfeaturespace.The
`feedbackisshowntodelayswitchingfromoneclasstoanotherandtosmoothoutput
`decisionsforsequencesoffeaturevectorsfromthesameclass.Fornetworkstrainedwith
`constantclasstargets,asequenceoffeaturevectorsfromthesameclasstendstodrive
`theoperationofhiddennodesintosaturation.
`Itisdemonstratedthatsaturationde-
`(cid:12)neslimitsonthepositionofthedecisionboundaryresultingincontext-sensitiveand
`context-insensitiveregionsofthefeaturespace.Whilesaturationpersists,itisshownthat
`networkshavereducedsensitivitytotheorderofpresentationoffeaturevectorsbecause
`movementofthedecisionboundaryisinhibited.Toimprovethiswithin-classsensitivity,
`trainingwithramp-likeclasstargetsisinvestigated.Theoperationofsmallrecurrent
`networksisdemonstratedfortwotasks;classi(cid:12)cationofspeechutterancesintovoicedand
`unvoicedsegments,andclassi(cid:12)cationofclockwiseandanti-clockwisetrajectoriesofvectors
`producedbytwoautoregressiveprocesses.
`
`Ex. 1030 / Page 2 of 202
`
`
`
`Acknowledgements
`IwouldliketothankeveryoneintheFallsideLabformakingmytimeinCambridge
`anexperience.Inparticular,IwouldliketomentionJulianforhispracticaladvice,Rob
`forallhishelpwiththe(cid:12)ddlytape-recording,andXtofforhispatiencewithmyfaltering
`Spanish.Specialthankstomysupervisor,Dr.MahesanNiranjan,forhisguidance,andto
`Dr.LjungandDr.Maciejowskiforhelpfuldiscussionsonsystemidenti(cid:12)cationtheory.The
`biggestthank-youofallgoestomysister,Tanya,forallherloveandsupport,especially
`whilewritingup.
`ThisworkhasbeenfundedbytheScienceandEngineeringResearchCouncilwith
`someusefultop-upsfromtheEngineeringDepartmentandQueens'College.
`Dedication
`ToMumandDad.ThankyouforsupportingmeinallthemadthingsIdo.
`Declaration
`This, worddissertationisentirelytheresultofmyownworkandincludesnothing
`whichistheoutcomeofworkdoneincollaboration.
`Tina-LouiseBurrows
`Queens'College
`March ,
`
`Ex. 1030 / Page 3 of 202
`
`
`
`Contents
`
` Introduction
`
` . TheSpeechProductionMechanism.......................
`
` .SpeechProcessing................................
`
` .. ReviewofResearchinModellingSpeechSignals
`...........
`
` ..ReviewofResearchinClassi(cid:12)cationwithNeuralNetworks
`.....
`
` . OutlineofThesis.................................
`
` . . PartI-VocalTractModelling.....................
` . .PartII-Classi(cid:12)cationofSpeechPatterns...............
` .Publications....................................
`IVocalTractModelling
`
`ModellingtheSpeechSignal
`
`.
`Introduction....................................
`.AcousticModelling................................
`.. FrequencyDomainAcousticModelling.................
`..TimeDomainAcousticModelling
`...................
`. LinearPredictionAnalysis............................
`. . LinearPredictionforSpeechAnalysisandSynthesis.........
`. .SpectralMatching............................
`. . PredictorOrder..............................
`. .Pre-emphasisofSpeech.........................
`. .LimitationsofLinearPredictionforAnalysisandSynthesis.....
`ImprovementstoLinearPredictionAnalysisandSynthesis..........
`.
`.. Analysis-by-SynthesisTechniques....................
`..PerceptualWeightingFilters
`......................
`.. DecouplingtheSourceandVocalTractFilter.............
`i
`
`Ex. 1030 / Page 4 of 202
`
`
`
`ii
`Contents
`.SystemIdenti(cid:12)cationApproachtoVocalTractModelling..........
` LinearModelsoftheVocalTract
`
` . LinearBlack-BoxModels
`............................
` . . ARXModels...............................
` . .OEModels................................
` .PredictionversusSynthesis
`...........................
` . ParameterEstimation..............................
` . . FrequencyDomainInterpretationofPrediction-ErrorMethod....
` . .PerceptualSigni(cid:12)canceoftheModelNoiseandTransferFunction
`Bias....................................
` . . ChangingtheNoiseModelandTransferFunctionBias........
` .ModelOrderSelection..............................
` .. A(q)andF(q)
`..............................
` ..B(q)....................................
` .GeneratinganExcitationWaveformforBlack-BoxModels..........
` ..
`InverseFilteringTechniques.......................
` ..VolumeVelocityPulseModels
`.....................
` .. TheGlottalExcitationModelUsedinThisWork...........
` .ComparisonofDi(cid:11)erentAnalysisMethodsUsingSyntheticData......
` .. NoiseModelandTransferFunctionEstimate.............
` ..E(cid:11)ectofPre-emphasisonEstimationofTransferFunction......
` .. E(cid:11)ectofNoiseonEstimationofTransferFunction..........
` ..E(cid:11)ectofMisalignmentofExcitationonEstimationofTransferFunction
` .TheVocalTractModellingFramework.....................
` .PreprocessingofSpeechandLaryngographData...............
` .. SourcesofSpeechandLaryngographData...............
` ..
`InitialPreprocessing...........................
` .. PitchandVoicingAnalysis
`.......................
` . TheVocalTractFilter..............................
` . . ModelOrder
`...............................
` . .ParameterEstimation..........................
` . . FilterImplementation..........................
` . PerformanceonRealSpeechDataatNormalPitch..............
` . . LinearPredictionPerformance.....................
` . .ARXPerformance............................
`
`Ex. 1030 / Page 5 of 202
`
`
`
`iii
`Contents
` . . ComparisonofOEandARXModels..................
` . .ImprovingthePerformanceoftheOEmodel
`.............
` . .E(cid:11)ectofMisalignmentErrorsonEstimationofVocalTractTransfer
`Function..................................
` . .ParallelImplementation.........................
` . PitchManipulationofSynthesisedSpeech...................
` . . PitchManipulationbyPSOLA.....................
` . .PitchManipulationPerformance....................
` . ConcludingRemarks...............................
` . . EvaluationofPerformance........................
` . .ChoosingaSuitablePre-emphasisFilter................
` . . SpeechCoding..............................
` . .LimitationsoftheLinearVocalTractSystem.............
`NeuralNetworkModelsoftheVocalTract
`
`. Di(cid:14)cultiesinModellingLongUtteranceswithNeuralNetworks.......
`.TheNeuralNetworkModel
`...........................
`.. OperationofHiddenNodesofSmallNetworksonContinuousData.
`.
`InitialisationofNeuralNetworkWeights....................
`. . ReviewofWorkonInitialisationofNeuralNetworkWeights
`....
`. .MotivationforInitialisationfromLinearModels
`...........
`. . WeightInitialisationsforFeed-forwardNetworks...........
`. .WeightInitialisationsforSingleDelayRecurrentNetworks
`.....
`.Modi(cid:12)edRecurrentNetworkArchitecture...................
`.. WeightInitialisationfromARXmodel.................
`..Trainingbyback-propagation......................
`.NonlinearVocalTractModellingUsingNeuralNetworks...........
`.. SelectingaNetworkArchitecture....................
`..SizeofNetworksandInitialARXmodels
`...............
`..
`IssuesforBack-propagation.......................
`.PerformanceofNetworkModels.........................
`ImprovingThePerceptualQualityofNetworkSynthesis...........
`.
`.. Back-propagationwithRegularization.................
`..
`ImprovedPerformanceResults
`.....................
`.ConcludingRemarks...............................
`.. Summary.................................
`
`Ex. 1030 / Page 6 of 202
`
`
`
`iv
`Contents
`..................
`..StabilityofNeuralNetworkModels
`.. UsefulnessofLinearInitialisation....................
`..DrawbacksofaNonlinearModel
`....................
`IIClassi(cid:12)cationofSpeechPatterns
`
`RecurrentNetworksforContext-DependentSpeechClassi(cid:12)cation
`
`.
`Introduction....................................
`. . Context-dependentPatternClassi(cid:12)cationTasks............
`. .HiddenMarkovModelsvsNeuralNetworks..............
`. . RecurrentNetworksforContext-DependentPatternClassi(cid:12)cation.
`.TheRecurrentNetworkDecisionBoundary..................
`. TheE(cid:11)ectofSaturation-Context-Sensitivity.................
`.TheE(cid:11)ectofDecisionBoundaryMovement-Trajectory-Sensitivity....
`.
`ImplicationsforClassi(cid:12)erPerformance.....................
`.. Misclassi(cid:12)cations.............................
`..Between-ClassContext-SwitchingDelay...............
`.. Within-ClassContext-OutputSmoothing..............
`.LargerNetworks
`.................................
`.Classi(cid:12)cationofDVectorARprocesses....................
`.Voiced-UnvoicedClassi(cid:12)cationofSpeechUtterances
`.............
`. ConcludingRemarks...............................
`ConclusionsandFurtherWork
`
`. Conclusions....................................
`. . VocalTractModelling..........................
`. .Classi(cid:12)cationofSpeechPatterns....................
`.FurtherWork...................................
`.. LinearModelsoftheVocalTract....................
`..NeuralNetworkModelsoftheVocalTract
`..............
`.. Classi(cid:12)cationofSpeechPatterns....................
`ABack-propagationTrainingForMulti-DelayRecurrentNeural
`Networks
`
`BTapeDemonstration
`
`B. Introduction....................................
`
`Ex. 1030 / Page 7 of 202
`
`
`
`v
`Contents
`B.ComparisonofDi(cid:11)erentLinearModels.....................
`B. PitchManipulationusingARXandLPModels................
`B.NeuralNetworkModelsoftheVocalTract...................
`
`Ex. 1030 / Page 8 of 202
`
`
`
`ListofFigures
`
`.........................
` . SpeechProductionMechanism.
`
` .Typicalspeechwaveformandspectrograms...................
`. Theacoustictheoryofspeechproduction.
`...................
`.CascadeandParallelFormantSynthesisers.
`..................
`. Source-(cid:12)lterarrangementforLPsynthesis.
`..................
`.Examplesoflinearpredictionspectra.
`.....................
`.Amplitudespectrafortypicalweighting(cid:12)lters.
`................
` . Systemidenti(cid:12)cationapproachtovocaltractmodelling.
`...........
` .Operationofblack-boxmodelsinpredictionandsynthesis.
`.........
` . Source-(cid:12)ltercon(cid:12)gurationsforblack-boxmodels.
`...............
` .TheRosenbergglottalvolumevelocitywavepulse.
`..............
` .Typicalspeech,laryngograph,residualandglottalvolumevelocitywaveforms.
` .Estimatesofthevocaltracttransferfunctionforthesyntheticvowelin`hod'.
` .Frequencybiasfunctionsforthesyntheticvowelin`hod'.
`..........
` .Noisemodelsandspectraofsynthesiserrorsforthesyntheticvowelin`hod'.
` . Estimatesofthevocaltracttransferfunctionforthesyntheticnasalised
`vowel=~(cid:15)=......................................
` . Frequencybiasfunctionsfor=~(cid:15)=.
`........................
` . Absolute(%)errorinestimationofformantsandbandwidthsofsynthetic
`voweldata.
`....................................
` . E(cid:11)ectofnoiseonestimationoftransferfunctionofthesyntheticvowelin
``hawed'.
`......................................
` . VaryingstagesintheestimationofOEtransferfunction.
`..........
` . E(cid:11)ectofmisalignmentofexcitationontransferfunctionestimates.
`.....
` . E(cid:11)ectofalignmenterrorsonpredictionerror,synthesiserrorandsynthesis
`forthesyntheticvowelin`hud'..........................
` . Thevocaltractmodellingframeworkforspeechsynthesis...........
`vi
`
`Ex. 1030 / Page 9 of 202
`
`
`
`vii
`ListofFigures
`.................
` . Pitch-synchronousparameterupdatescheme.
`............
` . Comparisonofpitchcontoursatdi(cid:11)erentframerates.
` . Typicalvoicingtransitionswhereerrorsinvoicingdecisionoccur.......
` . Comparisonofspectrafromautocorrelationandcovariancemethodsoflin-
`earprediction.
`..................................
` . ComparisonofARXandLPmodelsforthespeechfragment`inlang'.
`...
` .ComparisonofARXandLPmodelsforasegmentofthephone`ng'.....
` . Spectrogramsofsynthesisoftheutterance`Germany'sdecisionfollowed
`eightyearslater'bydi(cid:11)erentvocaltractmodels.
`...............
` .ComparisonperformanceofOEandARXmodelsinsynthesis.
`.......
` .Transferfunctionestimatesforthesyntheticvowelin`hawed'.
`.......
` .ComparisonofperformanceofARXandregularizedOEmodels.
`......
` .SpectrogramofsynthesisbyregularizedOEmodel...............
` .ModifyingthepitchofvoicedspeechusingthePSOLAmethod........
` . Spectrogramsofpitchmanipulatedsynthesisfromdi(cid:11)erentmodels......
`. Structureofaneuralnetworkmodelofthevocaltract.............
`.Regionsofoperationoftanhnonlinearity.
`...................
`.
`Illustrationoftheoperationofhiddennodesofatwonodenetwork.
`....
`.Comparisonofinitialisationtechniquesforfeed-forwardnetworks.......
`.ComparisonofinitialisationtechniquesforRNN .
`..............
`.ComparisonofinitialisationtechniquesforRNN.
`..............
`.Structureofmodi(cid:12)edrecurrentnetwork.....................
`.................
`.SpectraofARXmodelandHi(z)forRNN .
`. E(cid:11)ectoflearningrateadaptationonback-propagationtrainingoffeed-forward
`andrecurrentnetworks.
`.............................
`. Spectrogramsofsynthesesbynetworkmodels.
`................
`. Phonerecognition-acontext-dependentclassi(cid:12)cationtask.
`.........
`.HMMandneuralnetapproachestophonerecognition.............
`. Singlehiddennodewithrecurrentconnection..................
`.E(cid:11)ectofarecurrentconnectiononthepositionofthedecisionboundaryin
`featurespace....................................
`.E(cid:11)ectofsaturationonpositionofthedecisionboundary.
`..........
`.Featurespaceprojections(vTx(t))forwhichclassi(cid:12)cationcausesmovement
`ofthedecisionboundary.
`............................
`.Anexampleoftrajectory-sensitivity.
`......................
`
`Ex. 1030 / Page 10 of 202
`
`
`
`viii
`ListofFigures
`.Two-stateHMMequivalenttoasinglenoderecurrentnetwithstepnon-
`linearity.......................................
`. E(cid:11)ectofgradientofnonlinearfunctiononswitchingdelay...........
`. E(cid:11)ectofgradientofnonlinearfunctiononswitchingspeed.
`.........
`. Operationofclassi(cid:12)eronaclass - trajectory.
`................
`. E(cid:11)ectofboundarymovementonoutputsmoothing.
`.............
`. FeaturespaceforvectorARprocesses,showinglimitingpositionsofthe
`decisionboundarywhenhiddenunitssaturate.
`................
`. Operationoftherecurrentnetwork(nh= ),trainedwith(cid:12)xedclasstar-
`gets,inclassifyingthevectorARtrajectories..................
`. Operationoftherecurrentnetwork(nh=),trainedwith(cid:12)xedclasstar-
`gets,inclassifyingthevectorARtrajectories..................
`. Operationofrecurrentnetwork(nh=),trainedwithexponentialclass
`targets,inclassifyingthevectorARtrajectories.
`...............
`. Operationofrecurrentnetwork(nh=),trainedwith(cid:12)xedclasstargets,
`invoiced-unvoicedclassi(cid:12)cationofthesentence\Johncleansshell(cid:12)shfora
`living"........................................
`. Operationofrecurrentnetwork(nh=),trainedwith(cid:12)xedclasstargets,
`invoiced-unvoicedclassi(cid:12)cationofthesentence\Johncleansshell(cid:12)shfora
`living"........................................
`B. Pitchcontoursappliedtotheutterance\Francebecamethe(cid:12)rstdecimal
`countryinEurope,in ".
`..........................
`B.Pitchcontoursappliedtotheutterance\:::
`joinedbyBelgium,Italyand
`Switzerland,in ".
`..............................
`B. Pitchcontoursappliedtotheutterance\Germany'sdecisionfollowedeight
`yearslater".....................................
`
`Ex. 1030 / Page 11 of 202
`
`
`
`ListofTables
` . Summaryofsource-(cid:12)lterparametersforgenerationofsyntheticdata.....
` .Summaryofmodelordersusedinanalysisofsyntheticspeechdata.
`....
` . E(cid:11)ectofmisalignmentofexcitationonpredictionandsynthesisSNRfor
`ARXandOEmodels.
`..............................
` .E(cid:11)ectofvariationofmodelorderonmeanpredictionSNRforlinearpre-
`dictionmodels.
`..................................
` .E(cid:11)ectofvariationofmodelorderonmeanSNR(dB)forARXmodels....
` .Summaryofmodelordersusedforvocaltractmodellingusingblack-box
`andlinearpredictionmodels.
`..........................
` .
`ImprovementinpredictionSNRofARXmodelsoverLPmodels.
`......
` .E(cid:11)ectofvariationofinputanduseofpre-emphasisonOEandARXmodels.
`. Summaryofparametervaluesforvocaltractmodellingwithneuralnetworks.
`.Performanceofnetworkstrainedwithlearningrateadaptation........
`. Performanceofregularizednetworks.......................
`. PerformanceresultsfornetworkstrainedtoclassifyvectorARprocesses...
`.MappingbetweenTIMITphonelabelsandvoiced-unvoicedclasses......
`. Comparisonofperformanceofrecurrentnetworks(RNN)andextracted
`feed-forwardnetwork(FNN)withequivalentweightsUandV........
`.Comparisonofperformanceofrecurrentnetworkstrainedwith(cid:12)xedand
`exponentialtargetsforvoiced-unvoicedclassi(cid:12)cationofspeech.
`.......
`ix
`
`Ex. 1030 / Page 12 of 202
`
`
`
`ListofNotation
`Abbreviations
`AR
`autoregressivemodel
`ARX
`autoregressive(AR)modelwithexternalinput(X)
`OE
`outputerrormodel
`LP
`linearpredictionmodel
`CELPcodeexcitedlinearprediction
`FNN
`feedforwardneuralnetwork
`RNN
`recurrentneuralnetwork
`HMMhiddenMarkovmodel
`ETFEempiricaltransferfunctionestimate
`SNR
`signal-to-noiseratio
`MSEmeansquarederror
`SymbolDe(cid:12)nitions
`R(z);R(q)
`lipradiationcharacteristic
`P(z);P(q)
`pre-emphasis(cid:12)lter
`^H(z),H(z),H(q)
`vocaltracttransferfunction
`H(ej!;(cid:18))
`vocaltractfrequenyresponse
`^^H(q)
`EmpiricalTransferFunctionEstimate
`Q(!;(cid:18))
`frequencybiasfunctionfortransferfunctionestimate
`N(q),N(ej!;(cid:18))
`modelnoiseandcorrespondingspecturm
`spectrumofsynthesiserror
`(cid:8)ER(!;(cid:18))
`L(t),dL(t)
`laryngographsignaland(cid:12)rstdi(cid:11)erence
`x(t),dx(t)
`glottalvolumevelocitywavemodeland(cid:12)rstdi(cid:11)erence
`X(ej!)
`spectrumofinputwaveform(x(t)ordx(t))
`y(t)
`speechwaveform
`Y(ej!)
`speechspectrum
`^ys(t)
`modelsynthesis
`^yp(t)
`modelprediction
`x(t)
`networkinputvector
`h(t)
`hiddennodeoutput
`^y(t)
`networkoutput(predictionorsynthesis)
`U,V,W
`networkweights(output,inputandfeedback)
`q(cid:0)
`backwardshiftoperator,q(cid:0) x(t)=x(t(cid:0) )
`z,z(cid:0)
`z-transforms
`(:)T
`denotesmatrixtranspose
`k:k
`denotesEuclideannormx
`
`Ex. 1030 / Page 13 of 202
`
`
`
`Chapter
`Introduction
`\InthebeginningwastheWord,andtheWordwaswithGod,andtheWordwasGod."
`St.John : .
`Speechistheacousticrealisationofalanguage.Ourknowledgeofhowwespeak,
`hear,recogniseandunderstandalanguagecanbeincreasedbystudyingthespeechsignal
`andattemptingtomodelthesefunctions.Thisthesisinvestigatessomeissuesforspeech
`processingwithlinearandneuralnetworkmodels.Inthischapter,thespeechproduction
`mechanismisdescribedandsomeoftheterminologyapplicabletospeechprocessingis
`introduced.Previousrelevantresearchinspeechprocessingwithlinearandneuralnetwork
`modelsisreviewedandtheresearchpresentedinthisthesisisoutlined.
` . TheSpeechProductionMechanism
`Themechanismforspeechproduction,showninFig. . ,consistsofthetrachea,vocal
`cords,tongue,vocaltract(oralandnasalcavities),lips,teethandnostrils,inadditionto
`thediaphragmandlungs.Aspeechutterancebeginsasanairstreamorvolumevelocity
`wavefromthelungs,whichtravelsalongthetracheaandvocaltracttoberadiatedasan
`acousticpressurewaveformfromthelipsorthelipsandnostrils.
`Speechisclassi(cid:12)edasvoicedorunvoiced,dependingonthenatureoftheexcitationof
`thevocaltract.Forvoicedphones,theexcitationofthevocaltractoriginatesattheglottis
`andisbytheperiodicvibrationofthevocalcords.Thefrequencyofvibration,orpitch,is
`controlledbythetensioninthevocalcordsandtheairpressurefromthelungs.Typical
`pitchvalueslieintherange - Hzforadults,andcanriseto Hzinchildren.Due
`toitsperiodicnature,thespectrumofvoicedexcitationcontainsdiscretecomponentsat
`harmonicsofthepitchfrequency.Forunvoicedsounds,theexcitationisduetoturbulence
`generatedbyair(cid:13)owpastanarrowconstrictionandtendstoberandominnature,with
`a(cid:13)at,continuousspectrum.Thenoiseisknownasaspirationiftheconstrictionisatthe
`glottisandfricationifitoccursatsomepointalongthevocaltract.Mixedexcitation
`
`
`Ex. 1030 / Page 14 of 202
`
`
`
`
` .Introduction
`Figure . :SpeechProductionMechanism.
`isalsopossiblefortheclassofsoundsknownasvoicedfricatives,inwhichturbulent
`excitationisamplitudemodulatedperiodicallybythevibrationofthevocalcords.
`Theacousticsignalcanberepresentedbyatranscriptionofphonemes,whichare
`thesmallestunitswhichconveylinguisticmeaningofalanguage.Theactualsounds
`whichareproducedinspeakingastringoftargetphonemesarecalledphones.Each
`phoneofanutterancecorrespondstoasegmentoftheacousticwaveformwhichhas
`acharacteristictime-varyingvibratorypattern.Vibratorypatternsaresuperimposed
`ontheairstreambythevibrationofthevocalcordsandresonanceofthevocaltract.
`Theresonantpropertiesofthevocaltractaremodi(cid:12)edbychangingthepositionofthe
`articulators(thelips,tongue,jawandvelum,showninFig. . .)Duetothephysical
`constraintsofthevocaltract,thepositionsofthearticulatorscanonlychangeslowlywith
`timeandindividualrealisationsofaphonearestronglyin(cid:13)uencedbypreviousandfuture
`phonesinanutterance.Thisphenomenonisknownasco-articulationandisimportant
`forbothaccuratespeechrecognitionandnaturalsoundingspeechsynthesis.
`Duetotheslowlytime-varyingnatureoftheacousticwaveformforeachphone,the
`resultingspectrumofthespeechvarieswithtime.Thetimevariabilityofthespectrumis
`capturedbycalculatingthespectrumofoverlappingshort-timesegmentsoftheacoustic
`waveformandisdisplayedusingaspectrogram.Aspectrogramplotsthefrequencyof
`successiveshort-timespectrausingtheintensityoftheplottoindicatetheenergyofthe
`frequencycomponentsataparticularinstant.Mostoftheenergyinthespeechspec-
`trumisbetween - Hz.Intelligibilitytestsonband-pass(cid:12)lteredspeechshowthat
`intelligibilityisnotimpairedwhenspeechislow-pass(cid:12)lteredtoremoveallfrequencies
`abovekHz(French&Steinberg ,Klatt ).Thispermitsalowersamplingrate
`of kHz.Withinthefrequencyrange -kHz,thevocaltractforvoicedphonestypically
`has-resonantfrequencies(Klatt )whicharecalledformants.Formantsarevisible
`asdarkhorizontalbandsonaspectrogram.Examplesofwidebandandnarrowbandspec-
`trogramsfortheutterance`Belgium'areshowninFig. ..Widebandandnarrowband
`
`Ex. 1030 / Page 15 of 202
`
`
`
`
` .Introduction
`spectrogramsrepresentatradeo(cid:11)betweentimeandfrequencyresolutionofthespectrum.
`Narrowbandspectorgramsuseshort-timespeechsegmentsofacoupleofpitchperiodsin
`duration.Theresultingspectrogramhashighfrequencyresolution(yaxis)andindividual
`pitchharmonicsappearascloselyspacedhorizontalbands,asillustratedinFig. .(b).
`However,timeresolutionispoorandrapidformanttransitionsareaveragedovertime.
`Fortheclassofsoundscalledstops,the`b'in`Belgium'forexample,thevocaltract
`becomescompletelyoccludedbythetongueorlipsforpartoftheutterance.Rapidmove-
`mentofthearticulatorstoreleasetheocclusion,whichmaybeaccompaniedbyaburst
`ofnoise,givesrisetosoundsthatareshortindurationandhighlytransientinnature.
`Thetime-variabilityinthespectrumofsuchphonesmaynotbeaccuratelyrepresentedby
`anarrowbandspectrogram.Widebandspectrogramsuseshort-timesegmentsofroughly
`onepitchperiodindurationandgivemuchbettertimeresolutionattheexpenseoffre-
`quencyresolution.Forvoicedspeech,verticalstriationsatthepitchperiodarevisible,as
`illustratedinFig. .(c).
`Duringnasals,suchas`m'or`n',theair(cid:13)owisdivertedintothenasalcavityby
`theloweringofthevelum.Withthelipsclosed,thenasalcavityformstheprincipal
`resonantpathwhichdeterminestheformantsandthevocaltractactsasaclosedside-
`branchwhichintroducesananti-resonance(spectralvalley)intothespectrum.Innasalised
`vowels,boththenasalandoralcavitiesareopenandsoundisradiatedfromthelipsand
`nostrilssimultaneously.Themainresonancesareduetotheoralcavity,whichdetermines
`thelocationoftheformants,andthenasalbranchisconsideredastheside-branch.
`Onreachingthelipsandnostrils,thee(cid:11)ectofdirectionalsoundpropagationfrom
`theseaperturesistoconvertthevolumevelocitywaveintoanacousticpressurewaveform
`whichradiatesawayfromthehead.Thepressurewavemeasureddirectlyinfrontofthe
`headisproportionaltothetimederivativeoftheresultantvolumevelocitywavefromthe
`lipsandnostrils,andisinverselyproportionaltothedistancefromthelips(Fant ).
`Theradiatione(cid:11)ectcanbeapproximatedasthatofradiationfromacircularaperturein
`asphereorin(cid:12)niteplane(Flanagan )andtheamplitudespectrumoftheresultant
`acousticwaveformisapproximatelymodi(cid:12)edby+dB/octavewhencomparedtothatof
`thevolumevelocitywaveattheendofthevocaltract.
`Additionalfeatureswhichaddintelligibility,meaningandnaturalnesstospeechare
`stressand,overlongerphrasaldurations,prosody.
`Inadditiontopitch,durationand
`intensity(loudness)constitutetheparametersofstressandprosodywhichareusedto
`emphasiseimportantacousticeventsandbreakspeechupintomeaningfulunits.Ata
`higherphrasallevel,speci(cid:12)cprosodicpatternscanalsoconveyemotionandattitude.
` .SpeechProcessing
`Thetwoareasofspeechprocessingconsideredinthisthesisaresignalmodelling(for
`speechsynthesis)andsignalclassi(cid:12)cation(forspeechrecognition).Inmodellingthespeech
`signal,theaimistoparametrizespeechwaveformsinsuchawaythattheycanbestored
`
`Ex. 1030 / Page 16 of 202
`
`
`
`
`
`0.10
`
`0.20
`
`0.30
`
`0.40
`
`0.05
`
`0.1
`
`0.15
`
`0.2
`
`0.25
`
`0.3
`
`0.35
`
`0.4
`
`0.45
`
`0.05
`
`0.1
`
`0.15
`
`0.2
`
`0.25
`
`0.3
`
`0.35
`
`0.4
`
`0.45
`
` .Introduction
`
`6000
`
`4000
`
`2000
`
`0
`
`−2000
`
`−4000
`
`−6000
`
`−8000
`
`−10000
`
`−12000
`
`−14000
`
`4000
`
`3500
`
`3000
`
`2500
`
`2000
`
`1500
`
`1000
`
`500
`
`0
`
`4000
`
`3500
`
`3000
`
`2500
`
`2000
`
`1500
`
`1000
`
`500
`
`0
`
`−16000
`
`(a)Speechutterance`Belgium'
`(b)Narrowbandspectrogram
`(c)Widebandspectrogram
`Figure .:Typicalspeechwaveformandspectrograms.Forspectrograms,horizontalaxis
`showstimeinseconds,verticalaxisshowsfrequencyinHz.
`
`Ex. 1030 / Page 17 of 202
`
`
`
`
` .Introduction
`e(cid:14)cientlyandreproduced(synthesised)atalaterdate.Parametersformodelscanbe
`foundbyperformingatimeorfrequencydomainmatchbetweentheoriginalspeechsignal
`andthatgeneratedbythemodel.
`Inclassi(cid:12)cation,modelsaredevelopedtoassignclasslabelstosegmentsoftheacoustic
`signalbasedonthedistinguishingfeaturesofaparametricrepresentationofeachsegment.
`Inspeechrecognition,forexample,theclasslabelsarelinguisticunitsofthelanguage
`suchasphones,diphonesortriphones.Thelinguisticunitscanformtheinputforhigher
`levelnaturallanguageprocessing,inwhichsyntacticandsemanticconstraintsonpossible
`linguisticsequencesareappliedandthemeaningoftheintendedutteranceextracted.
`Lowerlevelclassesarealsopossible,suchasclassifyingthespeechsignalintovoicedand
`unvoicedsegments.
`Speechprocessingtypicallyinvolvestheuseorcalculationofaparametricrepresenta-
`tionofacousticwaveforms.Speechsignalsarenon-stationaryandwhenprocessinglong
`utterances,atime-varyingparametricrepresentationisneeded.Aquasi-stationaryap-
`proachisusuallyadopted,inwhichanutteranceisdividedintoasequenceofoverlapping
`segmentsandassumedtobestationaryforthedurationofeachsegment.Sincethespeech
`productionmechanismcanchangeonlyslowlywithtime,parametricrepresentationsof
`adjacentsegmentsofspeechshowahighdegreeofcorrelation.Formodellingacoustic
`waveforms,thisimpliescontinuityinthevaluesofmodelparametersforadjacentseg-
`ments.Forspeechclassi(cid:12)cation,itimpliesthattheclasslabelassignedtoaparticular
`featurevectorisdependentonthecontextinwhichthatfeaturevectoroccursinaninput
`sequence(context-dependentclassi(cid:12)cation).Exploitingthecorrelationbetweensegments
`ofspeechishighlybene(cid:12)cialforspeechprocessingapplicationsandisawayofrepresenting
`co-articulatione(cid:11)ects.
` .. ReviewofResearchinModellingSpeechSignals
`Themostwidelyusedtechniqueforspeechanalysisislinearpredictionanalysis(Makhoul
` ,Markel&Gray ),andformsthebasisofmostspeechcodingsystems,suchas
`vocoders(Markel&Gray ),CELP(code-excitedlinearprediction)coders(Schroeder
`&Atal ),multi-pulsecoders(Atal&Remde )andahostofvariantswhichdi(cid:11)erin
`thenatureoftheexcitationofthelinearpredictionmodelatthedecoder.Thepopularity
`oflinearpredictionisduetoeaseofanalysisandimplementationandlowcomputational
`requirements.Analternativeapproachistomodelthetransferfunctionofthevocal
`tractsystem(vocaltractmodelling).ARX(autoregressivewithexternalinput)(Lobo&
`Ainsworth ,Fujisaki&Ljungqvist ),OE(outputerror)(Wang,Guan&Fujisaki
` )andstate-space(Morikawa&Fujisaki )parametrizationsforthevocaltract
`(cid:12)lterhavebeenusedanddi(cid:11)erintheirunderlyingstructureofthemodelandthenature
`oftheerrorwhichisminimisedintheparameterestimationprocedure.
`Modellingthevocaltracttransferfunctiondirectlyallowsinclusionofzerosinthe
`modelandhasbeenshowntoimprovepredictiongainevenforasimpleimpulse(Fu-
`
`Ex. 1030 / Page 18 of 202
`
`
`
`
` .Introduction
`jisaki&Ljungqvist ),ormulti-pulseexcitation(Singhal&Atal ).Theuseofa
`morerealisticrepresentationofthevocaltractexcitation,basedonglottalvolumevelocity
`wavepulsemodels,hasbeenshowntoimprovethepredictiongainby- dBwhencom-
`paredwithlinearpredictionanalysis(Fujisaki&Ljungqvist ,Thomson ,Hedelin
` ),andgivesimprovednaturalnessofsyntheticspeechgeneratedfrombothformant
`synthesisersandvocaltractmodels(Holmes ,Rosenberg ,Fujisaki&Ljungqvist
` ).Alternativeapproachestomodellingtheexcitationsignalincludelinearandnon-
`linearinverse(cid:12)lteringtechniques(Alku ,Milenkovic ,Denzler,Kompe,Kie(cid:25)ling,
`Niemann&N(cid:127)oth ),incorporatinganall-zeromodeldirectlyinthevocaltracttrans-
`ferfunction(Mathews,Miller&David ,Funaki&Mitome )orincorporating
`amoregeneralfunction-basedmodeloftheexcitationwithinthevocaltracttransfer
`function(Thomson ,Cheng&O'Shaughnessy ).Speechcodingsystemsusing
`ARXmodelsandapulse-basedexcitationhavebeenshowntogiveimprovednaturalness
`andpredictionperformanceoverlinearpredictionbasedcoders(Hedelin ,Cheng&
`O'Shaughnessy ).
`Speechcodersrequireaparameterestimationprocedurethatisrobusttothee(cid:11)ects
`ofnoise.InthespeechenhancementworkbyLim&Oppenheim( )andHansen&
`Clements( ),MAPestimationwasusedtoimprovetheestimationoflinearpredic-
`tionparametersinnoisyenvironments.Thecorrelationbetweenthemodelsofadjacent
`segmentswasexploitedbyusingthemodelparametersfromprevioussegmentsasinitial
`estimatesoftheparametersforthecurrentsegment.UsingaBayesianframeworktocal-
`culatemodelparameters,priorassumptionsabouttheexpectedvaluesoftheparameters
`canbeincorporatedintotheestimationprocedure.Saleh,Niranjan&Fitzgerald( )
`haveusedthisapproachforlinearpredictionanalysis,toobtainsmoothedestimatesof
`theformanttracksofnoisyspeechutterances.
`Thereisexperimentalandtheoreticalevidencethatthespeechproductionmechanism
`isnonlinear(Teager&Teager ).Nonlinearitiesinthespeechdataarecausedby
`rapidtransitionsbetweenandduringphones,especiallyplosiveswherethereisocclusion
`ofthevocaltract,andbyturbulentexcitationdurin