throbber
CHINESE SEGMENTATION DISAMBIGUATION Wawing Jin Computing Research Laboratory New Mexico State University wanying((~crl.nlnsu.edu Abstract A technique of reasoning under uncertainty is studied in all attempt to solve disaml)igua,- tion probh;nls of Cilinesc segnlcnliation. A knowlcdge-.I)a,sed inexact reasoning thcory in- corpora,ting knowledge in morp]lology, syn- tax, seniantics and F, ra.gmati(:s is :,rcsent('d. 1 Introduction Processing (7hincsc texts is spccifi('ally dif- ficult in its computation because liol'mally sentc.nces in Chinese texts arc rcp:rcscnt(;d as strings of Chiucse characters without spacc's to indica.t(: wor(| boundaries. This (;auscs a problem for Chinese machine translation> sl, a- tistical analysis of (Jhincse corpora, (lhincse informal;ion rctrieva,l, ct(:.; a.s usually these projects axe I)e~scd on the a.ssurilt)tion t,tmt rill lexicon (lisl, iIictions have [)(;Cll i'ccoglliZ(',d iU ,~dva, il ('(:. Several a.pproachcs a.iYrled t;o lir a.tl s ['(:r a, @ h i-- nese ciia.ra,(;ter stri:ng into a. word sl, ring ha.ve I)ecn studied in recent decade's. Two coin- peting approaches cominonly used for Chi- nese l;cxl, scanlent&lion are the st~l;isl;ical a f)- proach (Cilang, (;t a.l, 1!)91; Sproat and Shih, 1991; Chiang, et al, 1992) and the heuristic N)proach (Chcn and l,iu, 1992; lie, ct al, 1991; ,]in arid Nie, 1993; diil, ;1992; I,iang and Zhcn, 1991; Wang, ctal, 1991). AI thougi~ ~t high degree of precision }las l)ecn reporl;cd for both :inel;hods, c~t(;h has its linl.- iliatiions particularly ill identifying ill/known words and disamMgu~ting mulLiplo .ql.~l~IilCil- rations, l/,ccently, a hybrid N)l)roach incof pora,ting heuristics with statistics h~s l:)een studied in an at;lieinpl, l;o solve ltllkllOWll word 17ccognil, ion prol)lems (Chen an(1 lAu, ]992; Nic and Jin, 199/1). l{owevcr, ambiguous scg-- menial;ion is still a difIicult problem. In t, his paper a Iriel, hod of r(;&SOlling illlder un(;o.rl;a,inty iul,ondirlg l;o disalnl)iguate (Jill- nose scgmcul, aliion is prcscnl;ed. A model ot! cvid(mtia,i sl, rengl;h in inex~mt rea.soning has been studied by (lhl('han;m and Short liffc, I {)8/I). hi the process of (]hiricsc segmentation know]trig(' ill tnot'phology., syl:ll, a x, Sel~nant;it:s gild pra,gma|,ics is used as evidcnco, to support t hc (lisalnl)igual, ion hypotheses. ']'lle silnilm'- ity of uut;('.rl;a.irl kuowh:dg(; and iucxacl; rca soning l)cl;wccn medical dbtgnosis and natu-- raJ ]migti;~ge intic'rpl'el;al, ion lnakcs it, po~siblc t,o apply MY(71N l;echnique to Chinese t;cxl, scgmcnl;at;ion. 2 Difficulties in Chinese segmentation As (:lainm(t in (lSu, t987),the main (;a.us(;s Of 8C~,lllCllta.tioll a, mbiguity al;(; vag~tlClieSs ill word dt;finition a, Nd l,hc phenomenon of word (:imins. Tlic V&gllCllCSS ()f the wor(I (lc[initioris (;a.tlsos s(;g]l/l(~rita, Liori alnbigilitics, as in t,h(; string ll~/~fl~iEU. It (;&it siiands either for tN4EtI~ -J:j: (modcr., factory) or for ~4~ #U:~ (rnod- ern chc'mical fa, ctory). A woM cli~in is a, se- (lU(mcc of Chinese characters fi'om which sev- oral words can /)c [)rodu(;ed with or withouL overlap. Two types of word chains have I)cou recognized in (Jhinese litera.turc, i.e. mull, f- S(~llS( ~, combinations and interse(;1;ion coral)i- nactions (]hlallg a,lid Liu, 1988). The sl, ring 1245
`
`AOL Ex. 1026
`Page 1 of 5
`
`

`

`;~N is an example of multi-sense combination; (ice), ~I(box) and ~N(refrigerator) are all words. The character string ~flN is an exam- ple of intersection combination; Ntfl(paddle) is a word and ~fl~(sell.-at-sate-price) is also a word, whereas tfl is the intersection charac- ter. The example of the string ~fl~ f illustrates the typical segmentation ambigu- ity caused by word chains. The segmentation of this string can be either (fl'hc ping-pong-balLs were soht outat sale price.) or ('13e paddles for gable tennis were sold out.) Some ambiguities can be solved by word structure knowledge. Others can be disam- biguated by syntactic and/or semantic knowl- edge. The most difficult disambiguation is that requiring contextual or pragmatic knowl-- edge to arrive at an appropriate interpreta- tion a,s in the string ~~t which can be segmented into: (students will write a paper.) or (student-association writes a paper'.) Both are syntactically and semantically cor- rect. in this case, contextual information would allow the reader to trace the informa- tion claimed in the previous statements to solve ambiguity problems. 3 Reasoning theory for Chinese segmentation disambiguation A model of evidential strength in inexact reasoning studied by (Buchanan and Short- liffe, 1984) has been successfully implemented in the MYCIN system. Tihe theory is that, if a hypothesis can be derived from various types of mutually exclusive evidence, then the strength of truth of the hypothesis can be in- creased to reach a plausible conclusion. Two concepts MB[h,e] and Ml)[h,e] have been introduced as the measures of belief and disbelief. MB[h,e] means the measure of in- creased belief in the hypothesis h, based on the evidence e. M l)[h,e] means the measure o[ increased disbelief in the hypothesis h, based on the evidence c. To facilitate comparison of the evidential strength of competing hy- potheses, certainty factor CF is introduced to combine degrees of belief and disbelief as fop ]OWS: csqh, ~1 = M l~[t~, e] - MY[h, c] in the case that a hypothesis is derived froIn a number of mutually exclusive observations, the combining functions are defined as: if MD[h, el&,e2] = 1 then MB[h, el&,e2] = 0 otherwise M l:~[h, el&,c2] = MB[h, e~] + M,[h, e~] • (:l - MY[h, e,]) if M13[h, el&e2] = 1 then MD[h, cl&c~2] = 0 otherwise M D[h, cl &c2] = MD[h,e~,] + MI)[h, e2], (l - mD[h, ej]) In the case that two hypotheses are estab- lished with positive evidence from syntactic and semantic knowledge with the same de- gree, no discrimination of the strength of truth hypotheses can be drawn. If world knowledge provides positive evidence for the first hypothesis and negative evidence to the second; then the strength of the first hypothe- sis is stronger than thai; of the second. There- fore, the first hypothesis would be the most likely correct segmentation. A weighted certainty factor is proposed he, re to represent the importance of various linguistic aspects. The, weight is a vector of four elements representing the importance of morphology, syntax, semantics and pragmat- its, respectively, which total 1, i.e. Cl,;[h,, e] - w~ , CF[h, ~] where Wi is the weight of the certainty fac-- tor CFi in hypothesis h supported by the ev- idence e with respect to one of the linguistic 1246
`
`AOL Ex. 1026
`Page 2 of 5
`
`

`

`a,specl;s. Suppose, the weight; vecl;or (O.l, 0.2, 0.3, 0A:) is a,ssigncd (or morphology, synU~x, scma,ni;i(:s a, nd pr~gtnal;i(;s, r(,speci;ivcly, Lh(;n I;hc following exa.tnple iJlusLra,i, es Lhe t:uncl, iou or Ge wcighLcd (:erLa,inl;y [a,(;l;o," (]/'i[/G c,]. (lihe Lhird ]e+der in our (:olnp+ny does (tel; ha,re much power) l;he word ¢t]~iil +~1 ~ pro- (hl(:es l, wo segmenLa£ions: (t;hc l;hird leaxler it+ ()tit: (:olrit)a,tty (toes HOt, have tnueh power) or: (l,llc Lhicd piece-el ha,ud hi ()ill' COtlll)a, lty (foes UOL ha, re much power) To esLima, l,e Lhe sLrengt, h o[' l, rul, h o1: (,he ficsL hypoLhesis, sttppos(': • Lhe word sLt'u(:Lltre rule gives Lhe evi(letl- l, ia, l st, rengl,h (0.5) ror l,h(, hypot;hesis be- e+us(, Lh(, word (:h+d. :le+ (:+m be ('ii, h(;r +t~ ~- (pi~c,,-or h,,,(l)(,,, f~-~ (k,,,der). T lwrefore, 6+r;[t~, ,;,] = 14:,, c i [/,,, q} :-: 0.0r, ++,,,~ c ~ []+, +,,] : M ~;[],,, <~,,] - Mn[]+, +,.,] :- 0.05 (,he s.ynl, a,ctJ(: rule gives Lh(', evi(]eul,hd sITeugLh (I) l)e(:~uise iL defitfilx'Jy is a. gt'amt:na, t;ic~d senl;en(:e. T]wr('l'or(', c/,~[/,, ,:4-- ~ * (, I [/,., ~] ::: 0.~ +,,,d cr'[A, m <~<;~] :~ ~ BIt,,, q~<;,] - ms/)[t,., <~.,<t+<~] =: O,2d • l, he sere;mr;it rule gives i;he evidentia, l st;,'eugiJ, 1) since +t~T.(i;he Io~utcr) (',a,n hame power. 'l~lieref'ore, or':+(~,., ,;:,]-- wi, , (: r'[l+., ~] = o.3 ~l.,i,t C If[D,, c ,&<;~&,<;:~] :: m nit,., .., a+<~.~<~,<,~:,] - M :)[1~ <~.~....~,~.,,.,] : 0.4(J • the world kuowledge rllle gives 1,he evi- dentia,l st;rcngl;h (0.8) I)e(:a~use it; is (lUit;( , Lrue l;}la.i, Lhe lea,der ha.s less t)ower Lha, n Lha, L of t, he [it'sL or second [caxter. There,- for(;> (, 14[I+, q] :-- W4 * U F[D,, (~4] :: 0.32 +u,l " L;h c.i&.c.~&,c:~&,e.+] --M I)[D,, c i &.r.~,~c.:u~q] -: 0.63 The cert,a, iut, y l:a, ct;or CI" of l;}le hyl)ot;hesis -f~: f:l ~,~,:1 ¢'J ~_~! +I,IT- ~Yf ~A: ~); is 0.63. The,'o- [ore, (;his segHietit;a,t;iorl iS likely 1,o hc a, <:oher- enl, sLriug. To esLiina.Le Lhe evidengi~d sLrengt.h of Lhe se(: oud hypol;h('sis, suppose: • l, he word sgrucLm:e rule gives l;he evi- dent, ial st;rengLh (0.5) for Lhis hyp,:~t.he sis since, #[~T" ca, u be eil;her :IEI ]~(piece+ol' ha.u(l) or :I1~ 1:" (le+~der). Therefore, c z", [z,.., ] :-14:1 * C//"[D,, q] ::: (}.05 ~u.l C If[D., eli M.[/,., ,.,] M nit,, <,,] :: o.o5 • Llle syui;a,cl;ic rule gives Lhe evide, uLia,I sl;reuglJI (]) beta.use it; is a. gramma.t.ic~d S(HI [;(;11 C(',. T hcrel'ot:(;~ C' I'~[D,, c,2] := W~ * C'/,'[A, c~] = 0.2 a, nd C l"[h,, ~:l&c'~] -- M u[A, <:,,E~] -- M/)[t,,, <:, ~<~] = 0.:~..I • t;tle se, m~ull;ic rule giw;s l;he uega, l, ive evi dcutM sl,reugl;h (-1) t)e('~ulse t;he t)hra, se ID.c h,a, nd o./'~t co.m, pa, ny vJola, Les Lhe se n,aui,ic coust, raiid,. 'l;herel'ore, C l":/[A, ~'.:~] - l'l/i~ * Ct,'[D,, e,:+] = - 0.3 a, nd C i,'[h, c l&+'.~&c:~] :_: M nil,,, <;,~t+,,~,t;+::+] - Ms)(/,., ,:,,t.:~,t+,:4 -: -0.06 • l, he world knowledge rule gives a, Hega,l, iw'. evidcmi~d stxeugllh (I) boca,use a, <'ore t)a,ny does uot; ha,w' a, }la, Nd a,s ()lie el! it;s COt]l( pOIICIII, S. (71'~[h, c.4] -: -0.4 amt C l;'[h, cl&:.'2&e.:.~x:.l] .... 0.34 The ceH,aiut;y I:a.cLor (~1" of Lhe ll.yl~ol, hcsis #.~ • If] (,,~i.J f¢,j ~2:£ lt~ 1: '~#/ ~).: }~)s is - 0.34. 1247
`
`AOL Ex. 1026
`Page 3 of 5
`
`

`

`Therefore, this segmentation is unlikely to be a coherent; string. 4 Discussion q_'he assignment for the weight vector is empirical. It is based on the following analy- sis in which ~l's reresent the truth of each evi- dence/hypothesis and ~O's represent the false. Since the segmentation algorithm always pro- duces a segmented string, it is assumed that the evidence from morphology is true in vary- ing degrees depending on the complexity of the word chain. The justification of a hy- pothesis is based on the evidence presented by the pragmatic, semantic and syntactic as- pects shown in the following table. ~-~ J pragmte I semte I s-sTfitC- (1) 0 0 0 (2) 0 0 (3) o o (4) 0 1 1 (5) i 0 0 (6) o 1 (7) 1 1 0 (8) 1 1 hypths 0 0 0 0 1 1 1 1 • Case(l) indicates that if no evidence can prove the truth of the hypothesis, then the hypothesis is false. • Case(2) indicates that if the evi- dence supports an incoherent grarumat- ical sentence inconsistent with the con- text/circumstance, then the hypothesis is false as in the case of ~,g~-~(a ba- nana ate a monkey). • Case(3) indicates that if the evidence supports a meaningful but ungrammat- ical string inconsistent with the con- text/circumstance, then the hypothesis is false, i.e. ~g~ (he wretch) against the real fact that he is a nice guy. • Case(4) indicates that even if tile evi- dence supports a grammatical meaning- ful sentence but is inconsistent with the context/circumstance, then tile hypoth- esis is false, i.e., ,~,(~ 7vN ~ ~ N (the president's forced resignation makes peo- ple angry) violates the circumstance that people hate the president. • Case(5) indicates the case of an idiomatic expression where the string is literally ungrammatical and incoherent, but as a whole it can be interpreted figuratively to make perfect sense. Therefore, we as- sutTrle that the hypothesis is true as in tile case of :~z~I:~J£, literally means "car- water-horse--dragon", but figuratively, it nleans "very crowded". ® Case(6) indicates the case of a metaphor or metonymy which superficially it is an incoherent grammatical string, but by reasoning with the support of world knowledge it can be interpreted as a lneaninghd string. Then, it is assumed that the hypothesis is true, i.e., ~NN~g ~t (1 drink North-West wind) means "i have nothing to eat". • Case(7) indicates that the evidence sup- ports a meaningful but ungrammat- ical string consistent with the con- text/circumstance, then the hypothesis is true as in Nla;lti (he wretch) is consis- tent with the real fact that he is a bad guy. • Case(8) indicates that if all evidence gives positive support to the hypothesis, then tile hypothesis is true. 1)Yore the analysis, it seems to be that pragmatic knowledge provides the strongest evidence for the hypothesis. Therefore, the highest weight is assigned to the prag- matic aspect of the certainty factor, in the absence of pragmatic inforrnation a de- fault assumption, that semantic evidence is more important than syntactic evidence, is made. This can he observed in daily life people communicate through many ungram- matical expressions without having a prob- lem of transferring the message such as a brief email message: ~ DRAFT-cornmerzts hard copy best-asap to yw pls. [t means "To 1A brief e_mail message from Dr. Yorick Wilks to the researchers in Computing |{esearch I,aboratory at New Mexico State University. "/248
`
`AOL Ex. 1026
`Page 4 of 5
`
`

`

`write the, comment for the Ill{AFT on the ha.rd COl)y would be the best. Please return it to Yorick Wilks ~s soon as possible." The certainty factor Cl;' ix used under the premise tha,t a,ll of I;he evide, nce is rendered by mutua, lly exclusive observations. Sitice lem- guage is a,n expression integr~ting synl;actic, semantic and pr~Lgmatic information, is the syntat:ti(:, sema,nti(: a,n([ I)r~gmatic evid(mce mutually exclusive? This is not so (:lca,r. All knowledge is cultur~dly (tel)e~l.d(mt , i.e. one paN;ieular instance m~y be ~meepta, b]e in one culture but not in a,nothe, r. In this research a defmflt assumption is made that the obserw> tions from various language ast)ects are inde- pendent. The questioa is left ope, for further discussiou. 5 References |~u(:h~mml, 13. and E. Shortliffe. (1{)84:). Ua- (;erta,inty and F, vident, i~[ ~qupport. iu B. C. Ihwha.na, mid F,. II. Short- lille Ed., ll, ulc-Bascd IJrpcrl S'ystcrns: The M YCIN I¢:rperimc'nts of th, c Sta,,,- ford lleuristic l~rogramming ['reject, Addlson-Wesley l)ltblishing Compa,ly., 1)P. 209-232. Cha.ug, J. S., et el. (1991). Chinese word segmettl,~t;iotl tJn'ottgh (;onsl;r~dnt s~tisfa.t:tion a.nd st~tistical optimiza.tion, Pro< of the 4th ILO. C. (/ompulalional Linguistics Conference, pp. 147-165. Chen, K. J. ~Ltl(:l S. H. /Au. (1992). Word l (lent ill cat ion for M~m (latin Chi nese Sen- tenet:s. I'r'oc, of the 5th Intc'rnatio,ml Conference on (/omputational Linguis- tics, Vol. l, pp. 101-107. Chiang, T. I[., et al. (:1!)92). Statis- tiea.l models for se, gmcnt~tion a.nd u lv known word resolut;ion. I)roc. of th, c 5th 1tO.(7. Computational Linguistics Con- J'crence, I)P. 123-] 46. lie, K. K,,ct el. (11991). The Design l>riu - ciple for a, Written Chinese Automatic Segment~tion Expert Syst;em. ,Journal of Chinese In, formation l'roccssing, re/.5, No. 2, pp. 1-14. l|ua, ng, X. X..~md 1). Y., l,iu. (1988). The Phenomenon of Word Chitin ~nd the Au- tomatic Segmentation in Written Chi- nese. Journal of the Development of I(nowlcdgc I'kzginecring~ pp. 287 291. ,lin, W. anti ,/. Y, Nie. (1993). Segmenta,- 1;ion du Chi~lois-- role El,ape Cruciale vet's la Tra.duction Automa.tique du Chino is. In e.llouillon an(l A. Clas Ed., La 7}'a- ductiquc, l,es presses de l'Universite de Montrea.I, pp. 349-363. ,)in, W. (1992). A Ca.so Study: Chi- /lese Segment~l, ion a.tl(l its lJisaml)igua- tioi~. M(7(Z5'-.92-227, Computing I{,(> search I,aboratory, New Mexi(:o State (i uiversity. 1Anug, N. Y. and Y. It, Zhen. (]991). A Chinese Word Segmentation Model and a Chinese Word Scgmt;nl;a,tiot~ System I)C - CWSS. lh'oc, of COLlt%', gel. l, No. l, I)l).51-,55. IAu, Y. Q. (1!)87). I)itIiculties in Chi- nese l~mguage Processing and Method to their Sohfl;ion. l)roc, of 1987 bzte'rna- tional (7onference on Chinese Informa- tion Processing, Vol. 2, pp. 7125-12(5. Nit;, J. Y. mM W. Jin. (1!)94). A Hybrid Approach ~o Unknown Word l)etection and Segmentation of Chinese, Apl)e~r in Prec. of I'nternational Oonfcrcnce on (/hincse Computing'.04 (ICC(704). Sl)r,,)a.t, 1{. a,t-l(l (~., Shill. (1991). A staA;isLi- (:el reel;hot] R)r finding word boundm'ics in Chim;se text,(fomputer l)rocessin.q of (kincse and Oriental Languages, gel 4, No. 4, PP. 336-351. ~vVmkg , l,. ,J., el; al. (1991). A Parsitlg Metho(l for [dentifying Words in M~m- (tarin Chinese Sentences. l)Tvc, of the 12lh lnternaiional Joint Co~@rencc on Artificial Intelligence , Vol. 2, pp. 1018- 1023. 1249
`
`AOL Ex. 1026
`Page 5 of 5
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket