`Knappe et al.
`
`USOO6850496B1
`(10) Patent No.:
`US 6,850,496 B1
`(45) Date of Patent:
`Feb. 1, 2005
`
`(54) VIRTUAL CONFERENCE ROOM FOR
`VOICE CONFERENCING
`
`(75) Inventors: Michael E. Knappe, San Jose, CA
`S. Shmuel Shafer, Palo Alto, CA
`
`(73) Assignee: Cisco Technology, Inc., San Jose, CA
`(US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 759 days.
`
`(*) Notice:
`
`(21) Appl. No.: 09/591,891
`(22) Filed:
`Jun. 9, 2000
`(51) Int. Cl." ........................... H04M 3/56; H04L 12/18
`(52) U.S. Cl. .................. 370/260; 370/266; 379/202.01;
`709/204
`(58) Field of Search ................................. 370/260, 266;
`379/202.01, 203.01, 20401, 205.01, 206.01;
`709/204, 205
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`4,506,358 A
`3/1985 Montgomery ................ 370/60
`4,734.934 A 3/1988 Boggs et al. ............... 379/202
`5,027,689 A 7/1991 Fujimori ...................... 84/622
`5,212,733 A 5/1993 DeVitt et al. ............... 381/119
`
`2- - -a-
`
`lill C all. . . . . . . . . . . . .
`
`5,440,639 A 8/1995 Suzuki et al. ................. 381/17
`5,521,981 A 5/1996 Gehring ............
`... 381/26
`5,734,724 A 3/1998 Kinoshita et al. ............. 381/17
`5. A : i
`Sings t al - - - - - - - - - - 2:
`6,011,851 A
`1/2000 Connor et al. ................ 381/17
`6,125,115 A
`9/2000 Smits ......................... 370/389
`6,327,567 B1 * 12/2001 Willehadson et al. ....... 704/270
`6,408,327 B1 * 6/2002 McClennon et al. ........ 709/204
`6,559,863 B1 * 5/2003 Megiddo .................... 345/753
`* cited by examiner
`Primary Examiner Min Jung
`(74) Attorney, Agent, or Firm Marger Johnson &
`McCollom, PC
`(57)
`
`ABSTRACT
`
`A System and method are disclosed for packet voice con
`ferencing. The System and method divide a conferencing
`presentation Sound field into Sectors, and allocate one or
`more Sectors to each conferencing endpoint. At Some point
`between capture and playout, the voice data from each
`endpoint is mapped into its designated Sector or Sectors.
`Thereafter, when the Voice data from a plurality of partici
`pants from multiple endpoints is combined, a listener can
`identify a unique apparent location within the presentation
`Sound field for each participant. The System allows a con
`ference participant to increase their comprehension when
`multiple participants Speak simultaneously, as well as alle
`Viate confusion as to who is speaking at any given time.
`
`44 Claims, 10 Drawing Sheets
`
`Endpoint B
`
`Endpoint A
`
`
`
`
`
`44
`
`Data NetWOrk
`32
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 001
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 1 of 10
`
`US 6,850,496 B1
`
`Endpoint B
`
`Endpoint A
`
`{A}
`
`O
`{X
`
`O
`(IS
`
`2OR
`
`26R
`
`2OL
`
`22
`
`
`
`
`
`
`
`
`
`Packet
`Data Network
`32
`
`Fig. 1
`
`Endpoint B
`
`Endpoint A
`
`A
`p.
`As4a.
`
`Packet
`Data Network
`32
`
`
`
`
`
`
`
`Fig. 2
`
`28
`
`36R
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 002
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 2 of 10
`
`US 6,850,496 B1
`
`Endpoint B
`(e) (S2) (s)
`
`Endpoint A
`(A)
`
`Fig. 3
`
`Data NetWOrk
`32
`
`44
`
`O
`
`3)
`
`Endpoint D
`
`42
`
`Ener
`
`area
`
`32
`
`G) g Endpoint C
`C. 3)
`Ender
`
`36L
`
`O
`
`O
`
`36R
`
`34R
`
`
`
`
`
`
`
`
`
`Fig. 4
`
`Packet
`Data Network
`32
`
`Capture
`Channels
`
`Transmit
`Channels
`
`Presentation
`Channels
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 003
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 3 of 10
`
`US 6,850,496 B1
`
`Endpoint
`
`Fig. 5
`
`From Packet
`Switched NetWOrk
`
`Network
`Interface
`80
`
`Packet
`Switch
`82
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`46
`
`
`
`Controller
`88
`
`
`
`
`
`Decoder
`84
`
`
`
`Jitter
`Byer
`C
`Jitter
`Buffer
`92
`
`
`
`Channel
`
`DeCOCer
`
`
`
`Jitter
`
`94
`
`Fig. 6
`
`104L
`
`9 S
`
`n o
`
`H
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 004
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 4 of 10
`
`US 6,850,496 B1
`
`File Sound Field
`
`Endpoints Participants Help
`
`Endpoint B
`XXXXXXXXXXXX
`Endpoint C
`XXXXXXXXXXXX
`Endpoint D
`XXX.XXXXXX.XXX
`
`Fig. 7A
`
`
`
`File Sound Field
`
`Endpoints Participants Help
`
`120
`
`110
`
`12O
`
`110
`
`Fig. 7B
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 005
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 5 of 10
`
`US 6,850,496 B1
`
`
`
`Endpoint A
`(A9
`
`Fig. 8
`
`s
`s
`-- as
`...
`WA
`Wve
`sa
`
`2
`
`w
`
`W
`
`36L
`
`O
`
`O
`
`36R
`
`34R
`
`(e) (9 Endpoint C
`Ci i)
`Encoder
`-34.
`
`Data Network
`32
`
`44
`
`C
`
`k's
`
`42
`
`Endpoint D
`
`o
`
`
`
`
`
`Direction
`Finder
`106
`
`32
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 006
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 6 of 10
`
`US 6,850,496 B1
`
`File Sound Field
`
`Endpoints Participants Help
`
`Fig. 10A
`
`
`
`File Sound Field
`
`Endpoints Participants Help
`
`120
`
`110
`
`120
`
`110
`
`Fig. 10B
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 007
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 7 of 10
`
`US 6,850,496 B1
`
`Capture
`Channels
`
`TranSmit
`Channels
`
`Packet
`EnCOded
`Presentation
`Channel
`
`Presentation
`Channels
`
`
`
`Endpoint
`B
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 008
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 8 of 10
`
`US 6,850,496 B1
`
`Packet-Switched
`Network
`
`Fig. 13A
`
`Network
`interface
`80
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Hig.) 96
`X.
`Channel /
`
`Mapper
`
`98
`
`Mixer 164R
`160R
`
`104L
`
`ma
`
`A
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 009
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 9 of 10
`
`US 6,850,496 B1
`
`Packet
`Multiplexer
`178
`
`Fig. 13B
`
`142
`
`
`
`Encoder
`166
`o
`
`EnCOder
`168
`
`162L
`
`162R
`
`164L
`
`164R
`
`104L
`
`EnCOder
`170
`
`104R
`
`
`
`Fig. 13A |
`
`Fig. 13B
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 010
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 10 of 10
`
`US 6,850,496 B1
`
`Endpoint
`B
`
`Fig. 15
`
`TO Packet-Switched
`Network
`
`
`
`
`
`
`
`Controller
`182
`
`Network
`Interface
`190
`
`
`
`
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 011
`
`
`
`1
`VIRTUAL CONFERENCE ROOM FOR
`VOICE CONFERENCING
`
`US 6,850,496 B1
`
`FIELD OF THE INVENTION
`This present invention relates generally to voice
`conferencing, and more particularly to Systems and methods
`for use with packet voice conferencing to create a perception
`of Spatial Separation between conference callers.
`
`BACKGROUND OF THE INVENTION
`
`2
`present in a conference call. In Several of the disclosed
`embodiments, a conferencing endpoint is equipped with a
`Stereo (or other multichannel) audio presentation capability.
`The packet data Streams arriving from other participants
`locations are decoded if necessary. Each Stream is then
`mapped into different, preferably non-overlapping arrival
`directions or Sound field Sectors by manipulating, e.g., the
`Separation, phase, delay, and/or audio level of the Stream for
`each presentation channel. The mapped Streams are then
`mixed to form the Stereo (or multichannel) presentation
`channels.
`A further aspect of the invention is the capability to
`control the perceived arrival direction of each participants
`Voice. This may be done automatically, i.e., a controller can
`partition the available presentation Sound field to provide a
`Sector of the Sound field for each participant, including
`changing the partitioning as participants enter or leave the
`conference. In an alternate embodiment, a Graphical User
`Interface (GUI) is presented to the user, who can position
`participants according to their particular taste, assign names
`to each, etc. The GUI can even be combined with Voice
`Activity Detection (VAD) to provide a visual cue as to who
`is Speaking at any given time.
`In accordance with the preceding concepts and one aspect
`of the invention, methods for manipulating multiple packet
`Voice Streams to create a perception of Spatial Separation
`between conference participants are disclosed. Each packet
`Voice Stream may represent monaural audio data, Stereo
`audio data, or a larger number of capture channels. Each
`packet voice Stream is mapped onto the presentation chan
`nels in a manner that allocates a particular Sound field Sector
`to that stream, and then the mapped streams are combined
`for presentation to the conferencer as a combined Sound
`field.
`In one embodiment, the methods described above are
`implemented in Software. In other words, one intended
`embodiment of the invention is an apparatus comprising a
`computer-readable medium containing computer instruc
`tions that, when executed, cause a processor or multiple
`communicating processors to perform a method for manipu
`lating multiple packet Voice Streams to create a perception of
`Spatial Separation between conference participants.
`In a Second aspect of the invention, a conferencing Sound
`localization System is disclosed. The System includes means
`for manipulating a capture or transmit Sound field into a
`Sector of a presentation Sound field, and means for Specify
`ing different presentation Sound field Sectors for different
`capture or transmit Sound fields. The Sound localization
`System can be located at a conferencing endpoint, or embod
`ied in a central MCU.
`BRIEF DESCRIPTION OF THE DRAWING
`The invention may be best understood by reading the
`disclosure with reference to the drawing, wherein:
`FIG. 1 illustrates a packet-switched stereo telephony
`System;
`FIG. 2 illustrates a packet-Switched Stereo telephony
`System in use for a conference call;
`FIG. 3 illustrates a packet-switched stereo telephony
`System in use for a conference call according to an embodi
`ment of the invention;
`FIG. 4 correlates different parts of a packet-switched
`Stereo telephony transmission path with channel terminol
`Ogy,
`FIG. 5 shows packet data virtual channels that exist in a
`three-way conference with mixing provided at the end
`points;
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`A conference call is a call between three or more callerS/
`called parties, where each party can hear each of the other
`parties (the number of conferencing parties is often limited,
`and in Some Systems, the number of Simultaneous talkers
`may also be limited). Conferencing capabilities exist in the
`PSTN (Public Switched Telephone Network), where remote
`caller's voices are mixed, e.g., at the central office, and then
`Sent to a conference participant over their Single line. Similar
`capabilities can be found as well in many PBX (Private
`Branch Exchange) systems.
`Packet-Switched networks can also carry real time Voice
`data, and therefore, with proper configuration, conference
`calls. Voice over IP (VoIP) is the common term used to refer
`to voice calls that, over at least part of a connection between
`two endpoints, use a packet-Switched network for transport
`of voice data. VoIP can be used as merely an intermediate
`transport media for a conventional phone, where the phone
`is connected through the PSTN or a PBX to a packet voice
`gateway. But other types of phones can communicate
`directly with a packet network. IP (Internet Protocol) phones
`are phones that may look and act like conventional phones,
`but connect directly to a packet network. Soft phones are
`similar to IP phones in function, but are software
`implemented phones, e.g., on a desktop computer.
`Since VoIP does not use a dedicated circuit for each caller,
`and therefore does not require mixing at a common circuit
`Switch point, conferencing implementations are Somewhat
`different than with circuit-Switched conferencing. In one
`implementation, each participant broadcasts their voice
`packet Stream to each other participant-at the receiving
`end, the VoIP client must be able to add the separate
`broadcast Streams together to create a single audio output. In
`another implementation, each participant addresses their
`voice packets to a central MCU (Multipoint Conferencing
`Unit). The MCU combines the streams and sends a single
`combined Stream to each conference participant.
`SUMMARY OF THE INVENTION
`Human hearing relies on a number of cues to increase
`Voice comprehension. In any purely audio conference, Sev
`eral important cues are lost, including lip movement and
`facial and hand gestures. But where Several conference
`55
`participant's voice Streams are lumped into a common
`output channel, further degradation in intelligibility may
`result because the Sound localization capability of the human
`binaural hearing System is underutilized. In contrast, when
`two people's voices can be perceived as arriving from
`distinctly different directions, binaural hearing allows a
`listener to more easily recognize who is talking, and in many
`instances, focus on what one perSon is Saying even though
`two or more people are talking Simultaneously.
`The present invention takes advantage of the Signal pro
`cessing capability present at either a central MCU or at a
`conferencing endpoint to add directional cues to the Voices
`
`60
`
`65
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 012
`
`
`
`US 6,850,496 B1
`
`15
`
`35
`
`40
`
`25
`
`3
`FIG. 6 contains a high-level block diagram for endpoint
`Signal processing according to an embodiment of the inven
`tion;
`FIGS. 7A and 7B illustrate a GUI display useful with the
`invention;
`FIG. 8 illustrates a packet-switched stereo telephony
`System in use for a conference call according to an embodi
`ment of the invention;
`FIG. 9 contains a high-level block diagram for endpoint
`Signal processing for an endpoint utilizing a direction finder
`to map speakers from a common endpoint to different
`locations in a presentation Sound field;
`FIGS. 10A and 10B illustrate further aspects of a GUI
`display useful with the invention;
`FIG. 11 correlates different parts of a central-MCU
`packet-Switched Stereo telephony transmission path with
`channel terminology;
`FIG. 12 shows packet data virtual channels that exist in a
`three-way conference with mixing provided at a central
`MCU according to an embodiment of the invention;
`FIGS. 13A and 13B contain a high-level block diagram
`for central MCU signal processing according to an embodi
`ment of the invention;
`FIG. 14 shows packet data channels existing in a three
`way conference with mixing provided at a central MCU, but
`with each endpoint able to specify Source locations in its
`presentation Sound field, according to an embodiment of the
`invention; and
`FIG. 15 contains a high-level block diagram for a con
`ferencing endpoint that provides presentation Sound field
`mapping for Voice data at its point of origination.
`DETAILED DESCRIPTION
`AS an introduction to the embodiments, a brief introduc
`tion to Some underlying technology and related terminology
`is useful. Referring to FIG. 1, one-half of a two-way stereo
`conference between two endpoints (the half allowing A to
`hear B1, B2, and B3) is depicted. A similar reverse path (not
`shown) allows A's voice to be heard by B1, B2, and B3.
`The elements shown in FIG. 1 include: two microphones
`20L, 20R connected to an encoder 24 via capture channels
`22L, 22R; two speakers 26L, 26R connected to a decoder 30
`via presentation channels 28L, 28R, and a packet data
`45
`network 32 over which encoder 24 and decoder 30 commu
`nicate.
`Microphones 20L and 20R simultaneously capture the
`Sound field produced at two spatially Separated locations
`when B1, B2, or B3 talk, translate the captured sound field
`to electrical Signals, and transmit those signals over left and
`right capture channels 22L and 22R. Capture channels 22L
`and 22R carry the Signals to encoder 24.
`Encoder 24 and decoder 30 work as a pair. Usually at call
`Setup, the endpoints establish how they will communicate
`with each other using control packets. AS part of this Setup,
`encoder 24 and decoder 30 negotiate a codec (compressor/
`decompressor) algorithm that will be used to transmit cap
`ture channel data from encoder 24 to decoder 30. The codec
`may use a technique as simple as Pulse-Code Modulation
`(PCM), or a very complex technique, e.g., one that uses
`Subband coding and/or predictive coding to decrease band
`width requirements. Voice Activity Detection (VAD) may be
`used to further reduce bandwidth. Many codecs have been
`Standardized and are well known to those skilled in the art,
`and the particular codec Selected is not critical to the
`operation of the invention. For Stereo or other multichannel
`
`50
`
`55
`
`60
`
`65
`
`4
`data, various techniques may be used to exploit channel
`correlation as well.
`Encoder 24 gathers capture channel Samples for a Selected
`time block (e.g., 10 ms), compresses the samples using the
`negotiated codec, and places them in a packet along with
`header information. The header information typically
`includes fields identifying Source and destination, time
`Stamps, and may include other fields. A protocol Such as RTP
`(Real-time Transport Protocol) is appropriate for transport of
`the packet. The packet is encapsulated with lower layer
`headers, such as an IP (Internet Protocol) header and a
`link-layer header appropriate for the encoder's link to packet
`data network 32. The packet is then submitted to the packet
`data network. This encoding proceSS is then repeated for the
`next time block, and So on.
`Packet data network 32 uses the destination addressing in
`each packet's headers to route that packet to decoder 30.
`Depending on a variety of network factors, Some packets
`may be dropped before reaching decoder 30, and each
`packet can experience a Somewhat random network transit
`delay, which in Some cases can cause packets to arrive at
`their destination in a different order than that in which they
`Were Sent.
`Decoder 30 receives the packets, Strips the packet
`headers, and re-orders the packets according to timestamp. If
`a packet arrives too late for its designated playout time,
`however, the packet will Simply be dropped by the decoder.
`Otherwise, the re-ordered packets are decompressed and
`amplified to create two presentation channels 28L and 28R.
`Channels 28L and 28R drive acoustic speakers 26L and 26R.
`Ideally, the whole process described above occurs in a
`relatively short period of time, e.g., 250 ms or less from the
`time B1 speaks until the time A hears B1's voice. Longer
`delays cause noticeable voice quality degradation, but can
`be tolerated to a point.
`A's binaural hearing capability allows A to localize each
`Speaker's Voice in a distinct location within their listening
`environment. If the delay and amplitude differences between
`the Sound field at microphone 20L and at microphone 20R
`can be faithfully transmitted and then reproduced by Speak
`ers 26L and 26R, B1's voice will appear to A to originate at
`roughly the dashed location shown for B1. Likewise, B2's
`Voice and B3's voice will appear to A to originate,
`respectively, at the dashed locations shown for B2 and B3.
`Now consider the three-way conference of FIG. 2. A third
`endpoint, endpoint C, with two additional conference par
`ticipants C1 and C2 has been added. Endpoint C uses an
`encoder 32, capture channels 3.4L and 34R, and microphones
`3.6L and 3.6R in much the same way as described for the
`corresponding components of endpoint B.
`Decoder/mixer 38 differs from decoder 30 of FIG. 1 in
`Several Significant respects. First, decoder/mixer 38 must be
`capable of receiving, processing, and decoding two packet
`Voice data Streams simultaneously. Second, decoder/mixer
`38 must add the left decoded signals from endpoints B and
`C together in order to create presentation channel 28L, and
`must do likewise with the right decoded signals to create
`presentation channel 28R.
`FIG. 2 illustrates the perception problem that A now faces
`in the three-way conference. The perceived locations of B1
`and C1 overlap, as do the perceived locations of B2 and C2.
`A can no longer identify from directional cues alone who is
`Speaking, and cannot use binaural hearing to Sort out two
`Simultaneous Speaker's voices that appear to be originating
`at the same general location. Of course, with a monaural
`three-way conference, a similar problem exists, as all Speak
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 013
`
`
`
`US 6,850,496 B1
`
`15
`
`25
`
`35
`
`40
`
`S
`erS from all endpoints would appear to be speaking from the
`Same central location.
`FIG. 3 illustrates the operation of one embodiment of the
`invention for the conferencing configuration of FIG. 2. To
`illustrate a further aspect of the invention, a fourth endpoint 5
`D, with a corresponding encoder 40, capture channel 42, and
`microphone 44 has been added. Endpoint D has only mon
`aural capture capability, as opposed to the Stereo capture
`capability of endpoints B and C. Decoder/mixer 38 of FIG.
`2 has been replaced with a packet voice conferencing System
`46 according to an embodiment of the invention. All other
`conferencing components of FIG. 2 have been carried over
`into FIG. 3.
`Whereas, in the preceding illustrations, the decoder or
`decoder/mixer attempted to recreate at endpoint A the cap
`ture sound field(s), that is no longer the case in FIG. 3. The
`presentation Sound field has been divided into three Sectors
`48, 50, 52. Voice data from endpoint B has been mapped to
`Sector 48, Voice data from endpoint C has been mapped to
`sector 50, and voice data from endpoint D has been mapped
`to sector 52 by system 46. Thus endpoint B’s capture sound
`field has been recreated “compressed” and shifted over to
`As left, endpoint CS capture Sound field has been com
`pressed and appears roughly right of center, and endpoint
`D’s monaural channel has been converted to Stereo and
`shifted to the far right of A's perceived sound field. Although
`the conference participants’ Voices are not recreated accord
`ing to their respective capture Sound fields, the result is a
`perceived separation between each Speaker. AS Stated earlier,
`Such a mapping can have beneficial effects in terms of AS
`recognition of who is speaking and in focusing on one voice
`if Several perSons Speak simultaneously.
`Turning briefly to FIG. 4, the meaning of several terms as
`they apply in this description is explained. A capture Sound
`field is the Sound field presented to a microphone. A pre
`Sentation Sound field is the Sound field presented to a
`listener. A capture channel is a Signal channel that delivers
`a representation of a capture Sound field to an encoding
`device-this may be anything from a simple wire pair, to a
`wireless link, to a telephone and PBX or PSTN facilities
`used to deliver a telephone Signal to a remote voice network
`gateway. A transmit channel is a packet-Switched Virtual
`channel, or possibly a Time-Division-Multiplexed (TDM)
`channel, between an encoder and a mixer-Sections of Such
`a channel may be fixed, e.g., a modem connection, but in
`general each packet will share a physical link with other
`packet traffic. And although Separate transmit channels may
`be used for each capture channel originating at a given
`endpoint, in general a common transmit channel for all
`capture channels is preferred. A presentation channel is a
`Signal channel that exists between a mixer and a device (e.g.,
`an acoustic speaker) used to create a presentation Sound
`field-this may include wiring, wireleSS links, amplifiers,
`filters, D/A or other format converters, etc. As will be
`explained later, part of the presentation channel may also
`exist on the packet data network when the mixer and
`acoustic Speakers are not co-located.
`In the following description, most examples make refer
`ence to a three-way conference between three endpoints.
`Each endpoint can have more than one Speaking participant.
`Furthermore, those skilled in the art recognize that the
`concepts discussed can be readily extended to larger con
`ferences with many more than three endpoints, and the Scope
`of the invention extends to cover larger conferences. On the
`other end of the endpoint Spectrum, Some embodiments of
`the invention are useful with as few as two conferencing
`endpoints, with one endpoint having two or more speakers
`with different capture channel arrival angles.
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`FIG. 5 illustrates, for a three-endpoint conference, one
`channel configuration that can be used with the invention.
`Endpoint A multicasts a packet Voice data Stream over
`virtual channel 60. Somewhere within packet data network
`32, a Switch or router (not shown) splits the stream, Sending
`the same packet data over virtual channel 62, to endpoint C,
`and over virtual channel 64, to endpoint B. If this multicast
`capability is unsupported, endpoint A can broadcast two
`unicast packet Voice data Streams, one to each other end
`point.
`Endpoint A also receives two packet voice data Streams,
`one over virtual channel 68 from endpoint B, and one over
`Virtual channel 74 from endpoint C. In general, each end
`point receives N-1 packet Voice data Streams, and transmits
`either one voice data Stream, if multicast is Supported, or
`N-1 unicast data Streams otherwise. Accordingly, this chan
`nel configuration is better Suited to Smaller conferences
`(e.g., three or four endpoints) than it is to larger conferences,
`particularly where bandwidth at one or more endpoints is an
`SS.C.
`FIG. 6 illustrates a high-level block diagram for one
`embodiment of a packet Voice conferencing System 46.
`Network interface 80 provides connectivity between a
`packet-Switched network and the remainder of System 46.
`Controller 88 sends and receives control packets to/from
`remote endpoints using network interface 80. Incoming
`voice data packets are forwarded by network interface 80 to
`packet switch 82. Although not illustrated in this
`embodiment, the System will typically also contain an
`encoder for Outgoing conference voice traffic. The encoder
`will Submit outgoing voice data packets to network interface
`80 for transmission. Network interface 80 can comprise the
`entire protocol Stack and physical layer hardware, an appli
`cation driver that receives RTP and control packets, or
`Something in between.
`Packet Switch 82 distributes voice data packets to the
`appropriate decoder. In FIG. 6, it is assumed that two remote
`endpoints are broadcasting Voice data Streams to System 46,
`and so two decoders 84 and 86 are employed, one per
`stream. Packet switch 82 distributes voice packets from one
`remote endpoint to decoder84, and distributes Voice packets
`from the other remote endpoint to decoder 86 (when more
`endpoints are joined in the conference, the number of
`decoders, jitter buffers, and channel mapperS is increased
`accordingly). Packet Switch 82 identifies packets belonging
`to a given Voice data Stream by examining header fields that
`uniquely identify the voice stream-for an RTP/UDP (User
`Datagram Protocol)/IP packet, these fields can be, e.g., one
`or more of the source IP address, source UDP port, and RTP
`SSRC (synchronization source) identifier. Controller 88 is
`responsible for providing packet Switch 82 with the field
`values for a given Voice Stream, and with an association of
`those field values with a decoder.
`Decoders 84 and 86 can use any suitable codec upon
`which the System and the respective encoding endpoint
`Successfully agree. Each codec may be renegotiated during
`a conference, e.g., if more participants place a bandwidth or
`processing Strain on conference resources. And the same
`codec need not be run by each decoder-indeed, in FIG. 6,
`decoder 84 is shown decoding a Stereo voice data Stream,
`while decoder 86 is shown decoding a monaural voice data
`stream. Controller 88 performs the actual codec negotiation
`with remote endpoints. In response to this negotiation,
`controller 88 activates, initializes, and reinitializes (when
`and if necessary) each decoder as needed for the conference.
`In most implementations, each decoder will be a proceSS or
`thread running on a digital Signal processor or general
`
`Zoho Corp. and Zoho Corp. Pvt., Ltd.
`Exhibit 1024 – 014
`
`
`
`US 6,850,496 B1
`
`1O
`
`15
`
`25
`
`35
`
`40
`
`7
`purpose processor, but many codecs can also be imple
`mented in hardware. The maximum number of Streams that
`can be concurrently decoded in Such an implementation will
`generally be limited by real-time processing power and
`available memory.
`Jitter buffers 90,92, and 94 receive the voice data streams
`output by decoders 84 and 86. The purpose of the jitter
`bufferS is to provide for Smooth audio playout, i.e., to
`account for the normal fluctuations in Voice data Sample
`arrival rate from the decoders (both due to network delays
`and to the fact that many Samples arrive in each packet).
`Each jitter buffer ideally attempts to insert as little delay in
`the transmission path as possible, while ensuring that audio
`playout is rarely, if ever, Starved for Samples. Those skilled
`in the art recognize that various methods of jitter buffer
`management are well known, and the Selection of a particu
`lar method is left as a design choice. In the embodiment
`shown in FIG. 6, controller 88 controls jitter buffer synchro
`nization by manipulating the relative delays of the buffers.
`Channel mappers 96 and 98 each manipulate their respec
`tive input Voice data channels to form a set of presentation
`mixing channels. Controller 88 manages each channel map
`per by providing mapping instructions, e.g., the number of
`input voice data channels, the number of output presentation
`mixing channels, and the presentation Sound field Sector that
`should be occupied by the presentation mixing channels.
`This last instruction can be replaced by more specific
`instructions, e.g., delay the left channel 2 ms, mix 50% of the
`left channel into the right channel, etc., to accomplish the
`mapping. In the former case, the channel mapper itself
`contains the ability to calculate a mapping to a desired Sound
`field; in the latter case, these computations reside in the
`controller, and the channel mapper performs basic Signal
`processing functions Such as channel delaying, mixing,
`phase shifting, etc., as instructed.
`A number of techniques are available for sound field
`mapping. From Studies of human hearing capabilities, it is
`known that directional cues are obtained via Several different
`mechanisms. The pinna, or outer projecting portion of the
`ear, reflects Sound into the ear in a manner that provides
`Some directional cues, and Serves a primary mechanism for
`locating the inclination angle of a Sound Source. The primary
`left-right directional cue is ITD (interaural time delay) for
`mid-low- to mid-frequencies (generally several hundred HZ
`up to about 1.5 to 2 kHz). For higher frequencies, the
`primary left-right directional cue is ILD (interaural level
`differences). For extremely low frequencies, Sound localiza
`tion is generally poor.
`ITD Sound localization relies on the difference in time that
`it takes for an off-center Sound to propagate to the far ear as
`opposed to the nearer ear-the brain uses the phase differ
`ence between left and right arrival times to infer the location
`of the Sound Source. For a Sound Source located along the
`Symmetrical plane of the head, no inter-ear phase difference
`exists, phase difference increaseS as the Sound Source moves
`left or right of center, the difference reaching a maximum
`when the Sound Source reaches the extreme left or right of
`the head. Once the ITD that causes the Sound to appear at the
`extreme left or right is reached, further delay may be
`perceived as an echo.
`In contrast, ILD is based on inter-ear differences in the
`perceived Sound level-e.g., the brain assumes that a Sound
`that seems louder in the left ear originated on the left Side of
`the head. For higher frequencies (where ITD sound local
`ization becomes difficult), humans rely chiefly on ILD to
`infer Source location.
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`Channel mappers 96 and 98 can position the apparent
`location of their assigned conferencing endpoints within the
`presentation sound field by manipulating ITD and/or ILD for
`their assigned Voice data channels. If a conferencing end
`point is broadcasting monaurally and the presentation SyS
`tem uses Stereo, a preliminary Step can be to split the Single
`channel by forming identical left and right channels. Or, the
`Single channel can be directly mapped to two channels with
`appropriate ITD/ILD effects introduced in each channel.
`Likewise, an ITD/ILD mapping matrix can be used to
`translate a monophonic or Stereophonic voice data channel
`to, e.g., a traditional two-speaker, 3-speaker (left, right,
`center) or 5.1 (left-rear, left, center, right, right-rear,
`subwoofer) format.
`Depending on the processing power available for use by
`the channel mapperS-as well as the desired fidelity
`various effects ranging from computationally simple to
`computationally intensive can be used. For instance, one
`Simple ITD app