`Knappe et al.
`
`USOO6850496B1
`(10) Patent No.:
`US 6,850,496 B1
`(45) Date of Patent:
`Feb. 1, 2005
`
`(54) VIRTUAL CONFERENCE ROOM FOR
`VOICE CONFERENCING
`
`(75) Inventors: Michael E. Knappe, San Jose, CA
`S. Shmuel Shafer, Palo Alto, CA
`
`(73) Assignee: Cisco Technology, Inc., San Jose, CA
`(US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 759 days.
`
`(*) Notice:
`
`(21) Appl. No.: 09/591,891
`(22) Filed:
`Jun. 9, 2000
`(51) Int. Cl." ........................... H04M 3/56; H04L 12/18
`(52) U.S. Cl. .................. 370/260; 370/266; 379/202.01;
`709/204
`(58) Field of Search ................................. 370/260, 266;
`379/202.01, 203.01, 20401, 205.01, 206.01;
`709/204, 205
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`4,506,358 A
`3/1985 Montgomery ................ 370/60
`4,734.934 A 3/1988 Boggs et al. ............... 379/202
`5,027,689 A 7/1991 Fujimori ...................... 84/622
`5,212,733 A 5/1993 DeVitt et al. ............... 381/119
`
`2- - -a-
`
`lill C all. . . . . . . . . . . . .
`
`5,440,639 A 8/1995 Suzuki et al. ................. 381/17
`5,521,981 A 5/1996 Gehring ............
`... 381/26
`5,734,724 A 3/1998 Kinoshita et al. ............. 381/17
`5. A : i
`Sings t al - - - - - - - - - - 2:
`6,011,851 A
`1/2000 Connor et al. ................ 381/17
`6,125,115 A
`9/2000 Smits ......................... 370/389
`6,327,567 B1 * 12/2001 Willehadson et al. ....... 704/270
`6,408,327 B1 * 6/2002 McClennon et al. ........ 709/204
`6,559,863 B1 * 5/2003 Megiddo .................... 345/753
`* cited by examiner
`Primary Examiner Min Jung
`(74) Attorney, Agent, or Firm Marger Johnson &
`McCollom, PC
`(57)
`
`ABSTRACT
`
`A System and method are disclosed for packet voice con
`ferencing. The System and method divide a conferencing
`presentation Sound field into Sectors, and allocate one or
`more Sectors to each conferencing endpoint. At Some point
`between capture and playout, the voice data from each
`endpoint is mapped into its designated Sector or Sectors.
`Thereafter, when the Voice data from a plurality of partici
`pants from multiple endpoints is combined, a listener can
`identify a unique apparent location within the presentation
`Sound field for each participant. The System allows a con
`ference participant to increase their comprehension when
`multiple participants Speak simultaneously, as well as alle
`Viate confusion as to who is speaking at any given time.
`
`44 Claims, 10 Drawing Sheets
`
`Endpoint B
`
`Endpoint A
`
`
`
`
`
`44
`
`Data NetWOrk
`32
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 1 of 22
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 1 of 10
`
`US 6,850,496 B1
`
`Endpoint B
`
`Endpoint A
`
`{A}
`
`O
`{X
`
`O
`(IS
`
`2OR
`
`26R
`
`2OL
`
`22
`
`
`
`
`
`
`
`
`
`Packet
`Data Network
`32
`
`Fig. 1
`
`Endpoint B
`
`Endpoint A
`
`A
`p.
`As4a.
`
`Packet
`Data Network
`32
`
`
`
`
`
`
`
`Fig. 2
`
`28
`
`36R
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 2 of 22
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 2 of 10
`
`US 6,850,496 B1
`
`Endpoint B
`(e) (S2) (s)
`
`Endpoint A
`(A)
`
`Fig. 3
`
`Data NetWOrk
`32
`
`44
`
`O
`
`3)
`
`Endpoint D
`
`42
`
`Ener
`
`area
`
`32
`
`G) g Endpoint C
`C. 3)
`Ender
`
`36L
`
`O
`
`O
`
`36R
`
`34R
`
`
`
`
`
`
`
`
`
`Fig. 4
`
`Packet
`Data Network
`32
`
`Capture
`Channels
`
`Transmit
`Channels
`
`Presentation
`Channels
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 3 of 22
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 3 of 10
`
`US 6,850,496 B1
`
`Endpoint
`
`Fig. 5
`
`From Packet
`Switched NetWOrk
`
`Network
`Interface
`80
`
`Packet
`Switch
`82
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`46
`
`
`
`Controller
`88
`
`
`
`
`
`Decoder
`84
`
`
`
`Jitter
`Byer
`C
`Jitter
`Buffer
`92
`
`
`
`Channel
`
`DeCOCer
`
`
`
`Jitter
`
`94
`
`Fig. 6
`
`104L
`
`9 S
`
`n o
`
`H
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 4 of 22
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 4 of 10
`
`US 6,850,496 B1
`
`File Sound Field
`
`Endpoints Participants Help
`
`Endpoint B
`XXXXXXXXXXXX
`Endpoint C
`XXXXXXXXXXXX
`Endpoint D
`XXX.XXXXXX.XXX
`
`Fig. 7A
`
`
`
`File Sound Field
`
`Endpoints Participants Help
`
`120
`
`110
`
`12O
`
`110
`
`Fig. 7B
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 5 of 22
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 5 of 10
`
`US 6,850,496 B1
`
`
`
`Endpoint A
`(A9
`
`Fig. 8
`
`s
`s
`-- as
`...
`WA
`Wve
`sa
`
`2
`
`w
`
`W
`
`36L
`
`O
`
`O
`
`36R
`
`34R
`
`(e) (9 Endpoint C
`Ci i)
`Encoder
`-34.
`
`Data Network
`32
`
`44
`
`C
`
`k's
`
`42
`
`Endpoint D
`
`o
`
`
`
`
`
`Direction
`Finder
`106
`
`32
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 6 of 22
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 6 of 10
`
`US 6,850,496 B1
`
`File Sound Field
`
`Endpoints Participants Help
`
`Fig. 10A
`
`
`
`File Sound Field
`
`Endpoints Participants Help
`
`120
`
`110
`
`120
`
`110
`
`Fig. 10B
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 7 of 22
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 7 of 10
`
`US 6,850,496 B1
`
`Capture
`Channels
`
`TranSmit
`Channels
`
`Packet
`EnCOded
`Presentation
`Channel
`
`Presentation
`Channels
`
`
`
`Endpoint
`B
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 8 of 22
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 8 of 10
`
`US 6,850,496 B1
`
`Packet-Switched
`Network
`
`Fig. 13A
`
`Network
`interface
`80
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Hig.) 96
`X.
`Channel /
`
`Mapper
`
`98
`
`Mixer 164R
`160R
`
`104L
`
`ma
`
`A
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 9 of 22
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 9 of 10
`
`US 6,850,496 B1
`
`Packet
`Multiplexer
`178
`
`Fig. 13B
`
`142
`
`
`
`Encoder
`166
`o
`
`EnCOder
`168
`
`162L
`
`162R
`
`164L
`
`164R
`
`104L
`
`EnCOder
`170
`
`104R
`
`
`
`Fig. 13A |
`
`Fig. 13B
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 10 of 22
`
`
`
`U.S. Patent
`
`Feb. 1, 2005
`
`Sheet 10 of 10
`
`US 6,850,496 B1
`
`Endpoint
`B
`
`Fig. 15
`
`TO Packet-Switched
`Network
`
`
`
`
`
`
`
`Controller
`182
`
`Network
`Interface
`190
`
`
`
`
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 11 of 22
`
`
`
`1
`VIRTUAL CONFERENCE ROOM FOR
`VOICE CONFERENCING
`
`US 6,850,496 B1
`
`FIELD OF THE INVENTION
`This present invention relates generally to voice
`conferencing, and more particularly to Systems and methods
`for use with packet voice conferencing to create a perception
`of Spatial Separation between conference callers.
`
`BACKGROUND OF THE INVENTION
`
`2
`present in a conference call. In Several of the disclosed
`embodiments, a conferencing endpoint is equipped with a
`Stereo (or other multichannel) audio presentation capability.
`The packet data Streams arriving from other participants
`locations are decoded if necessary. Each Stream is then
`mapped into different, preferably non-overlapping arrival
`directions or Sound field Sectors by manipulating, e.g., the
`Separation, phase, delay, and/or audio level of the Stream for
`each presentation channel. The mapped Streams are then
`mixed to form the Stereo (or multichannel) presentation
`channels.
`A further aspect of the invention is the capability to
`control the perceived arrival direction of each participants
`Voice. This may be done automatically, i.e., a controller can
`partition the available presentation Sound field to provide a
`Sector of the Sound field for each participant, including
`changing the partitioning as participants enter or leave the
`conference. In an alternate embodiment, a Graphical User
`Interface (GUI) is presented to the user, who can position
`participants according to their particular taste, assign names
`to each, etc. The GUI can even be combined with Voice
`Activity Detection (VAD) to provide a visual cue as to who
`is Speaking at any given time.
`In accordance with the preceding concepts and one aspect
`of the invention, methods for manipulating multiple packet
`Voice Streams to create a perception of Spatial Separation
`between conference participants are disclosed. Each packet
`Voice Stream may represent monaural audio data, Stereo
`audio data, or a larger number of capture channels. Each
`packet voice Stream is mapped onto the presentation chan
`nels in a manner that allocates a particular Sound field Sector
`to that stream, and then the mapped streams are combined
`for presentation to the conferencer as a combined Sound
`field.
`In one embodiment, the methods described above are
`implemented in Software. In other words, one intended
`embodiment of the invention is an apparatus comprising a
`computer-readable medium containing computer instruc
`tions that, when executed, cause a processor or multiple
`communicating processors to perform a method for manipu
`lating multiple packet Voice Streams to create a perception of
`Spatial Separation between conference participants.
`In a Second aspect of the invention, a conferencing Sound
`localization System is disclosed. The System includes means
`for manipulating a capture or transmit Sound field into a
`Sector of a presentation Sound field, and means for Specify
`ing different presentation Sound field Sectors for different
`capture or transmit Sound fields. The Sound localization
`System can be located at a conferencing endpoint, or embod
`ied in a central MCU.
`BRIEF DESCRIPTION OF THE DRAWING
`The invention may be best understood by reading the
`disclosure with reference to the drawing, wherein:
`FIG. 1 illustrates a packet-switched stereo telephony
`System;
`FIG. 2 illustrates a packet-Switched Stereo telephony
`System in use for a conference call;
`FIG. 3 illustrates a packet-switched stereo telephony
`System in use for a conference call according to an embodi
`ment of the invention;
`FIG. 4 correlates different parts of a packet-switched
`Stereo telephony transmission path with channel terminol
`Ogy,
`FIG. 5 shows packet data virtual channels that exist in a
`three-way conference with mixing provided at the end
`points;
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`A conference call is a call between three or more callerS/
`called parties, where each party can hear each of the other
`parties (the number of conferencing parties is often limited,
`and in Some Systems, the number of Simultaneous talkers
`may also be limited). Conferencing capabilities exist in the
`PSTN (Public Switched Telephone Network), where remote
`caller's voices are mixed, e.g., at the central office, and then
`Sent to a conference participant over their Single line. Similar
`capabilities can be found as well in many PBX (Private
`Branch Exchange) systems.
`Packet-Switched networks can also carry real time Voice
`data, and therefore, with proper configuration, conference
`calls. Voice over IP (VoIP) is the common term used to refer
`to voice calls that, over at least part of a connection between
`two endpoints, use a packet-Switched network for transport
`of voice data. VoIP can be used as merely an intermediate
`transport media for a conventional phone, where the phone
`is connected through the PSTN or a PBX to a packet voice
`gateway. But other types of phones can communicate
`directly with a packet network. IP (Internet Protocol) phones
`are phones that may look and act like conventional phones,
`but connect directly to a packet network. Soft phones are
`similar to IP phones in function, but are software
`implemented phones, e.g., on a desktop computer.
`Since VoIP does not use a dedicated circuit for each caller,
`and therefore does not require mixing at a common circuit
`Switch point, conferencing implementations are Somewhat
`different than with circuit-Switched conferencing. In one
`implementation, each participant broadcasts their voice
`packet Stream to each other participant-at the receiving
`end, the VoIP client must be able to add the separate
`broadcast Streams together to create a single audio output. In
`another implementation, each participant addresses their
`voice packets to a central MCU (Multipoint Conferencing
`Unit). The MCU combines the streams and sends a single
`combined Stream to each conference participant.
`SUMMARY OF THE INVENTION
`Human hearing relies on a number of cues to increase
`Voice comprehension. In any purely audio conference, Sev
`eral important cues are lost, including lip movement and
`facial and hand gestures. But where Several conference
`55
`participant's voice Streams are lumped into a common
`output channel, further degradation in intelligibility may
`result because the Sound localization capability of the human
`binaural hearing System is underutilized. In contrast, when
`two people's voices can be perceived as arriving from
`distinctly different directions, binaural hearing allows a
`listener to more easily recognize who is talking, and in many
`instances, focus on what one perSon is Saying even though
`two or more people are talking Simultaneously.
`The present invention takes advantage of the Signal pro
`cessing capability present at either a central MCU or at a
`conferencing endpoint to add directional cues to the Voices
`
`60
`
`65
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 12 of 22
`
`
`
`US 6,850,496 B1
`
`15
`
`35
`
`40
`
`25
`
`3
`FIG. 6 contains a high-level block diagram for endpoint
`Signal processing according to an embodiment of the inven
`tion;
`FIGS. 7A and 7B illustrate a GUI display useful with the
`invention;
`FIG. 8 illustrates a packet-switched stereo telephony
`System in use for a conference call according to an embodi
`ment of the invention;
`FIG. 9 contains a high-level block diagram for endpoint
`Signal processing for an endpoint utilizing a direction finder
`to map speakers from a common endpoint to different
`locations in a presentation Sound field;
`FIGS. 10A and 10B illustrate further aspects of a GUI
`display useful with the invention;
`FIG. 11 correlates different parts of a central-MCU
`packet-Switched Stereo telephony transmission path with
`channel terminology;
`FIG. 12 shows packet data virtual channels that exist in a
`three-way conference with mixing provided at a central
`MCU according to an embodiment of the invention;
`FIGS. 13A and 13B contain a high-level block diagram
`for central MCU signal processing according to an embodi
`ment of the invention;
`FIG. 14 shows packet data channels existing in a three
`way conference with mixing provided at a central MCU, but
`with each endpoint able to specify Source locations in its
`presentation Sound field, according to an embodiment of the
`invention; and
`FIG. 15 contains a high-level block diagram for a con
`ferencing endpoint that provides presentation Sound field
`mapping for Voice data at its point of origination.
`DETAILED DESCRIPTION
`AS an introduction to the embodiments, a brief introduc
`tion to Some underlying technology and related terminology
`is useful. Referring to FIG. 1, one-half of a two-way stereo
`conference between two endpoints (the half allowing A to
`hear B1, B2, and B3) is depicted. A similar reverse path (not
`shown) allows A's voice to be heard by B1, B2, and B3.
`The elements shown in FIG. 1 include: two microphones
`20L, 20R connected to an encoder 24 via capture channels
`22L, 22R; two speakers 26L, 26R connected to a decoder 30
`via presentation channels 28L, 28R, and a packet data
`45
`network 32 over which encoder 24 and decoder 30 commu
`nicate.
`Microphones 20L and 20R simultaneously capture the
`Sound field produced at two spatially Separated locations
`when B1, B2, or B3 talk, translate the captured sound field
`to electrical Signals, and transmit those signals over left and
`right capture channels 22L and 22R. Capture channels 22L
`and 22R carry the Signals to encoder 24.
`Encoder 24 and decoder 30 work as a pair. Usually at call
`Setup, the endpoints establish how they will communicate
`with each other using control packets. AS part of this Setup,
`encoder 24 and decoder 30 negotiate a codec (compressor/
`decompressor) algorithm that will be used to transmit cap
`ture channel data from encoder 24 to decoder 30. The codec
`may use a technique as simple as Pulse-Code Modulation
`(PCM), or a very complex technique, e.g., one that uses
`Subband coding and/or predictive coding to decrease band
`width requirements. Voice Activity Detection (VAD) may be
`used to further reduce bandwidth. Many codecs have been
`Standardized and are well known to those skilled in the art,
`and the particular codec Selected is not critical to the
`operation of the invention. For Stereo or other multichannel
`
`50
`
`55
`
`60
`
`65
`
`4
`data, various techniques may be used to exploit channel
`correlation as well.
`Encoder 24 gathers capture channel Samples for a Selected
`time block (e.g., 10 ms), compresses the samples using the
`negotiated codec, and places them in a packet along with
`header information. The header information typically
`includes fields identifying Source and destination, time
`Stamps, and may include other fields. A protocol Such as RTP
`(Real-time Transport Protocol) is appropriate for transport of
`the packet. The packet is encapsulated with lower layer
`headers, such as an IP (Internet Protocol) header and a
`link-layer header appropriate for the encoder's link to packet
`data network 32. The packet is then submitted to the packet
`data network. This encoding proceSS is then repeated for the
`next time block, and So on.
`Packet data network 32 uses the destination addressing in
`each packet's headers to route that packet to decoder 30.
`Depending on a variety of network factors, Some packets
`may be dropped before reaching decoder 30, and each
`packet can experience a Somewhat random network transit
`delay, which in Some cases can cause packets to arrive at
`their destination in a different order than that in which they
`Were Sent.
`Decoder 30 receives the packets, Strips the packet
`headers, and re-orders the packets according to timestamp. If
`a packet arrives too late for its designated playout time,
`however, the packet will Simply be dropped by the decoder.
`Otherwise, the re-ordered packets are decompressed and
`amplified to create two presentation channels 28L and 28R.
`Channels 28L and 28R drive acoustic speakers 26L and 26R.
`Ideally, the whole process described above occurs in a
`relatively short period of time, e.g., 250 ms or less from the
`time B1 speaks until the time A hears B1's voice. Longer
`delays cause noticeable voice quality degradation, but can
`be tolerated to a point.
`A's binaural hearing capability allows A to localize each
`Speaker's Voice in a distinct location within their listening
`environment. If the delay and amplitude differences between
`the Sound field at microphone 20L and at microphone 20R
`can be faithfully transmitted and then reproduced by Speak
`ers 26L and 26R, B1's voice will appear to A to originate at
`roughly the dashed location shown for B1. Likewise, B2's
`Voice and B3's voice will appear to A to originate,
`respectively, at the dashed locations shown for B2 and B3.
`Now consider the three-way conference of FIG. 2. A third
`endpoint, endpoint C, with two additional conference par
`ticipants C1 and C2 has been added. Endpoint C uses an
`encoder 32, capture channels 3.4L and 34R, and microphones
`3.6L and 3.6R in much the same way as described for the
`corresponding components of endpoint B.
`Decoder/mixer 38 differs from decoder 30 of FIG. 1 in
`Several Significant respects. First, decoder/mixer 38 must be
`capable of receiving, processing, and decoding two packet
`Voice data Streams simultaneously. Second, decoder/mixer
`38 must add the left decoded signals from endpoints B and
`C together in order to create presentation channel 28L, and
`must do likewise with the right decoded signals to create
`presentation channel 28R.
`FIG. 2 illustrates the perception problem that A now faces
`in the three-way conference. The perceived locations of B1
`and C1 overlap, as do the perceived locations of B2 and C2.
`A can no longer identify from directional cues alone who is
`Speaking, and cannot use binaural hearing to Sort out two
`Simultaneous Speaker's voices that appear to be originating
`at the same general location. Of course, with a monaural
`three-way conference, a similar problem exists, as all Speak
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 13 of 22
`
`
`
`US 6,850,496 B1
`
`15
`
`25
`
`35
`
`40
`
`S
`erS from all endpoints would appear to be speaking from the
`Same central location.
`FIG. 3 illustrates the operation of one embodiment of the
`invention for the conferencing configuration of FIG. 2. To
`illustrate a further aspect of the invention, a fourth endpoint 5
`D, with a corresponding encoder 40, capture channel 42, and
`microphone 44 has been added. Endpoint D has only mon
`aural capture capability, as opposed to the Stereo capture
`capability of endpoints B and C. Decoder/mixer 38 of FIG.
`2 has been replaced with a packet voice conferencing System
`46 according to an embodiment of the invention. All other
`conferencing components of FIG. 2 have been carried over
`into FIG. 3.
`Whereas, in the preceding illustrations, the decoder or
`decoder/mixer attempted to recreate at endpoint A the cap
`ture sound field(s), that is no longer the case in FIG. 3. The
`presentation Sound field has been divided into three Sectors
`48, 50, 52. Voice data from endpoint B has been mapped to
`Sector 48, Voice data from endpoint C has been mapped to
`sector 50, and voice data from endpoint D has been mapped
`to sector 52 by system 46. Thus endpoint B’s capture sound
`field has been recreated “compressed” and shifted over to
`As left, endpoint CS capture Sound field has been com
`pressed and appears roughly right of center, and endpoint
`D’s monaural channel has been converted to Stereo and
`shifted to the far right of A's perceived sound field. Although
`the conference participants’ Voices are not recreated accord
`ing to their respective capture Sound fields, the result is a
`perceived separation between each Speaker. AS Stated earlier,
`Such a mapping can have beneficial effects in terms of AS
`recognition of who is speaking and in focusing on one voice
`if Several perSons Speak simultaneously.
`Turning briefly to FIG. 4, the meaning of several terms as
`they apply in this description is explained. A capture Sound
`field is the Sound field presented to a microphone. A pre
`Sentation Sound field is the Sound field presented to a
`listener. A capture channel is a Signal channel that delivers
`a representation of a capture Sound field to an encoding
`device-this may be anything from a simple wire pair, to a
`wireless link, to a telephone and PBX or PSTN facilities
`used to deliver a telephone Signal to a remote voice network
`gateway. A transmit channel is a packet-Switched Virtual
`channel, or possibly a Time-Division-Multiplexed (TDM)
`channel, between an encoder and a mixer-Sections of Such
`a channel may be fixed, e.g., a modem connection, but in
`general each packet will share a physical link with other
`packet traffic. And although Separate transmit channels may
`be used for each capture channel originating at a given
`endpoint, in general a common transmit channel for all
`capture channels is preferred. A presentation channel is a
`Signal channel that exists between a mixer and a device (e.g.,
`an acoustic speaker) used to create a presentation Sound
`field-this may include wiring, wireleSS links, amplifiers,
`filters, D/A or other format converters, etc. As will be
`explained later, part of the presentation channel may also
`exist on the packet data network when the mixer and
`acoustic Speakers are not co-located.
`In the following description, most examples make refer
`ence to a three-way conference between three endpoints.
`Each endpoint can have more than one Speaking participant.
`Furthermore, those skilled in the art recognize that the
`concepts discussed can be readily extended to larger con
`ferences with many more than three endpoints, and the Scope
`of the invention extends to cover larger conferences. On the
`other end of the endpoint Spectrum, Some embodiments of
`the invention are useful with as few as two conferencing
`endpoints, with one endpoint having two or more speakers
`with different capture channel arrival angles.
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`FIG. 5 illustrates, for a three-endpoint conference, one
`channel configuration that can be used with the invention.
`Endpoint A multicasts a packet Voice data Stream over
`virtual channel 60. Somewhere within packet data network
`32, a Switch or router (not shown) splits the stream, Sending
`the same packet data over virtual channel 62, to endpoint C,
`and over virtual channel 64, to endpoint B. If this multicast
`capability is unsupported, endpoint A can broadcast two
`unicast packet Voice data Streams, one to each other end
`point.
`Endpoint A also receives two packet voice data Streams,
`one over virtual channel 68 from endpoint B, and one over
`Virtual channel 74 from endpoint C. In general, each end
`point receives N-1 packet Voice data Streams, and transmits
`either one voice data Stream, if multicast is Supported, or
`N-1 unicast data Streams otherwise. Accordingly, this chan
`nel configuration is better Suited to Smaller conferences
`(e.g., three or four endpoints) than it is to larger conferences,
`particularly where bandwidth at one or more endpoints is an
`SS.C.
`FIG. 6 illustrates a high-level block diagram for one
`embodiment of a packet Voice conferencing System 46.
`Network interface 80 provides connectivity between a
`packet-Switched network and the remainder of System 46.
`Controller 88 sends and receives control packets to/from
`remote endpoints using network interface 80. Incoming
`voice data packets are forwarded by network interface 80 to
`packet switch 82. Although not illustrated in this
`embodiment, the System will typically also contain an
`encoder for Outgoing conference voice traffic. The encoder
`will Submit outgoing voice data packets to network interface
`80 for transmission. Network interface 80 can comprise the
`entire protocol Stack and physical layer hardware, an appli
`cation driver that receives RTP and control packets, or
`Something in between.
`Packet Switch 82 distributes voice data packets to the
`appropriate decoder. In FIG. 6, it is assumed that two remote
`endpoints are broadcasting Voice data Streams to System 46,
`and so two decoders 84 and 86 are employed, one per
`stream. Packet switch 82 distributes voice packets from one
`remote endpoint to decoder84, and distributes Voice packets
`from the other remote endpoint to decoder 86 (when more
`endpoints are joined in the conference, the number of
`decoders, jitter buffers, and channel mapperS is increased
`accordingly). Packet Switch 82 identifies packets belonging
`to a given Voice data Stream by examining header fields that
`uniquely identify the voice stream-for an RTP/UDP (User
`Datagram Protocol)/IP packet, these fields can be, e.g., one
`or more of the source IP address, source UDP port, and RTP
`SSRC (synchronization source) identifier. Controller 88 is
`responsible for providing packet Switch 82 with the field
`values for a given Voice Stream, and with an association of
`those field values with a decoder.
`Decoders 84 and 86 can use any suitable codec upon
`which the System and the respective encoding endpoint
`Successfully agree. Each codec may be renegotiated during
`a conference, e.g., if more participants place a bandwidth or
`processing Strain on conference resources. And the same
`codec need not be run by each decoder-indeed, in FIG. 6,
`decoder 84 is shown decoding a Stereo voice data Stream,
`while decoder 86 is shown decoding a monaural voice data
`stream. Controller 88 performs the actual codec negotiation
`with remote endpoints. In response to this negotiation,
`controller 88 activates, initializes, and reinitializes (when
`and if necessary) each decoder as needed for the conference.
`In most implementations, each decoder will be a proceSS or
`thread running on a digital Signal processor or general
`
`CSCO-1027
`CISCO SYSTEMS, INC. / Page 14 of 22
`
`
`
`US 6,850,496 B1
`
`1O
`
`15
`
`25
`
`35
`
`40
`
`7
`purpose processor, but many codecs can also be imple
`mented in hardware. The maximum number of Streams that
`can be concurrently decoded in Such an implementation will
`generally be limited by real-time processing power and
`available memory.
`Jitter buffers 90,92, and 94 receive the voice data streams
`output by decoders 84 and 86. The purpose of the jitter
`bufferS is to provide for Smooth audio playout, i.e., to
`account for the normal fluctuations in Voice data Sample
`arrival rate from the decoders (both due to network delays
`and to the fact that many Samples arrive in each packet).
`Each jitter buffer ideally attempts to insert as little delay in
`the transmission path as possible, while ensuring that audio
`playout is rarely, if ever, Starved for Samples. Those skilled
`in the art recognize that various methods of jitter buffer
`management are well known, and the Selection of a particu
`lar method is left as a design choice. In the embodiment
`shown in FIG. 6, controller 88 controls jitter buffer synchro
`nization by manipulating the relative delays of the buffers.
`Channel mappers 96 and 98 each manipulate their respec
`tive input Voice data channels to form a set of presentation
`mixing channels. Controller 88 manages each channel map
`per by providing mapping instructions, e.g., the number of
`input voice data channels, the number of output presentation
`mixing channels, and the presentation Sound field Sector that
`should be occupied by the presentation mixing channels.
`This last instruction can be replaced by more specific
`instructions, e.g., delay the left channel 2 ms, mix 50% of the
`left channel into the right channel, etc., to accomplish the
`mapping. In the former case, the channel mapper itself
`contains the ability to calculate a mapping to a desired Sound
`field; in the latter case, these computations reside in the
`controller, and the channel mapper performs basic Signal
`processing functions Such as channel delaying, mixing,
`phase shifting, etc., as instructed.
`A number of techniques are available for sound field
`mapping. From Studies of human hearing capabilities, it is
`known that directional cues are obtained via Several different
`mechanisms. The pinna, or outer projecting portion of the
`ear, reflects Sound into the ear in a manner that provides
`Some directional cues, and Serves a primary mechanism for
`locating the inclination angle of a Sound Source. The primary
`left-right directional cue is ITD (interaural time delay) for
`mid-low- to mid-frequencies (generally several hundred HZ
`up to about 1.5 to 2 kHz). For higher frequencies, the
`primary left-right directional cue is ILD (interaural level
`differences). For extremely low frequencies, Sound localiza
`tion is generally poor.
`ITD Sound localization relies on the difference in time that
`it takes for an off-center Sound to propagate to the far ear as
`opposed to the nearer ear-the brain uses the phase differ
`ence between left and right arrival times to infer the location
`of the Sound Source. For a Sound Source located along the
`Symmetrical plane of the head, no inter-ear phase difference
`exists, phase difference increaseS as the Sound Source moves
`left or right of center, the difference reaching a maximum
`when the Sound Source reaches the extreme left or right of
`the head. Once the ITD that causes the Sound to appear at the
`extreme left or right is reached, further delay may be
`perceived as an echo.
`In contrast, ILD is based on inter-ear differences in the
`perceived Sound level-e.g., the brain assumes that a Sound
`that seems louder in the left ear originated on the left Side of
`the head. For higher frequencies (where ITD sound local
`ization becomes difficult), humans rely chiefly on ILD to
`infer Source location.
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`Channel mappers 96 and 98 can position the apparent
`location of their assigned conferencing endpoints within the
`presentation sound field by manipulating ITD and/or ILD for
`their assigned Voice data channels. If a conferencing end
`point is broadcasting monaurally and the presentation SyS
`tem uses Stereo, a preliminary Step can be to split the Single
`channel by forming identical left and right channels. Or, the
`Single channel can be directly mapped to two channels with
`appropriate ITD/ILD effects introduced in each channel.
`Likewise, an ITD/ILD mapping matrix can be used to
`translate a monophonic or Stereophonic voice data channel
`to, e.g., a traditional two-speaker, 3-speaker (left, right,
`center) or 5.1 (left-rear, left, center, right, right-rear,
`subwoofer) format.
`Depending on the processing power available for use by
`the channel mapperS-as well as the desired fidelity
`various effects ranging from computationally simple to
`computationally intensive can be used. For instance, one
`Simple ITD approach is to delay one voice data channel from
`a given endpoint with respect to a companion Voice data
`channel. For Stereo, this can be accomplished
Accessing this document will incur an additional charge of $.
After purchase, you can access this document again without charge.
Accept $ ChargeStill Working On It
This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.
Give it another minute or two to complete, and then try the refresh button.
A few More Minutes ... Still Working
It can take up to 5 minutes for us to download a document if the court servers are running slowly.
Thank you for your continued patience.
This document could not be displayed.
We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.
Your account does not support viewing this document.
You need a Paid Account to view this document. Click here to change your account type.
Your account does not support viewing this document.
Set your membership
status to view this document.
With a Docket Alarm membership, you'll
get a whole lot more, including:
- Up-to-date information for this case.
- Email alerts whenever there is an update.
- Full text search for other cases.
- Get email alerts whenever a new case matches your search.
One Moment Please
The filing “” is large (MB) and is being downloaded.
Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.
Your document is on its way!
If you do not receive the document in five minutes, contact support at support@docketalarm.com.
Sealed Document
We are unable to display this document, it may be under a court ordered seal.
If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.
Access Government Site