This is a preliminary document and may be changed substantially prior to final commercial release of the software described herein.
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.
© 2006 Microsoft Corporation. All rights reserved.
Microsoft, MS-DOS, Windows, Windows NT, Windows Server, Windows Vista, Active Directory, ActiveSync, ActiveX, Direct3D, DirectDraw, DirectInput, DirectMusic, DirectPlay, DirectShow, DirectSound, DirectX, Expression, FrontPage, HighMAT, Internet Explorer, JScript, Microsoft Press, MSN, OneNote, Outlook, PlaysForSure logo, PowerPoint, SideShow, Visual Basic, Visual C++, Visual InterDev, Visual J++, Visual Studio, WebTV, Windows Live, Windows Media, Win32, and Win32s are either registered trademarks or trademarks of Microsoft Corporation in the U.S.A. and/or other countries.
All other trademarks are property of their respective owners.
Some of the links in this document might let you leave Microsoft's site. The linked sites are not under the control of Microsoft and Microsoft is not responsible for the contents of any linked site or any link contained in a linked site, or any changes or updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission received from any linked site. Microsoft is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement by Microsoft of the site.
Windows Vista™ introduces new advanced audio and communication functionality that enhances the high-fidelity music and movie audio experience and provides great hands-free voice support. For entertainment audio, it supports the same kind of functionality and performance usually found only in expensive, feature-laden audio/video receivers (AVRs), including previously exclusive features like Room Correction and Bass Management. For communications, it supports echo cancellation as well as array microphone voice acquisition. This new functionality is well ahead of the competition for both in-box entertainment and communications services.
The new audio features can be broadly classified into three categories:
· Enhanced music, television, and movie audio playback
· Surround headphones and bass boost for laptop computers
· Advanced communication
This white paper provides an overview of all the audio system effects available in Windows Vista. In addition to information on the actual Digital Signal Processing (DSP) algorithms, the user interface (UI) to access and choose the different effects is also described.
Today, consumers use their PCs to access and enjoy a wide variety of entertainment. It is not uncommon for music, television, and movie playback to be integrated into one machine. With Microsoft® Windows® Media Center, the PC is increasingly found in the living room, and is being used as the entertainment hub for the entire household. New audio features enable the use of media from one place: the computer. Having the source and rendering information present in one place instead of spread out over several sources (such as a CD or DVD player, an AVR, a TV, and so on) provides a much more engrossing playback experience for both the casual and the avid listener.
The new audio effects introduced in Windows Vista are explained in the sections that follow.
One of the frustrating things about current integrated media experiences is inconsistency in volume levels between different sources. You may have noticed that sometimes, when watching TV, even though the program you are watching is at just the right volume level, the commercial breaks can vary widely in volume, causing you to adjust the volume setting accordingly. Some of today's expensive HD-capable televisions feature the ability to equalize volume so that the sound is kept at a somewhat constant level. That works fine if you rely on your TV for sound, but most home theater and home audio enthusiasts do not. They connect their TVs directly to their sound systems. In addition, equalization solutions today are less effective across different audio content and sources. Windows Vista has the ability to maintain a more constant perceived loudness across different digital audio files or sources.
This means that loudness will always be within a constant range—even across digital signals—when performing tasks such as the following:
· Switching between watching an NTSC/ATSC TV broadcast and listening to a locally stored Windows Media Audio (WMA) or MP3 file.
· Switching between different formats in a playlist, which might contain WMA, MP3 or WAV files authored at different volume levels.
This feature is ideal, for example, for watching a movie at night. It makes it easier to hear the quieter parts of the movie while keeping the volume within a range that is considerate of others. You can also use this feature to improve the listening experience in noisy playback settings, by again keeping the quiet parts of the content audible without having disturbingly loud crescendos.
Loudness, in its technical sense, refers to the perceived (internal to the listener) sensation of how loud an auditory stimulus is. Intensity (volume and level) is the external, measured power of an auditory stimulus. Two signals of the same intensity, if they have different time structure or different frequency content, may have substantially different loudness levels. This leads to the common experience where some content sounds much louder than other simply because of the source material and the way in which the content was recorded. Furthermore, different standards for content (for example, digital TV versus analog TV) may have different specified intensity levels as well. As a result, the perceived level of content may vary widely, from nearly inaudible in a moderately quiet listening environment to loud enough to be uncomfortable.
Loudness equalization (EQ) uses a simulation of human hearing to obtain an accurate loudness measurement—as opposed to intensity—of an audio source and then provides a dynamic gain adjustment to keep the loudness of the sources more constant. Therefore, loudness EQ might affect both dynamic range and peak loudness.
Single-pass loudness EQ calculates loudness on a block-by-block basis and adjusts the gain just as many wideband compressors do—with a fast "attack" and slow "decay"—to tightly control the peak loudness of a signal while keeping the local dynamics. The fast "attack" means that signals that are much louder result in rapid gain reductions in order to control the loudest signal presented to the listener. The slower "decay" means that the gain after a peak increases slowly when the signal level after a peak is not maintained. In this fashion, long-term level changes are somewhat equalized, but short-term dynamics of the signal are preserved. In single-pass loudness EQ the loudness equalization is not full, and some sense of "louder" versus "softer" is deliberately preserved across different material.
This service is very rare in AVRs, and is not generally included in competing products.
Figure 1: Screenshot of the Loudness Equalization option
No home theater would be complete without bass that you can feel. With Windows Vista you can adjust the movie or music playback experience to suit the loudspeakers in your home theater system for maximum bass effects.
In many audio systems, some or all of the loudspeakers are not full-range loudspeakers. In such systems, a single subwoofer is often used to present frequencies below the capabilities of the main loudspeakers. While this system, with one subwoofer, may not maintain all of the auditory cues in the original source material, such systems are very common, and often require pre-filtering of the signals to all channels. Such systems can use a form of bass management called "Forward Bass Management", as shown in the following illustration.
Figure 2: Forward Bass Management
In forward bass management, some or all of the loudspeakers are small, and there can be two large (left and right) loudspeakers, a subwoofer, or both.
In other systems, people who prefer full-range loudspeakers in some or all positions may not have—or want—a subwoofer. In such cases, material intended for the subwoofer in home theater systems may need to be mapped back into the main channels so that it is not lost. Such a scheme of bass management is referred to as "Reverse Bass Management", as shown in the following illustration.
Figure 3: Reverse Bass Management
In reverse bass management, there are at least two large loudspeakers, and no subwoofer is present. It is possible to perform reverse bass management with only two or all large loudspeakers, and the LFE material will be distributed appropriately in either case.
In any of these systems, knowledge of the configuration of the loudspeakers and knowledge of the cutoff frequency of the limited-bandwidth loudspeakers can be entered at setup in the Bass Management Settings window, as shown in the following screen shot, and the system will then take the appropriate actions.
Figure 4: Screen shot of the Bass Management Settings window
Most music is produced with only two channels and thus is not optimized for the typical audio/video enthusiast's multi-channel audio equipment. Having your music emanate only from your front left and right loudspeakers is a less than ideal audio experience for those who have invested in a multi-channel setup. With Speaker Fill, Windows Vista can simulate a virtual multi-channel loudspeaker setup, enabling all of the loudspeakers in your room to play sound that would otherwise only originate from two loudspeakers, with enhanced spatial sensations.
Figure 5: Screen shot of the Speaker Fill option
Speaker fill is used when there are more playback channels (or loudspeakers) than there are source channels (n–>M, where M ≥ n). The fill is generated by a combination of channel manipulations and inserted delays. This effect can be turned on and off via the Audio control panel. Speaker fill accepts stereo or multi-channel input. Speaker fill for situations where n = M occurs when content is authored for a channel mask corresponding to "n" but the actual physical configuration uses a different channel mask. (For example, quad <–> surround.)
Configuring audio in a home theater can take a good deal of time to get just right. Dialog and sound effects should sound as impressive as possible. For this purpose, Windows Vista now features Room Correction processing, which will find a "sweet spot" for the listener. The settings optimize the listening experience for a particular location in the room—for example, the center cushion of your couch—by automatically calculating a combination of delay, frequency response, and gain adjustments. This technology works differently than similar features in high-end receivers since it better accounts for the way the human ear processes sound. With this advance, sound is better matched to the on-screen image. This feature is also useful for users who place their desktop speakers in nonstandard locations.
Room correction requires the use of a microphone to calibrate its settings. The microphone is placed at the location the user intends to sit, and then the user activates a wizard that measures the room response. Using a set of specifically designed tones from each loudspeaker in turn, the computer calculates the distance, frequency, response and overall gain of each loudspeaker from the listening location at the microphone. This is applicable to both stereo and multi-channel systems. Once these measurements have been made, they are stored as a profile. This profile is used by the room correction DSP to correct the delay, overall gain, and frequency balance between loudspeaker locations so that the listening area will have a good stereo and multi-channel soundstage with improved timbre, envelopment, and front and back sensation compared to the uncorrected system.
If the user has a good microphone, the room and loudspeaker correction will automatically attempt to flatten the frequency response of each channel to compensate for relative differences in each channel, as well as any deficiencies in the frequency response from each channel.
Figure 6: Screen shot of the Room Calibration window
Virtual Surround uses simple digital methods to combine multi-channel signals into a 2-channel signal that can be decoded back into multi-channel signal using Pro Logic decoders available in most of today's audio receivers. This setting is ideal for a scenario that includes a two-channel sound card and a receiver with a surround sound enhancement mechanism.
In a multi-channel setup, it is generally expected that all the loudspeakers (center and satellite loudspeakers) are always present. However, this is not always the case. Users may not have all the loudspeakers or they may choose to selectively turn off one or more loudspeakers in a multi-channel setup. Reproduction of the sound from the missing channel by splitting it between the adjacent loudspeakers is referred to as Speaker Phantoming. A common example is absence of the center loudspeaker in a multi-channel setup. However, phantoming of other combinations such as rear left and right pairs or side left and right pairs is also possible.
More people are watching movies and television shows on their laptop computers. With Windows Vista™, you can take your home theater experience with you on the road.
Movie playback experience on a laptop is taken to the next level with virtualized surround sound over headphones. Headphone Virtualization uses advanced technology called Head Related Transfer Functions (HRTF) to account for the shape of the human head, thereby enabling a virtualized surround sound experience through your stereo headphones. The effect is one in which the sound feels like it is transcending the headphones, providing an "outside-the-head" listening experience. With this experience, users are able to distinguish sound from side to side as well as from front to back.
With conventional headphone playback, spatial cues that a listener would normally experience with playback over loudspeakers are not possible. The result is an unnaturally wide stereo image forming a straight line between your ears. Left and right sounds appear to occur directly beside the listener, while center sounds appear to be within the listener's head. Headphone virtualization transmits spatial cues that help the brain to localize the sounds and integrate them into a sound field, thus giving the listener the feeling that the sounds are "out of the head".
HRTF describe the acoustic cues that not only help listeners locate the direction and source of sound but also the type of acoustic environment surrounding the listener. The HRTF are measurable characteristics that depict the near-ear response, far-ear response, and inter-aural delay (the delay between the two ears). These characteristics can be synthesized using digital signal processing and the effect can be delivered to headphones. The brain then uses the three-dimensional spatial cue to recreate and realize an exceptional listening experience.
This function can only be enabled by indicating that the user is listening with headphones, as shown in the following screen shot.
Figure 7: Screen shot of the Headphone Virtualization option
In systems that have speakers with limited bass capability—such as laptops—it is sometimes beneficial to boost the bass in the frequency range that the speaker can support, in order to increase the perceived quality of the audio. Bass boost essentially provides this functionality by providing a gain in the mid-bass range, thereby making the audio sound better on mobile devices with very small speakers.
The Audio System Effects are accessible through the Audio control panel (Audio CPL). In the Audio CPL, users can set up the physical loudspeaker configuration. Users can also specify if the audio end points are loudspeakers, headphones, or a Sony/Philips Digital Interface (S/PDIF) receiver, as shown in the following screen shot.
Figure 8: Screenshot of the Audio Control Panel
Once a user chooses the end point, the user must go through a Speaker Configuration Wizard. In the wizard, users choose the loudspeaker setup for the system (stereo or multi-channel) and verify that the setup is accurate by playing test tones through the loudspeakers, as shown in the following screen shot.
Figure 9: Screen shot of the Speaker Configuration Wizard
In addition, users can specify the physical characteristics of the loudspeakers, such as which loudspeakers are full range, whether any loudspeakers are absent, and so on.
The loudspeaker configuration settings determine which audio effects are enabled. The loudspeaker characteristics also set some of the parameters required for audio effects such as Bass Management and Speaker Phantoming.
Upon completion of the Speaker Configuration Wizard, users can select audio system effects pertinent to the loudspeaker configuration. The Enhancements tab in the Audio control panel takes you to the Audio System Effects window.
Figure 10: Screenshot of the Speaker Properties in the Audio Control Panel
Depending on which audio end point, loudspeaker configuration, and loudspeaker characteristics are chosen, certain audio system effects will be available for users to choose from.
The audio system effects available to corresponding loudspeaker configurations are listed in the following table.
Many people are using computers for voice over IP (VoIP), video conferencing, and other real-time uses. As a result, many computers and related devices have an embedded or external microphone that is used for purposes such as dictation, speech recognition, and VoIP telephony. However, under normal conditions, ambient noise and reverberation can make it difficult for a single microphone to capture a useful speech signal. Expensive tele-conferencing equipment solves those problems by using array-based microphones and specialized processing.
Personal computers and other computing devices can usually play sounds well, but they do a poor job of capturing sound due to the typical computer's difficult acoustic environment. With the processing power, storage capacities, broadband connections, and speech-recognition engines available today, computing devices can use better sound capture to deliver more value to customers.
With current computer-based audio technology, it is often possible to provide better live communication than telephones, much better record and playback or note-taking devices than tape recorders, and better command of the user interface than remote controls. All applications that use sound could benefit from better sound capture capabilities. Consider, for example, the following real-time communication applications:
· Windows® Messenger, Microsoft® Windows Live™ Messenger, and all other applications built on top of the Microsoft Real-Time Communication stack, such as AOL Instant Messenger, other applications for VoIP, and enhanced telephony.
· Enterprise solutions for collaboration and groupware applications, such as Microsoft Office Live Meeting, the meeting recording capabilities in Microsoft Office OneNote®, and voice-messaging applications.
Robust speech recognition technologies are still under development, but many Windows-based applications already have voice command integration that work satisfactorily when the user wears a headset with a close-talk microphone that has good sound-capturing quality. Such technologies are especially convenient for Tablet PCs and handheld devices, where otherwise the user would have to enter data with a stylus.
Numerous studies have shows that users don't like to wear headsets or to be tethered to the computer. Furthermore, in many scenarios, headsets are not an option. For example, walking with a headset and a Tablet PC in your hand can feel awkward. Using an array of microphones with a personal computer and other computing devices can alleviate the problems caused by using only one microphone. The goal—"Wear no close-proximity microphone gear; just talk to your computer"—implies mobility and freedom of movement.
A microphone array is a set of closely positioned microphones. Microphone arrays achieve better directionality than a single microphone by taking advantage of the fact that an incoming acoustic wave arrives at each of the microphones at a slightly different time. The main concepts that are used in the design of microphone arrays are beam forming, array directivity, and beam width.
Windows Vista™ includes microphone array support as part of a complete audio subsystem that provides the following advances:
· Improved acoustic echo cancellation
· Microphone array support
· Stationary noise suppressor
· Automatic gain control
· Wideband quality of sound capturing and processing
Microphone array processing is linear and doesn't introduce distortions to the signal, so the microphone array output is good for a human listener and friendly for the speech recognition engine. The Windows Vista audio stack can be used both for real-time communication applications such as Windows Messenger and for speech recognition-enabled applications that include features such as voice commands and dictation.
Some of the potential and preferred design choices for an array microphone placement with laptop and computer monitors are shown in the following illustrations:
a) Laptop: microphones on the top bezel, speakers in front, away from them
b) Monitor: microphones close to the user, loudspeakers low on the monitor
c) Tablet: microphone array on the top, the speaker on the opposite side
d) Laptop/tablet convertible: L-shaped array in good position in laptop mode and not covered by hand in tablet mode
Windows Vista has state of the art audio capture-enhancing algorithms and functionality that go beyond what is offered in competing operating systems and software in existence today. This set of innovations will dramatically improve the quality of real-time communication and speech recognition scenarios on a computer running Windows Vista and has the potential to significantly increase their relevance as well.
Windows Vista provides a new and unique set of audio processing and services for both entertainment and communications scenarios that improves the listener's ability to understand and communicate, as well enhances the listener's audio playback experience. These new features, provided as in-box services, are on par with expensive AVRs and communication devices in terms of quality and functionality, and far exceed the functionality available in competing operating systems.
The Audio DSP Effects are a set of new components in Windows Vista™ and are provided as System Effect (SysFX) Audio Processing Objects (sAPO). A SysFX component can either reside in the local effects (LFX) space or in the global effects (GFX) space. LFX corresponds to the point just before all the audio streams are mixed in the Audio Engine. GFX corresponds to the point after the audio streams are mixed (post-mix) in the Audio Engine, just before the audio data is passed to the audio adapter.
The following diagram depicts how audio streams that are rendered by different applications (pipeline) interface with the Audio engine.
Figure 11: Audio engine and LFX/GFX
In the preceding diagram SysFX is part of the audio engine infrastructure. Applications communicate with the SysFX components through the Windows Audio Speech API (WASAPI) interface.
The following effects are available in the SysFX APO associated with the in-box class driver.
Local Effects (LFX) include:
· Loudness Equalization
· Bass Management
· Bass Boost
· Speaker Filling
· Speaker Phantoming
· Headphone Virtualization
· Virtual Surround
Global Effects (GFX) include:
· Room Correction
Note: Audio system effects (also referred to as Home Theater DSPs) or SysFX are available on any HD-Audio- and USB Audio-equipped computer that uses in-box class drivers. Third party audio drivers that do not use the in-box class drivers will either have similar effects of their own or will re-use the inbox SysFX audio DSPs and expose them through the Control Panel.
The preceding diagram also depicts where some of the advanced voice capture DSPs like Microphone Array, AEC (Acoustic Echo Canceller), NS (Noise Suppression) & AGC (Automatic Gain Control) reside for use by any application.
· Audio Device Technologies for Windows®:
· Universal Audio Architecture:
· Custom Audio Effects in Windows Vista™:
· Microphone Array support in Windows Vista: