Press "Enter" to skip to content

The Emerging Future of Spatial Audio for Virtual Presence

Viktor 0

One of the more exciting applications of reproduced spatial audio, with significant potential to improve human connection and communication over long distances, is spatial audio for virtual presence, a.k.a. telepresence. Imagine joining a meeting on the other side of the world — and instead of hearing a flat, single-channel voice in your headphones, you sense the room. The chairperson’s voice comes from slightly left, the murmured side conversation sits behind you, and the distant hum of a projector completes the acoustic picture. This is the promise of spatial audio for virtual presence — not just teleconferencing, but truly being there without leaving your chair. In this post, we’ll explore the technology, the breakthroughs, and the remaining hurdles.

Introduction

Remote presence has two main components: spatial audio and 360° (VR) video. The applications are numerous — from attending remote meetings in conference rooms to conducting remote inspections or enabling industrial collaboration (see, for example, the company Avatour). Just imagine how much resources and CO2 can be saved by avoiding unnecessary travel to distant locations.

In the last few years, researchers have ticked off some of the last major obstacles to achieving high-quality spatial audio for virtual presence. In this post, I highlight some of the most significant developments from both research and industry.

To get a sense of the concept, I recomment listening to this short spatial audio recording I made featuring my voice. It was captured using a (somewhat noisy) Ambisonics microphone1 and requires headphones. The recording demonstrates the spatial qualities of the sound field (i.e. the voice direction, and slight room reverb). Compare the spatial audio vs. mono versions below, and notice how much easier it is to distinguish and understand multiple simultaneous voices in spatial audio!

Spatial audio:

Mono:

The current paradigm for conferencing audio is to use advanced algorithms to increase voice intelligibility by isolating the voice of the current speaker and supress all other sounds. The result is usually mono audio. This approach works well enough for many applications, but it also introduces artifacts and represents an incredible information reduction compared to being in the remote room yourself.

In fact, our brain already has sophisticated capabilities for understanding speech in reverberant environments with multiple simultaneous speakers, as long as we listen with two ears. See the “coctail party effect” on Wikipedia. Thus, with virtual presence, the idea is to transmit enough information to reproduce binaural (ear) signals instead and let the brain do its processing. And it can actually be desired to keep environmental background sounds and room reverberation, as these are important for a natural and immersive experience2.

How it Works

Delivering spatial audio for virtual presence involves:

  1. Capturing spatial audio with a dedicated microphone, and encoding it to a spatial audio format.
  2. The transmission process over a network.
  3. Spatial audio decoding and reproduction.

Recording

Ambisonics has become a de facto standard for spatial audio capture, so I will focus on it here. In simple terms, it encodes how much sound energy is arriving from different directions at each moment in time. Higher “orders” of Ambisonics provide finer directional resolution.

First-order Ambisonics uses four channels and already sounds quite good for capturing diffuse sound fields that are dominated by background sounds, and for voice capture. However, to adequately capture the directionality of distinct full-band sound sources in different directions, higher orders are needed.

Microphone quality is critical. Until recently, higher order Ambisonics microphones were prohibitively expensive. A couple of years ago, the 3rd order “spcmic” by Harpex audio became available, arguably the first and currently only reasonably priced option for a wideband higher order (>=3) microphone. Third-order Ambisonics has 16 channels, though the choice of order also depends on available network bandwidth.

More options will surely follow. Apple was recently granted a patent for what appears to be a professional-grade spatial audio microphone array. In iOS 18, Apple also introduced a spatial audio capture API, enabling iPhone 16 to record first-order Ambisonics. While this is great news for the proliferation of spatial audio, mobile devices have restrictions on microphone placements and a range of tricks and assumptions about the sound field typically need to be used for spatial audio capture (c.f. parametric spatial audio processing). Professional-grade microphone arrays can be expected to provide higher quality recordings.

The miniDSP ambiMik-1 ambisonic microphone, used for the voice recording above

Encoding and Transmission

Real-time spatial audio transmission has made significant strides. A major milestone came in 2022 with the release of the 3GPP standard for Immersive Voice and Audio Services (IVAS). Backed by industry leaders including Fraunhofer, Dolby, Ericsson, Qualcomm, Nokia, and Huawei, IVAS enables high-quality spatial audio over mobile networks. It supports both higher-order Ambisonics and other immersive formats — so if you’ve ever wondered what you’ll be using that glorious 5G bandwidth for, here’s one answer.

In this 2024 press release, Nokia announced it had made the world’s first spatial audio phone call — unsurprising given Finland’s strong research presence in spatial audio and Nokia’s history with immersive audio, such as the Nokia Ozo project.

For virtual presence, low latency is essential. It should be possible to naturally interact with people at the other end. Around 100–200 ms, like with Zoom3, is usually fine for voice conferencing. Other applications demand far stricter timing, such as performing musical instruments together (e.g. Jamulus).

Reproduction

For headphone playback, an Ambisonics decoder converts the Ambisonics channels into left and right ear signals. It wasn’t until 2018 that important progress was published on how to do this conversion at reasonable Ambisonics orders without major quality loss4.

Still, the reproduction side poses some of the biggest challenges. The most obvious are that 1) consumer headphones have an extreme variation in their frequency response, and 2) everyones ears are different. The problem is that measuring and correcting for the individual variation, while it can be done in a lab setting, is difficult to do on a large scale for many people. The effect of employing generic non-individual reproduction can be sound coloration, perceiving the wrong directions, and in-head localization.

The way each ear responds to sound from different directions — essentially its acoustic “fingerprint” — is described by Head-Related Transfer Functions (HRTFs). Measuring HRTFs involves putting miniature microphones in the ears and recording responses to sound sources from many angles. This process is, naturally, impractical for the mass market. To overcome this, researchers have explored simplified methods such as scanning or photographing the head to build a 3D model, then using mathematics and machine learning to estimate the HRTFs. Apple has implemented such an approach at scale, but so far the results remain less convincing than direct measurements.

Luckily, perfect reproduction is not needed for every application. As you may have heard in my spatial audio example above, you can still get a decent sense of direction even with non-individual reproduction. Voice has a limited frequency range and less strict requirements. Many impressive binaural demos — such as the famous “Virtual Barber Shop” — use non-individual HRTFs effectively. Head-tracking also helps create an out-of-head listening experience, though in many consumer products latency remains high enough to be noticeable, often due to wireless audio codec delays.

A few tips for judging the quality of binaural spatial audio is in order. Firstly, do you get out-of-head localization? When manufacturers try to produce convincing demos, it is much easier to achieve out-of-head localization for virtual sound sources to your sides, so try listening with sources to the front! Drenching the sound in reverberation is another “trick” used to get more out-of-head localization. Furthermore, with non-individual HRTFs, elevation perception often contains errors. Try listening to sources that should be directly in front to see if you perceive them at the correct height.

Final Words

In my own experiments streaming first-order Ambisonics between conference rooms (with standard 2D video via Zoom), people consistently find the experience novel and exciting. You get sonic cues of where people are in the sending room and a whole different feeling of “being there” than mono audio can provide.

It looks like spatial audio for real-time communications will become a thing in the relatively near future. I look forward to it and believe that better long-distance communication can really benefit the world. As this post hopefully made clear, the quality of the experience is dependent on every part of the signal chain and it is important to use the best available methods in every step. Since some of the research is very new, this is not a given.

My doctoral thesis deals with optimizing the whole signal chain in binaural reproduction. For more on individual HRTF measurements and individual headphone calibration, see the post about my research on this matter. Also see section 4.4.1 of my thesis.

Note: this article represents my own thoughts and was not sponsored.

Footnotes/References

  1. A miniDSP ambiMik-1, which I developed the VST-plugin software for. It is a first order Ambisonics microphone, and I also applied Ambisonics upmixing to 4th order. ↩︎
  2. “On the Relative Importance of Visual and Spatial Audio Rendering on VR Immersion” https://www.frontiersin.org/journals/signal-processing/articles/10.3389/frsip.2022.904866/full ↩︎
  3. See https://en.wikipedia.org/wiki/Jamulus ↩︎
  4. Notable advancements in Ambisonics binaural decoders
    https://www.researchgate.net/publication/325080691_Binaural_Rendering_of_Ambisonic_Signals_via_Magnitude_Least_Squares https://www.researchgate.net/publication/325864081_Binaural_rendering_of_Ambisonic_signals_by_head-related_impulse_response_time_alignment_and_a_diffuseness_constraint
    ↩︎

Comments are closed.