DOAVINCI: Direction of Arrival based Videoconferencing Incorporating Neural Networks for Increased Conversational Intelligibility

Nils Poschadel, Stephan Preihs, Jürgen Peissig (2025): DOAVINCI: Direction of Arrival based Videoconferencing Incorporating Neural Networks for Increased Conversational Intelligibility, submitted to Fortschritte der Akustik - DAGA 2025, 51. Jahrestagung für Akustik, Kopenhagen.

This page provides some additonal material on DOAVINCI: direction of arrival based videoconferencing that incorporates neural networks to enhance conversational intelligibility. It leverages a spherical microphone array and a 360° camera to improve both audio and visual focus on active speakers. DOAVINCI employs deep learning based direction of arrival (DOA) estimation in the spherical harmonics domain, complemented by a voice activity detection. The detected DOA informs a beamforming algorithm that focuses on the active speaker, aiming to improve speech intelligibility by attenuating background noise. Additionally, the DOA information directs a zoomed and perspective-corrected view of the active speaker within the 360° video stream, aligning visual attention with auditory focus. The tool’s effectiveness in enhancing speech intelligibility is evaluated using the Short-Time Objective Intelligibility (STOI) metric across different realistic scenarios including varying SNR conditions.

Without beamforming (None)
With Max-EV beamforming

Speech samples with varying SNR and different beamformers

  • 0 dB

    None

    STOI: 0.51

    Max-EV

    STOI: 0.61

    Cardioid

    STOI: 0.59

    Hypercardioid

    STOI: 0.59
  • 4 dB

    None

    STOI: 0.59

    Max-EV

    STOI: 0.71

    Cardioid

    STOI: 0.68

    Hypercardioid

    STOI: 0.68
  • 8 dB

    None

    STOI: 0.66

    Max-EV

    STOI: 0.77

    Cardioid

    STOI: 0.76

    Hypercardioid

    STOI: 0.76
  • 12 dB

    None

    STOI: 0.72

    Max-EV

    STOI: 0.82

    Cardioid

    STOI: 0.81

    Hypercardioid

    STOI: 0.81
  • 16 dB

    None

    STOI: 0.75

    Max-EV

    STOI: 0.86

    Cardioid

    STOI: 0.85

    Hypercardioid

    STOI: 0.84
  • 20 dB

    None

    STOI: 0.78

    Max-EV

    STOI: 0.88

    Cardioid

    STOI: 0.87

    Hypercardioid

    STOI: 0.86

Speech samples without beamformig and with Max-EV beamforming after prcoessing by different videoconferencing software and with varying SNR

  • 0 dB

    None

    STOI: 0.51

    None-Teams

    STOI: 0.59

    None-Webex

    STOI: 0.61

    None-Zoom

    STOI: 0.54

    Max-EV

    STOI: 0.61

    Max-EV-Teams

    STOI: 0.65

    Max-EV-Webex

    STOI: 0.72

    Max-EV-Zoom

    STOI: 0.68
  • 10 dB

    None

    STOI: 0.69

    None-Teams

    STOI: 0.74

    None-Webex

    STOI: 0.75

    None-Zoom

    STOI: 0.74

    Max-EV

    STOI: 0.81

    Max-EV-Teams

    STOI: 0.85

    Max-EV-Webex

    STOI: 0.85

    Max-EV-Zoom

    STOI: 0.84
  • 20 dB

    None

    STOI: 0.78

    None-Teams

    STOI: 0.80

    None-Webex

    STOI: 0.79

    None-Zoom

    STOI: 0.79

    Max-EV

    STOI: 0.88

    Max-EV-Teams

    STOI: 0.88

    Max-EV-Webex

    STOI: 0.88

    Max-EV-Zoom

    STOI: 0.88