Self-supervised Visual Grounding of Sound and Language (2024)

\useunder

\ul

Mark Hamilton
MIT, Microsoft
markth@mit.edu
  Andrew Zisserman
Oxford, Google
  John R. Hershey
Google
  William T. Freeman
MIT, Google

Abstract

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV’s localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn “global” audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters. Project Page: https://aka.ms/denseav

Self-supervised Visual Grounding of Sound and Language (1)

1 Introduction

Self-supervised Visual Grounding of Sound and Language (2)

Associating audio and video events is a fundamental task in human perception. As infants develop, the synchronization and correspondence of visible sounds enables multi-modal association – a voice with a face, and a “moo" with a cow[55]. Later, as they acquire language, they associate spoken words with objects they represent[11, 49]. Amazingly, these association abilities, constituting speech recognition, sound event recognition, and visual object recognition, develop without much direct supervision. This work aims to create a model with this capability by learning high-resolution, semantically meaningful, audio-visually (AV) aligned representations. Features with these properties can be used to discover fine-grained correspondences between modalities without localization supervision or prior knowledge of the semantic representation of language.

Consider the spoken caption and accompanying sounds of the image shown in Figure1. We wish to “ground” both the speech and the sounds by identifying them with the corresponding visual objects. For instance, both the spoken word “dog” and the sound of a bark in the audio signal should be associated with the pixels of the dog in the visual signal if present. We seek high quality local representations where this behavior, which is notably absent from popular approaches in the literature, emerges from simple inner products between cross-modal features.

To achieve this, we make three innovations. First, we introduce DenseAV, a dual-encoder architecture that computes a dense similarity volume over audio and visual features. Looking at a slice of this similarity volume for a spoken word, as in Figure1, we can visualize the AV activation strength between a word or sound and an image’s pixels. The novelty we introduce is to extend this dense similarity mechanism to have multiple similarity volume heads, much like those of multi-head attention.This allows each head to specialize on a particular type of coupling between the visual and audio modalities. Interestingly, we discover that if we give DenseAV two heads and train on a dataset that contains both language and sound, the heads naturally learn to distinguish language from more general sound using only cross-modal supervision. For example, as shown in Figure1, head 1 focuses on sounds, such as a dog bark, emitted by visible objects, whereas head 2 focuses on speech, such as the word “dog", that refers to visible objects.

Second, we show the importance of the “aggregation function” one uses to create a summary similarity score between an audio clip and a video frame for contrastive learning. The traditional choices, using inner products between global representations such as class tokens [14, 6, 54] or pooled features[65, 21], do not promote AV alignment of dense local features. Because of this, several popular audio-video backbones that excel on cross-modal retrieval cannot directly associate objects and sounds using their local features. This limits their ability to be used for downstream tasks such as semantic segmentation, sound localization, or unsupervised language learning and discovery.

Third, we introduce two semantic segmentation datasets to evaluate visual grounding with AV representations for speech and (non-speech) sounds. We build these datasets from the high-quality segmentation masks provided by the ADE20K dataset[66] and measure mean average precision (mAP) and mean intersection over union (mIoU) on a binary mask prediction task. This evaluation is simpler and more thorough than previous efforts to measure visual grounding such as the concept counting metrics of[26] and the “pointing games” of[42, 3, 17] that only check if a heatmap’s peak occurs within a target box or segment. Furthermore, our evaluation avoids brittle word-net ontologies[37], clustering, Wu and Palmer distance[62], threshold choices, and a variety of other complicating factors.

To summarize, our main contributions are as follows:

  • We introduce DenseAV, a novel self-supervised architecture that learns high-resolution AV correspondences.

  • We introduce a local-feature-based image similarity function that significantly improves a network’s zero-shot localization ability compared to common strategies such as average pooling or CLS tokens.

  • We introduce new datasets for evaluating speech and sound prompted semantic segmentation. We show DenseAV significantly outperforms the current state-of-the-art on these tasks as well as on cross-modal retrieval.

  • We discover that our multi-head architecture naturally disentangles audio-visual correspondence into sound and language components using only contrastive supervision.

Self-supervised Visual Grounding of Sound and Language (3)

2 Related Work

Audio-visual (AV), text-visual, and other multi-modal models have a long history [16, 15], and have recently surged in popularity [67]. Broadly speaking DenseAV is an audio-video contrastive learning architecture; this class of methods learns AV representations by aligning paired signals and pushing apart unpaired signals[12, 30]. Of the models in this class, several stand out for their ability to localize sounds[3, 8, 46] or capture the semantics of language[26, 47]. Many models in this class compare AV signals using inner products between “global” representations formed by pooled deep features[21, 60, 39], or class tokens[54, 46, 47, 20, 35]. Most notably, ImageBind has gained popularity due to its state-of-the-art performance on a variety of tasks and datasets and unified class-token-based contrastive architecture. In this work we show that many of these architectures do not show strong localization properties in their local features, despite excelling at cross-modal retrieval on a “global” level. This limits their applicability to new out-of-domain sounds, sounds that don’t have a textual representation, and low-resource languages. We diverge from these works by directly supervising local tokens. In particular, we build on previous works[26, 3] that show max-pooling improves localization capabilities and introduce a new multi-head aggregation operator that generalizes previous losses using a self-attention-like operator[59].

Another class of methods discover structure in signals through uni- and multi-modal clustering. Early works on audio clustering[45] discovered meaningful utterances without supervision. Similar visual analyses have discovered visual objects[31, 9, 5, 23]. Recent works have applied these ideas to the AV domain [24, 2], but do not focus on extracting high-resolution AV representations.

Finally, several works investigate generative audio-video learning. The Sound of Pixels[64] generates the sound of a specific object using a source separation loss. Newer approaches using GANs[33, 34], and diffusion models[10, 20, 36] have generated audio from video and vice versa. Here we focus on improving the local representations of contrastive learners because of their relative scalability, simplicity, and ability to learn high-quality representations.

3 Methods

At a high level, DenseAV tries to determine when a given audio and visual signal belong “together” using dense audio-visual representations. To perform this task robustly, DenseAV must learn to predict the contents of an audio signal from a visual signal and vice versa. Doing so causes DenseAV to learn dense modality-specific features that capture the mutual information shared between the modalities [58]. Once learned, we can directly query these informative features to perform speech and sound prompted semantic segmentation as illustrated in Figure1.

More specifically, DenseAV is built from two modality-specific deep featurizers. These backbones produce temporally varying audio features across an audio clip and spatially varying video features for a single randomly selected frame. Our loss computes a similarity between audio and visual signals based on the intuition that two signals are similar if they have a variety of strong couplings or shared objects. More formally, we form a scalar similarity for a pair of audio and video signals by carefully aggregating a volume of pairwise inner products between dense features. We use the InfoNCE[40] contrastive loss to encourage similarity between “positive” pairs of signals and dissimilarity between “negative” pairs formed by in-batch shuffling. Figure3 graphically depicts this loss function and subsequent sections detail each component of our architecture.

3.1 Multi-Headed Aggregation of Similarities

DenseAV’s key architectural distinction is its loss function that directly supervises the “local” tokens of the visual and audio featurizers. This is a significant departure from other works[54, 22, 50, 6, 43, 20] that pool modality specific information into “global” representations prior to the contrastive loss. Unlike prior works, our loss function aggregates the full pairwise similarities between the local tokens into an aggregate measure of similarity for a given pair of audio and visual signals. We show in Figure2 that this architectural choice enables DenseAV’s local features to align across modalities whereas other approaches such as average pooling, class tokens, and SimPool [48] do not.

We first describe our loss function informally and definite it more precisely in the next paragraph. Our loss function computes the (un-normalized) inner product between every pair of visual and audio features to form a “volume” of inner products. This volume represents how strongly each part of an audio signal “couples” to each part of a visual signal. We aim to find many large couplings between positive pairs of audio and visual signals. Ideally, these couplings should connect visual objects with their references in the audio signal. Conversely, we do not want to find couplings between negative pairs of signals. To compute a single global coupling strength for a pair of signals, we aggregate this volume of pairwise similarities into a single number. There are myriad ways to aggregate this volume ranging from “soft" average-pooling to “hard" max-pooling. Average pooling yields dense gradients and can improve convergence speed and stability. However, max-pooling allows the network to focus on the best couplings regardless the object’s size or a sound’s duration. Our aggregation function combines the benefits of average and max pooling by max-pooling visual dimensions and average pooling audio dimensions as proposed in[26]. Intuitively speaking, this averages the strongest image couplings over an audio signal. It allows small visual objects to have large effects yet provides a strong training gradient to many regions of the signals. Finally, we draw inspiration from multi-head self-attention [59] and generalize this operation to multiple “heads” that we max-pool before pooling the visual and audio dimensions. This allows DenseAV to discover multiple “ways” to associate objects across modalities.

More formally, let 𝒮(a,v)𝒮𝑎𝑣\mathcal{S}(a,v)\in\mathbb{R}caligraphic_S ( italic_a , italic_v ) ∈ blackboard_R represent the similarity between a tensor of audio features aCKFT𝑎superscript𝐶𝐾𝐹𝑇a\in\mathbb{R}^{CKFT}italic_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_C italic_K italic_F italic_T end_POSTSUPERSCRIPT of size (Channel ×\times× K-heads ×\times× Frequency ×\times× Time) and a tensor of visual features vCKHW𝑣superscript𝐶𝐾𝐻𝑊v\in\mathbb{R}^{CKHW}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_C italic_K italic_H italic_W end_POSTSUPERSCRIPT of size (Channel ×\times× K-heads ×\times× Height ×\times× Width). To define this scalar similarity score, we first create a local similarity volume, s(a,v)kfthw𝑠𝑎𝑣superscript𝑘𝑓𝑡𝑤s(a,v)\in\mathbb{R}^{kfthw}italic_s ( italic_a , italic_v ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_k italic_f italic_t italic_h italic_w end_POSTSUPERSCRIPT. For simplicity, we consider the aggregated similarity between a single image and audio clip but note one can easily generalize this to max-pool over video-frames. We define the full pairwise volume of similarities as:

s(a,v)kfthw=c=1Ca[c,k,f,t]v[c,k,h,w]𝑠𝑎𝑣superscript𝑘𝑓𝑡𝑤superscriptsubscript𝑐1𝐶𝑎𝑐𝑘𝑓𝑡𝑣𝑐𝑘𝑤s(a,v)\in\mathbb{R}^{kfthw}=\sum_{c=1}^{C}a[c,k,f,t]\cdot v[c,k,h,w]italic_s ( italic_a , italic_v ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_k italic_f italic_t italic_h italic_w end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_a [ italic_c , italic_k , italic_f , italic_t ] ⋅ italic_v [ italic_c , italic_k , italic_h , italic_w ](1)

Where a[c,k,f,t]𝑎𝑐𝑘𝑓𝑡a[c,k,f,t]italic_a [ italic_c , italic_k , italic_f , italic_t ] represents the value of a𝑎aitalic_a at location [c,k,f,t]𝑐𝑘𝑓𝑡[c,k,f,t][ italic_c , italic_k , italic_f , italic_t ] and \cdot is scalar multiplication. We aggregate this similarity volume into a single score 𝒮(a,v)𝒮𝑎𝑣\mathcal{S}(a,v)\in\mathbb{R}caligraphic_S ( italic_a , italic_v ) ∈ blackboard_R:

𝒮(a,v)=1FTf=1Ft=1Tmaxk,h,w(s(a,v)[k,f,t,h,w])𝒮𝑎𝑣1𝐹𝑇superscriptsubscript𝑓1𝐹superscriptsubscript𝑡1𝑇subscript𝑘𝑤𝑠𝑎𝑣𝑘𝑓𝑡𝑤\mathcal{S}(a,v)=\frac{1}{FT}\sum_{f=1}^{F}\sum_{t=1}^{T}\max_{k,h,w}\left(s(a%,v)[k,f,t,h,w]\right)caligraphic_S ( italic_a , italic_v ) = divide start_ARG 1 end_ARG start_ARG italic_F italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_k , italic_h , italic_w end_POSTSUBSCRIPT ( italic_s ( italic_a , italic_v ) [ italic_k , italic_f , italic_t , italic_h , italic_w ] )(2)

We note that this operation can be viewed as a multi-head generalization of the MISA loss of[26], and a multi-head multi-time generalization of the MIL loss of[3].

3.2 Loss

We can use the similarity between audio and visual signals defined in Equation2 to construct a contrastive loss. We follow recent works[20, 18, 61] and use the temperature-weighted InfoNCE [40] to encourage similarity between positive pairs of signals and dissimilarity between negative pairs. In DenseAV, we form B𝐵Bitalic_B positive pairs by splitting the audio and visual components of a Batch of training data. We form B2Bsuperscript𝐵2𝐵B^{2}-Bitalic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_B negative pairs by comparing a signal to all of the other signals in the training batch. More formally let (ab,vb)1Bsuperscriptsubscriptsubscript𝑎𝑏subscript𝑣𝑏1𝐵(a_{b},v_{b})_{1}^{B}( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT be B𝐵Bitalic_B pairs of audio and visual signals. The visual-retrieval term of our InfoNCE loss is then:

AV=12Bb=1B(logexp(γ𝒮(ab,vb))b=1Bexp(γ𝒮(ab,vb)))subscript𝐴𝑉12𝐵superscriptsubscript𝑏1𝐵𝛾𝒮subscript𝑎𝑏subscript𝑣𝑏superscriptsubscriptsuperscript𝑏1𝐵𝛾𝒮subscript𝑎𝑏subscript𝑣superscript𝑏\mathcal{L}_{A\to V}=\frac{1}{2B}\sum_{b=1}^{B}\left(\log\frac{\exp{\left(%\gamma\mathcal{S}(a_{b},v_{b})\right)}}{\sum_{b^{\prime}=1}^{B}\exp{\left(%\gamma\mathcal{S}(a_{b},v_{b^{\prime}})\right)}}\right)caligraphic_L start_POSTSUBSCRIPT italic_A → italic_V end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( roman_log divide start_ARG roman_exp ( italic_γ caligraphic_S ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_exp ( italic_γ caligraphic_S ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) end_ARG )(3)

Where γ+𝛾superscript\gamma\in\mathbb{R}^{+}italic_γ ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a trainable inverse temperature parameter. We symmetrize this loss by adding the analogous audio-retrieval term, VAsubscript𝑉𝐴\mathcal{L}_{V\to A}caligraphic_L start_POSTSUBSCRIPT italic_V → italic_A end_POSTSUBSCRIPT, which iterates over negative audio signals in the denominator.

3.3 Audio and Visual Featurizers

The core of DenseAV is two modality-specific backbone networks. We use the DINO vision transformer[6] with ImageNet pretrained weights (without labels) to provide a strong, yet fully unsupervised, vision backbone. Unlike other approaches that use CLIP[50] as a backbone, DINO does not require paired text captions and learns from unlabeled images only. Practically, we find that DINO outperforms CLIP because of its better-behaved local tokens[13], an effect we explore in the Supplement. We append an additional layer norm operation across the channel dimension[4] and a 1×1111\times 11 × 1 Convolution to DINO. The layer-norm and 1×1111\times 11 × 1 convolution ensure the architecture does not start with a saturated loss function. We use the HuBERT audio transformer[28] as DenseAV’s audio backbone. HuBERT operates on waveforms and is trained on the LibriSpeech[44] dataset using only self-supervision. Hubert outputs a single feature per time frame, corresponding to F=1𝐹1F=1italic_F = 1 in Section3. Though HuBERT was only trained on speech, its audio features can be fine-tuned for more general sounds, much like how vision backbones can be fine-tuned for new datasets[63]. As in the visual branch, we append a channel-wise LayerNorm block and two 3×3333\times 33 × 3 convolutions to the audio branch. These layers help the network avoid saturation and speed convergence. Furthermore, the two convolutions help the model aggregate information, which reduces the cost of the pairwise feature comparison used in our loss function. We refer to these added layers after the pretrained backbones as the “aligners” in later sections.

3.4 Regularizers

Disentanglement Regularizer, Dissubscript𝐷𝑖𝑠\mathcal{L}_{Dis}caligraphic_L start_POSTSUBSCRIPT italic_D italic_i italic_s end_POSTSUBSCRIPT: We add a small regularization term to encourage each head of Equation1 to specialize and learn independent types of audio-visual associations. Interestingly we find that our 2-head model naturally learns to distinguish the meaning of words with one head and capture the sounds objects produce with another head. To further encourage this unsupervised discovery of concepts, we penalize the network when multiple attention heads are simultaneously active. More precisely, let (ab,vb)1Bsuperscriptsubscriptsubscript𝑎𝑏subscript𝑣𝑏1𝐵(a_{b},v_{b})_{1}^{B}( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT be a Batch of B𝐵Bitalic_B paired audio and visual signals. Our disentanglement loss for two heads is then:

Dis=Mean(|s(ab,vb)[1]s(ab,vb)[2]|)subscript𝐷𝑖𝑠Mean𝑠subscript𝑎𝑏subscript𝑣𝑏delimited-[]1𝑠subscript𝑎𝑏subscript𝑣𝑏delimited-[]2\mathcal{L}_{Dis}=\text{Mean}(|s(a_{b},v_{b})[1]\circ s(a_{b},v_{b})[2]|)caligraphic_L start_POSTSUBSCRIPT italic_D italic_i italic_s end_POSTSUBSCRIPT = Mean ( | italic_s ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) [ 1 ] ∘ italic_s ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) [ 2 ] | )(4)

Where \circ is elementwise multiplication and |||\cdot|| ⋅ | is the elementwise absolute value function. [k]delimited-[]𝑘[k][ italic_k ] mirrors PyTorch slicing notation and refers to selecting the activations for only the k𝑘kitalic_kth attention head. Intuitively, this loss encourages one head to be silent if the other head is active and is a “cross-term” generalization of the l2superscript𝑙2l^{2}italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regularizer[27] for encouraging activation shrinkage. When K>2𝐾2K>2italic_K > 2 we average contributions from every combination of heads. We ablate this, and our decision to max-pool heads in Table3.

Stability Regularizers, Stabilitysubscript𝑆𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦\mathcal{L}_{Stability}caligraphic_L start_POSTSUBSCRIPT italic_S italic_t italic_a italic_b italic_i italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT: Finally, we add several other small regularization terms to encourage stable convergence. We detail and ablate these terms in the Supplement. Briefly, these terms include standard regularizers like Total Variation[52] smoothness over time and non-negative pressure to encourage the network to focus on similarity instead of dissimilarity. In addition, we add a regularizer to prevent the calibration temperature, γ𝛾\gammaitalic_γ, from drifting too quickly, and a regularizer to discourage activations during silence and noise. In the supplement we show that each regularizer alone does not have a dramatic effect on final metrics but together they can stop collapses during training.

Combining these losses into a single loss function yields:

=AV+VA+λDisDis+Stabilitysubscript𝐴𝑉subscript𝑉𝐴subscript𝜆𝐷𝑖𝑠subscript𝐷𝑖𝑠subscript𝑆𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦\mathcal{L}=\mathcal{L}_{A\to V}+\mathcal{L}_{V\to A}+\lambda_{Dis}\mathcal{L}%_{Dis}+\mathcal{L}_{Stability}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_A → italic_V end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_V → italic_A end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_D italic_i italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D italic_i italic_s end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_S italic_t italic_a italic_b italic_i italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT(5)

In our experiments we use λDis=0.05subscript𝜆𝐷𝑖𝑠0.05\lambda_{Dis}=0.05italic_λ start_POSTSUBSCRIPT italic_D italic_i italic_s end_POSTSUBSCRIPT = 0.05 and refer interested readers to the supplement for the details of our small stability regularizer, Stabilitysubscript𝑆𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦\mathcal{L}_{Stability}caligraphic_L start_POSTSUBSCRIPT italic_S italic_t italic_a italic_b italic_i italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT.

3.5 Training

In our experiments we train DenseAV and relevant baselines on the AudioSet[19] dataset for sound prompted segmentation and AudioSet retrieval. We train on the PlacesAudio[25] dataset for speech prompted segmentation, PlacesAudio retrieval, and the ablation studies of Table4. In our disentanglement experiments of Table3 and feature visualizations of Figures1 and 2 we train on both AudioSet and PlacesAudio so that DenseAV can be familiar with both language, the prominent audio signal in PlacesAudio, and more general sounds from AudioSet. In these experiments we sample training data from these two corpora, so each batch has an even split between AudioSet and PlacesAudio.

Warming up Aligners: We find that we can dramatically improve the stability by first training the added aligners (convolutions and layer norms) for 3000300030003000 steps while keeping pretrained DINO and HuBERT backbones fixed. This allows the aligners to adapt to these intelligent backbones before modifying each backbone’s sensitive weights. We use random resize crops, color jitter, random flips, and random greyscaling as image augmentations. We randomly sample a single video frame to feed to our visual branch. Audio clips are converted to single-channel format and are trimmed or padded with silence to create uniform 10 second clips. We re-sample audio clips according to the requirements of the backbone models used. For HuBERT, we re-sample to 16KhZ. We train on 8 V100 GPUs with an effective batch size of 80, and aggregate negative samples on all GPUs prior to computing the loss to ensure efficient parallelization. We provide additional training information and hyperparameters in the supplement.

Full Training: After warming up the aligners, we train the full model for an additional 800,000 steps using the same loss, batch-size, and training logic. We train all aligner weights and fine-tune all HuBERT audio backbone weights. We use low rank adaptation (LoRA)[29] to fine-tune the “Q”, “K”, and “V” layers of the DINO visual backbone attention blocks. This allows us to efficiently adapt DINO and stabilize the training as it is quite easy to collapse the carefully trained DINO weights. We use a LoRA rank of 8.

MethodSpeech Semseg.Sound Semseg.
mAPmIoUmAPmIoU
DAVENet [26]\ul32.2%\ul26.3%16.8%17.0%
CAVMAE [21]27.2%19.9%\ul26.0%\ul20.5%
ImageBind [20]20.2%19.7%18.3%18.1%
Ours48.7%36.8%32.7%24.2%

4 Experiments

To evaluate AV representation quality, we perform a variety of analyses including comparative activation visualization, quantitative measurements of speech and sound prompted semantic segmentation, and cross-modal retrieval. Additionally, we quantify our observation that DenseAV can distinguish the meanings of words (language), from the sounds of objects (sound) without supervision.

To adequately measure a representation’s AV alignment quality, we found it necessary to introduce two evaluation datasets that measure speech and sound prompted semantic segmentation performance. Our two datasets introduce pairs of speech and sound prompts coupled with matching images and segmentation masks derived from ADE20K. We create these datasets because previous works[26] have not published their datasets or evaluation code. However, we use an experimental setting from the literature for our cross-modal retrieval experiments.

We compare against a variety of prior art including the popular state-of-the art multi-modal retrieval network, ImageBind[20]. We also compare against CAVMAE[21], a leading multimodal backbone trained specifically for AudioSet retrieval, and DAVENet[26], which is trained to localize the meanings of words. We include two other baselines[25, 24] which have reported cross modal retrieval metrics on Places Audio. Finally, we compare our multi-head aggregation strategy to common “global” retrieval methods such as inner products between class-tokens, average-pooled tokens, and SimPooled[48] tokens. We note that SimPool achieves state-of-the-art localization results when compared to 14 other pooling methods. Nevertheless, our multi-head aligner yields better localization results than any of these “global” methods.

4.1 Qualitative Comparison of Feature Maps

Our first experiment in Figure2 highlights the dramatic differences in quality between DenseAV’s features and other approaches in the literature. DenseAV is the only backbone whose local tokens are semantically meaningful and show cross-modal alignment for speech and sound. Though both CAVMAE and ImageBind show high-quality retrieval performance, neither shows high quality aligned local tokens. As a result, DenseAV can associate and localize both sound and language significantly better than other backbones. DAVENet shows coarse correspondences between language and visual objects but cannot associate sound with visual objects and does not match DenseAV’s high resolution maps. Furthermore, the right half of Figure1 demonstrates that DenseAV naturally discovers and separates word semantics from the sound of objects without labels to supervise this separation. In the supplement, we provide additional visualizations of all backbones considered across a wide range of words and sounds.

MethodPlaces Acc. @10AudioSet Acc. @10
I \rightarrow AA \rightarrow II \rightarrow AA \rightarrow I
[25]*46.3%54.8%--
[24]*54.2%56.4%--
DAVENet [26]*52.8%60.4%--
CAVMAE [21]\ul81.7%\ul77.7%55.7%50.7%
ImageBind [20]1.10%1.10%\ul64.5%\ul66.5%
Ours94.2%94.3%69.8%68.1%

4.2 Speech Prompted Image Segmentation

Dataset: We introduce a speech prompted segmentation dataset using the ADE20K dataset, which is known for its comprehensive ontology and pixel-precise annotations[66]. From this dataset, we curate an evaluation subset of image-class pairs by sampling up to 10 images for each object class in ADE20K, excluding images where the selected class was tiny (<5%absentpercent5<5\%< 5 % of pixels). We only consider classes with at least 2 images that pass the tiny object criterion. For each class and image, we formed a binary target mask by selecting the semantic segmentation mask for that class. This resulted in 3030 image-object pairs spanning 478 ADE20K classes.

We created paired speech signals by speaking the prompt “A picture of a(n) [object]” where [object] is the name of the ADE20K class. We create clear, controlled, and consistent audio prompts using Microsoft’s neural text to speech service[51]. This service also provides exact timing of the “[object]” utterance within the broader prompt and ensures each class is measured equally. Grammar was manually verified for the utterances to ensure proper singular/plural and a/an agreement with the class name. We release images, masks, and audio prompts for reproducibility.

Evaluation Measure: We evaluate methods based on how well their speech-prompted activations align with ground truth masks for the visual object’s class. We quantify this with the binary Average Precision (AP) and Intersection over Union (IoU) metrics. These quantify how close activations match with the binary label mask from the ADE20K dataset. To compute an aggregate score over all of the object classes considered, we compute the mean average precision (mAP) and mean intersection over union (mIoU) by averaging AP scores across all object categories considered.

The mAP is particularly well suited for evaluating feature similarities because it is unaffected by monotonic transformations of the similarity scores. This eliminates the need for arbitrary thresholding and calibration. This is particularly important because many networks’ inner products are not centered at zero, and the best thresholding strategy can be nontrivial, and dependent on the network and object class. Average Precision avoids these confounding factors and ensures a fair comparison across methods. Unfortunately, unlike the mAP, the mIoU metric requires selecting a threshold. To ensure our mIoU measurement is similarly invariant to monotonic transformations we evaluate 20 uniformly spaced thresholds between the smallest and largest activations of each model. For each baseline, we report results for the best threshold to ensure a fair comparison between all networks considered.

Implementation: We compute image heatmaps by evaluating each modality-specific network on the image-audio pairs from our dataset. We extract dense features from the final layer of each network and form their similarity volume according to Equation1. For DenseAV we max-pool the head dimension to properly compare with single-headed models. We average activations over the temporal extent of the “[object]” utterance using the word timing information from the ground truth audio clip. This creates a heatmap over the image features that can be bi-linearly resized to the original image’s size. We then compare these per-pixel activation scores to ground truth object masks from our dataset.

Results: In Speech mAP and mIoU columns of Table1 we show that DenseAV achieves a 51% (+16.5 mAP) relative increase in speech-prompted semantic segmentation over previous methods. Approaches that use global token based contrastive strategies such as CAVMAE and ImageBind perform particularly poorly in this task, and this observation aligns with the qualitative results of Figure2.

MethodPred. Dis.Act. Dis.
No Dissubscript𝐷𝑖𝑠\mathcal{L}_{Dis}caligraphic_L start_POSTSUBSCRIPT italic_D italic_i italic_s end_POSTSUBSCRIPT, No Head Max Pool64.1%70.3%
No Dissubscript𝐷𝑖𝑠\mathcal{L}_{Dis}caligraphic_L start_POSTSUBSCRIPT italic_D italic_i italic_s end_POSTSUBSCRIPT99.9%86.5%
Ours99.9%91.2%

4.3 Sound Prompted Image Segmentation

Dataset: To evaluate how well deep features localize sound, we build on Section4.2 and create a dataset of sound prompts that align with ADE20K classes. We first select the same (large) image-object pairs from ADE20K. We then create a mapping between the ADE20K and VGGSound[7] ontologies. To compute a robust mapping, we first embed ADE20K class names and VGGSound class names with the GPT Ada 2 text embedding model[41]. For each ADE20K class, we create a list of at most three candidates from the VGGSound ontology that have a cosine similarity (>.85)absent.85(>.85)( > .85 ). We then manually review these candidates to select the best VGGSound class for each ADE20K class and remove any spurious or mistaken matches. This produces a set of 95 ADE20K classes with strong matches in the VGGSound ontology. For each of our original 3030 image-object pairs we select a random VGGSound validation clip with a matching class according to our mapped ontology. This yields 106 image-object pairs across 20 ADE20K classes.

Evaluation Measure: We use the same mAP and mIoU evaluation metrics as Section4.2, but instead average over the 20 ADE20K classes considered.

Implementation: We compute sound prompted image activations as in section4.2 but with one key change: we average activations over the entire clip because we do not have ground-truth sound timing information.

Results: The “Sound mAP and mIoU” columns of Table1 show that DenseAV achieves a 25% (+6.4mAP) relative improvement in sound prompted segmentation compared to the prior art. Most notably, ImageBind’s features cannot localize sound despite their high cross-modal retrieval performance learned from millions of hours of sound.

MethodSpeech mAPPlaces Acc. @10
V \rightarrow AA \rightarrow V
Average Pool20.1%92.0%91.2%
CLS Token20.6%86.4%89.8%
SimPool [48]35.3%92.6%92.8%
Multi-Head (Ours)48.2%93.5%93.8%

4.4 Cross-Modal Retrieval

We show that DenseAV’s representations are not only better for localization, but significantly outperform other approaches on cross-modal retrieval. We adopt the evaluation setting of[26] and measure cross modal retrieval accuracy at 1, 5, and 10 in a thousand-way retrieval task. In particular, we use the same thousand images from the validation set of[26] and also replicate this analysis on one-thousand random clips from the AudioSet validation data. Table2 shows results for 1000-way retrieval tasks on both the Places Audio and AudioSet datasets. We show cross-modal accuracy at 10, but also show larger tables in the supplement that echo these results using accuracy at 1 and 5. DenseAV significantly outperforms all baselines across all metrics. Interestingly, DenseAV outperforms ImageBind with less than half of the trainable parameters and no reliance on text.

4.5 Measuring Disentanglement

We observe that DenseAV’s heads naturally learn to differentiate audio-visual couplings that capture the meaning of words (language) and those that capture the sounds of objects (sound). Furthermore this effect generalizes to novel clips, including those with both sound and language as shown in Figure1. We quantify this observation in two ways, the first measures if a head’s average activation strength predicts whether a clip contains mainly “language” or “sound”. The second method quantifies how often the “sound” head is incorrectly active when the “language” head should be active and vice versa. We leverage the fact that AudioSet dataset contains mostly clips with ambient sound and rarely contains language. In contrast, Places Audio is entirely language-based without external ambient sound. We note that these analyses are specifically for our architecture with two heads K=2𝐾2K=2italic_K = 2 and trained on both AudioSet and PlacesAudio data.

For both measures of disentanglement, we first compute a clip’s aggregated similarity for each head. In particular, we remove the max-pooling over heads in Equation2 to create a single-head similarity, 𝒮(a,v)k𝒮subscript𝑎𝑣𝑘\mathcal{S}(a,v)_{k}caligraphic_S ( italic_a , italic_v ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. We then min-max scale the scores of each head across both datasets to lie in the [0,1]01[0,1][ 0 , 1 ] interval, which we refer to as 𝒮^(a,v)k^𝒮subscript𝑎𝑣𝑘\hat{\mathcal{S}}(a,v)_{k}over^ start_ARG caligraphic_S end_ARG ( italic_a , italic_v ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Using these normalized scores, we can create metrics that capture how well a given head responds only to a specific dataset.

Our first metric measures how well a head’s scores predict whether a clip is from the “sound” or “language” dataset. Let (ab,vb)1Bsuperscriptsubscriptsubscript𝑎𝑏subscript𝑣𝑏1𝐵(a_{b},v_{b})_{1}^{B}( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT be tuples of paired audio and visual signals. let l[k]b𝑙subscriptdelimited-[]superscript𝑘𝑏l[k^{\prime}]_{b}italic_l [ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT be an indicator variable of whether the signal (ab,vb)subscript𝑎𝑏subscript𝑣𝑏(a_{b},v_{b})( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) arises from the sound dataset, AudioSet, (k=1)superscript𝑘1(k^{\prime}=1)( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 ), or the language dataset Places Audio (k=2)superscript𝑘2(k^{\prime}=2)( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 2 ).

δpred(k,k)=AP((𝒮^(ab,vb)k)1B,(l[k]b)1B)subscript𝛿𝑝𝑟𝑒𝑑𝑘superscript𝑘APsuperscriptsubscript^𝒮subscriptsubscript𝑎𝑏subscript𝑣𝑏𝑘1𝐵superscriptsubscript𝑙subscriptdelimited-[]superscript𝑘𝑏1𝐵\delta_{pred}(k,k^{\prime})=\text{AP}\left((\hat{\mathcal{S}}(a_{b},v_{b})_{k}%)_{1}^{B},(l[k^{\prime}]_{b})_{1}^{B}\right)italic_δ start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT ( italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = AP ( ( over^ start_ARG caligraphic_S end_ARG ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT , ( italic_l [ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT )(6)

Where AP(,)𝐴𝑃AP(\cdot,\cdot)italic_A italic_P ( ⋅ , ⋅ ) is the binary average precision with prediction and label arguments respectively. Intuitively, this measures whether the scores of head k𝑘kitalic_k are direct predictors of whether the data is from dataset ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We can find the best assignment between heads and datasets such that each head is maximally predictive of the given dataset:

PredDis=12max(δpred(0,0)+δpred(1,1),δpred(1,0)+δpred(0,1))PredDis12subscript𝛿𝑝𝑟𝑒𝑑00subscript𝛿𝑝𝑟𝑒𝑑11subscript𝛿𝑝𝑟𝑒𝑑10subscript𝛿𝑝𝑟𝑒𝑑01\text{PredDis}=\frac{1}{2}\max\left(\delta_{pred}(0,0)+\delta_{pred}(1,1),%\right.\\\left.\delta_{pred}(1,0)+\delta_{pred}(0,1)\right)start_ROW start_CELL PredDis = divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_max ( italic_δ start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT ( 0 , 0 ) + italic_δ start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT ( 1 , 1 ) , end_CELL end_ROW start_ROW start_CELL italic_δ start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT ( 1 , 0 ) + italic_δ start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT ( 0 , 1 ) ) end_CELL end_ROW(7)

The prediction disentanglement score, PredDis, is a percentage that ranges from 50%percent5050\%50 % for completely entangled signals to 100%percent100100\%100 % if one can perfectly classify the signals using the scores of either head. The maximum over the two possible assignments makes this metric invariant to permutations of the heads. We note that this metric is a Hungarian matching assignment[32] over two entries, a common technique to asses unsupervised classification performance[31, 23].

Our second measure quantifies “spurious activations” in the non-dominant head. A truly disentangled system should have a head that only fires on sound, and another head that only fires on language. We create another disentanglement measure, ActDis, by replacing δpredsubscript𝛿𝑝𝑟𝑒𝑑\delta_{pred}italic_δ start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT in Equation7 with:

δact(k,k)=11bl[k]bb=1B𝒮^(ab,vb)kl[k]bsubscript𝛿𝑎𝑐𝑡𝑘superscript𝑘11subscriptsuperscript𝑏𝑙subscriptdelimited-[]superscript𝑘superscript𝑏superscriptsubscript𝑏1𝐵^𝒮subscriptsubscript𝑎𝑏subscript𝑣𝑏𝑘𝑙subscriptdelimited-[]superscript𝑘𝑏\delta_{act}(k,k^{\prime})=1-\frac{1}{\sum_{b^{\prime}}l[k^{\prime}]_{b^{%\prime}}}\sum_{b=1}^{B}\hat{\mathcal{S}}(a_{b},v_{b})_{k}\cdot l[k^{\prime}]_{b}italic_δ start_POSTSUBSCRIPT italic_a italic_c italic_t end_POSTSUBSCRIPT ( italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1 - divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_l [ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT over^ start_ARG caligraphic_S end_ARG ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ italic_l [ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT(8)

Intuitively, this measures the “inactivity” of head k𝑘kitalic_k on dataset ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. If head k𝑘kitalic_k is totally silent on dataset ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT then δact(k,k)=1subscript𝛿𝑎𝑐𝑡𝑘superscript𝑘1\delta_{act}(k,k^{\prime})=1italic_δ start_POSTSUBSCRIPT italic_a italic_c italic_t end_POSTSUBSCRIPT ( italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1. Like PredDis, ActDis is a percentage ranging from 50%percent5050\%50 % to 100%percent100100\%100 % with 100%percent100100\%100 % representing perfect disentanglement where the sound head is completely silent during the language clips, and vice versa.

Table3 shows that DenseAV achieves near perfect predictive (99%percent9999\%99 %) and activation (91%percent9191\%91 %) disentanglement. It also shows that our disentanglement regularizer and max-pooling over heads improves DenseAV’s natural ability to distinguish sound from language without supervision.

5 Conclusion

We presented DenseAV, a novel contrastive learning architecture that can discover the meaning of words and localize the sounds of objects using only video supervision. We are the first to observe both qualitatively and quantitatively that it’s possible to disentangle the meaning of words from the sound of objects with only a contrastive learning signal. DenseAV’s success stems from its novel multi-head attention aggregation mechanism that encourages its modality-specific backbones to create high-resolution, semantically meaningful, and AV aligned representations. These properties of DenseAV’s representation are not seen in other state-of-the-art models in the literature. Consequently, DenseAV significantly surpasses other leading models in dense prediction tasks such as speech and sound-prompted semantic segmentation as well as in cross-modal retrieval.

Acknowledgements

We would like to thank the Microsoft Research Grand Central Resources team for their gracious help performing the experiments in this work. Special thanks to Oleg Losinets and Lifeng Li for their consistent, gracious, and timely help, debugging, and expertise. Without them, none of the experiments could have been run.

We would also like to thank David Harwath, Andrew Rouditchenko, Yuan Gong, Didac Suris, Adria Recasens Continente, and Jim Glass for their help in running DAVENet and CAVMAE baselines and evaluations as well as for many helpful tips on audio visual contrastive learning.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. 2021323067. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.This work is supported by the National Science Foundation under Cooperative Agreement PHY-2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi.org/). This work is funded by a Royal Society ResearchProfessorship RSRP\\\backslash\R\\\backslash\241003, and EPSRC Programme Grant VisualAI EP/T028572/1.

References

  • Afouras etal. [2020]Triantafyllos Afouras, Andrew Owens, JoonSon Chung, and Andrew Zisserman.Self-supervised learning of audio-visual objects from video, 2020.
  • Alwassel etal. [2020]Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran.Self-supervised learning by cross-modal audio-video clustering.Advances in Neural Information Processing Systems, 33:9758–9770, 2020.
  • Arandjelovic and Zisserman [2018]Relja Arandjelovic and Andrew Zisserman.Objects that sound.In Proceedings of the European conference on computer vision (ECCV), pages 435–451, 2018.
  • Ba etal. [2016]JimmyLei Ba, JamieRyan Kiros, and GeoffreyE Hinton.Layer normalization.arXiv preprint arXiv:1607.06450, 2016.
  • Caron etal. [2018]Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze.Deep clustering for unsupervised learning of visual features.In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018.
  • Caron etal. [2021]Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.Emerging properties in self-supervised vision transformers.In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  • Chen etal. [2020]Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman.Vggsound: A large-scale audio-visual dataset.In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
  • Chen etal. [2021]Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman.Localizing visual sounds the hard way.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16867–16876, 2021.
  • Cho etal. [2021]JangHyun Cho, Utkarsh Mall, Kavita Bala, and Bharath Hariharan.Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16794–16804, 2021.
  • Choi etal. [2023]Jeongsoo Choi, Joanna Hong, and YongMan Ro.Diffv2s: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7812–7821, 2023.
  • Chomsky [1987]Noam Chomsky.Language and problems of knowledge: The Managua lectures.MIT press, 1987.
  • Chopra etal. [2005]Sumit Chopra, Raia Hadsell, and Yann LeCun.Learning a similarity metric discriminatively, with application to face verification.In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), pages 539–546. IEEE, 2005.
  • Darcet etal. [2023]Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski.Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023.
  • Dosovitskiy etal. [2020]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020.
  • Fisher and Darrell [2002]JohnW Fisher and Trevor Darrell.Probabalistic models and informative subspaces for audiovisual correspondence.In Computer Vision—ECCV 2002: 7th European Conference on Computer Vision Copenhagen, Denmark, May 28–31, 2002 Proceedings, Part III 7, pages 592–603. Springer, 2002.
  • FisherIII etal. [2000]JohnW FisherIII, Trevor Darrell, William Freeman, and Paul Viola.Learning joint statistical models for audio-visual fusion and segregation.Advances in neural information processing systems, 13, 2000.
  • Fong and Vedaldi [2017]RuthC Fong and Andrea Vedaldi.Interpretable explanations of black boxes by meaningful perturbation.In Proceedings of the IEEE international conference on computer vision, pages 3429–3437, 2017.
  • Frosst etal. [2019]Nicholas Frosst, Nicolas Papernot, and Geoffrey Hinton.Analyzing and improving representations with the soft nearest neighbor loss.In International conference on machine learning, pages 2012–2020. PMLR, 2019.
  • Gemmeke etal. [2017]JortF Gemmeke, DanielPW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, RChanning Moore, Manoj Plakal, and Marvin Ritter.Audio set: An ontology and human-labeled dataset for audio events.In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
  • Girdhar etal. [2023]Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, KalyanVasudev Alwala, Armand Joulin, and Ishan Misra.Imagebind: One embedding space to bind them all.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  • Gong etal. [2022]Yuan Gong, Andrew Rouditchenko, AlexanderH Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and JamesR Glass.Contrastive audio-visual masked autoencoder.In The Eleventh International Conference on Learning Representations, 2022.
  • Guzhov etal. [2022]Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel.Audioclip: Extending clip to image, text and audio.In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022.
  • Hamilton etal. [2022]Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and WilliamT Freeman.Unsupervised semantic segmentation by distilling feature correspondences.arXiv preprint arXiv:2203.08414, 2022.
  • Harwath and Glass [2017]David Harwath and JamesR Glass.Learning word-like units from joint audio-visual analysis.arXiv preprint arXiv:1701.07481, 2017.
  • Harwath etal. [2016]David Harwath, Antonio Torralba, and James Glass.Unsupervised learning of spoken language with visual context.Advances in Neural Information Processing Systems, 29, 2016.
  • Harwath etal. [2018]David Harwath, Adria Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass.Jointly discovering visual objects and spoken words from raw sensory input.In Proceedings of the European conference on computer vision (ECCV), pages 649–665, 2018.
  • ho*rl and Kennard [1970]ArthurE ho*rl and RobertW Kennard.Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970.
  • Hsu etal. [2021]Wei-Ning Hsu, Benjamin Bolte, Yao-HungHubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed.Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  • Hu etal. [2021]EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021.
  • Jaiswal etal. [2020]Ashish Jaiswal, AshwinRamesh Babu, MohammadZaki Zadeh, Debapriya Banerjee, and Fillia Makedon.A survey on contrastive self-supervised learning.Technologies, 9(1):2, 2020.
  • Ji etal. [2019]Xu Ji, JoãoF Henriques, and Andrea Vedaldi.Invariant information clustering for unsupervised image classification and segmentation.In Proceedings of the IEEE International Conference on Computer Vision, pages 9865–9874, 2019.
  • Kuhn [1955]HaroldW Kuhn.The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955.
  • Kumar etal. [2020]Neeraj Kumar, Srishti Goel, Ankur Narang, and Mujtaba Hasan.Robust one shot audio to video generation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 770–771, 2020.
  • Liu etal. [2021]Shiguang Liu, Sijia Li, and Haonan Cheng.Towards an end-to-end visual-to-raw-audio generation with gan.IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1299–1312, 2021.
  • Ma etal. [2020]Shuang Ma, Zhaoyang Zeng, Daniel McDuff, and Yale Song.Active contrastive learning of audio-visual video representations.arXiv preprint arXiv:2009.09805, 2020.
  • Mao etal. [2023]Yuxin Mao, Jing Zhang, Mochu Xiang, Yunqiu Lv, Yiran Zhong, and Yuchao Dai.Contrastive conditional latent diffusion for audio-visual segmentation.arXiv preprint arXiv:2307.16579, 2023.
  • Miller [1995]GeorgeA Miller.Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995.
  • Mo and Morgado [2022]Shentong Mo and Pedro Morgado.A closer look at weakly-supervised audio-visual source localization.Advances in Neural Information Processing Systems, 35:37524–37536, 2022.
  • Monfort etal. [2021]Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva.Spoken moments: Learning joint audio-visual representations from video descriptions.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14871–14881, 2021.
  • Oord etal. [2018]Aaron vanden Oord, Yazhe Li, and Oriol Vinyals.Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018.
  • OpenAI [2023]OpenAI.Gpt-4 technical report, 2023.
  • Oquab etal. [2015]Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic.Is object localization for free?-weakly-supervised learning with convolutional neural networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 685–694, 2015.
  • Oquab etal. [2023]Maxime Oquab, Timothée Darcet, Theo Moutakanni, HuyV. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski.Dinov2: Learning robust visual features without supervision, 2023.
  • Panayotov etal. [2015]Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur.Librispeech: an asr corpus based on public domain audio books.In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
  • Park and Glass [2007]AlexS Park and JamesR Glass.Unsupervised pattern discovery in speech.IEEE Transactions on Audio, Speech, and Language Processing, 16(1):186–197, 2007.
  • Peng and Harwath [2022a]Puyuan Peng and David Harwath.Self-supervised representation learning for speech using visual grounding and masked language modeling.arXiv preprint arXiv:2202.03543, 2022a.
  • Peng and Harwath [2022b]Puyuan Peng and David Harwath.Word discovery in visually grounded, self-supervised speech models.arXiv preprint arXiv:2203.15081, 2022b.
  • Psomas etal. [2023]Bill Psomas, Ioannis Kakogeorgiou, Konstantinos Karantzalos, and Yannis Avrithis.Keep it simpool: Who said supervised transformers suffer from attention deficit?In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5350–5360, 2023.
  • Pullum and Scholz [2002]GeoffreyK Pullum and BarbaraC Scholz.Empirical assessment of stimulus poverty arguments.The linguistic review, 19(1-2):9–50, 2002.
  • Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Ren etal. [2020]Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu.Fastspeech 2: Fast and high-quality end-to-end text to speech.arXiv preprint arXiv:2006.04558, 2020.
  • Rudin etal. [1992]LeonidI Rudin, Stanley Osher, and Emad Fatemi.Nonlinear total variation based noise removal algorithms.Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
  • Senocak etal. [2018]Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and InSo Kweon.Learning to localize sound source in visual scenes, 2018.
  • Shih etal. [2023]Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, and David Harwath.Speechclip: Integrating speech with pre-trained vision and language model.In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 715–722. IEEE, 2023.
  • Smith and Yu [2008]Linda Smith and Chen Yu.Infants rapidly learn word-referent mappings via cross-situational statistics.Cognition, 106(3):1558–1568, 2008.
  • Snyder etal. [2015]David Snyder, Guoguo Chen, and Daniel Povey.Musan: A music, speech, and noise corpus, 2015.
  • Sun etal. [2023]Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, and Nick Barnes.Learning audio-visual source localization via false negative aware contrastive learning, 2023.
  • Tian etal. [2020]Yonglong Tian, Dilip Krishnan, and Phillip Isola.Contrastive multiview coding.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
  • Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
  • Wang and Oord [2021]Luyu Wang and Aaron vanden Oord.Multi-format contrastive learning of audio representations.arXiv preprint arXiv:2103.06508, 2021.
  • Wang and Isola [2020]Tongzhou Wang and Phillip Isola.Understanding contrastive representation learning through alignment and uniformity on the hypersphere.In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
  • Wu and Palmer [1994]Zhibiao Wu and Martha Palmer.Verb semantics and lexical selection.arXiv preprint cmp-lg/9406033, 1994.
  • Yosinski etal. [2014]Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson.How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014.
  • Zhao etal. [2018]Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba.The sound of pixels.In Proceedings of the European conference on computer vision (ECCV), pages 570–586, 2018.
  • Zhou etal. [2016]Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.Learning deep features for discriminative localization.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
  • Zhou etal. [2017]Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba.Scene parsing through ade20k dataset.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
  • Zhu etal. [2021]Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He.Deep audio-visual learning: A survey.International Journal of Automation and Computing, 18:351–376, 2021.

\thetitle

Supplementary Material

6 Full Cross Modal Retrieval Results

Places Audio RetrievalAudioSet Retrieval
I \rightarrow AA \rightarrow II \rightarrow AA \rightarrow I
Method@1@5@10@1@5@10@1@5@10@1@5@10
[25]12.1%33.5%46.3%14.8%40.3%54.8%------
[24]13.0%37.8%54.2%16.1%40.4%56.4%------
DAVENet[26]12.7%37.5%52.8%20.0%46.9%60.4%------
DAVENet*[26]13.3%38.3%51.2%20.5%45.3%57.2%0.10%0.70%1.30%0.10%0.30%1.20%
CAVMAE*[21]\ul36.7%\ul70.3%\ul81.7%\ul33.9%\ul65.7%\ul77.7%22.8%44.9%55.7%21.1%41.7%50.7%
ImageBind[20]0.10%0.50%1.10%0.10%0.40%1.10%\ul29.6%\ul55.4%\ul64.5%\ul31.8%\ul57.3%\ul66.5%
Ours65.3%90.0%94.2%64.4%89.4%94.3%35.1%58.0%68.2%33.6%59.3%68.4%

7 VGGSound Source Evaluation

Table6 adds evaluations on the VGGSound Source dataset. We note that VGGSS annotation’s large bounding boxes do not reward high-resolution results. Nevertheless, DenseAV outperforms all methods including 5 additional baselines (Attention10K [53], AVObject [1], LVS [8], FNAC AVL [57], and SLAVC [38]).

MethodcIoUAUC
DAVENet6.8%21.2%
CAVMAE7.9%25.0%
ImageBind3.4%20.5%
Attention10K18.5%30.2%
AVObject29.7%35.7%
LVS34.4%38.2%
SLAVC38.8%38.8%
FNAC AVL\ul39.4%\ul39.4%
Ours40.6%40.6%

8 Speech Prompted Semantic Segmentation Noise Robustness:

DenseAV was trained with natural speech and sounds and is robust to environmental noise and common speech errors like stutters. We explore additional noise-robustness experiments in Table 7.

MethodmAPmIoU
DAVENet\ul31.8%\ul26.1%
CAVMAE27.2%23.8%
ImageBind20.2%19.7%
Ours48.1%36.6%

9 Speech Prompted Semantic Segmentation Examples

Self-supervised Visual Grounding of Sound and Language (4)

10 Sound Prompted Semantic Segmentation Examples

Self-supervised Visual Grounding of Sound and Language (5)

11 Comparison Across Backbones

Self-supervised Visual Grounding of Sound and Language (6)

12 Associating Spoken Words to Visual Objects

Visual ObjectTop 5 Retrieved Words
ottomansofachairchairseatliving
ruinsbrickstonecastleclaystone
dirt trackdirtdirttrailfielddirt
monitorscreenscreencomputertelevisionscreen
control panelco*ckpitairplaneco*ckpitairplaneairplane
bardeskpicturecounterpokerkitchen
waterfallwaterfallfountainwaterwaterfallwaterfall
embankmenttrenchlandfieldlandhill
bleachersamphitheaterstepscolosseumstepstairway
snowsnowsnowsnowmountainsnow
Self-supervised Visual Grounding of Sound and Language (7)

13 Failure Cases

Self-supervised Visual Grounding of Sound and Language (8)

14 Comparing to DINO CLS Token Activations

Self-supervised Visual Grounding of Sound and Language (9)

15 Visualizing Activations when an Object is not Present

Self-supervised Visual Grounding of Sound and Language (10)

16 Additional Regularizer Details

Negative Audio Splicing

Though using 3 is enough to make a reasonable cross-modal retrieval system, the extreme flexibility of self-attention operator in modern transformers can lead to degenerate solutions. For example, we found that without regularizers that encourage local features to be meaningful, the network could develop its own “global” tokens by selecting a handful of local tokens to carry all of the information. This is similar to the observation of [13] and we observed this occasionally in our audio branch, which would collapse to only use the first tokens. To keep the network from collapsing the semantics of the audio clip into a single token, we introduce small negative sample clips into our audio samples. These small negative audio regions are randomly spliced into the larger audio clip, and we encourage the network to set the couplings in these regions to zero with a l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularizer. We include further details of the DenseAV’s architecture, hyperparameters, and regularizers in the Supplement.

More formally, let (ab,vb)1Bsuperscriptsubscriptsubscript𝑎𝑏subscript𝑣𝑏1𝐵(a_{b},v_{b})_{1}^{B}( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT be a Batch of B𝐵Bitalic_B paired audio and visual signals as before. Let mb[0,1]Tsubscript𝑚𝑏superscript01𝑇m_{b}\in[0,1]^{T}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be a soft mask where that measures whether a given location in the audio signal is actually part of a spliced negative clip. For example, mb[t]=1subscript𝑚𝑏delimited-[]𝑡1m_{b}[t]=1italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT [ italic_t ] = 1 when the clip at time t𝑡titalic_t is part of the negative clip, mb[t]=0subscript𝑚𝑏delimited-[]𝑡0m_{b}[t]=0italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT [ italic_t ] = 0 in the positive part of the clip, and 0<mb[t]<10subscript𝑚𝑏delimited-[]𝑡10<m_{b}[t]<10 < italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT [ italic_t ] < 1 in the small boundary regions when the true clip is being spliced into the negative clip and both sounds are present. Our negative audio splicing regularizer squares each entry of the similarity tensor and averages these according to the strength of the negative clip indicator mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT:

Splice=WeightedMean(s(ab,vb)2,mb)subscript𝑆𝑝𝑙𝑖𝑐𝑒WeightedMean𝑠superscriptsubscript𝑎𝑏subscript𝑣𝑏2subscript𝑚𝑏\mathcal{L}_{Splice}=\text{WeightedMean}(s(a_{b},v_{b})^{2},m_{b})caligraphic_L start_POSTSUBSCRIPT italic_S italic_p italic_l italic_i italic_c italic_e end_POSTSUBSCRIPT = WeightedMean ( italic_s ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT )(9)

Where the mean assumes that the weighting strength mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT has been broadcast to the shape of s(ab,vb)2𝑠superscriptsubscript𝑎𝑏subscript𝑣𝑏2s(a_{b},v_{b})^{2}italic_s ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We point interested readers to the supplement for explicit formulations of these regularizers which are too verbose for the double-column format here. Intuitively, this term penalizes the network for having activations during a period of spliced negative audio. We also note that we apply this regularizer to any padded silence at the ends of short audio clips.

Calibration Regularization

The calibration temperature provides the network with the crucial ability to increase or decrease its certainty by updating a single parameter. However, the network can also achieve this effect by increasing or decreasing the magnitudes of its features. We found that sometimes the temperature would accelerate downward, forcing the feature magnitudes to increase to compensate. As a result, the network would eventually saturate or become unstable. We hypothesize that this is due to optimizer momentum, and we prevent this “runaway calibration”, by adding a small regularizer to the temperature parameter γ𝛾\gammaitalic_γ

Cal=max(log(1)log(γ),0)2\mathcal{L}_{Cal}=\max(\log(1)-\log(\gamma),0)^{2}caligraphic_L start_POSTSUBSCRIPT italic_C italic_a italic_l end_POSTSUBSCRIPT = roman_max ( roman_log ( 1 ) - roman_log ( italic_γ ) , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(10)

This term penalizes the calibrator when it drops below 1.0 and encourages the calibrator to stay at or above 1.0.

Nonnegative Pressure

The InfoNCE loss function is invariant to the addition of a scalar to every inner product. Thus, to the network can choose to either find evidence of “positive” couplings connecting similar objects or “negative” couplings connecting regions that definitely do not belong together. We found that by encouraging the network to look for “positive” evidence, as opposed counterfactual evidence, improved training stability and performance across the key metrics we investigate. To encourage this behavior, we add a small regularizer to encourage inner products between features to be 0absent0\geq 0≥ 0. More specifically, let ΩΩ\Omegaroman_Ω be a set of 250 randomly selected coordinates (b,b,k,f,t,h,w)𝑏superscript𝑏𝑘𝑓𝑡𝑤(b,b^{\prime},k,f,t,h,w)( italic_b , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k , italic_f , italic_t , italic_h , italic_w ). We then form our non-negativity regularizer:

NonNeg=1|Ω|Ωmin(s(ab,vb)[k,f,t,h,w],0)2\mathcal{L}_{NonNeg}=\frac{1}{|\Omega|}\sum_{\Omega}\min\left(s(a_{b},v_{b^{%\prime}})[k,f,t,h,w],0\right)^{2}caligraphic_L start_POSTSUBSCRIPT italic_N italic_o italic_n italic_N italic_e italic_g end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT roman_min ( italic_s ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) [ italic_k , italic_f , italic_t , italic_h , italic_w ] , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(11)

This regularizer penalizes the similarity tensor if it drops below zero, encouraging features to exhibit positive couplings. We note than other works [23], have noted the benefits of using only non-negative feature couplings.

Disentangement Regularization

DenseAV’s multi-head similarity aggregation allows the network to use its different heads to model different independent ways that the audio and video modalities could couple together. Interestingly we find that if we give DenseAV two heads, one naturally specializes to language and the other head to more generic sounds. In particular, we find that one head will rediscover the meaning of words by “grounding” them to visual objects and another head will localize which objects created a given sound. To purify this disentanglement of concepts without supervision, we encourage different attention heads of our algorithm to specialize. More specifically we penalize the network when multiple attention heads are simultaneously active. In our experiments we use two attention heads. As before, we let (ab,vb)1Bsuperscriptsubscriptsubscript𝑎𝑏subscript𝑣𝑏1𝐵(a_{b},v_{b})_{1}^{B}( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT be a Batch of B𝐵Bitalic_B paired audio and visual signals. Our disentanglement loss for two heads is then:

Dis=Mean(|s(ab,vb)[1]s(ab,vb)[2]|)subscript𝐷𝑖𝑠Mean𝑠subscript𝑎𝑏subscript𝑣𝑏delimited-[]1𝑠subscript𝑎𝑏subscript𝑣𝑏delimited-[]2\mathcal{L}_{Dis}=\text{Mean}(|s(a_{b},v_{b})[1]\circ s(a_{b},v_{b})[2]|)caligraphic_L start_POSTSUBSCRIPT italic_D italic_i italic_s end_POSTSUBSCRIPT = Mean ( | italic_s ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) [ 1 ] ∘ italic_s ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) [ 2 ] | )(12)

Where \circ represents elementwise multiplication and |||\cdot|| ⋅ | is the elementwise absolute value function. [k]delimited-[]𝑘[k][ italic_k ] mirrors PyTorch slicing notation and refers to selecting the activations for only the k𝑘kitalic_kth attention head. Intuitively, this loss will encourage one head to be silent if the other head is active and can be viewed as a “cross-term” generalization of the l2superscript𝑙2l^{2}italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regularizer [27] for encouraging activation shrinkage.

Total Variation Smoothness

To improve the quality and temporal consistency of discovered audio-visual couplings we impose a smoothness regularizer, TVsubscript𝑇𝑉\mathcal{L}_{TV}caligraphic_L start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT, in the audio-time dimension.

TV=Mean((act(1:t1)act(2:t))2)\mathcal{L}_{TV}=\text{Mean}((\text{act}(1:t-1)-\text{act}(2:t))^{2})caligraphic_L start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT = Mean ( ( act ( 1 : italic_t - 1 ) - act ( 2 : italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )(13)

Where the activations for a given time slice [1,t1]1𝑡1[1,t-1][ 1 , italic_t - 1 ] are given by:

act(1:t1)=(s(ab,vb)[:,:,t,:,:])t=1t1\text{act}(1:t-1)=(s(a_{b},v_{b})[:,:,t^{\prime},:,:])_{t^{\prime}=1}^{t-1}act ( 1 : italic_t - 1 ) = ( italic_s ( italic_a start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) [ : , : , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , : , : ] ) start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT(14)

Informally, this regularizer penalizes when the inner product strengths change quickly over time.

Full Stability Regularizer

Putting these terms together into a single equation we have:

Stability=λSpliceSplice+λCalCal+λNonNegNonNeg+λTVTVsubscript𝑆𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦subscript𝜆𝑆𝑝𝑙𝑖𝑐𝑒subscript𝑆𝑝𝑙𝑖𝑐𝑒subscript𝜆𝐶𝑎𝑙subscript𝐶𝑎𝑙subscript𝜆𝑁𝑜𝑛𝑁𝑒𝑔subscript𝑁𝑜𝑛𝑁𝑒𝑔subscript𝜆𝑇𝑉subscript𝑇𝑉\mathcal{L}_{Stability}=\lambda_{Splice}\mathcal{L}_{Splice}+\lambda_{Cal}%\mathcal{L}_{Cal}+\lambda_{NonNeg}\mathcal{L}_{NonNeg}+\lambda_{TV}\mathcal{L}%_{TV}caligraphic_L start_POSTSUBSCRIPT italic_S italic_t italic_a italic_b italic_i italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_S italic_p italic_l italic_i italic_c italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_S italic_p italic_l italic_i italic_c italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_C italic_a italic_l end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_a italic_l end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_N italic_o italic_n italic_N italic_e italic_g end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_N italic_o italic_n italic_N italic_e italic_g end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT(15)

Where λSplice=0.01subscript𝜆𝑆𝑝𝑙𝑖𝑐𝑒0.01\lambda_{Splice}=0.01italic_λ start_POSTSUBSCRIPT italic_S italic_p italic_l italic_i italic_c italic_e end_POSTSUBSCRIPT = 0.01, λCal=0.1subscript𝜆𝐶𝑎𝑙0.1\lambda_{Cal}=0.1italic_λ start_POSTSUBSCRIPT italic_C italic_a italic_l end_POSTSUBSCRIPT = 0.1, λNonNeg=0.01subscript𝜆𝑁𝑜𝑛𝑁𝑒𝑔0.01\lambda_{NonNeg}=0.01italic_λ start_POSTSUBSCRIPT italic_N italic_o italic_n italic_N italic_e italic_g end_POSTSUBSCRIPT = 0.01, andλTV=0.01subscript𝜆𝑇𝑉0.01\lambda_{TV}=0.01italic_λ start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT = 0.01.

17 Regularizer Ablation

RegularizerSpeech Semseg.Places Acc. @ 10
Calsubscript𝐶𝑎𝑙\mathcal{L}_{Cal}caligraphic_L start_POSTSUBSCRIPT italic_C italic_a italic_l end_POSTSUBSCRIPTNonNegsubscript𝑁𝑜𝑛𝑁𝑒𝑔\mathcal{L}_{NonNeg}caligraphic_L start_POSTSUBSCRIPT italic_N italic_o italic_n italic_N italic_e italic_g end_POSTSUBSCRIPTSplicesubscript𝑆𝑝𝑙𝑖𝑐𝑒\mathcal{L}_{Splice}caligraphic_L start_POSTSUBSCRIPT italic_S italic_p italic_l italic_i italic_c italic_e end_POSTSUBSCRIPTTVsubscript𝑇𝑉\mathcal{L}_{TV}caligraphic_L start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPTmAPmIoUIA𝐼𝐴I\to Aitalic_I → italic_AAI𝐴𝐼A\to Iitalic_A → italic_I
\checkmark\checkmark\checkmark\checkmark48.7%36.8%94.2%94.3%
\checkmark\checkmark\checkmark49.1%37.3%94.3%94.1%
\checkmark\checkmark\checkmark48.2%36.8%94.1%93.4%
\checkmark\checkmark\checkmark48.6%36.7%94.8%94.5%
\checkmark\checkmark\checkmark49.0%36.9%94.2%93.7%
----
Self-supervised Visual Grounding of Sound and Language (2024)
Top Articles
Latest Posts
Article information

Author: Gregorio Kreiger

Last Updated:

Views: 5994

Rating: 4.7 / 5 (77 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Gregorio Kreiger

Birthday: 1994-12-18

Address: 89212 Tracey Ramp, Sunside, MT 08453-0951

Phone: +9014805370218

Job: Customer Designer

Hobby: Mountain biking, Orienteering, Hiking, Sewing, Backpacking, Mushroom hunting, Backpacking

Introduction: My name is Gregorio Kreiger, I am a tender, brainy, enthusiastic, combative, agreeable, gentle, gentle person who loves writing and wants to share my knowledge and understanding with you.