Self-supervised Visual Grounding of Sound and Language (2024)

\useunder

\ul

Mark Hamilton
MIT, Microsoft
markth@mit.edu Andrew Zisserman
Oxford, Google
John R. Hershey
Google
William T. Freeman
MIT, Google

Abstract

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV’s localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn “global” audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters. Project Page: https://aka.ms/denseav

1 Introduction

Self-supervised Visual Grounding of Sound and Language (2)

Associating audio and video events is a fundamental task in human perception. As infants develop, the synchronization and correspondence of visible sounds enables multi-modal association – a voice with a face, and a “moo" with a cow[55]. Later, as they acquire language, they associate spoken words with objects they represent[11, 49]. Amazingly, these association abilities, constituting speech recognition, sound event recognition, and visual object recognition, develop without much direct supervision. This work aims to create a model with this capability by learning high-resolution, semantically meaningful, audio-visually (AV) aligned representations. Features with these properties can be used to discover fine-grained correspondences between modalities without localization supervision or prior knowledge of the semantic representation of language.

Consider the spoken caption and accompanying sounds of the image shown in Figure1. We wish to “ground” both the speech and the sounds by identifying them with the corresponding visual objects. For instance, both the spoken word “dog” and the sound of a bark in the audio signal should be associated with the pixels of the dog in the visual signal if present. We seek high quality local representations where this behavior, which is notably absent from popular approaches in the literature, emerges from simple inner products between cross-modal features.

To achieve this, we make three innovations. First, we introduce DenseAV, a dual-encoder architecture that computes a dense similarity volume over audio and visual features. Looking at a slice of this similarity volume for a spoken word, as in Figure1, we can visualize the AV activation strength between a word or sound and an image’s pixels. The novelty we introduce is to extend this dense similarity mechanism to have multiple similarity volume heads, much like those of multi-head attention.This allows each head to specialize on a particular type of coupling between the visual and audio modalities. Interestingly, we discover that if we give DenseAV two heads and train on a dataset that contains both language and sound, the heads naturally learn to distinguish language from more general sound using only cross-modal supervision. For example, as shown in Figure1, head 1 focuses on sounds, such as a dog bark, emitted by visible objects, whereas head 2 focuses on speech, such as the word “dog", that refers to visible objects.

Second, we show the importance of the “aggregation function” one uses to create a summary similarity score between an audio clip and a video frame for contrastive learning. The traditional choices, using inner products between global representations such as class tokens [14, 6, 54] or pooled features[65, 21], do not promote AV alignment of dense local features. Because of this, several popular audio-video backbones that excel on cross-modal retrieval cannot directly associate objects and sounds using their local features. This limits their ability to be used for downstream tasks such as semantic segmentation, sound localization, or unsupervised language learning and discovery.

Third, we introduce two semantic segmentation datasets to evaluate visual grounding with AV representations for speech and (non-speech) sounds. We build these datasets from the high-quality segmentation masks provided by the ADE20K dataset[66] and measure mean average precision (mAP) and mean intersection over union (mIoU) on a binary mask prediction task. This evaluation is simpler and more thorough than previous efforts to measure visual grounding such as the concept counting metrics of[26] and the “pointing games” of[42, 3, 17] that only check if a heatmap’s peak occurs within a target box or segment. Furthermore, our evaluation avoids brittle word-net ontologies[37], clustering, Wu and Palmer distance[62], threshold choices, and a variety of other complicating factors.

To summarize, our main contributions are as follows:

•
We introduce DenseAV, a novel self-supervised architecture that learns high-resolution AV correspondences.
•
We introduce a local-feature-based image similarity function that significantly improves a network’s zero-shot localization ability compared to common strategies such as average pooling or CLS tokens.
•
We introduce new datasets for evaluating speech and sound prompted semantic segmentation. We show DenseAV significantly outperforms the current state-of-the-art on these tasks as well as on cross-modal retrieval.
•
We discover that our multi-head architecture naturally disentangles audio-visual correspondence into sound and language components using only contrastive supervision.

Self-supervised Visual Grounding of Sound and Language (3)

2 Related Work

Audio-visual (AV), text-visual, and other multi-modal models have a long history [16, 15], and have recently surged in popularity [67]. Broadly speaking DenseAV is an audio-video contrastive learning architecture; this class of methods learns AV representations by aligning paired signals and pushing apart unpaired signals[12, 30]. Of the models in this class, several stand out for their ability to localize sounds[3, 8, 46] or capture the semantics of language[26, 47]. Many models in this class compare AV signals using inner products between “global” representations formed by pooled deep features[21, 60, 39], or class tokens[54, 46, 47, 20, 35]. Most notably, ImageBind has gained popularity due to its state-of-the-art performance on a variety of tasks and datasets and unified class-token-based contrastive architecture. In this work we show that many of these architectures do not show strong localization properties in their local features, despite excelling at cross-modal retrieval on a “global” level. This limits their applicability to new out-of-domain sounds, sounds that don’t have a textual representation, and low-resource languages. We diverge from these works by directly supervising local tokens. In particular, we build on previous works[26, 3] that show max-pooling improves localization capabilities and introduce a new multi-head aggregation operator that generalizes previous losses using a self-attention-like operator[59].

Another class of methods discover structure in signals through uni- and multi-modal clustering. Early works on audio clustering[45] discovered meaningful utterances without supervision. Similar visual analyses have discovered visual objects[31, 9, 5, 23]. Recent works have applied these ideas to the AV domain [24, 2], but do not focus on extracting high-resolution AV representations.

Finally, several works investigate generative audio-video learning. The Sound of Pixels[64] generates the sound of a specific object using a source separation loss. Newer approaches using GANs[33, 34], and diffusion models[10, 20, 36] have generated audio from video and vice versa. Here we focus on improving the local representations of contrastive learners because of their relative scalability, simplicity, and ability to learn high-quality representations.

3 Methods

At a high level, DenseAV tries to determine when a given audio and visual signal belong “together” using dense audio-visual representations. To perform this task robustly, DenseAV must learn to predict the contents of an audio signal from a visual signal and vice versa. Doing so causes DenseAV to learn dense modality-specific features that capture the mutual information shared between the modalities [58]. Once learned, we can directly query these informative features to perform speech and sound prompted semantic segmentation as illustrated in Figure1.

More specifically, DenseAV is built from two modality-specific deep featurizers. These backbones produce temporally varying audio features across an audio clip and spatially varying video features for a single randomly selected frame. Our loss computes a similarity between audio and visual signals based on the intuition that two signals are similar if they have a variety of strong couplings or shared objects. More formally, we form a scalar similarity for a pair of audio and video signals by carefully aggregating a volume of pairwise inner products between dense features. We use the InfoNCE[40] contrastive loss to encourage similarity between “positive” pairs of signals and dissimilarity between “negative” pairs formed by in-batch shuffling. Figure3 graphically depicts this loss function and subsequent sections detail each component of our architecture.

3.1 Multi-Headed Aggregation of Similarities

DenseAV’s key architectural distinction is its loss function that directly supervises the “local” tokens of the visual and audio featurizers. This is a significant departure from other works[54, 22, 50, 6, 43, 20] that pool modality specific information into “global” representations prior to the contrastive loss. Unlike prior works, our loss function aggregates the full pairwise similarities between the local tokens into an aggregate measure of similarity for a given pair of audio and visual signals. We show in Figure2 that this architectural choice enables DenseAV’s local features to align across modalities whereas other approaches such as average pooling, class tokens, and SimPool [48] do not.

3.2 Loss

We can use the similarity between audio and visual signals defined in Equation2 to construct a contrastive loss. We follow recent works[20, 18, 61] and use the temperature-weighted InfoNCE [40] to encourage similarity between positive pairs of signals and dissimilarity between negative pairs. In DenseAV, we form $B$ positive pairs by splitting the audio and visual components of a Batch of training data. We form $B^{2}-B$ negative pairs by comparing a signal to all of the other signals in the training batch. More formally let $(a_{b},v_{b})_{1}^{B}$ be $B$ pairs of audio and visual signals. The visual-retrieval term of our InfoNCE loss is then:

\mathcal{L}_{A\to V}=\frac{1}{2B}\sum_{b=1}^{B}\left(\log\frac{\exp{\left(%\gamma\mathcal{S}(a_{b},v_{b})\right)}}{\sum_{b^{\prime}=1}^{B}\exp{\left(%\gamma\mathcal{S}(a_{b},v_{b^{\prime}})\right)}}\right)

(3)

Where $\gamma\in\mathbb{R}^{+}$ is a trainable inverse temperature parameter. We symmetrize this loss by adding the analogous audio-retrieval term, $\mathcal{L}_{V\to A}$ , which iterates over negative audio signals in the denominator.

3.3 Audio and Visual Featurizers

The core of DenseAV is two modality-specific backbone networks. We use the DINO vision transformer[6] with ImageNet pretrained weights (without labels) to provide a strong, yet fully unsupervised, vision backbone. Unlike other approaches that use CLIP[50] as a backbone, DINO does not require paired text captions and learns from unlabeled images only. Practically, we find that DINO outperforms CLIP because of its better-behaved local tokens[13], an effect we explore in the Supplement. We append an additional layer norm operation across the channel dimension[4] and a $1\times 1$ Convolution to DINO. The layer-norm and $1\times 1$ convolution ensure the architecture does not start with a saturated loss function. We use the HuBERT audio transformer[28] as DenseAV’s audio backbone. HuBERT operates on waveforms and is trained on the LibriSpeech[44] dataset using only self-supervision. Hubert outputs a single feature per time frame, corresponding to $F=1$ in Section3. Though HuBERT was only trained on speech, its audio features can be fine-tuned for more general sounds, much like how vision backbones can be fine-tuned for new datasets[63]. As in the visual branch, we append a channel-wise LayerNorm block and two $3\times 3$ convolutions to the audio branch. These layers help the network avoid saturation and speed convergence. Furthermore, the two convolutions help the model aggregate information, which reduces the cost of the pairwise feature comparison used in our loss function. We refer to these added layers after the pretrained backbones as the “aligners” in later sections.

3.4 Regularizers

Disentanglement Regularizer, $\mathcal{L}_{Dis}$ : We add a small regularization term to encourage each head of Equation1 to specialize and learn independent types of audio-visual associations. Interestingly we find that our 2-head model naturally learns to distinguish the meaning of words with one head and capture the sounds objects produce with another head. To further encourage this unsupervised discovery of concepts, we penalize the network when multiple attention heads are simultaneously active. More precisely, let $(a_{b},v_{b})_{1}^{B}$ be a Batch of $B$ paired audio and visual signals. Our disentanglement loss for two heads is then:

\mathcal{L}_{Dis}=\text{Mean}(|s(a_{b},v_{b})[1]\circ s(a_{b},v_{b})[2]|)

(4)

Where $\circ$ is elementwise multiplication and $|\cdot|$ is the elementwise absolute value function. $[k]$ mirrors PyTorch slicing notation and refers to selecting the activations for only the $k$ th attention head. Intuitively, this loss encourages one head to be silent if the other head is active and is a “cross-term” generalization of the $l^{2}$ regularizer[27] for encouraging activation shrinkage. When $K>2$ we average contributions from every combination of heads. We ablate this, and our decision to max-pool heads in Table3.

Stability Regularizers, $\mathcal{L}_{Stability}$ : Finally, we add several other small regularization terms to encourage stable convergence. We detail and ablate these terms in the Supplement. Briefly, these terms include standard regularizers like Total Variation[52] smoothness over time and non-negative pressure to encourage the network to focus on similarity instead of dissimilarity. In addition, we add a regularizer to prevent the calibration temperature, $\gamma$ , from drifting too quickly, and a regularizer to discourage activations during silence and noise. In the supplement we show that each regularizer alone does not have a dramatic effect on final metrics but together they can stop collapses during training.

Combining these losses into a single loss function yields:

\mathcal{L}=\mathcal{L}_{A\to V}+\mathcal{L}_{V\to A}+\lambda_{Dis}\mathcal{L}%_{Dis}+\mathcal{L}_{Stability}

(5)

In our experiments we use $\lambda_{Dis}=0.05$ and refer interested readers to the supplement for the details of our small stability regularizer, $\mathcal{L}_{Stability}$ .

3.5 Training

In our experiments we train DenseAV and relevant baselines on the AudioSet[19] dataset for sound prompted segmentation and AudioSet retrieval. We train on the PlacesAudio[25] dataset for speech prompted segmentation, PlacesAudio retrieval, and the ablation studies of Table4. In our disentanglement experiments of Table3 and feature visualizations of Figures1 and 2 we train on both AudioSet and PlacesAudio so that DenseAV can be familiar with both language, the prominent audio signal in PlacesAudio, and more general sounds from AudioSet. In these experiments we sample training data from these two corpora, so each batch has an even split between AudioSet and PlacesAudio.

Warming up Aligners: We find that we can dramatically improve the stability by first training the added aligners (convolutions and layer norms) for $3000$ steps while keeping pretrained DINO and HuBERT backbones fixed. This allows the aligners to adapt to these intelligent backbones before modifying each backbone’s sensitive weights. We use random resize crops, color jitter, random flips, and random greyscaling as image augmentations. We randomly sample a single video frame to feed to our visual branch. Audio clips are converted to single-channel format and are trimmed or padded with silence to create uniform 10 second clips. We re-sample audio clips according to the requirements of the backbone models used. For HuBERT, we re-sample to 16KhZ. We train on 8 V100 GPUs with an effective batch size of 80, and aggregate negative samples on all GPUs prior to computing the loss to ensure efficient parallelization. We provide additional training information and hyperparameters in the supplement.

Full Training: After warming up the aligners, we train the full model for an additional 800,000 steps using the same loss, batch-size, and training logic. We train all aligner weights and fine-tune all HuBERT audio backbone weights. We use low rank adaptation (LoRA)[29] to fine-tune the “Q”, “K”, and “V” layers of the DINO visual backbone attention blocks. This allows us to efficiently adapt DINO and stabilize the training as it is quite easy to collapse the carefully trained DINO weights. We use a LoRA rank of 8.

Method	Speech Semseg.		Sound Semseg.
Method	mAP	mIoU	mAP	mIoU
DAVENet [26]	\ul32.2%	\ul26.3%	16.8%	17.0%
CAVMAE [21]	27.2%	19.9%	\ul26.0%	\ul20.5%
ImageBind [20]	20.2%	19.7%	18.3%	18.1%
Ours	48.7%	36.8%	32.7%	24.2%

4 Experiments

To evaluate AV representation quality, we perform a variety of analyses including comparative activation visualization, quantitative measurements of speech and sound prompted semantic segmentation, and cross-modal retrieval. Additionally, we quantify our observation that DenseAV can distinguish the meanings of words (language), from the sounds of objects (sound) without supervision.

To adequately measure a representation’s AV alignment quality, we found it necessary to introduce two evaluation datasets that measure speech and sound prompted semantic segmentation performance. Our two datasets introduce pairs of speech and sound prompts coupled with matching images and segmentation masks derived from ADE20K. We create these datasets because previous works[26] have not published their datasets or evaluation code. However, we use an experimental setting from the literature for our cross-modal retrieval experiments.

We compare against a variety of prior art including the popular state-of-the art multi-modal retrieval network, ImageBind[20]. We also compare against CAVMAE[21], a leading multimodal backbone trained specifically for AudioSet retrieval, and DAVENet[26], which is trained to localize the meanings of words. We include two other baselines[25, 24] which have reported cross modal retrieval metrics on Places Audio. Finally, we compare our multi-head aggregation strategy to common “global” retrieval methods such as inner products between class-tokens, average-pooled tokens, and SimPooled[48] tokens. We note that SimPool achieves state-of-the-art localization results when compared to 14 other pooling methods. Nevertheless, our multi-head aligner yields better localization results than any of these “global” methods.

4.1 Qualitative Comparison of Feature Maps

Our first experiment in Figure2 highlights the dramatic differences in quality between DenseAV’s features and other approaches in the literature. DenseAV is the only backbone whose local tokens are semantically meaningful and show cross-modal alignment for speech and sound. Though both CAVMAE and ImageBind show high-quality retrieval performance, neither shows high quality aligned local tokens. As a result, DenseAV can associate and localize both sound and language significantly better than other backbones. DAVENet shows coarse correspondences between language and visual objects but cannot associate sound with visual objects and does not match DenseAV’s high resolution maps. Furthermore, the right half of Figure1 demonstrates that DenseAV naturally discovers and separates word semantics from the sound of objects without labels to supervise this separation. In the supplement, we provide additional visualizations of all backbones considered across a wide range of words and sounds.

Method	Places Acc. @10		AudioSet Acc. @10
Method	I $\rightarrow$ A	A $\rightarrow$ I	I $\rightarrow$ A	A $\rightarrow$ I
[25]*	46.3%	54.8%	-	-
[24]*	54.2%	56.4%	-	-
DAVENet [26]*	52.8%	60.4%	-	-
CAVMAE [21]	\ul81.7%	\ul77.7%	55.7%	50.7%
ImageBind [20]	1.10%	1.10%	\ul64.5%	\ul66.5%
Ours	94.2%	94.3%	69.8%	68.1%

4.2 Speech Prompted Image Segmentation

Dataset: We introduce a speech prompted segmentation dataset using the ADE20K dataset, which is known for its comprehensive ontology and pixel-precise annotations[66]. From this dataset, we curate an evaluation subset of image-class pairs by sampling up to 10 images for each object class in ADE20K, excluding images where the selected class was tiny ( $<5\%$ of pixels). We only consider classes with at least 2 images that pass the tiny object criterion. For each class and image, we formed a binary target mask by selecting the semantic segmentation mask for that class. This resulted in 3030 image-object pairs spanning 478 ADE20K classes.

Method	Pred. Dis.	Act. Dis.
No $\mathcal{L}_{Dis}$ , No Head Max Pool	64.1%	70.3%
No $\mathcal{L}_{Dis}$	99.9%	86.5%
Ours	99.9%	91.2%

4.3 Sound Prompted Image Segmentation

Dataset: To evaluate how well deep features localize sound, we build on Section4.2 and create a dataset of sound prompts that align with ADE20K classes. We first select the same (large) image-object pairs from ADE20K. We then create a mapping between the ADE20K and VGGSound[7] ontologies. To compute a robust mapping, we first embed ADE20K class names and VGGSound class names with the GPT Ada 2 text embedding model[41]. For each ADE20K class, we create a list of at most three candidates from the VGGSound ontology that have a cosine similarity $(>.85)$ . We then manually review these candidates to select the best VGGSound class for each ADE20K class and remove any spurious or mistaken matches. This produces a set of 95 ADE20K classes with strong matches in the VGGSound ontology. For each of our original 3030 image-object pairs we select a random VGGSound validation clip with a matching class according to our mapped ontology. This yields 106 image-object pairs across 20 ADE20K classes.

Evaluation Measure: We use the same mAP and mIoU evaluation metrics as Section4.2, but instead average over the 20 ADE20K classes considered.

Implementation: We compute sound prompted image activations as in section4.2 but with one key change: we average activations over the entire clip because we do not have ground-truth sound timing information.

Results: The “Sound mAP and mIoU” columns of Table1 show that DenseAV achieves a 25% (+6.4mAP) relative improvement in sound prompted segmentation compared to the prior art. Most notably, ImageBind’s features cannot localize sound despite their high cross-modal retrieval performance learned from millions of hours of sound.

Method	Speech mAP	Places Acc. @10
Method	Speech mAP	V $\rightarrow$ A	A $\rightarrow$ V
Average Pool	20.1%	92.0%	91.2%
CLS Token	20.6%	86.4%	89.8%
SimPool [48]	35.3%	92.6%	92.8%
Multi-Head (Ours)	48.2%	93.5%	93.8%

4.4 Cross-Modal Retrieval

We show that DenseAV’s representations are not only better for localization, but significantly outperform other approaches on cross-modal retrieval. We adopt the evaluation setting of[26] and measure cross modal retrieval accuracy at 1, 5, and 10 in a thousand-way retrieval task. In particular, we use the same thousand images from the validation set of[26] and also replicate this analysis on one-thousand random clips from the AudioSet validation data. Table2 shows results for 1000-way retrieval tasks on both the Places Audio and AudioSet datasets. We show cross-modal accuracy at 10, but also show larger tables in the supplement that echo these results using accuracy at 1 and 5. DenseAV significantly outperforms all baselines across all metrics. Interestingly, DenseAV outperforms ImageBind with less than half of the trainable parameters and no reliance on text.

4.5 Measuring Disentanglement

We observe that DenseAV’s heads naturally learn to differentiate audio-visual couplings that capture the meaning of words (language) and those that capture the sounds of objects (sound). Furthermore this effect generalizes to novel clips, including those with both sound and language as shown in Figure1. We quantify this observation in two ways, the first measures if a head’s average activation strength predicts whether a clip contains mainly “language” or “sound”. The second method quantifies how often the “sound” head is incorrectly active when the “language” head should be active and vice versa. We leverage the fact that AudioSet dataset contains mostly clips with ambient sound and rarely contains language. In contrast, Places Audio is entirely language-based without external ambient sound. We note that these analyses are specifically for our architecture with two heads $K=2$ and trained on both AudioSet and PlacesAudio data.

For both measures of disentanglement, we first compute a clip’s aggregated similarity for each head. In particular, we remove the max-pooling over heads in Equation2 to create a single-head similarity, $\mathcal{S}(a,v)_{k}$ . We then min-max scale the scores of each head across both datasets to lie in the $[0,1]$ interval, which we refer to as $\hat{\mathcal{S}}(a,v)_{k}$ . Using these normalized scores, we can create metrics that capture how well a given head responds only to a specific dataset.

Our first metric measures how well a head’s scores predict whether a clip is from the “sound” or “language” dataset. Let $(a_{b},v_{b})_{1}^{B}$ be tuples of paired audio and visual signals. let $l[k^{\prime}]_{b}$ be an indicator variable of whether the signal $(a_{b},v_{b})$ arises from the sound dataset, AudioSet, $(k^{\prime}=1)$ , or the language dataset Places Audio $(k^{\prime}=2)$ .

\delta_{pred}(k,k^{\prime})=\text{AP}\left((\hat{\mathcal{S}}(a_{b},v_{b})_{k}%)_{1}^{B},(l[k^{\prime}]_{b})_{1}^{B}\right)

(6)

Where $AP(\cdot,\cdot)$ is the binary average precision with prediction and label arguments respectively. Intuitively, this measures whether the scores of head $k$ are direct predictors of whether the data is from dataset $k^{\prime}$ . We can find the best assignment between heads and datasets such that each head is maximally predictive of the given dataset:

\text{PredDis}=\frac{1}{2}\max\left(\delta_{pred}(0,0)+\delta_{pred}(1,1),%\right.\\\left.\delta_{pred}(1,0)+\delta_{pred}(0,1)\right)

(7)

The prediction disentanglement score, PredDis, is a percentage that ranges from $50\%$ for completely entangled signals to $100\%$ if one can perfectly classify the signals using the scores of either head. The maximum over the two possible assignments makes this metric invariant to permutations of the heads. We note that this metric is a Hungarian matching assignment[32] over two entries, a common technique to asses unsupervised classification performance[31, 23].

Our second measure quantifies “spurious activations” in the non-dominant head. A truly disentangled system should have a head that only fires on sound, and another head that only fires on language. We create another disentanglement measure, ActDis, by replacing $\delta_{pred}$ in Equation7 with:

\delta_{act}(k,k^{\prime})=1-\frac{1}{\sum_{b^{\prime}}l[k^{\prime}]_{b^{%\prime}}}\sum_{b=1}^{B}\hat{\mathcal{S}}(a_{b},v_{b})_{k}\cdot l[k^{\prime}]_{b}

(8)

Intuitively, this measures the “inactivity” of head $k$ on dataset $k^{\prime}$ . If head $k$ is totally silent on dataset $k^{\prime}$ then $\delta_{act}(k,k^{\prime})=1$ . Like PredDis, ActDis is a percentage ranging from $50\%$ to $100\%$ with $100\%$ representing perfect disentanglement where the sound head is completely silent during the language clips, and vice versa.

Table3 shows that DenseAV achieves near perfect predictive ( $99\%$ ) and activation ( $91\%$ ) disentanglement. It also shows that our disentanglement regularizer and max-pooling over heads improves DenseAV’s natural ability to distinguish sound from language without supervision.

5 Conclusion

We presented DenseAV, a novel contrastive learning architecture that can discover the meaning of words and localize the sounds of objects using only video supervision. We are the first to observe both qualitatively and quantitatively that it’s possible to disentangle the meaning of words from the sound of objects with only a contrastive learning signal. DenseAV’s success stems from its novel multi-head attention aggregation mechanism that encourages its modality-specific backbones to create high-resolution, semantically meaningful, and AV aligned representations. These properties of DenseAV’s representation are not seen in other state-of-the-art models in the literature. Consequently, DenseAV significantly surpasses other leading models in dense prediction tasks such as speech and sound-prompted semantic segmentation as well as in cross-modal retrieval.

Acknowledgements

We would like to thank the Microsoft Research Grand Central Resources team for their gracious help performing the experiments in this work. Special thanks to Oleg Losinets and Lifeng Li for their consistent, gracious, and timely help, debugging, and expertise. Without them, none of the experiments could have been run.

We would also like to thank David Harwath, Andrew Rouditchenko, Yuan Gong, Didac Suris, Adria Recasens Continente, and Jim Glass for their help in running DAVENet and CAVMAE baselines and evaluations as well as for many helpful tips on audio visual contrastive learning.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. 2021323067. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.This work is supported by the National Science Foundation under Cooperative Agreement PHY-2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi.org/). This work is funded by a Royal Society ResearchProfessorship RSRP $\backslash$ R $\backslash$ 241003, and EPSRC Programme Grant VisualAI EP/T028572/1.

References

Afouras etal. [2020]Triantafyllos Afouras, Andrew Owens, JoonSon Chung, and Andrew Zisserman.Self-supervised learning of audio-visual objects from video, 2020.
Alwassel etal. [2020]Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran.Self-supervised learning by cross-modal audio-video clustering.Advances in Neural Information Processing Systems, 33:9758–9770, 2020.
Arandjelovic and Zisserman [2018]Relja Arandjelovic and Andrew Zisserman.Objects that sound.In Proceedings of the European conference on computer vision (ECCV), pages 435–451, 2018.
Ba etal. [2016]JimmyLei Ba, JamieRyan Kiros, and GeoffreyE Hinton.Layer normalization.arXiv preprint arXiv:1607.06450, 2016.
Caron etal. [2018]Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze.Deep clustering for unsupervised learning of visual features.In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018.
Caron etal. [2021]Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.Emerging properties in self-supervised vision transformers.In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
Chen etal. [2020]Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman.Vggsound: A large-scale audio-visual dataset.In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
Chen etal. [2021]Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman.Localizing visual sounds the hard way.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16867–16876, 2021.
Cho etal. [2021]JangHyun Cho, Utkarsh Mall, Kavita Bala, and Bharath Hariharan.Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16794–16804, 2021.
Choi etal. [2023]Jeongsoo Choi, Joanna Hong, and YongMan Ro.Diffv2s: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7812–7821, 2023.
Chomsky [1987]Noam Chomsky.Language and problems of knowledge: The Managua lectures.MIT press, 1987.
Chopra etal. [2005]Sumit Chopra, Raia Hadsell, and Yann LeCun.Learning a similarity metric discriminatively, with application to face verification.In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), pages 539–546. IEEE, 2005.
Darcet etal. [2023]Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski.Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023.
Dosovitskiy etal. [2020]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020.
Fisher and Darrell [2002]JohnW Fisher and Trevor Darrell.Probabalistic models and informative subspaces for audiovisual correspondence.In Computer Vision—ECCV 2002: 7th European Conference on Computer Vision Copenhagen, Denmark, May 28–31, 2002 Proceedings, Part III 7, pages 592–603. Springer, 2002.
FisherIII etal. [2000]JohnW FisherIII, Trevor Darrell, William Freeman, and Paul Viola.Learning joint statistical models for audio-visual fusion and segregation.Advances in neural information processing systems, 13, 2000.
Fong and Vedaldi [2017]RuthC Fong and Andrea Vedaldi.Interpretable explanations of black boxes by meaningful perturbation.In Proceedings of the IEEE international conference on computer vision, pages 3429–3437, 2017.
Frosst etal. [2019]Nicholas Frosst, Nicolas Papernot, and Geoffrey Hinton.Analyzing and improving representations with the soft nearest neighbor loss.In International conference on machine learning, pages 2012–2020. PMLR, 2019.
Gemmeke etal. [2017]JortF Gemmeke, DanielPW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, RChanning Moore, Manoj Plakal, and Marvin Ritter.Audio set: An ontology and human-labeled dataset for audio events.In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
Girdhar etal. [2023]Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, KalyanVasudev Alwala, Armand Joulin, and Ishan Misra.Imagebind: One embedding space to bind them all.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
Gong etal. [2022]Yuan Gong, Andrew Rouditchenko, AlexanderH Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and JamesR Glass.Contrastive audio-visual masked autoencoder.In The Eleventh International Conference on Learning Representations, 2022.
Guzhov etal. [2022]Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel.Audioclip: Extending clip to image, text and audio.In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022.
Hamilton etal. [2022]Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and WilliamT Freeman.Unsupervised semantic segmentation by distilling feature correspondences.arXiv preprint arXiv:2203.08414, 2022.
Harwath and Glass [2017]David Harwath and JamesR Glass.Learning word-like units from joint audio-visual analysis.arXiv preprint arXiv:1701.07481, 2017.
Harwath etal. [2016]David Harwath, Antonio Torralba, and James Glass.Unsupervised learning of spoken language with visual context.Advances in Neural Information Processing Systems, 29, 2016.
Harwath etal. [2018]David Harwath, Adria Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass.Jointly discovering visual objects and spoken words from raw sensory input.In Proceedings of the European conference on computer vision (ECCV), pages 649–665, 2018.
ho*rl and Kennard [1970]ArthurE ho*rl and RobertW Kennard.Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970.
Hsu etal. [2021]Wei-Ning Hsu, Benjamin Bolte, Yao-HungHubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed.Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
Hu etal. [2021]EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021.
Jaiswal etal. [2020]Ashish Jaiswal, AshwinRamesh Babu, MohammadZaki Zadeh, Debapriya Banerjee, and Fillia Makedon.A survey on contrastive self-supervised learning.Technologies, 9(1):2, 2020.
Ji etal. [2019]Xu Ji, JoãoF Henriques, and Andrea Vedaldi.Invariant information clustering for unsupervised image classification and segmentation.In Proceedings of the IEEE International Conference on Computer Vision, pages 9865–9874, 2019.
Kuhn [1955]HaroldW Kuhn.The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955.
Kumar etal. [2020]Neeraj Kumar, Srishti Goel, Ankur Narang, and Mujtaba Hasan.Robust one shot audio to video generation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 770–771, 2020.
Liu etal. [2021]Shiguang Liu, Sijia Li, and Haonan Cheng.Towards an end-to-end visual-to-raw-audio generation with gan.IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1299–1312, 2021.
Ma etal. [2020]Shuang Ma, Zhaoyang Zeng, Daniel McDuff, and Yale Song.Active contrastive learning of audio-visual video representations.arXiv preprint arXiv:2009.09805, 2020.
Mao etal. [2023]Yuxin Mao, Jing Zhang, Mochu Xiang, Yunqiu Lv, Yiran Zhong, and Yuchao Dai.Contrastive conditional latent diffusion for audio-visual segmentation.arXiv preprint arXiv:2307.16579, 2023.
Miller [1995]GeorgeA Miller.Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995.
Mo and Morgado [2022]Shentong Mo and Pedro Morgado.A closer look at weakly-supervised audio-visual source localization.Advances in Neural Information Processing Systems, 35:37524–37536, 2022.
Monfort etal. [2021]Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva.Spoken moments: Learning joint audio-visual representations from video descriptions.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14871–14881, 2021.
Oord etal. [2018]Aaron vanden Oord, Yazhe Li, and Oriol Vinyals.Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018.
OpenAI [2023]OpenAI.Gpt-4 technical report, 2023.
Oquab etal. [2015]Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic.Is object localization for free?-weakly-supervised learning with convolutional neural networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 685–694, 2015.
Oquab etal. [2023]Maxime Oquab, Timothée Darcet, Theo Moutakanni, HuyV. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski.Dinov2: Learning robust visual features without supervision, 2023.
Panayotov etal. [2015]Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur.Librispeech: an asr corpus based on public domain audio books.In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
Park and Glass [2007]AlexS Park and JamesR Glass.Unsupervised pattern discovery in speech.IEEE Transactions on Audio, Speech, and Language Processing, 16(1):186–197, 2007.
Peng and Harwath [2022a]Puyuan Peng and David Harwath.Self-supervised representation learning for speech using visual grounding and masked language modeling.arXiv preprint arXiv:2202.03543, 2022a.
Peng and Harwath [2022b]Puyuan Peng and David Harwath.Word discovery in visually grounded, self-supervised speech models.arXiv preprint arXiv:2203.15081, 2022b.
Psomas etal. [2023]Bill Psomas, Ioannis Kakogeorgiou, Konstantinos Karantzalos, and Yannis Avrithis.Keep it simpool: Who said supervised transformers suffer from attention deficit?In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5350–5360, 2023.
Pullum and Scholz [2002]GeoffreyK Pullum and BarbaraC Scholz.Empirical assessment of stimulus poverty arguments.The linguistic review, 19(1-2):9–50, 2002.
Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Ren etal. [2020]Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu.Fastspeech 2: Fast and high-quality end-to-end text to speech.arXiv preprint arXiv:2006.04558, 2020.
Rudin etal. [1992]LeonidI Rudin, Stanley Osher, and Emad Fatemi.Nonlinear total variation based noise removal algorithms.Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
Senocak etal. [2018]Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and InSo Kweon.Learning to localize sound source in visual scenes, 2018.
Shih etal. [2023]Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, and David Harwath.Speechclip: Integrating speech with pre-trained vision and language model.In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 715–722. IEEE, 2023.
Smith and Yu [2008]Linda Smith and Chen Yu.Infants rapidly learn word-referent mappings via cross-situational statistics.Cognition, 106(3):1558–1568, 2008.
Snyder etal. [2015]David Snyder, Guoguo Chen, and Daniel Povey.Musan: A music, speech, and noise corpus, 2015.
Sun etal. [2023]Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, and Nick Barnes.Learning audio-visual source localization via false negative aware contrastive learning, 2023.
Tian etal. [2020]Yonglong Tian, Dilip Krishnan, and Phillip Isola.Contrastive multiview coding.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
Wang and Oord [2021]Luyu Wang and Aaron vanden Oord.Multi-format contrastive learning of audio representations.arXiv preprint arXiv:2103.06508, 2021.
Wang and Isola [2020]Tongzhou Wang and Phillip Isola.Understanding contrastive representation learning through alignment and uniformity on the hypersphere.In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
Wu and Palmer [1994]Zhibiao Wu and Martha Palmer.Verb semantics and lexical selection.arXiv preprint cmp-lg/9406033, 1994.
Yosinski etal. [2014]Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson.How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014.
Zhao etal. [2018]Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba.The sound of pixels.In Proceedings of the European conference on computer vision (ECCV), pages 570–586, 2018.
Zhou etal. [2016]Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.Learning deep features for discriminative localization.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
Zhou etal. [2017]Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba.Scene parsing through ade20k dataset.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
Zhu etal. [2021]Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He.Deep audio-visual learning: A survey.International Journal of Automation and Computing, 18:351–376, 2021.

\thetitle

Supplementary Material

6 Full Cross Modal Retrieval Results

	Places Audio Retrieval						AudioSet Retrieval
	I $\rightarrow$ A			A $\rightarrow$ I			I $\rightarrow$ A			A $\rightarrow$ I
Method	@1	@5	@10	@1	@5	@10	@1	@5	@10	@1	@5	@10
[25]	12.1%	33.5%	46.3%	14.8%	40.3%	54.8%	-	-	-	-	-	-
[24]	13.0%	37.8%	54.2%	16.1%	40.4%	56.4%	-	-	-	-	-	-
DAVENet[26]	12.7%	37.5%	52.8%	20.0%	46.9%	60.4%	-	-	-	-	-	-
DAVENet*[26]	13.3%	38.3%	51.2%	20.5%	45.3%	57.2%	0.10%	0.70%	1.30%	0.10%	0.30%	1.20%
CAVMAE*[21]	\ul36.7%	\ul70.3%	\ul81.7%	\ul33.9%	\ul65.7%	\ul77.7%	22.8%	44.9%	55.7%	21.1%	41.7%	50.7%
ImageBind[20]	0.10%	0.50%	1.10%	0.10%	0.40%	1.10%	\ul29.6%	\ul55.4%	\ul64.5%	\ul31.8%	\ul57.3%	\ul66.5%
Ours	65.3%	90.0%	94.2%	64.4%	89.4%	94.3%	35.1%	58.0%	68.2%	33.6%	59.3%	68.4%

7 VGGSound Source Evaluation

Table6 adds evaluations on the VGGSound Source dataset. We note that VGGSS annotation’s large bounding boxes do not reward high-resolution results. Nevertheless, DenseAV outperforms all methods including 5 additional baselines (Attention10K [53], AVObject [1], LVS [8], FNAC AVL [57], and SLAVC [38]).

Method	cIoU	AUC
DAVENet	6.8%	21.2%
CAVMAE	7.9%	25.0%
ImageBind	3.4%	20.5%
Attention10K	18.5%	30.2%
AVObject	29.7%	35.7%
LVS	34.4%	38.2%
SLAVC	38.8%	38.8%
FNAC AVL	\ul39.4%	\ul39.4%
Ours	40.6%	40.6%

8 Speech Prompted Semantic Segmentation Noise Robustness:

DenseAV was trained with natural speech and sounds and is robust to environmental noise and common speech errors like stutters. We explore additional noise-robustness experiments in Table 7.

Method	mAP	mIoU
DAVENet	\ul31.8%	\ul26.1%
CAVMAE	27.2%	23.8%
ImageBind	20.2%	19.7%
Ours	48.1%	36.6%

9 Speech Prompted Semantic Segmentation Examples

Self-supervised Visual Grounding of Sound and Language (4)

10 Sound Prompted Semantic Segmentation Examples

Self-supervised Visual Grounding of Sound and Language (5)

11 Comparison Across Backbones

Self-supervised Visual Grounding of Sound and Language (6)

12 Associating Spoken Words to Visual Objects

Visual Object	Top 5 Retrieved Words
ottoman	sofa	chair	chair	seat	living
ruins	brick	stone	castle	clay	stone
dirt track	dirt	dirt	trail	field	dirt
monitor	screen	screen	computer	television	screen
control panel	co*ckpit	airplane	co*ckpit	airplane	airplane
bar	desk	picture	counter	poker	kitchen
waterfall	waterfall	fountain	water	waterfall	waterfall
embankment	trench	land	field	land	hill
bleachers	amphitheater	steps	colosseum	step	stairway
snow	snow	snow	snow	mountain	snow

Self-supervised Visual Grounding of Sound and Language (7)

13 Failure Cases

Self-supervised Visual Grounding of Sound and Language (8)

14 Comparing to DINO CLS Token Activations

Self-supervised Visual Grounding of Sound and Language (9)

15 Visualizing Activations when an Object is not Present

Self-supervised Visual Grounding of Sound and Language (10)

16 Additional Regularizer Details

Negative Audio Splicing

Though using 3 is enough to make a reasonable cross-modal retrieval system, the extreme flexibility of self-attention operator in modern transformers can lead to degenerate solutions. For example, we found that without regularizers that encourage local features to be meaningful, the network could develop its own “global” tokens by selecting a handful of local tokens to carry all of the information. This is similar to the observation of [13] and we observed this occasionally in our audio branch, which would collapse to only use the first tokens. To keep the network from collapsing the semantics of the audio clip into a single token, we introduce small negative sample clips into our audio samples. These small negative audio regions are randomly spliced into the larger audio clip, and we encourage the network to set the couplings in these regions to zero with a $l_{2}$ regularizer. We include further details of the DenseAV’s architecture, hyperparameters, and regularizers in the Supplement.

More formally, let $(a_{b},v_{b})_{1}^{B}$ be a Batch of $B$ paired audio and visual signals as before. Let $m_{b}\in[0,1]^{T}$ be a soft mask where that measures whether a given location in the audio signal is actually part of a spliced negative clip. For example, $m_{b}[t]=1$ when the clip at time $t$ is part of the negative clip, $m_{b}[t]=0$ in the positive part of the clip, and $0<m_{b}[t]<1$ in the small boundary regions when the true clip is being spliced into the negative clip and both sounds are present. Our negative audio splicing regularizer squares each entry of the similarity tensor and averages these according to the strength of the negative clip indicator $m_{b}$ :

\mathcal{L}_{Splice}=\text{WeightedMean}(s(a_{b},v_{b})^{2},m_{b})

(9)

Where the mean assumes that the weighting strength $m_{b}$ has been broadcast to the shape of $s(a_{b},v_{b})^{2}$ . We point interested readers to the supplement for explicit formulations of these regularizers which are too verbose for the double-column format here. Intuitively, this term penalizes the network for having activations during a period of spliced negative audio. We also note that we apply this regularizer to any padded silence at the ends of short audio clips.

Calibration Regularization

The calibration temperature provides the network with the crucial ability to increase or decrease its certainty by updating a single parameter. However, the network can also achieve this effect by increasing or decreasing the magnitudes of its features. We found that sometimes the temperature would accelerate downward, forcing the feature magnitudes to increase to compensate. As a result, the network would eventually saturate or become unstable. We hypothesize that this is due to optimizer momentum, and we prevent this “runaway calibration”, by adding a small regularizer to the temperature parameter $\gamma$

\mathcal{L}_{Cal}=\max(\log(1)-\log(\gamma),0)^{2}

(10)

This term penalizes the calibrator when it drops below 1.0 and encourages the calibrator to stay at or above 1.0.

Nonnegative Pressure

The InfoNCE loss function is invariant to the addition of a scalar to every inner product. Thus, to the network can choose to either find evidence of “positive” couplings connecting similar objects or “negative” couplings connecting regions that definitely do not belong together. We found that by encouraging the network to look for “positive” evidence, as opposed counterfactual evidence, improved training stability and performance across the key metrics we investigate. To encourage this behavior, we add a small regularizer to encourage inner products between features to be $\geq 0$ . More specifically, let $\Omega$ be a set of 250 randomly selected coordinates $(b,b^{\prime},k,f,t,h,w)$ . We then form our non-negativity regularizer:

\mathcal{L}_{NonNeg}=\frac{1}{|\Omega|}\sum_{\Omega}\min\left(s(a_{b},v_{b^{%\prime}})[k,f,t,h,w],0\right)^{2}

(11)

This regularizer penalizes the similarity tensor if it drops below zero, encouraging features to exhibit positive couplings. We note than other works [23], have noted the benefits of using only non-negative feature couplings.

Disentangement Regularization

DenseAV’s multi-head similarity aggregation allows the network to use its different heads to model different independent ways that the audio and video modalities could couple together. Interestingly we find that if we give DenseAV two heads, one naturally specializes to language and the other head to more generic sounds. In particular, we find that one head will rediscover the meaning of words by “grounding” them to visual objects and another head will localize which objects created a given sound. To purify this disentanglement of concepts without supervision, we encourage different attention heads of our algorithm to specialize. More specifically we penalize the network when multiple attention heads are simultaneously active. In our experiments we use two attention heads. As before, we let $(a_{b},v_{b})_{1}^{B}$ be a Batch of $B$ paired audio and visual signals. Our disentanglement loss for two heads is then:

\mathcal{L}_{Dis}=\text{Mean}(|s(a_{b},v_{b})[1]\circ s(a_{b},v_{b})[2]|)

(12)

Where $\circ$ represents elementwise multiplication and $|\cdot|$ is the elementwise absolute value function. $[k]$ mirrors PyTorch slicing notation and refers to selecting the activations for only the $k$ th attention head. Intuitively, this loss will encourage one head to be silent if the other head is active and can be viewed as a “cross-term” generalization of the $l^{2}$ regularizer [27] for encouraging activation shrinkage.

Total Variation Smoothness

To improve the quality and temporal consistency of discovered audio-visual couplings we impose a smoothness regularizer, $\mathcal{L}_{TV}$ , in the audio-time dimension.

\mathcal{L}_{TV}=\text{Mean}((\text{act}(1:t-1)-\text{act}(2:t))^{2})

(13)

Where the activations for a given time slice $[1,t-1]$ are given by:

\text{act}(1:t-1)=(s(a_{b},v_{b})[:,:,t^{\prime},:,:])_{t^{\prime}=1}^{t-1}

(14)

Informally, this regularizer penalizes when the inner product strengths change quickly over time.

Full Stability Regularizer

Putting these terms together into a single equation we have:

\mathcal{L}_{Stability}=\lambda_{Splice}\mathcal{L}_{Splice}+\lambda_{Cal}%\mathcal{L}_{Cal}+\lambda_{NonNeg}\mathcal{L}_{NonNeg}+\lambda_{TV}\mathcal{L}%_{TV}

(15)

Where $\lambda_{Splice}=0.01$ , $\lambda_{Cal}=0.1$ , $\lambda_{NonNeg}=0.01$ , and $\lambda_{TV}=0.01$ .

17 Regularizer Ablation

Regularizer				Speech Semseg.		Places Acc. @ 10
$\mathcal{L}_{Cal}$	$\mathcal{L}_{NonNeg}$	$\mathcal{L}_{Splice}$	$\mathcal{L}_{TV}$	mAP	mIoU	$I\to A$	$A\to I$
$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	48.7%	36.8%	94.2%	94.3%
	$\checkmark$	$\checkmark$	$\checkmark$	49.1%	37.3%	94.3%	94.1%
$\checkmark$		$\checkmark$	$\checkmark$	48.2%	36.8%	94.1%	93.4%
$\checkmark$	$\checkmark$		$\checkmark$	48.6%	36.7%	94.8%	94.5%
$\checkmark$	$\checkmark$	$\checkmark$		49.0%	36.9%	94.2%	93.7%
				-	-	-	-