Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Pritam Sarkar

doi:10.48550/ARXIV.2111.05329

Outline

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Pritam Sarkar

2021, arXiv (Cornell University)

https://doi.org/10.48550/ARXIV.2111.05329

visibility

…

description

16 pages

Abstract

We present CrissCross, a self-supervised framework for learning audiovisual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and Au-dioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound. The codes, pretrained models, and supplementary material are available on the project website.

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity Pritam Sarkar1, 2 Ali Etemad1 1 Queen’s University, Canada 2 Vector Institute {pritam.sarkar, ali.etemad}@queensu.ca https://pritamqu.github.io/CrissCross arXiv:2111.05329v5 [cs.CV] 25 Nov 2022 Abstract and multi-modal nature of videos make them difficult to an- notate, further motivating the use of self-supervision. We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is in- The common and standard practice in self-supervised troduced in our framework whereby in addition to learning audio-visual representations learning is to learn intra-modal the intra-modal and standard ‘synchronous’ cross-modal re- and cross-modal relationships between the audio and vi- lations, CrissCross also learns ‘asynchronous’ cross-modal sual streams by maintaining tight temporal synchronicity be- relationships. We perform in-depth studies showing that by tween the two modalities (Alayrac et al. 2020; Korbar, Tran, relaxing the temporal synchronicity between the audio and and Torresani 2018; Alwassel et al. 2020; Asano et al. 2020). visual modalities, the network learns strong generalized rep- Yet, the impact of learning temporally asynchronous cross- resentations useful for a variety of downstream tasks. To modal relationships in the context of self-supervised learn- pretrain our proposed solution, we use 3 different datasets ing has not been explored. This notion deserves deeper ex- with varying sizes, Kinetics-Sound, Kinetics400, and Au- ploration as learning such temporally asynchronous cross- dioSet. The learned representations are evaluated on a num- ber of downstream tasks namely action recognition, sound modal relationships may in fact result in increased invari- classification, and action retrieval. Our experiments show that ance and distinctiveness in the learned representations. CrissCross either outperforms or achieves performances on In this study, in an attempt to explore the notion above, par with the current state-of-the-art self-supervised methods we present CrissCross, a self-supervised framework to on action recognition and action retrieval with UCF101 and learn robust generalized audio-visual representations from HMDB51, as well as sound classification with ESC50 and videos. CrissCross is built upon SimSiam (Chen and He DCASE. Moreover, CrissCross outperforms fully-supervised 2021) to jointly learn self-supervised audio-visual represen- pretraining while pretrained on Kinetics-Sound. The codes, tations through a mixture of intra- and cross- modal op- pretrained models, and supplementary material are available timizations. In addition to learning intra-modal and stan- on the project website. dard synchronous cross-modal relations, CrissCross intro- duces the novel idea of learning cross-modal representations 1 Introduction through relaxing time-synchronicity between correspond- ing audio and visual segments. We refer to this as ‘asyn- In recent years, self-supervised learning has shown great chronous cross-modal’ optimization, a concept that has not promise in learning strong representations without human- been explored in prior works. We use 3 datasets of different annotated labels (Chen et al. 2020; Chen and He 2021; sizes: Kinetics-Sound (Arandjelovic and Zisserman 2017), Caron et al. 2018), and emerged as a strong competitor for Kinetics400 (Kay et al. 2017), and AudioSet (Gemmeke fully-supervised pretraining. There are a number of benefits et al. 2017), to pretrain CrissCross. We evaluate CrissCross to such methods. Firstly, they reduce the time and resources on different downstream tasks, namely action recognition, required for expensive human annotations and allow re- sound classification, and action retrieval. We use 2 popular searchers to directly use large uncurated datasets for learning benchmarks UCF101 (Soomro, Zamir, and Shah 2012) and meaningful representations. Moreover, the models trained HMDB51 (Kuehne et al. 2011) to perform action recogni- in a self-supervised fashion learn more abstract representa- tion and retrieval, while ESC50 (Piczak 2015) and DCASE tions, which are useful for a variety of downstream tasks (Stowell et al. 2015) are used for sound classification. without needing to train the models from scratch. Given The key contributions of this work are as follows: the abundance of videos, their spatio-temporal information- rich nature, and the fact that in most cases they contain • We present a novel framework for multi-modal self- both audio and visual streams, self-supervised approaches supervised learning by relaxing the audio-visual tempo- are strong alternatives to fully-supervised methods for video ral synchronicity to learn effective generalized represen- representation learning. Moreover, the high dimensionality tations. Our method is simple, data efficient and less re- source intensive, yet learns robust multi-modal represen- Copyright © 2023, Association for the Advancement of Artificial tations for a variety of downstream tasks. Intelligence (www.aaai.org). All rights reserved. • We perform an in-depth study to explore the performance of the proposed framework and its major concepts. More- et al. 2020; Asano et al. 2020) rely on self-labeling where over, we perform thorough analyses, both quantitatively data is fed to the network and the extracted feature em- and qualitatively, in different setups, showing the benefit beddings are clustered using a classical clustering algorithm of learning asynchronous cross-modal relations. such as k-means, followed by using the cluster assignments • Comparing the performance of our method to prior as the pseudo-labels for training the neural network. The key works, CrissCross achieves state-of-the-arts on UCF101, concept of contrastive learning (Chen and He 2021; Misra HMDB, ESC50, and DCASE when pretrained on Kinet- and Maaten 2020; Grill et al. 2020; Caron et al. 2020; Mor- ics400. Moreover, when trained with AudioSet, Criss- gado, Vasconcelos, and Misra 2021; Patrick et al. 2021a) is Cross achieves better or competitive performances versus that in the embedding space, ‘positive’ samples should be the current state-of-the-arts. similar to each other, and ‘negative’ samples should have • Lastly, when pretrained on the small-scale Kinetics- discriminative properties. Using this concept, several prior Sound (Arandjelovic and Zisserman 2017), CrissCross works (Morgado, Vasconcelos, and Misra 2021; Morgado, outperforms fully-supervised pretraining (Ma et al. 2020) Misra, and Vasconcelos 2021; Patrick et al. 2021a; Ma et al. by 1.4% and 7.4%, as well as prior self-supervised state- 2020) have attempted to learn representations by minimiz- of-the-art (Ma et al. 2020) by 11.1% and 19.9% on ing the distance between positive pairs and maximizing the UCF101 and HMDB51 respectively. To the best of our distance between negative pairs. knowledge, very few prior works have attempted to pre- train on such small datasets, and in fact, this is the first 2.2 Audio-Visual Representation Learning time where self-supervised pretraining outperforms full Typically in multi-modal self-supervised learning, multi- supervision on action recognition in this setup. ple networks are jointly trained on the pseudo tasks to- We hope our proposed self-supervised method can mo- wards maximizing the mutual information between multi- tivate researchers to further explore the notion of asyn- ple data streams (Alwassel et al. 2020; Morgado, Vascon- chronous multi-modal representation learning. celos, and Misra 2021; Korbar, Tran, and Torresani 2018; Xu et al. 2019; Wang et al. 2021; Khare, Parthasarathy, and 2 Related Work Sundaram 2021; Siriwardhana et al. 2020). Following, we briefly discuss some of the prior works (Korbar, Tran, and 2.1 Self-supervised Learning Torresani 2018; Alwassel et al. 2020; Morgado, Vasconce- Self-supervised learning aims to learn generalized represen- los, and Misra 2021; Ma et al. 2020) on audio-visual rep- tations of data without any human annotated labels through resentation learning. A multi-modal self-supervised task is properly designed pseudo tasks (also known as pretext introduced in AVTS (Korbar, Tran, and Torresani 2018), tasks). Self-supervised learning has recently drawn signifi- leveraging the natural synergy between audio-visual data. cant attention in different areas such as image (Chen et al. The network is trained to distinguish whether the given au- 2020; Chen and He 2021; Misra and Maaten 2020; Caron dio and visual sequences are ‘in sync’ or ‘out of sync’. In et al. 2020; Grill et al. 2020; Caron et al. 2018), video (Mor- XDC (Alwassel et al. 2020), the authors introduce a frame- gado, Vasconcelos, and Misra 2021; Morgado, Misra, and work to learn cross-modal representations through a self- Vasconcelos 2021; Alwassel et al. 2020; Asano et al. 2020; labeling process. Specifically, cross-modal pseudo-labeling Patrick et al. 2021a; Alayrac et al. 2020; Min et al. 2021), is performed where the pseudo-labels computed from au- and wearable data (Sarkar and Etemad 2020b,a; Sarkar et al. dio embeddings are used to train the visual backbone, while 2020) analysis among others. the pseudo-labels computed using visual embeddings are In self-supervised learning, the main focus of interest used to train the audio network. A self-supervised learn- lies in designing novel pseudo-tasks to learn useful repre- ing framework based on contrastive learning is proposed sentations. We briefly mention some of the popular cate- in AVID (Morgado, Vasconcelos, and Misra 2021) to learn gories in the context of self-supervised video representation audio-visual representations from videos. AVID performs learning, namely, i) context-based, ii) generation-based, iii) instance discrimination as the pretext task by maximizing clustering-based, and iv) contrastive learning-based. Vari- the cross-modal agreement of the audio-visual segments in ous pretext tasks have been proposed in the literature ex- addition to visual similarity. Though earlier works focus on ploring the spatio-temporal context of video frames, for ex- learning cross-modal relations while maintaining a tight syn- ample, temporal order prediction (Lee et al. 2017), puz- chronicity between the audio and visual data, our proposed zle solving (Kim, Cho, and Kweon 2019; Misra, Zitnick, framework also considers asynchronous cross-modal rela- and Hebert 2016; Ahsan, Madhok, and Essa 2019), rotation tionships in addition to the standard synchronous relations. prediction (Jing et al. 2018), and others. Generation-based video feature learning methods refer to the process of learn- 3 Method ing feature representations through video generation (Von- drick, Pirsiavash, and Torralba 2016; Tulyakov et al. 2018; 3.1 Approach Saito, Matsumoto, and Saito 2017), video colorization (Tran Let be given v, a sequence of visual frames, and a, the cor- et al. 2016), and frame or clip prediction (Mathieu, Couprie, responding audio waveform. We can obtain n augmented and LeCun 2016; Reda et al. 2018; Babaeizadeh et al. 2018; views of v as {vi }ni=0 , and equal number of augmented Liang et al. 2017; Finn, Goodfellow, and Levine 2016), views of a as {ai }ni=0 . A common way to learn individ- among a few others. Clustering-based approaches (Alwassel ual representations from v and a is to minimize the em- bedding distances (D) between Pn the augmented views of the each modality as Lvv = i,j=0,i6=j D(vi , vj ) and Laa = Pn i,j=0,i6=j D(ai , aj ) respectively in a self-supervised set- ting (Caron et al. 2020; Bardes, Ponce, and LeCun 2021; Chen and He 2021; Grill et al. 2020; Niizumi et al. 2021). Further, to learn multi-modal representations from {v, a}, a standard technique is to simply optimize a joint intra- modal loss Lintra = Lvv + Laa . Prior works (Alwassel et al. 2020; Morgado, Vasconcelos, and Misra 2021; Mor- gado, Misra, and Vasconcelos 2021) have demonstrated that in addition to Lintra , a cross-modal optimization can be per- formed directly across visual and audio segments to further Figure 1: Distribution of the learned representations with Pn learn strong joint representations as Lav = i=0 D(ai , vi ). and without the asynchronous cross-modal optimization. All of these learning procedures maintain a tight syn- chronicity between the two modalities, given that both ai and vi are segmented from the same timestamps. We con- jecture, however, that relaxing the synchronicity between modalities by a reasonable margin will enable more gener- alized representations to be learned across time, to achieve better and more robust performance. Accordingly, we intro- duce asynchronous cross-modal loss Lasync , which exploits the relationship between audio and visual segments sam- pled at different timestamps. We define the final objective as LCrissCross which exploits the combination of Lintra , syn- chronous Lav (which we refer to as Lsync ), and Lasync in an attempt to learn more generalized representations. While we present the detailed experiments and analysis of our pro- posed approach in the subsequent sections of the paper, here we perform a quick visualization to demonstrate the ben- efits of this concept. Figure 1 depicts the distributions of representations learned with and without Lasync , demon- strating that indeed relaxing the tight synchronicity helps in widening the distribution of the learned representations which could result in improved performance in a wide vari- Intra−modal Loss (𝐿𝑖𝑛𝑡 𝑟𝑎 ) ety of downstream tasks. Synchronous Cross−modal Loss (𝐿𝑠𝑦𝑛𝑐 ) Visual Encoder Audio Encoder Asynchronous Cross-modal Loss (𝐿𝑎𝑠𝑦𝑛𝑐 ) 3.2 Training Objective , RGB Frames Mel-spectrogram To accomplish the notion above, let’s define two neural net- works, a visual encoder fv and an audio encoder fa . Here, fv Figure 2: Our proposed framework. CrissCross learns strong and fa are composed of convolutional backbones and MLP audio-visual representations by exploiting intra-modal, as projection heads. Moreover, we adopt a Siamese (Bromley well as, sync. and async. cross-modal relations. et al. 1993) representation learning setup, where the net- works share weights on two or more inputs. Next, We ob- tain two augmented views of v = {vt }Tt=0 , denoted by v1 and obtained from the predictor head and z represents the out- t1 +tv 2 +tv v2 , defined as {vt }t=t 1 and {vt }tt=t 2 respectively. Here, v1 put vector obtained from the feature encoder followed by and v2 have a duration of tv , and are sampled at times t1 the stop-gradient operation. Here, the predictor head and t2 respectively. Note that v1 and v2 are augmented dif- consists of an MLP head, which is used as an identity map- ferently. Similarly, two augmented views of a = {at }Tt=0 can ping, while the stop-gradient operation prevents the 1 +ta 2 +ta be obtained as a1 and a2 as {at }tt=t 1 and {at }tt=t 2 , respec- model from collapsing to a degenerated solution (Chen and tively. Next, to learn intra-modal representations, the dis- He 2021). Here, D is defined as: tance between fv (v1 ) and fv (v2 ), as well as, fa (a1 ) and p z fa (a2 ) can be minimized to train fv and fa respectively. D(p, z) = − · . (1) ||p||2 ||z||2 However, such a naive approach would lead to mode col- lapse as pointed out in (Grill et al. 2020; Niizumi et al. 2021; We use hv and ha as the predictor heads corresponding to vi- Chen and He 2021; Caron et al. 2020). To tackle this, we sual and audio representations. Next, we obtain pv1 and zv2 follow the technique proposed in (Chen and He 2021). In as hv (fv (v1 )) and S(fv (v2 )). Similarly, pa1 and za2 are ob- particular, we minimize the cosine embedding distance D of tained as ha (fa (a1 )) and S(fa (a2 )). To calculate the sym- two output vectors p and S(z), where p is the output vector metrized loss, we further obtain pv2 and zv1 , as well as, pa2 and za1 . Therefore, to learn the intra-modal relations, we op- Method UCF101 ESC50 timize the intra-modal loss Lintra defined as: Lv1 v2 69.1 - 1 1 La1 a2 - 62.0 Lintra = ( D(pv1 , S(zv2 )) + D(pv2 , S(zv1 )) 2 2 (2) Lintra 69.7 71.8 1 1 Lsync 70.1 75.8 + D(pa1 , S(za2 )) + D(pa2 , S(za1 )))/2 . Lasync 69.1 74.8 2 2 Lsync + Lintra 73.8 78.0 Next, to learn synchronous cross-modal relations, we opti- Lsync + Lasync 69.1 74.8 mize the synchronous cross-modal loss Lsync , defined as: Lasync + Lintra 72.4 75.3 Lv1 v2 + Lsync + Lasync 71.3 78.5 1 1 La1 a2 + Lsync + Lasync 70.8 75.3 Lsync = ( D(pv1 , S(za1 )) + D(pa1 , S(zv1 )) LCrissCross 74.8 79.0 2 2 (3) 1 1 + D(pv2 , S(za2 )) + D(pa2 , S(zv2 )))/2 . Table 1: We present the top-1 accuracy of CrissCross and its 2 2 ablation variants, pretrained on Kinetics-Sound. Additionally, based on our earlier intuition, to relax the tem- poral synchronicity, we minimize the distance between the audio and visual segments originated from different times- Pretrain Downstream w/o Lasync w/ Lasync tamps. We define asynchronous cross-modal loss Lasync as: KS UCF101 73.8(↓ 1.0) 74.8 KS ESC50 78.0(↓ 1.0) 79.0 1 1 Lasync = ( D(pv1 , S(za2 )) + D(pa2 , S(zv1 )) K400 UCF101 75.8(↓ 4.1) 79.9 2 2 (4) K400 ESC50 78.5(↓ 3.5) 82.0 1 1 + D(pv2 , S(za1 )) + D(pa1 , S(zv2 )))/2 . K400 KS (a) 43.2(↓ 3.9) 47.1 2 2 K400 KS (v) 53.3(↓ 2.4) 55.7 Finally, to exploit intra-modal, as well as, synchronous and K400 KS (a+v) 65.0(↓ 1.7) 66.7 asynchronous cross-modal relations we define the final ob- jective function as: Table 2: Impact of Lasync optimization in different pretrain- 1 ing and evaluation setups. Here, K400: Kinetics400, KS: LCrissCross = (Lintra + Lsync + Lasync ) . (5) Kinetics-Sound. 3 We present the proposed CrissCross framework in Figure 2. Please note, for the sake of simplicity, we omit showing the stop-grad and predictor head connections in Figure 2. We mel filters, we set the hop size as 10 milliseconds and FFT present the pseudocode in Appendix A. window length as 1024. Finally, we feed spectrograms of shape 80 × 200 to the audio encoder. We use Adam (Kingma 4 Experiments and Results and Ba 2015) optimizer with a cosine learning rate sched- uler (Loshchilov and Hutter 2017) to pretrain the encoders The details of the experiment setup and the findings of our and use a fixed learning rate to train the predictors. Please thorough ablation studies investigating the major concepts note that during the design exploration, we use Kinetics- of our proposed framework are presented here. Addition- Sound for pretraining, while the downstream evaluations are ally, we extensively investigate a wide range of audio-visual performed on UCF101 and ESC50 unless stated otherwise. augmentation techniques capable of learning strong audio- We perform linear evaluations using 8 frames of visual input visual representations within our framework, the details are and 2 seconds of audio input. Next, a linear SVM classifier as follows. is trained using the extracted features, and report the top-1 4.1 Experiment Setup accuracy for sample-level predictions. We provide the addi- tional details of the experiment setup, datasets, architectures, Following the standard practice among the prior works and evaluation protocols in the Appendix. (Morgado, Vasconcelos, and Misra 2021; Alwassel et al. 2020; Asano et al. 2020; Patrick et al. 2021a; Ma et al. 2020), we use Kinetics-Sound, Kinetics400, and AudioSet for pre- 4.2 Ablation Study training. Additionally, Kinetics400, UCF101, HMDB51, We present the ablation results in Tables 1 and 2 to show ESC50 and DCASE are used for downstream evaluation. the improvements made by optimizing asynchronous cross- We use R(2+1)D (Tran et al. 2018) and ResNet (He et al. modal loss in addition to intra-modal and synchronous 2016) as the visual and audio backbones. To pretrain the net- cross-modal losses. First, using Kinetics-Sound, we train work in a self-supervised fashion with audio-visual inputs, the framework in uni-modal setups, denoted as Lv1 v2 and we downsample the visual streams to 16 frames per second La1 a2 . We report the top-1 accuracy of UCF101 and ESC50 and feed 8 frames of resolution 1122 to the visual encoder. as 69.1% and 62.0% respectively. Next, we train the network Next, we downsample the audio signals to 16kHz, and seg- in a multi-modal setup, where we find that Lsync outper- ment them into 2-second segments. We transform the seg- forms the other multi-modal variants including Lintra and mented raw audio waveforms to mel-spectrograms using 80 Lasync , as well as, uni-modal baselines Lv1 v2 and La1 a2 . w/o asynchronous loss w/ asynchronous loss Further study shows that combining all the multi-modal losses improves the model performance. LCrissCross out- performs Lsync by 4.7% and 3.2% on action recognition and blowing nose blowing nose sound classification, respectively. Further, to study the effect of Lasync in particular, we per- form ablation studies using small-scale Kinetics-Sound and dribbling basketball dribbling basketball large-scale Kinetics400. We present the results in Table 2, where we observe that Lasync improves the performance on singing singing both the pretraining datasets. In particular, while pretrained on Kinetics400, optimizing Lasync in addition to Lsync and Lintra improves the performances by 4.1% and 3.5% on ac- tapping pen tapping pen tion recognition and sound classification respectively, show- ing the significance of asynchronous cross-modal optimiza- tion in a multi-modal setup. While pretrained on Kinetics- laughing laughing Sound, adding Lasync improves the performances by 1% on both the UCF101 and ESC50. We interestingly find that tapping guitar tapping guitar learning asynchronous cross-modal loss significantly im- proves the model performance when pretrained on large- Figure 3: Visualization of saliency maps while pretrained scale Kinetics400. Our intuition is that as Kinetics-Sound without (left) and with (right) asynchronous loss. consists of a few hand-picked classes which are prominently manifested in both audio and visual modalities, the per- w/ asynchronous loss 80 formance gain due to Lasync is less prominent. However, w/o asynchronous loss UCF101 ESC50 Kinetics400 is considerably larger in scale and comprises 78 highly diverse action classes which are not always very 76 Acc.(%) prominent both audibly and visually. It therefore benefits 74 more from the generalized representations learned by asyn- 72 chronous cross-modal optimization. Moreover, to demon- strate the benefit of optimizing Lasync throughout the pre- 70 None Mild Medium Mixed Extreme training process, we present the top-1 accuracy vs. pretrain- ing epoch in Figure 4. It shows that Lasync significantly im- Figure 4: Left: Linear eval. top-1 acc. vs. pretraining epochs. proves the model performance throughout the pretraining. Right: Exploring different temporal relaxation techniques. Multi-modal fusion. Next, we investigate if learning asynchronous cross-modal relations helps in multi-modal fusion. To test this, we use Kinetics-Sound as the down- 4.3 Exploring Relaxed Time-synchronicity stream dataset and Kinetics400 as the pretraining dataset. Audio and visual modalities from the same source clip We choose Kinetics-Sound for downstream evaluation as it generally maintain a very strong correlation, which makes consists of action classes that are represented prominently in them suitable for multi-modal representation learning as one both audio and visual domains. The results are presented in modality can be used as a supervisory signal for the other in Table 2, where it is shown that learning asynchronous cross- a self-supervised setup. However, our intuition behind Criss- modal relations improves multi-modal fusion by 1.7%. Ad- Cross is that these cross-modal temporal correlations do not ditionally, we show the linear evaluation results obtained necessarily need to follow a strict frame-wise coupling. In- from the uni-modal feature representations for reference. It stead, we hypothesize that relaxing cross-modal temporal shows that optimizing Lasync improves the action classifi- synchronicity to some extent can help in learning more gen- cation accuracy by 2.4% and 3.9% using visual and audio eralized representations. representations, respectively. To facilitate this idea within CrissCross, we exploit 5 Qualitative analysis. Lastly, to perform a qualitative anal- different temporal sampling methods to explore varying ysis on the impact of Lasync we visualize the saliency maps amounts of temporal synchronicity when learning cross- obtained from the models when pretrained with and with- modal relationships. (i) None: where both the audio and vi- out the presence of the asynchronous loss. In this experi- sual segments are sampled from the exact same time win- ment, we directly use the models pretrained on Kinetics400 dow. (ii) Mild: where the two views of the audio-visual and use Grad-CAM (Omeiza et al. 2019) to visualize ran- segments share 50% overlap amongst them. (iii) Medium: domly selected samples from Kinetics400. A few examples where adjacent frame sequences and audio segments are are presented in Figure 3, where we observe that learning sampled. (iv) Extreme: in which we sample one view from asynchronous relations helps the model focus better on the the first half of the source clip, while the other view is sam- salient information. Specifically, we notice that optimizing pled from the second half of the source clip. (v) Mixed: Lasync helps in correctly locating the sound sources on the where the two audio-visual segments are sampled in a tem- visual streams, as shown by the examples of ‘dribbling bas- porally random manner. The results presented in Figure 4 ketball’, ‘laughing’, ‘tapping guitar’, etc. show that the mild relaxation works best for both action lrp =lrb comm. pred. 2 layers proj. default Pretraining Dataset UCF101 59.0 73.6 72.4 74.8 KS (22K) K400 (240K) AS (1.8M) ESC50 62.3 75.3 75.0 79.0 HMDB51 45.7 50.0 56.2 UCF101 78.1 83.9 87.7 Table 3: A comparative study of different predictor and pro- Kinetics400 39.0 44.5 50.1 jector setups. Here, lrb : base LR and lrp : pred LR ESC50 82.8 86.8 90.5 DCASE 93.0 96.0 97.0 Table 4: We present the top-1 acc. of linear evaluation on action recognition and sound classification. training curves in Figure 5, it shows using common predic- tor head results in training losses saturate very quickly ulti- mately yielding worse performance compared to the use of separate predictor heads. Figure 5: Left: Pretraining loss curves during predictor head design exploration. Right: Linear evaluation top-1 accuracy Projector. We present a comparative study of projection vs. pretraining dataset (size). heads with 2 layers vs. 3 layers (default setup). We notice 2.4% and 4% improvements in top-1 accuracies when using 3 layers instead of 2 on action recognition and sound clas- recognition and sound classification. Interestingly, we find sification respectively (please see Table 3). The architecture that medium relaxation shows worse performance in com- details are presented in Appendix F. parison to others, whereas, extreme relaxation works some- what well in our setup. 4.5 Exploring Audio-Visual Augmentations. We perform an in-depth study to explore the impact of dif- 4.4 Exploring Design Choices ferent audio and visual augmentations. Predictor. Our empirical study shows that the predictor Visual Augmentations. We explore a wide range of visual head plays an important role in effectively training the au- augmentations. As a starting point, we adopt the basic spatial dio and visual encoders to learn good representations. The augmentations used in (Morgado, Vasconcelos, and Misra predictor architecture is similar to (Chen and He 2021). For 2021), which consists of Multi-Scale Crop (MSC), Hori- the sake of completeness, we provide the details of the pre- zontal Flip (HF), and Color Jitter (CJ). Additionally, we ex- dictor head in Appendix F. We explore (i) different learning plore other augmentations, namely Gray Scale (GS), Gaus- rates, and (ii) using a common vs. a separate predictor in the sian Blur (GB) (Chen et al. 2020), and Cutout (C) (DeVries multi-modal setup. It should be noted that none of the vari- and Taylor 2017), which show great performance in image- ants cause a collapse, even though we notice considerable based self-supervised learning (Chen et al. 2020; Van Gans- differences in performance. We present the findings below. beke et al. 2020). We explore almost all the possible com- Following (Chen and He 2021), we use a constant learn- binations of different visual augmentations in a uni-modal ing rate for the predictors. However, unlike (Chen and He setup and present the results in Table 5. The results show that 2021), where the predictor learning rate is the same as the strong augmentations improve the top-1 accuracy by 6.8% in base learning rate of the encoder, we find that a higher pre- comparison to basic augmentations used in (Morgado, Vas- dictor learning rate helps the network to learn better repre- concelos, and Misra 2021). sentations. In particular, setting the predictor learning rate Temporal Consistency of Spatial Augmentations. While to be the same as the base learning rate results in unstable investigating different spatial augmentations, we are also in- training, and the loss curve shows oscillating behavior. We terested to know if the spatial augmentations should be con- empirically find that setting the predictor learning rate to 10 sistent at the frame level or whether they should be random times the base learning rate works well. We present the re- (i.e., vary among consecutive frames within a sequence). sults in Table 3 and training curves in Figure 5. We refer to these concepts as temporarily consistent or tem- Next, we evaluate whether the framework can be trained porarily random. We perform an experiment where we apply with a common predictor head instead of separate predictor MSC-HF-CJ-GS randomly at the frame level and compare heads (default setup). In simple terms, one predictor head the results to applying the same augmentations consistently would work towards identity mapping for both audio and vi- across all the frames of a sequence. Our results show that sual feature vectors. To test this, l2-normalized feature vec- maintaining temporal consistency in spatial augmentations tors fv (v) and fa (a) are fed to the predictor, which are then across consecutive frames is beneficial, which is in line with used in a usual manner to optimize the cost function. The re- the findings in (Qian et al. 2021). Specifically, Temporally sults are presented in Table 3. We observe that though such random augmentations, results in top-1 accuracy of 53.69%, a setup works somewhat well, having separate predictors is whereas, the same augmentations applied in a temporally beneficial for learning better representations. We present the consistent manner results in 68.09%. Visual UCF101 Audio ESC50 Backbone Method Compute U101 H51 (#Params (M)) MSC-HF-CJ 62.3 VJ 44.8 MSC-HF-CJ-GS 68.1 VJ-M 49.5 Pretrained Dataset: Kinetics-Sound (Finetune input 32×2242 ) Uni MSC-HF-CJ-GS-C 68.3 VJ-M-TW 49.5 CM-ACC(2020) 40 GPUs 3D-ResNet18 (33.4) 77.2 40.6 MSC-HF-CJ-GS-GB 68.7 VJ-M-RC 62.0 CrissCross 4 GPUs R(2+1)D-18 (15.4) 88.3 60.5 MSC-HF-CJ-GS-GB-C 69.1 Supervised (2020) - 3D-ResNet18 (33.4) 86.9 53.1 Visual + Audio UCF101 ESC50 Pretrained Dataset: Kinetics400 (Finetune input 8×2242 ) MSC-HF-CJ-GS-C + VJ-M-RC 73.9 79.0 XDC (2020) 64 GPUs R(2+1)D-18 (31.5) 74.2 39.0 Multi MSC-HF-CJ-GS-GB + VJ-M-RC 73.5 79.0 AVID (2021) 64 GPUs R(2+1)D-18 (15.4) 83.7 49.5 MSC-HF-CJ-GS-GB-C + VJ-M-RC 74.8 79.0 Robust-xID (2021) 8 GPUs R(2+1)D-18 (15.4) 81.9 49.5 CrissCross 8 GPUs R(2+1)D-18 (15.4) 86.9 54.3 Table 5: Exploring audio-visual augmentations. Pretrained Dataset: Kinetics400 (Finetune input 32×2242 ) SeLaVi (2020) 64 GPUs R(2+1)D-18 (31.5) 83.1 47.1 XDC (2020) 64 GPUs R(2+1)D-18 (31.5) 86.8 52.6 ∗ CM-ACC (2020) 40 GPUs 3D-ResNet18 (33.4) 90.2 61.8 Audio Augmentations. Similar to visual augmentations, we AVID (2021) 64 GPUs R(2+1)D-18 (15.4) 87.5 60.8 thoroughly investigate a variety of audio augmentations. Our GDT (2021a) 64 GPUs R(2+1)D-18 (31.5) 90.9 62.3 CMAC (2021) 8 GPUs R(2+1)D-18 (31.5) 90.3 61.1 audio augmentations include, Volume Jitter (VJ), Time and Robust-xID (2021) 8 GPUs R(2+1)D-18 (15.4) 85.6 55.0 Frequency Masking (Mask) (Park et al. 2019), Random Crop CrissCross 8 GPUs R(2+1)D-18 (15.4) 91.5 64.7 (RC) (Niizumi et al. 2021), and Time Warping (TW) (Park Supervised (2021a) - R(2+1)D-18 (31.5) 95.0 74.0 et al. 2019). We also explore almost all the possible combi- Pretrained Dataset: AudioSet (Finetune input 8×2242 ) nations of these augmentations and present the results in Ta- XDC (2020) 64 GPUs R(2+1)D-18 (31.5) 84.9 48.8 AVID (2021) 64 GPUs R(2+1)D-18 (15.4) 88.6 57.6 ble 5. Our findings show that time-frequency masking and CrissCross 8 GPUs R(2+1)D-18 (15.4) 89.4 58.3 random crop improve the top-1 accuracy by 17.25% com- Pretrained Dataset: AudioSet (Finetune input 32×2242 ) pared to the base variant. We also notice that time warping XDC (2020) 64 GPUs R(2+1)D-18 (31.5) 93.0 63.7 doesn’t improve performance and is also quite computation- MMV (2020) 32 TPUs R(2+1)D-18 (31.5) 91.5 70.1 ally expensive. Hence, going forward we do not use time CM-ACC (2020) ∗∗ 40 GPUs R(2+1)D-18 (33.4) 93.5 67.2 BraVe (2021) 16 TPUs R(2+1)D-18 (31.5) 93.6 70.8 warping during pretraining. AVID (2021) 64 GPUs R(2+1)D-18 (15.4) 91.5 64.7 Audio-Visual Augmentations. We conduct further experi- CrissCross 8 GPUs R(2+1)D-18 (15.4) 92.4 67.4 ments on a few combinations of augmentations in a multi- Supervised (2021) - R(2+1)D-18 (31.5) 96.8 75.9 ∗ ∗∗ modal setup. We pick the top-performing augmentations ob- refers to 240K samples from Kinetics700. pretrained with very high tem- poral resolutions (2 views of 32 & 128 frames) compared to others (8/16/32). tained from the uni-modal variants and apply them con- currently. The results are presented in Table 5 where we Table 6: SOTA comparison on action recognition. find that the results are consistent with the uni-modal se- tups, as the combination of MSC-HF-CJ-GS-GB-C and VJ-M-RC performs the best in comparison to the other combinations. Finally, We summarize the augmentation 4.7 Comparison to the State-of-the-Arts schemes used for pretraining and evaluation in Tables S4 Action Recognition. In line with (Alwassel et al. 2020; and S3. Asano et al. 2020; Morgado, Vasconcelos, and Misra 2021; Patrick et al. 2021a; Ma et al. 2020), we benchmark Criss- 4.6 Linear Evaluation and Scalability Cross using UCF101 and HMDB51 on action recognition. To evaluate the quality of the representations learned For a fair comparison to earlier works, we adopt 2 setups through pretraining, we perform linear evaluation on ac- for finetuning, once with 8 frames, and the other with 32 tion recognition (HMDB51, UCF101, and Kinetics400) and frames. In both these setups, we use a spatial resolution of sound classification (ESC50 and DCASE). As mentioned, 2242 . We tune the model using the split-1 of both datasets we use 3 different-sized datasets, i.e., Kinetics-Sound, Ki- and report the top-1 accuracy averaged over all the splits. netics400, and AudioSet for pretraining. In Table 4 we re- We notice large variability in experimental setups in the lit- port the top-1 accuracies averaged over all the splits. More- erature in terms of different backbones (e.g., deeper Con- over, to evaluate the scalability of CrissCross, we plot the vNets, Transformer-based architectures, etc.) (Piergiovanni, linear evaluation results against the size of pretraining data Angelova, and Ryoo 2020; Qian et al. 2021; Patrick et al. as shown in Figure 5. We notice a steady improvement 2021b), pretraining inputs (e.g., the addition of optical flow in performance as the dataset size increases, which shows or text in addition to audio-visual data, etc.) (Piergiovanni, CrissCross can likely be scaled on even larger datasets like Angelova, and Ryoo 2020; Qian et al. 2021; Alayrac et al. IG65M (Ghadiyaram, Tran, and Mahajan 2019). Please note 2020), and pretraining datasets, making it impractical to that in order to evaluate scalability we choose linear eval- compare to all the prior works. Following the inclusion cri- uation accuracy instead of full-finetuning as it gives more teria of earlier works (Patrick et al. 2021a; Alwassel et al. accurate measurements of learned representations obtained 2020; Morgado, Vasconcelos, and Misra 2021), we compare through self-supervised pretraining. In Figure 5, we do not CrissCross with methods that use similar backbones, inputs, include DCASE as it is a very small dataset (total of 100 and pretraining datasets. recordings spread over 10 classes) and already reached very The comparison of CrissCross with recent works is pre- high accuracy on both Kinetics400 and AudioSet. sented in Table 6. When pretrained with Kinetics400, Criss- UCF101 HMDB51 pretraining on action recognition using the same small-scale Method R@1 R@5 R@20 R@1 R@5 R@20 pretraining dataset, showing that our method performs well on limited pretraining data. ST Order (2018) 25.7 36.2 49.2 - - - SpeedNet (2020) 13.0 28.1 49.5 - - - Action Retrieval. In addition to full finetuning, we also Clip Order (2019) 14.1 30.3 51.1 7.6 22.9 48.8 compare the performance of CrissCross in an unsupervised VCP (2020) 18.6 33.6 53.5 7.6 24.4 53.6 setup. Following prior works (Morgado, Misra, and Vascon- VSP (2020) 24.6 41.9 76.9 10.3 26.6 54.6 celos 2021; Patrick et al. 2021a; Asano et al. 2020), we CoCLR (2020) 55.9 70.8 82.5 26.1 45.8 69.7 perform action retrieval using the split-1 of both UCF101 SeLaVi (2020) 52.0 68.6 84.5 24.8 47.6 75.5 Robust-xID (2021) 60.9 79.4 90.8 30.8 55.8 79.7 and HMDB51. The results are presented in Table 7 shows GDT (2021a) 57.4 73.4 88.1 25.4 51.4 75.0 that CrissCross outperforms the current state-of-the-arts on UCF101 while achieving competitive results for HMDB51. CrissCross 63.8 78.7 89.9 26.4 50.5 77.7 Sound Classification. We use two popular benchmarks Table 7: SOTA comparison on action retrieval. ESC50 and DCASE to perform sound classification. We find large variability of experimental setups in the literature for evaluating audio representations. For instance, different ESC50 DCASE backbones, input lengths, datasets, and evaluation protocols Method K400 AS K400 AS (linear evaluation, full-finetuning) have been used, making it impractical to compare to all the prior works. Following (Re- AVTS (2018) 76.7 80.6 91 93 casens et al. 2021; Alayrac et al. 2020), we perform linear XDC (2020) 78.0 84.8 91 95 evaluations using 5-second inputs on ESC50 and 1-second AVID (2021) 79.1 89.1 93 96 MMV (2020) - 85.6 - - input for DCASE. As presented in Table 8, CrissCross out- BraVe (2021) - 90.4 - - performs current state-of-the-art AVID (Morgado, Vascon- CrissCross 86.8 90.5 96 97 celos, and Misra 2021) and BraVe (Recasens et al. 2021) on ESC50, while pretrained on Kinetics400 and AudioSet Table 8: SOTA comparison on sound classification. respectively. Additionally, CrissCross sets new state-of-the- art by outperforming all the prior works on DCASE when pretrained on both Kinetics400 and AudioSet. Cross outperforms all the prior works by considerable mar- gins on UCF101 and HMDB51 in both the fine-tuning se- 5 Summary tups. Moreover, CrissCross outperforms the current state- We propose a novel self-supervised framework to learn of-the-art AVID (Morgado, Vasconcelos, and Misra 2021), audio-visual representations by exploiting intra-modal, as when pretrained on AudioSet and fine-tuned with 8-frame well as, synchronous and asynchronous cross-modal rela- inputs, on both the UCF101 and HMDB51. When fine-tuned tionships. We conduct a thorough study investigating the with 32-frame inputs, CrissCross achieves competitive re- major concepts of our framework. Our findings show that re- sults amongst the leading methods. We note that some of laxation of cross-modal temporal synchronicity is beneficial the prior works show slightly better performance compared for learning effective audio-visual representations. These to ours in some settings. We conjecture this to be due to representations can then be used for a variety of downstream the use of higher spatio-temporal resolution pretraining in- tasks including action recognition, sound classification, and puts in these models. E.g., BraVe (Recasens et al. 2021) is action retrieval. pretrained with 2 views of 32 × 1122 and 128 × 1122 , and the input size for MMV (Alayrac et al. 2020) and CM-ACC Acknowledgments We are grateful to the Bank of Mon- (Ma et al. 2020) are 32 × 2242 and 16 × 2242 , respectively. treal and Mitacs for funding this research. We are thankful In comparison, CrissCross is pretrained with visual inputs to SciNet HPC Consortium for helping with the computation of size 8×1122 . However, we expect the performance of our resources. model to improve further by using such higher resolutions, given the trend shown in (Recasens et al. 2021). References In addition to the commonly used Kinectis400 and Au- Ahsan, U.; Madhok, R.; and Essa, I. 2019. Video jigsaw: dioSet, we further evaluate CrissCross while pretrained on Unsupervised learning of spatiotemporal context for video the small-scale Kinetics-Sound. Here, we observe signifi- action recognition. In WACV, 179–189. cant improvements compared to the current state-of-the-art CM-ACC (Ma et al. 2020) on both UCF101 (88.3 vs. 77.2) Alayrac, J.-B.; Recasens, A.; Schneider, R.; Arandjelovic, and HMDB51 (60.5 vs. 40.6). Additionally, CrissCross out- R.; Ramapuram, J.; De Fauw, J.; Smaira, L.; Dieleman, S.; performs fully-supervised pretraining by 1.4% and 7.4% on and Zisserman, A. 2020. Self-Supervised MultiModal Ver- UCF101 and HMDB51 respectively when both the fully- satile Networks. NeurIPS, 2(6): 7. supervised and self-supervised methods are pretrained on Alwassel, H.; Mahajan, D.; Korbar, B.; Torresani, L.; Kinetics-Sound. To the best of our knowledge, this is the Ghanem, B.; and Tran, D. 2020. Self-Supervised Learning first time that self-supervision outperforms full-supervised by Cross-Modal Audio-Video Clustering. NeruIPS, 33. Arandjelovic, R.; and Zisserman, A. 2017. Look, listen and He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn. In ICCV, 609–617. learning for image recognition. In CVPR, 770–778. Asano, Y. M.; Patrick, M.; Rupprecht, C.; and Vedaldi, A. Jing, L.; Yang, X.; Liu, J.; and Tian, Y. 2018. Self-supervised 2020. Labelling unlabelled videos from scratch with multi- spatiotemporal feature learning via video rotation predic- modal self-supervision. In NeurIPS. tion. arXiv preprint arXiv:1811.11387. Babaeizadeh, M.; Finn, C.; Erhan, D.; Campbell, R. H.; and Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Levine, S. 2018. Stochastic Variational Video Prediction. In Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, ICLR. P.; et al. 2017. The kinetics human action video dataset. Bardes, A.; Ponce, J.; and LeCun, Y. 2021. Vi- arXiv preprint arXiv:1705.06950. creg: Variance-invariance-covariance regularization for self- Khare, A.; Parthasarathy, S.; and Sundaram, S. 2021. Self- supervised learning. arXiv preprint arXiv:2105.04906. Supervised learning with cross-modal transformers for emo- Benaim, S.; Ephrat, A.; Lang, O.; Mosseri, I.; Freeman, tion recognition. In SLT, 381–388. W. T.; Rubinstein, M.; Irani, M.; and Dekel, T. 2020. Speed- Kim, D.; Cho, D.; and Kweon, I. S. 2019. Self-supervised net: Learning the speediness in videos. In CVPR, 9922– video representation learning with space-time cubic puzzles. 9931. In AAAI, volume 33, 8545–8552. Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; and Shah, Kingma, D. P.; and Ba, J. 2015. Adam: A Method for R. 1993. Signature verification using a” siamese” time delay Stochastic Optimization. In ICLR. neural network. Advances in neural information processing Korbar, B.; Tran, D.; and Torresani, L. 2018. Cooperative systems, 6. learning of audio and video models from self-supervised Buchler, U.; Brattoli, B.; and Ommer, B. 2018. Improv- synchronization. In NeruIPS, 7774–7785. ing spatiotemporal self-supervision by deep reinforcement Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre, learning. In ECCV. T. 2011. HMDB: a large video database for human motion Caron, M.; Bojanowski, P.; Joulin, A.; and Douze, M. 2018. recognition. In ICCV, 2556–2563. Deep clustering for unsupervised learning of visual features. In ECCV, 132–149. Lee, H.-Y.; Huang, J.-B.; Singh, M.; and Yang, M.-H. 2017. Unsupervised representation learning by sorting sequences. Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; In CVPR. and Joulin, A. 2020. Unsupervised Learning of Visual Fea- tures by Contrasting Cluster Assignments. In NeurIPS. Liang, X.; Lee, L.; Dai, W.; and Xing, E. P. 2017. Dual motion gan for future-flow embedded video prediction. In Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. ICCV, 1744–1752. A simple framework for contrastive learning of visual repre- sentations. In ICML, 1597–1607. Loshchilov, I.; and Hutter, F. 2017. Sgdr: Stochastic gradient descent with warm restarts. In ICLR. Chen, X.; and He, K. 2021. Exploring simple siamese rep- resentation learning. In CVPR, 15750–15758. Luo, D.; Liu, C.; Zhou, Y.; Yang, D.; Ma, C.; Ye, Q.; and Wang, W. 2020. Video cloze procedure for self-supervised Cho, H.; Kim, T.; Chang, H. J.; and Hwang, W. 2020. Self- spatio-temporal learning. In AAAI. Supervised Spatio-Temporal Representation Learning Us- ing Variable Playback Speed Prediction. arXiv preprint Ma, S.; Zeng, Z.; McDuff, D.; and Song, Y. 2020. Ac- arXiv:2003.02692. tive Contrastive Learning of Audio-Visual Video Represen- tations. In ICLR. DeVries, T.; and Taylor, G. W. 2017. Improved regulariza- tion of convolutional neural networks with cutout. arXiv Mathieu, M.; Couprie, C.; and LeCun, Y. 2016. Deep multi- preprint arXiv:1708.04552. scale video prediction beyond mean square error. In ICLR. Finn, C.; Goodfellow, I.; and Levine, S. 2016. Unsupervised McFee, B.; Raffel, C.; Liang, D.; Ellis, D. P.; McVicar, M.; learning for physical interaction through video prediction. Battenberg, E.; and Nieto, O. 2015. librosa: Audio and mu- NeurIPS, 29: 64–72. sic signal analysis in python. In Python in Science Confer- Gemmeke, J. F.; Ellis, D. P.; Freedman, D.; Jansen, A.; ence, volume 8, 18–25. Lawrence, W.; Moore, R. C.; Plakal, M.; and Ritter, M. Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, 2017. Audio set: An ontology and human-labeled dataset E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; for audio events. In ICASSP, 776–780. Venkatesh, G.; et al. 2018. Mixed Precision Training. In Ghadiyaram, D.; Tran, D.; and Mahajan, D. 2019. Large- ICLR. Scale Weakly-Supervised Pre-Training for Video Action Min, S.; Dai, Q.; Xie, H.; Gan, C.; Zhang, Y.; and Wang, J. Recognition. In CVPR, 12038–12047. 2021. Cross-Modal Attention Consistency for Video-Audio Grill, J.-B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Unsupervised Learning. arXiv preprint arXiv:2106.06939. Buchatskaya, E.; Doersch, C.; Pires, B.; Guo, Z.; Azar, M.; Misra, I.; and Maaten, L. v. d. 2020. Self-supervised learning et al. 2020. Bootstrap Your Own Latent: A new approach to of pretext-invariant representations. In CVPR, 6707–6717. self-supervised learning. In NeurIPS. Misra, I.; Zitnick, C. L.; and Hebert, M. 2016. Shuffle and Han, T.; Xie, W.; and Zisserman, A. 2020. Self-supervised learn: unsupervised learning using temporal order verifica- Co-training for Video Representation Learning. In NeurIPS. tion. In ECCV, 527–544. Morgado, P.; Misra, I.; and Vasconcelos, N. 2021. Robust Sarkar, P.; and Etemad, A. 2020b. Self-supervised learning Audio-Visual Instance Discrimination. In CVPR, 12934– for ecg-based emotion recognition. In ICASSP, 3217–3221. 12945. Sarkar, P.; Lobmaier, S.; Fabre, B.; Berg, G.; Mueller, A.; Morgado, P.; Vasconcelos, N.; and Misra, I. 2021. Audio- Frasch, M. G.; Antonelli, M. C.; and Etemad, A. 2020. De- visual instance discrimination with cross-modal agreement. tection of Maternal and Fetal Stress from ECG with Self- In CVPR, 12475–12486. supervised Representation Learning. arXiv e-prints, arXiv– Niizumi, D.; Takeuchi, D.; Ohishi, Y.; Harada, N.; and 2011. Kashino, K. 2021. BYOL for Audio: Self-Supervised Siriwardhana, S.; Kaluarachchi, T.; Billinghurst, M.; and Learning for General-Purpose Audio Representation. arXiv Nanayakkara, S. 2020. Multimodal Emotion Recognition preprint arXiv:2103.06695. With Transformer-Based Self Supervised Feature Fusion. Omeiza, D.; Speakman, S.; Cintas, C.; and Weldermariam, IEEE Access, 8: 176274–176285. K. 2019. Smooth grad-cam++: An enhanced inference level Soomro, K.; Zamir, A. R.; and Shah, M. 2012. UCF101: visualization technique for deep convolutional neural net- A dataset of 101 human actions classes from videos in the work models. arXiv preprint arXiv:1908.01224. wild. arXiv preprint arXiv:1212.0402. Park, D. S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Stowell, D.; Giannoulis, D.; Benetos, E.; Lagrange, M.; and Cubuk, E. D.; and Le, Q. V. 2019. Specaugment: A simple Plumbley, M. D. 2015. Detection and classification of acous- data augmentation method for automatic speech recognition. tic scenes and events. IEEE Transactions on Multimedia, arXiv preprint arXiv:1904.08779. 17(10): 1733–1746. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; M. 2016. Deep end2end voxel2voxel prediction. In CVPRW, et al. 2019. Pytorch: An imperative style, high-performance 17–24. deep learning library. NeurIPS, 32: 8026–8037. Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; and Patrick, M.; Asano, Y. M.; Kuznetsova, P.; Fong, R.; Hen- Paluri, M. 2018. A closer look at spatiotemporal convolu- riques, J. F.; Zweig, G.; and Vedaldi, A. 2021a. On composi- tions for action recognition. In CVPR, 6450–6459. tions of transformations in contrastive self-supervised learn- Tulyakov, S.; Liu, M.-Y.; Yang, X.; and Kautz, J. 2018. ing. In Proceedings of the IEEE/CVF International Confer- MoCoGAN: Decomposing motion and content for video ence on Computer Vision, 9577–9587. generation. In CVPR, 1526–1535. Patrick, M.; Huang, P.-Y.; Misra, I.; Metze, F.; Vedaldi, A.; Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Proes- Asano, Y. M.; and Henriques, J. F. 2021b. Space-Time mans, M.; and Van Gool, L. 2020. Scan: Learning to classify Crop & Attend: Improving Cross-modal Video Representa- images without labels. In ECCV, 268–285. tion Learning. In ICCV, 10560–10572. Vondrick, C.; Pirsiavash, H.; and Torralba, A. 2016. Gener- Piczak, K. J. 2015. ESC: Dataset for Environmental Sound ating videos with scene dynamics. NeurIPS, 29: 613–621. Classification. In ACM Conference on Multimedia, 1015– Wang, J.; Jiao, J.; Bao, L.; He, S.; Liu, W.; and Liu, Y.-H. 1018. . 2021. Self-supervised Video Representation Learning by Piergiovanni, A.; Angelova, A.; and Ryoo, M. S. 2020. Uncovering Spatio-temporal Statistics. PAMI. Evolving losses for unsupervised video representation learn- Xu, D.; Xiao, J.; Zhao, Z.; Shao, J.; Xie, D.; and Zhuang, ing. In CVPR, 133–142. Y. 2019. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, 10334–10343. Qian, R.; Meng, T.; Gong, B.; Yang, M.-H.; Wang, H.; Be- longie, S.; and Cui, Y. 2021. Spatiotemporal contrastive You, Y.; Gitman, I.; and Ginsburg, B. 2017. Large video representation learning. In CVPR, 6964–6974. batch training of convolutional networks. arXiv preprint arXiv:1708.03888. Recasens, A.; Luc, P.; Alayrac, J.-B.; Wang, L.; Strub, F.; Tallec, C.; Malinowski, M.; Patraucean, V.; Altché, F.; Valko, M.; et al. 2021. Broaden Your Views for Self-Supervised Video Learning. arXiv preprint arXiv:2103.16559. Reda, F. A.; Liu, G.; Shih, K. J.; Kirby, R.; Barker, J.; Tar- jan, D.; Tao, A.; and Catanzaro, B. 2018. Sdc-net: Video prediction using spatially-displaced convolution. In ECCV, 718–733. Saito, M.; Matsumoto, E.; and Saito, S. 2017. Temporal generative adversarial nets with singular value clipping. In ICCV, 2830–2839. Sarkar, P.; and Etemad, A. 2020a. Self-supervised ECG rep- resentation learning for emotion recognition. IEEE Trans- actions on Affective Computing. Supplementary Material Query Neighborhoods The organization of the supplementary material is as fol- lows: • Appendix A: Pseudocode; • Appendix B: Qualitative Analysis; • Appendix C: Datasets; • Appendix D: Data Augmentations; • Appendix E: Evaluation Protocols; • Appendix G: Hyperparameters; • Appendix F: Architectures; • Appendix H: Limitations; • Appendix I: Broader Impact. A Pseudocode We present the pseudocode of our proposed CrissCross framework in Algorithm 1. Algorithm 1: CrissCross pseudocode (PyTorch style). # fv: visual encoder (backbone+projection mlp) # fa: audio encoder (backbone+projection mlp) # hv: visual predictor head (prediction mlp) # ha: audio predictor head (prediction mlp) # D: loss function, following Eqn. 1 def forward(v1, v2, a1, a2): """ v1,V2: minibatch of augmented visual samples a1,a2: minibatch of augmented audio samples """ # visual zv1, zv2 = fv(v1), fv(v2) # visual embeddings pv1, pv2 = hv(zv1), hv(zv2) # predictor output # audio za1, za2 = fa(a1), fa(a2) # audio embeddings pa1, pa2 = ha(za1), ha(za2) # predictor output # loss calculation # intra-modal loss, following Eqn. 2 L_intra = D(pv1, zv2)/2 + D(pv2, zv1)/2 + \ D(pa1, za2)/2 + D(pa2, za1)/2 Figure S1: We present a few randomly selected samples of video-to-video retrieval. Here, the frames with black bor- # synchronous cross-modal loss, following Eqn. 3 L_sync = (D(pv1, za1)/2 + D(pa1, zv1)/2 + Lv2a2 +\ ders represent the query, and the next 5 frames represent the D(pv2, za2)/2 + D(pa2, zv2)/2)/2 top-5 neighborhoods. The correct retrievals are marked with # asynchronous cross-modal loss, following Eqn. 4 green, while the wrong ones are marked with red. L_async = (D(pv1, za2)/2 + D(pa2, zv1)/2 +\ D(pa1, zv2)/2 + D(pv2, za1)/2)/2 # total loss, following Eqn. 5 occur when the visual scenes or sound events are very sim- L_CrissCross = (L_async + L_sync + L_intra)/3 ilar. For instance, ‘playing piano’ and ‘playing organ’ for return L_CrissCross video-to-video retrieval and ‘playing keyboard’ and ‘play- ing xylophone’ for audio-to-audio retrieval. B Qualitative Analysis C Datasets To perform a qualitative analysis of the learned representa- C.1 Pretraining Datasets tions in an unsupervised setup, we present the nearest neigh- We use 3 datasets of different sizes for pretraining, namely, borhoods of video-to-video and audio-to-audio retrieval in Kinetics-Sound (Arandjelovic and Zisserman 2017), Kinet- Figures S1 and S2. In this experiment, we use Kinetics400 ics400 (Kay et al. 2017), and AudioSet (Gemmeke et al. (Kay et al. 2017) to pretrain CrissCross. Next, we use the 2017). Kinetics-Sound is a small-scale action recognition features extracted from randomly selected samples of the dataset, which has a total of 22K video clips, distributed validation split to query the training features. We find that over 32 action classes. Kinetics400 is a medium-scale hu- in most of the cases CrissCross performs fairly well, we no- man action recognition dataset, originally collected from tice very few instances of wrong retrieval, which generally YouTube. It has a total of 240K training samples and 400 Query Neighborhoods stream tasks: (i) action recognition based on visual repre- sentations and (ii) sound classification based on audio repre- sentations. To perform action recognition, we use two pop- ular benchmarks, i.e., UCF101 (Soomro, Zamir, and Shah 2012) and HMDB51 (Kuehne et al. 2011). UCF101 con- sists of a total of 13K clips distributed among 101 action classes, while HMDB contains nearly 7K video clips dis- tributed over 51 action categories. To perform sound clas- sification, we use two popular benchmarks ESC50 (Piczak 2015) and DCASE2014 (Stowell et al. 2015). ESC50 is a collection of 2K audio events comprised of 50 classes and DCASE2014 is an audio event dataset of 100 recordings spread over 10 categories. D Data Augmentation Here we present the details of the augmentation parameters for both visual and audio modalities. D.1 Visual Augmentations The parameters for visual augmentations are presented in Table S1. Some of the parameters are chosen from the liter- ature, while the rest are found through empirical search. We set the parameters of Multi-Scale Crop, Gaussian Blur, and Gray Scale as suggested in (Chen et al. 2020), and the pa- rameters for Color Jitter are taken from (Morgado, Vascon- celos, and Misra 2021). We use TorchVision (Paszke et al. 2019) for all the implementations of visual augmentations, except Cutout where we use the implementation available here1 . Please note that for the Cutout transformation, the mask is created with the mean value of the first frame in the sequence. D.2 Audio Augmentations We present the parameters used for audio augmentations in Figure S2: We present a few randomly selected samples of Table S2. We use the Librosa(McFee et al. 2015) library to audio-to-audio retrieval. Here, the frames with black bor- generate mel-spectrograms. We use the techniques proposed ders represent the query, and the next 5 frames represent the in (Park et al. 2019) to perform Time Mask, Frequency top-5 neighborhoods. The correct retrievals are marked with Mask, and Time Warp transformations2 . The parameters for green, while the wrong ones are marked with red. the audio augmentations are set empirically, except for Ran- dom Crop which we adopt from (Niizumi et al. 2021). action classes. Please note that Kinetics-Sound is a subset of E Evaluation Protocol Kinetics400, and consists of action classes which are promi- nently manifested audibly and visually (Arandjelovic and To evaluate the representations learned with self-supervised Zisserman 2017). Lastly, AudioSet (Gemmeke et al. 2017) pretraining, we test the proposed framework in different se- is a large-scale video dataset of audio events consisting of tups, namely linear evaluation, full finetuning, and retrieval. a total of 1.8M audio-video segments originally obtained The details of the evaluation protocols are mentioned below. from YouTube spread over 632 audio classes. Please note that none of the provided labels are used in self-supervised E.1 Linear Evaluation pretraining. To perform linear evaluations of the learned representations on downstream tasks, we extract fixed features (also called C.2 Downstream Datasets frozen features) using the pretrained backbones. We train a Following the standard practices of prior works (Morgado, linear classifier using the fixed feature representations. The Vasconcelos, and Misra 2021; Morgado, Misra, and Vas- details are presented below. concelos 2021; Alayrac et al. 2020; Alwassel et al. 2020; 1 Asano et al. 2020; Korbar, Tran, and Torresani 2018), we https://github.com/uoguelph-mlrg/Cutout 2 evaluate our self-supervised methods on two types of down- https://github.com/s3prl/s3prl Augmentation Parameters MSC HF CJ GS GB C Multi Scale Crop min area = 0.08 Pretraining 3 3 3 3 3 3 Horizontal Flip p = 0.5 Full-finetune 3 3 3 3 7 3 Linear evaluation 3 3 3 3 7 3 brightness = 0.4 contrast = 0.4 Color Jitter Table S3: Audio augmentation summary. saturation = 0.4 hue = 0.2 Gray Scale p = 0.2 VJ Mask RC TW Gaussian Blur p = 0.5 Pretraining 3 3 3 7 max size = 20 Linear evaluation 3 3 3 3 Cutout num = 1 Table S4: Visual augmentation summary. Table S1: Visual augmentation parameters. Augmentation Parameters 0.0001, 0.0005, 0.001, 0.005, 0.01, 1} and report the best accuracy. Volume Jitter range = ±0.2 When validating on 32-frame inputs, we could not per- max size = 20 form SVM as the feature vector is too large to hold in the Time Mask memory. Hence, we use a linear fully-connected layer at num = 2 the end of the video backbone. Note that during training the max size = 10 backbone is kept frozen and only the linear layer is trained. Frequency Mask num = 2 we keep the rest of the setup the same as described earlier, Timewarp wrap window = 20 with the exception of training where we randomly select 10 clips per sample. range = [0.6,1.5] Kinetics400. As Kinetics400 (Kay et al. 2017) is a large- Random Crop crop scale = [1.0,1.5] scale dataset, the feature vector is too large to save in mem- ory. Following (Morgado, Vasconcelos, and Misra 2021), we Table S2: Audio augmentation parameters. use a fully connected layer at the end of the frozen back- bone and feed 8 × 2242 frame inputs. During training, we randomly pick 1 clip per sample, while during validation, Action Recognition. we uniformly select 10 clips per sample. Note that the rest To perform linear evaluations on action recognition, we fol- of the setups remain the same, as described for HMDB51 low standard evaluation protocols laid out in prior works and UCF101. Finally, we obtain the sample-level prediction (Alayrac et al. 2020; Recasens et al. 2021; Patrick et al. by averaging the clip-level predictions and report the top-1 2021a; Morgado, Vasconcelos, and Misra 2021). The details accuracy. are presented below. Sound Classification. HMDB51 and UCF101. We perform linear evaluations in 2 In case of evaluating audio representations, we follow the setups, i.e., 8-frame and 32-frame inputs. We evaluate on 8- evaluation protocol laid out in prior works (Morgado, Vas- frame inputs for the design explorations and 32-frame inputs concelos, and Misra 2021; Alwassel et al. 2020; Alayrac for large-scale experiments. et al. 2020; Recasens et al. 2021) for respective datasets. The Following the protocols mentioned in (Alayrac et al. details are mentioned below. 2020; Recasens et al. 2021), we feed 8-frame inputs to the ESC50. We perform linear evaluations on ESC50 in 2 se- video backbone, with a spatial resolution of 2242 . During tups, we use 2-second audio input for design exploration training, we randomly pick 25 clips per sample to extract and 5-second audio input for large-scale experiments. Fol- augmented representations, while during testing, we uni- lowing (Patrick et al. 2021a), we extract 10 epochs worth formly select 10 clips per sample and report top-1 accuracy of augmented feature vectors from the training clips. During at sample-level prediction by averaging clip-level predic- testing, when using 2-second inputs, we extract 10 equally tions. The augmentation techniques are mentioned in Sec- spaced audio segments (Morgado, Vasconcelos, and Misra tion D. We don’t apply the Gaussian Blur while extract- 2021; Patrick et al. 2021a; Alwassel et al. 2020), and when ing the training features since it deteriorates the perfor- using 5-second inputs, we extract 1 segment (Alayrac et al. mance. Moreover, to perform a deterministic evaluation, we 2020; Recasens et al. 2021) from each sample. We perform don’t apply any augmentations during validation. The visual the augmentations mentioned in Section D to extract the features are extracted from the final convolution layer and training features. We notice that unlike self-supervised pre- passed to a max-pool layer with a kernel size of (1, 4, 4) training, time warping improves the model performance in (Morgado, Vasconcelos, and Misra 2021). Finally, we use the linear evaluation. We do not apply any augmentations the learned visual representations to train a linear SVM clas- during validation. We extract the representations from the sifier, we sweep the cost values between {0.00001, 0.00005, final convolution layer and pass it through a max-pool layer with a kernel size of (1, 3) and a stride of (1, 2) (Patrick Layer Xs Xt C Ks Kt Ss St et al. 2021a). Similar to action recognition, we perform clas- sification using a one-vs-all linear SVM classifier, we sweep frames 112 8 3 - - - - conv1 56 8 64 7 3 2 1 the cost values between {0.00001, 0.00005, 0.0001, 0.0005, maxpool 28 8 64 3 1 2 1 0.001, 0.005, 0.01, 1} and report the best accuracy. block2.1.1 28 8 64 3 3 1 1 DCASE. To validate on DCASE, we follow the protocol block2.1.2 28 8 64 3 3 1 1 mentioned in (Morgado, Vasconcelos, and Misra 2021). We block2.2.1 28 8 64 3 3 1 1 extract 60 clips per sample and train a linear classifier on the block2.2.2 28 8 64 3 3 1 1 extracted representations. Note that the augmentation and block3.1.1 14 4 128 3 3 2 2 feature extraction schemes remain the same as mentioned block3.1.2 14 4 128 3 3 1 1 for ESC50. We report the top-1 sample level accuracies by block3.2.1 14 4 128 3 3 1 1 averaging the clip level predictions. block3.2.2 14 4 128 3 3 1 1 Multi-modal Fusion. To perform a multi-modal linear eval- block4.1.1 7 2 256 3 3 2 2 uation with late fusion, we extract features from Kinetics- block4.1.2 7 2 256 3 3 1 1 block4.2.1 7 2 256 3 3 1 1 Sound. During training, we randomly pick 10 audio-visual block4.2.2 7 2 256 3 3 1 1 clips per sample, each 2 seconds long. Next, we extract fea- block5.1.1 4 1 512 3 3 2 2 ture vectors of dimension 2048 from the last convolution block5.1.2 4 1 512 3 3 1 1 layer by using max-pooling with kernel sizes of (1, 2, 2) and block5.2.1 4 1 512 3 3 1 1 (1, 4) for visual and audio respectively. Following, the fea- block5.2.2 4 1 512 3 3 1 1 ture vectors are concatenated to train a linear SVM classifier. avg-pool - - 512 - - - - Finally, we report the top-1 sample level accuracy for action classification. Table S5: Architecture of the video backbone: R(2+1)D-18. E.2 Full Finetuning Following earlier works (Alwassel et al. 2020; Morgado, k = 1, 5, 20. We use the NearestNeighbors3 API provided Vasconcelos, and Misra 2021; Morgado, Misra, and Vas- in SciKit-Learn in this experiment. concelos 2021; Asano et al. 2020), we use the pretrained visual backbone along with a newly added fully-connected F Architecture Details layer for full finetuning on UCF101 (Soomro, Zamir, and In this study, we use a slightly modified version of R(2+1)D- Shah 2012) and HMDB51 (Kuehne et al. 2011). We adopt 18 (Tran et al. 2018) as the video backbone as proposed two setups for full finetuning, 8-frame inputs and 32-frame in (Morgado, Vasconcelos, and Misra 2021), and ResNet- inputs. In both cases, we use a spatial resolution of 2242 . 18 (He et al. 2016) as the audio backbone. For the sake of Lastly, we replace the final adaptive average-pooling layer completeness, we present the architecture details in Tables with an adaptive max-pooling layer. We find that applying S5 and S6, respectively. The predictor and projector heads strong augmentations improves the model performance in are made of fully-connected layers following (Chen and He full-finetuning. Please see the augmentation details in Sec- 2021), and their architecture details are presented in Table tion D. During testing, we extract 10 equally spaced clips S7. from each sample and do not apply any augmentations. We report the top-1 accuracy at sample-level prediction by aver- G Hyperparameters and Training Details aging the clip-level predictions. We use an SGD optimizer with a multi-step learning rate scheduler to finetune the In this section, we present the details of the hyperparameters, model. We present the hyperparameters of full-finetuning in computation requirements, as well as additional training de- Table S11. tails of self-supervised pretraining and full finetuning. G.1 Pretraining Details E.3 Retrieval We present the pretraining hyperparameters of CrissCross in We follow the protocol laid out in (Patrick et al. 2021a; Xu Table S10. Most of the parameters remain the same across et al. 2019). We uniformly select 10 clips per sample from all 3 datasets, with the exception of a few hyperparameters both training and test splits. We fit 2-second inputs to the such as learning rates and epoch size which are set depend- backbone to extract representations. We empirically test ad- ing on the size of the datasets. We train on Kinetics-Sound ditional steps such as l2-normalization and applying batch- with a batch size of 512, on a single node with 4 Nvidia normalization on the extracted features, and notice that they RTX-6000 GPUs. Next, when training on Kinetics400 and do not help the performance. Hence, we simply average the AudioSet, we use 2 nodes and set the batch size to 2048. features extracted from the test split to query the features Adam (Kingma and Ba 2015) optimizer is used to train of the training split. We compute the cosine distance be- our proposed framework. We use LARC4 (You, Gitman, and tween the feature vectors of the test clips (query) and the representations of all the training clips (neighbors). We con- 3 sklearn.neighbors.NearestNeighbors sider a correct prediction if k neighboring clips of a query 4 https://github.com/NVIDIA/apex/blob/master/apex/parallel/ clip belong to the same class. We calculate accuracies for LARC.py Layer Xf Xt C Ks Kt Sf St G.2 Full Finetuning Details spectrogram 80 200 1 - - - - The full fine-tuning hyperparameters for both benchmarks conv1 40 100 64 7 7 2 2 are presented in Table S11. We use a batch size of 32 for maxpool 20 50 64 3 3 2 2 the 32-frame input and 64 for the 8-frame input. We use an block2.1.1 20 50 64 3 3 2 2 SGD optimizer with a multi-step learning rate scheduler to block2.1.2 20 50 64 3 3 2 2 finetune the video backbones. Please note that we perform block2.2.1 20 50 64 3 3 2 2 the full finetuning on a single Nvidia RTX-6000 GPU. block2.2.2 20 50 64 3 3 2 2 block3.1.1 10 25 128 3 3 2 2 block3.1.2 10 25 128 3 3 2 2 H Limitations. block3.2.1 10 25 128 3 3 2 2 The notion of asynchronous cross-modal optimization has block3.2.2 10 25 128 3 3 2 2 not been explored beyond audio-visual modalities. For ex- block4.1.1 5 13 256 3 3 2 2 ample, our model can be expanded to consider more than block4.1.2 5 13 256 3 3 2 2 2 modalities (e.g., audio, visual, and text), which are yet block4.2.1 5 13 256 3 3 2 2 block4.2.2 5 13 256 3 3 2 2 to be studied. Additionally, we notice a considerable per- block5.1.1 3 7 512 3 3 2 2 formance gap between full-supervision and self-supervision block5.1.2 3 7 512 3 3 2 2 when both methods are pretrained with the same large-scale block5.2.1 3 7 512 3 3 2 2 dataset (Kinetics400 or AudioSet), showing room for further block5.2.2 3 7 512 3 3 2 2 improvement. avg-pool - - 512 - - - - I Broader Impact. Table S6: Architecture of the audio backbone: ResNet-18. Better self-supervised audio-visual learning can be used for detection of harmful contents on the Internet. Additionally, such methods can be used to develop better multimedia sys- Layer Dimensions tems. Lastly, the notion that relaxed cross-modal temporal input 512 synchronicity is useful, can challenge our existing/standard fc-bn-relu 2048 approaches in learning multi-modal representations and re- fc-bn-relu 2048 sult in new directions of inquiry. The authors don’t foresee fc-bn 2048 any major negative impacts. Table S7: Architecture of projector heads. Layer Dimensions input 2048 fc-bn-relu 512 fc 2048 Table S8: Architecture of predictor heads. Ginsburg 2017) as a wrapper to the Adam optimizer to clip the gradients while pretraining with a batch size of 2048. In this work, we stick to batch sizes of 512 and 2048, because (i) as they show stable performance based on the findings of (Chen and He 2021); (ii) they fit well with our available GPU setups. Additionally, we perform mixed-precision training (Micikevicius et al. 2018) using PyTorch AMP (Paszke et al. 2019) to reduce the computation overhead. Ablation Parameters. In the ablation study, we keep the training setup exactly identical across all the variants, with the exception of the learning rates, which we tune to find the best performance for that particular variant. For example, we set the base learning rate for Lv1v2 and La1a2 models as 0.0001 and 0.00001 respectively. Next, the predictor learn- ing rates are set to 0.001 and 0.0001 for the Lv1v2 and La1a2 variants. Abbreviations Name Description bs batch size The size of a mini-batch. es epoch size The total number of samples per epoch. ep toal epochs The total number of epochs. lr learning rate lrab audio backbone lr lrvb video backbone lr The learning rates to train the networks. lrap audio predictor lr lrvp video predictor lr lrs learning rate scheduler The learning rate scheduler to train the network. ms milestones At every ms epoch the learning rate is decayed. γ lr decay rate The learning rate is decayed by a factor of γ. wd weight decay The weight decay used in the SGD optimizer. mtm momentum The momentum used in the SGD optimizer. drp dropout The dropout rate. Table S9: Abbreviations and descriptions of the hyperparameters. dataset bs es ep optim lrs lrvb (start/end) lrab (start/end) lrvp lrap wd betas KS 512 220K 100 Adam Cosine 0.0002/0 0.0002/0 0.002 0.002 0.0001 0.9, 0.999 K400 2048 1M 100 Adam∗ Cosine 0.0002/0.0001 0.0002/0.0001 0.002 0.002 0.0001 0.9, 0.999 AS 2048 3.5M 100 Adam∗ Cosine 0.0001/0 0.0001/0 0.001 0.001 0.0001 0.9, 0.999 Table S10: Pretext training parameters. Note the abbreviations used below, KS: Kinetics-Sound, K400: Kinetics400, AS: Au- dioSet, Adam∗ : Adam with LARC dataset input es bs ep ms optim lrs lr γ wd mtm drp UCF101 8×2242 95K 64 20 6/10/14 SGD multi-step 0.0005 0.3 0.0 0.9 0.0 UCF101 32×2242 95K 32 20 8/12/16 SGD multi-step 0.00007 0.3 0.0 0.9 0.0 HMDB51 8×2242 35K 64 20 6/10/14 SGD multi-step 0.0005 0.1 0.0 0.9 0.0 HMDB51 32×2242 35K 32 20 8/12/16 SGD multi-step 0.0001 0.3 0.0 0.9 0.0 Table S11: Full-finetuning hyperparameters for action recognition when pretrained on Kinetics400.

References (65)

Ahsan, U.; Madhok, R.; and Essa, I. 2019. Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In WACV, 179-189.
Alayrac, J.-B.; Recasens, A.; Schneider, R.; Arandjelovic, R.; Ramapuram, J.; De Fauw, J.; Smaira, L.; Dieleman, S.; and Zisserman, A. 2020. Self-Supervised MultiModal Ver- satile Networks. NeurIPS, 2(6): 7.
Alwassel, H.; Mahajan, D.; Korbar, B.; Torresani, L.; Ghanem, B.; and Tran, D. 2020. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. NeruIPS, 33. R.; and Zisserman, A. 2017. Look, listen and learn. In ICCV, 609-617.
Asano, Y. M.; Patrick, M.; Rupprecht, C.; and Vedaldi, A. 2020. Labelling unlabelled videos from scratch with multi- modal self-supervision. In NeurIPS.
Babaeizadeh, M.; Finn, C.; Erhan, D.; Campbell, R. H.; and Levine, S. 2018. Stochastic Variational Video Prediction. In ICLR.
Bardes, A.; Ponce, J.; and LeCun, Y. 2021. Vi- creg: Variance-invariance-covariance regularization for self- supervised learning. arXiv preprint arXiv:2105.04906. Benaim, S.; Ephrat, A.; Lang, O.; Mosseri, I.; Freeman, W. T.; Rubinstein, M.; Irani, M.; and Dekel, T. 2020. Speed- net: Learning the speediness in videos. In CVPR, 9922- 9931.
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; and Shah, R. 1993. Signature verification using a" siamese" time delay neural network. Advances in neural information processing systems, 6.
Buchler, U.; Brattoli, B.; and Ommer, B. 2018. Improv- ing spatiotemporal self-supervision by deep reinforcement learning. In ECCV.
Caron, M.; Bojanowski, P.; Joulin, A.; and Douze, M. 2018. Deep clustering for unsupervised learning of visual features. In ECCV, 132-149.
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; and Joulin, A. 2020. Unsupervised Learning of Visual Fea- tures by Contrasting Cluster Assignments. In NeurIPS.
Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual repre- sentations. In ICML, 1597-1607.
Chen, X.; and He, K. 2021. Exploring simple siamese rep- resentation learning. In CVPR, 15750-15758.
Cho, H.; Kim, T.; Chang, H. J.; and Hwang, W. 2020. Self- Supervised Spatio-Temporal Representation Learning Us- ing Variable Playback Speed Prediction. arXiv preprint arXiv:2003.02692.
DeVries, T.; and Taylor, G. W. 2017. Improved regulariza- tion of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.
Finn, C.; Goodfellow, I.; and Levine, S. 2016. Unsupervised learning for physical interaction through video prediction. NeurIPS, 29: 64-72.
Gemmeke, J. F.; Ellis, D. P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R. C.; Plakal, M.; and Ritter, M. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 776-780.
Ghadiyaram, D.; Tran, D.; and Mahajan, D. 2019. Large- Scale Weakly-Supervised Pre-Training for Video Action Recognition. In CVPR, 12038-12047.
Grill, J.-B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Pires, B.; Guo, Z.; Azar, M.; et al. 2020. Bootstrap Your Own Latent: A new approach to self-supervised learning. In NeurIPS.
Han, T.; Xie, W.; and Zisserman, A. 2020. Self-supervised Co-training for Video Representation Learning. In NeurIPS. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770-778.
Jing, L.; Yang, X.; Liu, J.; and Tian, Y. 2018. Self-supervised spatiotemporal feature learning via video rotation predic- tion. arXiv preprint arXiv:1811.11387.
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
Khare, A.; Parthasarathy, S.; and Sundaram, S. 2021. Self- Supervised learning with cross-modal transformers for emo- tion recognition. In SLT, 381-388.
Kim, D.; Cho, D.; and Kweon, I. S. 2019. Self-supervised video representation learning with space-time cubic puzzles. In AAAI, volume 33, 8545-8552.
Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
Korbar, B.; Tran, D.; and Torresani, L. 2018. Cooperative learning of audio and video models from self-supervised synchronization. In NeruIPS, 7774-7785.
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre, T. 2011. HMDB: a large video database for human motion recognition. In ICCV, 2556-2563.
Lee, H.-Y.; Huang, J.-B.; Singh, M.; and Yang, M.-H. 2017. Unsupervised representation learning by sorting sequences. In CVPR.
Liang, X.; Lee, L.; Dai, W.; and Xing, E. P. 2017. Dual motion gan for future-flow embedded video prediction. In ICCV, 1744-1752.
Loshchilov, I.; and Hutter, F. 2017. Sgdr: Stochastic gradient descent with warm restarts. In ICLR.
Luo, D.; Liu, C.; Zhou, Y.; Yang, D.; Ma, C.; Ye, Q.; and Wang, W. 2020. Video cloze procedure for self-supervised spatio-temporal learning. In AAAI.
Ma, S.; Zeng, Z.; McDuff, D.; and Song, Y. 2020. Ac- tive Contrastive Learning of Audio-Visual Video Represen- tations. In ICLR.
Mathieu, M.; Couprie, C.; and LeCun, Y. 2016. Deep multi- scale video prediction beyond mean square error. In ICLR.
McFee, B.; Raffel, C.; Liang, D.; Ellis, D. P.; McVicar, M.; Battenberg, E.; and Nieto, O. 2015. librosa: Audio and mu- sic signal analysis in python. In Python in Science Confer- ence, volume 8, 18-25.
Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. 2018. Mixed Precision Training. In ICLR.
Min, S.; Dai, Q.; Xie, H.; Gan, C.; Zhang, Y.; and Wang, J. 2021. Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning. arXiv preprint arXiv:2106.06939.
Misra, I.; and Maaten, L. v. d. 2020. Self-supervised learning of pretext-invariant representations. In CVPR, 6707-6717.
Misra, I.; Zitnick, C. L.; and Hebert, M. 2016. Shuffle and learn: unsupervised learning using temporal order verifica- tion. In ECCV, 527-544.
Morgado, P.; Misra, I.; and Vasconcelos, 2021. Robust Audio-Visual Instance Discrimination. In CVPR, 12934- 12945.
Morgado, P.; Vasconcelos, N.; and Misra, I. 2021. Audio- visual instance discrimination with cross-modal agreement. In CVPR, 12475-12486.
Niizumi, D.; Takeuchi, D.; Ohishi, Y.; Harada, N.; and Kashino, K. 2021. BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation. arXiv preprint arXiv:2103.06695.
Omeiza, D.; Speakman, S.; Cintas, C.; and Weldermariam, K. 2019. Smooth grad-cam++: An enhanced inference level visualization technique for deep convolutional neural net- work models. arXiv preprint arXiv:1908.01224.
Park, D. S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E. D.; and Le, Q. V. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 32: 8026-8037.
Patrick, M.; Asano, Y. M.; Kuznetsova, P.; Fong, R.; Hen- riques, J. F.; Zweig, G.; and Vedaldi, A. 2021a. On composi- tions of transformations in contrastive self-supervised learn- ing. In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, 9577-9587.
Patrick, M.; Huang, P.-Y.; Misra, I.; Metze, F.; Vedaldi, A.; Asano, Y. M.; and Henriques, J. F. 2021b. Space-Time Crop & Attend: Improving Cross-modal Video Representa- tion Learning. In ICCV, 10560-10572.
Piczak, K. J. 2015. ESC: Dataset for Environmental Sound Classification. In ACM Conference on Multimedia, 1015- 1018. .
Piergiovanni, A.; Angelova, A.; and Ryoo, M. S. 2020. Evolving losses for unsupervised video representation learn- ing. In CVPR, 133-142.
Qian, R.; Meng, T.; Gong, B.; Yang, M.-H.; Wang, H.; Be- longie, S.; and Cui, Y. 2021. Spatiotemporal contrastive video representation learning. In CVPR, 6964-6974.
Recasens, A.; Luc, P.; Alayrac, J.-B.; Wang, L.; Strub, F.; Tallec, C.; Malinowski, M.; Patraucean, V.; Altché, F.; Valko, M.; et al. 2021. Broaden Your Views for Self-Supervised Video Learning. arXiv preprint arXiv:2103.16559.
Reda, F. A.; Liu, G.; Shih, K. J.; Kirby, R.; Barker, J.; Tar- jan, D.; Tao, A.; and Catanzaro, B. 2018. Sdc-net: Video prediction using spatially-displaced convolution. In ECCV, 718-733.
Saito, M.; Matsumoto, E.; and Saito, S. 2017. Temporal generative adversarial nets with singular value clipping. In ICCV, 2830-2839.
Sarkar, P.; and Etemad, A. 2020a. Self-supervised ECG rep- resentation learning for emotion recognition. IEEE Trans- actions on Affective Computing.
Sarkar, P.; and Etemad, A. 2020b. Self-supervised learning for ecg-based emotion recognition. In ICASSP, 3217-3221.
Sarkar, P.; Lobmaier, S.; Fabre, B.; Berg, G.; Mueller, A.; Frasch, M. G.; Antonelli, M. C.; and Etemad, A. 2020. De- tection of Maternal and Fetal Stress from ECG with Self- supervised Representation Learning. arXiv e-prints, arXiv- 2011.
Siriwardhana, S.; Kaluarachchi, T.; Billinghurst, M.; and Nanayakkara, S. 2020. Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion. IEEE Access, 8: 176274-176285.
Soomro, K.; Zamir, A. R.; and Shah, M. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Stowell, D.; Giannoulis, D.; Benetos, E.; Lagrange, M.; and Plumbley, M. D. 2015. Detection and classification of acous- tic scenes and events. IEEE Transactions on Multimedia, 17(10): 1733-1746.
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2016. Deep end2end voxel2voxel prediction. In CVPRW, 17-24.
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; and Paluri, M. 2018. A closer look at spatiotemporal convolu- tions for action recognition. In CVPR, 6450-6459.
Tulyakov, S.; Liu, M.-Y.; Yang, X.; and Kautz, J. 2018. MoCoGAN: Decomposing motion and content for video generation. In CVPR, 1526-1535.
Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Proes- mans, M.; and Van Gool, L. 2020. Scan: Learning to classify images without labels. In ECCV, 268-285.
Vondrick, C.; Pirsiavash, H.; and Torralba, A. 2016. Gener- ating videos with scene dynamics. NeurIPS, 29: 613-621.
Wang, J.; Jiao, J.; Bao, L.; He, S.; Liu, W.; and Liu, Y.-H. 2021. Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics. PAMI.
Xu, D.; Xiao, J.; Zhao, Z.; Shao, J.; Xie, D.; and Zhuang, Y. 2019. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, 10334-10343.
You, Y.; Gitman, I.; and Ginsburg, B. 2017. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888.

About the author

Pritam Sarkar

Papers

Followers

View all papers from Pritam Sarkararrow_forward

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Sign up for access to the world's latest research

Abstract

Related papers

References (65)

Related papers

Related topics