Multi-Format Contrastive Learning of Audio Representations

Luyu Wang

doi:10.48550/ARXIV.2103.06508

Outline

Multi-Format Contrastive Learning of Audio Representations

Luyu Wang

2021, arXiv: Sound

https://doi.org/10.48550/ARXIV.2103.06508

visibility

…

description

9 pages

Abstract

Recent advances suggest the advantage of multi-modal training in comparison with single-modal methods. In contrast to this view, in our work we find that similar gain can be obtained from training with different formats of a single modality. In particular, we investigate the use of the contrastive learning framework to learn audio representations by maximizing the agreement between the raw audio and its spectral representation. We find a significant gain using this multi-format strategy against the single-format counterparts. Moreover, on the downstream AudioSet and ESC-50 classification task, our audio-only approach achieves new state-ofthe-art results with a mean average precision of 0.376 and an accuracy of 90.5%, respectively.

Key takeaways
AI

Multi-format training enhances audio representation learning, outperforming single-format methods significantly.
Achieved a mean average precision of 0.376 on AudioSet, surpassing previous state-of-the-art scores.
New state-of-the-art accuracy of 90.5% on the ESC-50 classification task was established.
Contrastive learning maximizes agreement between raw audio and log-mel representations for better results.
This work highlights the potential of audio-only unsupervised learning with large-scale datasets.

Multi-Format Contrastive Learning of Audio Representations Luyu Wang Aäron van den Oord Google DeepMind Google Deepmind [email protected] [email protected] arXiv:2103.06508v3 [cs.SD] 24 Mar 2021 Abstract Recent advances suggest the advantage of multi-modal training in comparison with single-modal methods. In contrast to this view, in our work we find that similar gain can be obtained from training with different formats of a single modality. In particular, we investigate the use of the contrastive learning framework to learn audio representations by maximizing the agreement between the raw audio and its spectral representation. We find a significant gain using this multi-format strategy against the single-format counterparts. Moreover, on the downstream AudioSet and ESC-50 classification task, our audio-only approach achieves new state-of- the-art results with a mean average precision of 0.376 and an accuracy of 90.5%, respectively. 1 Introduction Self-supervised learning leverages proxy tasks to learn useful representations of the data without requiring manually annotated labels. In computer vision, methods using contrastive losses [1, 2, 3, 4, 5, 6, 7] stand out on the ImageNet benchmark [8], which learns by maximizing the similarity between augmented views from the same image. Contrastive learning also facilitates the recent rapid progress in unsupervised speech recognition [9, 10, 11, 12, 13, 14]. Good speech representations should be able to extract transient linguistic information. Therefore, these works rely on context prediction models that output representations with a fine-grain temporal resolution, while excluding non-speech sound that can distract the model from the task. For both image and speech recognition, the gaps between supervised and unsupervised representations have largely been eliminated. Unlike speech recognition, the multi-instance audio events classification problem requires discrimi- native representations to tell the differences among a broad class of audio events. AudioSet [15] is the ImageNet-scale dataset for general audio understanding, which contains 527 highly imbalanced event classes. Recent development in this direction has mainly been focusing on supervised learning [16, 17, 18, 19]. In [20], a triplet-based unsupervised approach is introduced to learn audio features from augmented spectrograms. Later, the effectiveness of contrastive predictive coding (CPC) is invstigated in [21], which operates on raw waveforms and is widely used for speech models. Mean- while, recent works show that better representations can be learned by the proxy task of predicting whether the visual and audio signals come from the same video [22, 23, 24, 25, 26, 27, 28, 29, 30]. On the AudioSet benchmark, [26] shows that it is beneficial to maximize the coincidence between video and audio with the contrastive loss. There is a clear advantage by further taking the additional text modality into account [28]. However, the state-of-the-art unsupervised audio model is still lagging behind the supervised one [19] (mean average precision (mAP) 0.309 vs 0.439). Apart from the raw audio format, traditional signal processing allows us to convert the wavefroms into the spectral representations via shoft-time Fourier transforms (STFTs) [31]. Such spectrograms can further be retrieved into log-mel filter banks and mel-frequency cepstral coefficients (MFCCs), which are the dominant formats for supervised and unsupervised audio recognition [16, 17, 18, 19, 20, 26]. Self-Supervised Learning for Speech and Audio Processing Workshop @ NeurIPS 2020. Waveform Max agreement Spectral features features Waveform Spectrogram encoder encoder Waveform Spectrogram augmentations augmentations Crop 1 Crop 2 Extract spectrogram Figure 1: Illustration of the multi-format contrastive audio learning framework. However, to the best of our knowledge, all previous works consider only one format of the audio modality. In this paper, we investigate the use of contrastive learning to learn audio representations from multiple formats. Different from the models that drive the current progress on learning unsupervised speech representations [9, 10, 11, 12, 13, 14], our method does not rely on the context prediction network, and directly contrast two augmented views (which resembles image models in [5, 6]). We conduct experiments on different input formats, architectures, and augmentations. It is found that much better representations can be learned by maximizing the agreement between two views from the same audio represented by the raw waveform and log-mel filterbanks. As a result, our single-modal model has a test mAP of 0.376 on AudioSet, outperforming the previous best multi-modal score of 0.309 [28] by a large margin. Moreover, it generalizes to the ESC-50 downstream classification task with a new state-of-the-art accuracy of 90.5%. 2 Learning framework The multi-format contrastive audio learning framework is depicted in Figure 1. The input audio in the waveform format is first cropped into two shorter clips xi and xj . One of them can be further transformed into the spectral representation. Waveform or spectrogram augmentations are then applied accordingly, creating the positive pair (e xi , x ej ). There are different ways to form the distractors or negatives including sampling from the same or other data samples. In this work we choose the negative pairs composing views from two different samples the same way as in SimCLR [6]. Then it learns by maximizing the similarity between the encoded representations of the positive pair (zi , zj ). The loss function is defined as exp (sim (zi , zj ) /τ ) Li,j = −log P (1) k6=i exp (sim (zi , zk ) /τ ) where τ denotes the temperature parameter, and sim (·, ·) is the nonlinear cosine similarity measure with the form of sim (u, v) = g (u) · g (v) / kg (u)k kg (v)k, in which the projector g is a multi-layer perceptron (MLP) model with 1 hidden layer and ReLU nonlinearity. It is shared by both branches. The summation in the denominator over k is computed from 2N − 1 crops in the batch (excluding zi ). Both Li,j and Lj,i are computed and summed up as the overall loss for the positive pair (exi , x ej ). The final loss is computed across all positive pairs in the batch. 2.1 Architecture For the spectral input format (including spectrograms, log-mel, and MFCCs), we adopt the state- of-the-art supervised models from [19] by removing the last two linear layers, and directly use the outputs from the global pooling layer. These models include CNN6, CNN10, CNN14, ResNet22, Resnet38, and ResNet54. For convenience, in this paper we use their original names even though there are two less layers. 2 When raw audio is presented, we employ networks previously used as the encoders in various unsupervised speech models [9, 11, 14]. The building block is a 1D convolutional layer followed by Group Normalization and ReLU activation. The first layer has a kernel size of 10 and stride 5, followed by 5 to 9 layers of kernel size of 4 and stride 2. These models shrink the temporal dimension by 160 to 2560 times, so that we refer them as Conv160, Conv320, ..., Conv2560 in this work. The number of filters in each layer is 512. Global average pooling is applied on the time dimension at the end. Besides, we consider the Res1dNet-31 and Res1dNet-51 model from [19]. We remove the last two linear layers and find it is important to use Group Normalizations for these models. The CNN6, CNN10 and Conv160 to Conv2560 model output a feature space of 512 dimensions, and the rest, namely, CNN14, ResNet22, ResNet38, ResNet54, Res1dNet-31, and Res1dNet-51, result in 2048 dimensions. The latent sizes need to be matched for the loss function when two different models are used. Note that the features from the encoders are pooled from the time and/or frequency dimension. In the ablation studies, if only one audio format is used, we let the two branches share the same model. We have also experimented training two models and concatenate the output features but found it yields similar results. If two different formats are presented, we use two networks and concatenate the features for the downstream tasks. In the final results, we also report the performance of each network. 2.2 Audio augmentations It is observed previously that augmentations are important to the contrastive learning framework [3, 4, 6]. We consider different types of augmentations for raw audio and spectrograms [20, 32, 33]. As the pitch shift requires additional STFTs on the fly, in this work we consider the following less expensive augmentations: Audio mixing Small additive noise of any sort will not alter the original categories of the audio. Given two audio clips x1 and x2 , the mixed-up version is x̂1 = αx1 + (1 − α)x2 (2) where x̂1 inheritances labels from x1 . In this work, α is samples from β(5, 2) distribution. This simulates various realistic noise conditions. Time masking t consecutive time steps [t0 , t0 + t) of the audio can be dropped out and it should not change the event classes, where t0 is randomly sampled. This can be applied both to raw audio and spectrograms. Frequency masking A small amount of f frequency components [f0 , f0 + f ) on the spectrogram can be masked out without losing semantic information. Frequency shift One can apply the truncated shift in frequency to the spectrograms by an integer number sampled from [−F, F ], where F is the maximum shift size. Missing values after the shift are set to zero energy. Intuitively, this is a less expensive alternative of changing the pitch of the audio. 3 Experiments We use the audio segments sampled at 16k Hz from AudioSet [15] for both training and evaluation. We split the original training set into training and validation subset by 95% and 5%, respectively. We evaluate the representations in the downstream task of training shallow fully connected audio classifiers following the same setup as in [20, 26], where a 1-hidden-layer MLP with 512 units is used and the parameters in the pretrained network are fixed. In this section we detail some of key factors that affect the model performance based on the development set. Then we show how the proposed method compares to the state of the art on the test set. Unless noted, the models are trained up to 400k steps with a batch size of 1024. Adam optimizer is used, starting from an initial learning rate of 10−4 , and follows a cosine learning rate decay down to 10−6 . We randomly crop two windows of 3 seconds from each data sample during training. On evaluation we equally split the data into overlapped subclips with the stride of half of the crop 3 Figure 2: Validation mAP of the small (left) and base (right) models with different combinations of audio formats as input. Figure 3: Validation mAP of the raw-audio-vs-log-mel models with different combinations of raw audio (along rows) and spectrogram (along columns) augmentations. size, and average the logits from the subclips to obtain the overall score of the clip. The loss has a temperature of 0.1. When the spectral representations are used, they are extracted by a window size of 20 ms and stride of 10 ms. The spectrogram and log-mel features have 80 dimensions. For MFCCs we follow the convention and take 13 features [31]. The default small model uses CNN10 for spectrograms and Conv320 for raw input, and the base model employs CNN14 and Res1dNet-31. 3.1 Audio formats We compare the performances of learning with different combinations of audio formats in Figure 2. It is noticed that both the small and base model benefit from using two kinds of input formats. Meanwhile, maximizing agreement between raw waveforms and the frequency representations outperforms combinations with two spectrograms by a large margin. In particular, the combination of raw audio and log mel spectrograms really stands out: on the base model, it improves relatively upon the raw-audio-only and log-mel-only counterpart by 15% and 41%, respectively. The MFCC-based models have the lowest scores, possibly because they are low in feature dimensions. We think the reason why this works well is because taking another format of audio can be viewed as an aggressive way of transforming or augmenting the data to create semantically related but vastly different views, such that the contrastive learning framework can not leverage the trivial cues to solve the proxy task without learning meaningful representations. 3.2 Creation of the views Table 1 shows that the model is better trained when taking two randomly cropped clips of 3 to 5 seconds to create the views. Taking the full length (10 seconds) results in the worst performance. This is possibly due to the multi-instance nature of AudioSet. Because our models output features averaged on the time dimension, and some class may only last a very short duration within the clip, taking a long temporal scale may completely bury the short-lasting classes, resulting in a lower validation score. In addition, it also shows that the maximum frequency shift is optimal around half of the frequency dimension size. 4 Table 1: Effects of different crop sizes, maximum frequency shifts, temperature parameters, and projection latent sizes on the base model when trained with both raw audios and log mel spectrograms. Crop size (s) 1 2 3 4 5 6 8 10 Val mAP 0.310 0.336 0.340 0.344 0.341 0.328 0.305 0.262 Max freq shift 0 2 4 10 20 40 60 80 Val mAP 0.331 0.329 0.333 0.340 0.340 0.342 0.338 0.326 Temperature 0.05 0.1 0.25 0.5 0.75 1 Val mAP 0.330 0.340 0.326 0.312 0.297 0.285 Latent size 128 256 512 1024 2048 Val mAP 0.331 0.338 0.340 0.342 0.345 Table 2: Impacts of network architectures when trained with both raw waveforms and log mel spectrograms. Small model Conv160 Conv320 Conv640 Conv1280 Conv2560 CNN6 0.300 0.305 0.313 0.314 0.314 CNN10 0.303 0.313 0.320 0.320 0.322 Base model Res1dNet-31 Res1dNet-51 CNN14 0.340 0.340 ResNet-22 0.335 0.336 ResNet-38 0.332 0.333 ResNet-54 0.335 0.340 In Figure 3 it is seen that the frequency shift has the biggest impact on both the small and base model. The base model benefits more from the audio mixing, possibly because the small model does not have enough capacity to account for it. In our experiments, masking on either time or frequency does not improve the downstream performance. 3.3 Model architectures We run ablations on the choice of model architectures and the results are shown in Table 2. For smaller models, we observe gains by increasing the model capacity. However, the same does not hold for large models - the performance does not further improve when the model size goes beyond Res1dNet-31 or CNN14. The same behavior has also been documented in the supervised setting [19]. We leave how to scale it further for future work. 3.4 More ablations We observe trends similar to the SimCLR image model [6] on the choices of the temperature parameter (Table 1), projection MLP latent size (Table 1), batch size (Figure 4). In particular, we also find that it is better to train the contrastive framework with a very large batch size, possibly because it needs a large pool of negatives for the softmax loss in Equation 1 to pick out the hard ones. 3.5 Comparison to the state of the art Based on the findings from the ablations above, we train our final model with CNN14 and Res1dNet- 31 using a large batch size of 32768. Frequency shift, audio mixing, time and frequency masking are applied to the log-mel branch (CNN14), and audio mixing is used for the raw waveform (Res1dNet- 31). It takes 3-second crops and runs up to 700k steps and save the model with the best validation score. The latent size in the projection head is 1024. 5 log-mel raw audio raw audio & log-mel 0.4 0.3 Val mAP 0.2 0.1 0.0 256 512 1024 2048 4096 Batch size Figure 4: Validation mean average precision of different batch sizes on the base model. Table 3: Test performance of shallow model classification on AudioSet with fixed representations. Model Train inputs Eval inputs Test mAP Triplet [20] log-mel log-mel 0.244 L3 [22] log-mel + video log-mel 0.249 CPC [21] waveform waveform 0.277 C 3 [26] log-mel + video log-mel 0.285 MMV [28] log-mel + video + text log-mel 0.309 Ours log-mel log-mel 0.329 Ours waveform waveform 0.336 Ours waveform + log-mel log-mel 0.368 Ours waveform + log-mel waveform 0.355 Ours waveform + log-mel waveform + log-mel 0.376 Supervised [19] waveform + log-mel waveform + log-mel 0.439 We report the test scores on AudioSet in Table 3. When only one audio format is used in the contrastive learning framework, it is observed that the results are already better than the previous best score trained with multiple modalities. Specifically, our log-mel-only model has a test mAP of 0.329 and the waveform-only one scores 0.336, outperforming the multimodal versatile network at 0.309, which is trained with audio, video, and texts [28]. This is counter-intuitive because previous works have shown multi-modal learning is better than the single-modal ones. We think this is at least partially because using only the audio modality allows one to train models using the contrastive loss with a very large batch size and more training steps given the same computation budget. When two audio formats are presented and both the log-mel and waveform network are trained simultaneously, it is seen that each individual network performs better than when trained with only one format. Note that the log-mel network, performing at 0.368 mAP, employs the CNN14 architecture. The same network trained under the supervised setting is reported with a mAP of 0.375 in [19] (without class balancing). If we use the concatenated features from the two networks, the performance further increases to 0.376. The current supervised state of the art of 0.439 mAP is achieved by the Wavegram model trained with both log-mel spectrograms and waveforms [19]. However, it is noted that class balancing is crucial to this model, which requires the access to class labels. The self-supervised framework used in this work does not require any label and has no assumption about the class distribution. This ensures that this self-supervised method can scale well with the large amount of unlabelled data. We leave this for future work. We also present in Table 4 the results on generalization to a smaller downstream dataset ESC-50 [34]. It is widely used for evaluating audio representations trained by various cross-modal frameworks. Our multi-format model achieves a new SOTA accuracy of 90.5% without requiring any additional modality or dataset for pre-training. Moreover, without feature concatenation, the raw audio network of this model alone has an accuracy of 89.3%, while the log mel network achieves 89.7%, which are higher than the corresponding results of 84.9% and 86.3% using single-format training, respectively. 6 Table 4: Test accuracy of linear classification on ESC-50 with fixed audio representations. Hy- perparameters of the classifier are selected with split 1 and the average accuracy over 5 splits is reported. Model Train inputs Eval inputs Test accuracy (%) 3 L [22] log-mel + video log-mel 79.3 AVTS [24] log-mel + video log-mel 82.3 XDC [27] log-mel + video log-mel 84.8 GDT [30] log-mel + video log-mel 88.5 MMV [28] log-mel + video + text log-mel 88.9 AVID [29] log-mel + video log-mel 89.2 Ours log-mel log-mel 86.3 Ours waveform waveform 84.9 Ours waveform + log-mel log-mel 89.7 Ours waveform + log-mel waveform 89.3 Ours waveform + log-mel waveform + log-mel 90.5 Supervised [19] waveform + log-mel log-mel 90.8 4 Conclusions In this work, we study the use of multiple formats of the audio for contrastive learning. We observe a significant advantage when training with both raw waveforms and log-mel spectrograms. Our model improves the state of the art on the AudioSet benchmark relatively by 21.7%, bridging the gap between unsupervised and supervised learning. Our work shows that multi-format training is promising to fully unlock the potential of large-scale and audio-only unsupervised learning. Acknowledgments The authors would like to thank Yan Wu for fruitful discussions. We also appreciate the feedback from the anonymous reviewers. References [1] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2018. [2] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. [3] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems (NeurIPS), pages 15509–15519, 2019. [4] Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In Interna- tional Conference on Machine Learning, pages 4182–4192. PMLR, 2020. [5] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020. [6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020. [7] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020. [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 7 [9] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. Proc. Interspeech 2019, pages 3465–3469, 2019. [10] Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. International Conference on Learning Representations (ICLR), 2019. [11] Kazuya Kawakami, Luyu Wang, Chris Dyer, Phil Blunsom, and Aäron van den Oord. Learning robust and multilingual speech representations. arXiv preprint arXiv:2001.11128, 2020. [12] Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. Unsuper- vised pretraining transfers well across languages. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414–7418. IEEE, 2020. [13] Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre- Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. IEEE, 2020. [14] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A frame- work for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020. [15] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780. IEEE, 2017. [16] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Chan- ning Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135. IEEE, 2017. [17] Yun Wang, Juncheng Li, and Florian Metze. A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 31–35. IEEE, 2019. [18] Logan Ford, Hao Tang, François Grondin, and James R Glass. A deep residual network for large-scale acoustic scene analysis. In INTERSPEECH, pages 2568–2572, 2019. [19] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020. [20] Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel PW Ellis, Shawn Hershey, Jiayang Liu, R Channing Moore, and Rif A Saurous. Unsupervised learning of semantic audio representations. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 126–130. IEEE, 2018. [21] Luyu Wang, Kazuya Kawakami, and Aäron van den Oord. Contrastive predictive coding of audio with an adversary. Proc. Interspeech 2020, pages 826–830, 2020. [22] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017. [23] Relja Arandjelovic and Andrew Zisserman. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), pages 435–451, 2018. [24] Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems, pages 7763–7774, 2018. [25] Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisen- sory features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018. [26] Aren Jansen, Daniel PW Ellis, Shawn Hershey, R Channing Moore, Manoj Plakal, Ashok C Popat, and Rif A Saurous. Coincidence, categorization, and consolidation: Learning to recognize 8 sounds with minimal supervision. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 121–125. IEEE, 2020. [27] Humam Alwassel, Dhruv Mahajan, Lorenzo Torresani, Bernard Ghanem, and Du Tran. Self- supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667, 2019. [28] Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapu- ram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-supervised multimodal versatile networks. arXiv preprint arXiv:2006.16228, 2020. [29] Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio-visual instance discrimination with cross-modal agreement. arXiv preprint arXiv:2004.12943, 2020. [30] Mandela Patrick, Yuki M. Asano, Ruth Fong, João F. Henriques, Geoffrey Zweig, and Andrea Vedaldi. Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298, 2020. [31] Daniel Jurafsky and James H. Martin. Speech and language processing (2nd edition). Pearson Education, 2008. [32] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition. Proc. Interspeech 2019, pages 2613–2617, 2019. [33] Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, and Emmanuel Dupoux. Data augmenting contrastive learning of speech representations in the time domain. arXiv preprint arXiv:2007.00991, 2020. [34] Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018, 2015. 9

References (34)

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2018.
Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems (NeurIPS), pages 15509-15519, 2019.
Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In Interna- tional Conference on Machine Learning, pages 4182-4192. PMLR, 2020.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729-9738, 2020.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248-255. Ieee, 2009.
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. Proc. Interspeech 2019, pages 3465-3469, 2019.
Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. International Conference on Learning Representations (ICLR), 2019.
Kazuya Kawakami, Luyu Wang, Chris Dyer, Phil Blunsom, and Aäron van den Oord. Learning robust and multilingual speech representations. arXiv preprint arXiv:2001.11128, 2020.
Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. Unsuper- vised pretraining transfers well across languages. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414-7418. IEEE, 2020.
Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre- Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669-7673. IEEE, 2020.
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A frame- work for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020.
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776-780. IEEE, 2017.
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Chan- ning Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131-135. IEEE, 2017.
Yun Wang, Juncheng Li, and Florian Metze. A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 31-35. IEEE, 2019.
Logan Ford, Hao Tang, François Grondin, and James R Glass. A deep residual network for large-scale acoustic scene analysis. In INTERSPEECH, pages 2568-2572, 2019.
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880-2894, 2020.
Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel PW Ellis, Shawn Hershey, Jiayang Liu, R Channing Moore, and Rif A Saurous. Unsupervised learning of semantic audio representations. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 126-130. IEEE, 2018.
Luyu Wang, Kazuya Kawakami, and Aäron van den Oord. Contrastive predictive coding of audio with an adversary. Proc. Interspeech 2020, pages 826-830, 2020.
Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 609-617, 2017.
Relja Arandjelovic and Andrew Zisserman. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), pages 435-451, 2018.
Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems, pages 7763-7774, 2018.
Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisen- sory features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 631-648, 2018.
Aren Jansen, Daniel PW Ellis, Shawn Hershey, R Channing Moore, Manoj Plakal, Ashok C Popat, and Rif A Saurous. Coincidence, categorization, and consolidation: Learning to recognize sounds with minimal supervision. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 121-125. IEEE, 2020.
Humam Alwassel, Dhruv Mahajan, Lorenzo Torresani, Bernard Ghanem, and Du Tran. Self- supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667, 2019.
Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapu- ram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-supervised multimodal versatile networks. arXiv preprint arXiv:2006.16228, 2020.
Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio-visual instance discrimination with cross-modal agreement. arXiv preprint arXiv:2004.12943, 2020.
Mandela Patrick, Yuki M. Asano, Ruth Fong, João F. Henriques, Geoffrey Zweig, and Andrea Vedaldi. Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298, 2020.
Daniel Jurafsky and James H. Martin. Speech and language processing (2nd edition). Pearson Education, 2008.
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition. Proc. Interspeech 2019, pages 2613-2617, 2019.
Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, and Emmanuel Dupoux. Data augmenting contrastive learning of speech representations in the time domain. arXiv preprint arXiv:2007.00991, 2020.
Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015-1018, 2015.

FAQs

What explains the improved performance using multi-format audio representations?add

The research reveals that combining raw waveforms and log-mel spectrograms yields 21.7% better performance, achieving a mean average precision of 0.376 on AudioSet. This advantage arises from maximizing agreement between distinct audio representations.

How does the performance of single-modal versus multi-modal audio models compare?add

The study finds that single-modal models outperform the best multi-modal model (0.329 vs. 0.309 mAP). The raw audio-only model recorded a score of 0.336, suggesting advantages in using only audio data for contrastive learning.

When were the enhancements in audio representation accuracy observed?add

Significant enhancements were noted when using multi-format training, resulting in a new state-of-the-art accuracy of 90.5% on the ESC-50 classification task. This accuracy is achieved without additional modalities or pre-training datasets.

What methodologies were used for augmenting audio data in the experiments?add

The paper implemented augmentations such as audio mixing, time masking, and frequency masking to enhance contrastive learning. Notably, audio mixing improved the base model's performance due to its increased capacity.

How impactful are batch sizes on model training in contrastive learning?add

The research indicates that training with a batch size of 32,768 significantly improves model performance due to a larger pool of negative samples for the softmax loss. This aligns with trends observed in similar image models.

About the author