Multi-Format Contrastive Learning of Audio
Representations
Luyu Wang Aäron van den Oord
Google DeepMind Google Deepmind
[email protected] [email protected]
arXiv:2103.06508v3 [cs.SD] 24 Mar 2021
Abstract
Recent advances suggest the advantage of multi-modal training in comparison with
single-modal methods. In contrast to this view, in our work we find that similar
gain can be obtained from training with different formats of a single modality. In
particular, we investigate the use of the contrastive learning framework to learn
audio representations by maximizing the agreement between the raw audio and its
spectral representation. We find a significant gain using this multi-format strategy
against the single-format counterparts. Moreover, on the downstream AudioSet
and ESC-50 classification task, our audio-only approach achieves new state-of-
the-art results with a mean average precision of 0.376 and an accuracy of 90.5%,
respectively.
1 Introduction
Self-supervised learning leverages proxy tasks to learn useful representations of the data without
requiring manually annotated labels. In computer vision, methods using contrastive losses [1, 2, 3, 4,
5, 6, 7] stand out on the ImageNet benchmark [8], which learns by maximizing the similarity between
augmented views from the same image. Contrastive learning also facilitates the recent rapid progress
in unsupervised speech recognition [9, 10, 11, 12, 13, 14]. Good speech representations should be
able to extract transient linguistic information. Therefore, these works rely on context prediction
models that output representations with a fine-grain temporal resolution, while excluding non-speech
sound that can distract the model from the task. For both image and speech recognition, the gaps
between supervised and unsupervised representations have largely been eliminated.
Unlike speech recognition, the multi-instance audio events classification problem requires discrimi-
native representations to tell the differences among a broad class of audio events. AudioSet [15] is
the ImageNet-scale dataset for general audio understanding, which contains 527 highly imbalanced
event classes. Recent development in this direction has mainly been focusing on supervised learning
[16, 17, 18, 19]. In [20], a triplet-based unsupervised approach is introduced to learn audio features
from augmented spectrograms. Later, the effectiveness of contrastive predictive coding (CPC) is
invstigated in [21], which operates on raw waveforms and is widely used for speech models. Mean-
while, recent works show that better representations can be learned by the proxy task of predicting
whether the visual and audio signals come from the same video [22, 23, 24, 25, 26, 27, 28, 29, 30].
On the AudioSet benchmark, [26] shows that it is beneficial to maximize the coincidence between
video and audio with the contrastive loss. There is a clear advantage by further taking the additional
text modality into account [28]. However, the state-of-the-art unsupervised audio model is still
lagging behind the supervised one [19] (mean average precision (mAP) 0.309 vs 0.439).
Apart from the raw audio format, traditional signal processing allows us to convert the wavefroms into
the spectral representations via shoft-time Fourier transforms (STFTs) [31]. Such spectrograms can
further be retrieved into log-mel filter banks and mel-frequency cepstral coefficients (MFCCs), which
are the dominant formats for supervised and unsupervised audio recognition [16, 17, 18, 19, 20, 26].
Self-Supervised Learning for Speech and Audio Processing Workshop @ NeurIPS 2020.
Waveform Max agreement Spectral
features features
Waveform Spectrogram
encoder encoder
Waveform Spectrogram
augmentations augmentations
Crop 1 Crop 2
Extract
spectrogram
Figure 1: Illustration of the multi-format contrastive audio learning framework.
However, to the best of our knowledge, all previous works consider only one format of the audio
modality. In this paper, we investigate the use of contrastive learning to learn audio representations
from multiple formats. Different from the models that drive the current progress on learning
unsupervised speech representations [9, 10, 11, 12, 13, 14], our method does not rely on the context
prediction network, and directly contrast two augmented views (which resembles image models
in [5, 6]). We conduct experiments on different input formats, architectures, and augmentations.
It is found that much better representations can be learned by maximizing the agreement between
two views from the same audio represented by the raw waveform and log-mel filterbanks. As a
result, our single-modal model has a test mAP of 0.376 on AudioSet, outperforming the previous
best multi-modal score of 0.309 [28] by a large margin. Moreover, it generalizes to the ESC-50
downstream classification task with a new state-of-the-art accuracy of 90.5%.
2 Learning framework
The multi-format contrastive audio learning framework is depicted in Figure 1. The input audio
in the waveform format is first cropped into two shorter clips xi and xj . One of them can be
further transformed into the spectral representation. Waveform or spectrogram augmentations are
then applied accordingly, creating the positive pair (e
xi , x
ej ). There are different ways to form the
distractors or negatives including sampling from the same or other data samples. In this work we
choose the negative pairs composing views from two different samples the same way as in SimCLR
[6]. Then it learns by maximizing the similarity between the encoded representations of the positive
pair (zi , zj ). The loss function is defined as
exp (sim (zi , zj ) /τ )
Li,j = −log P (1)
k6=i exp (sim (zi , zk ) /τ )
where τ denotes the temperature parameter, and sim (·, ·) is the nonlinear cosine similarity measure
with the form of sim (u, v) = g (u) · g (v) / kg (u)k kg (v)k, in which the projector g is a multi-layer
perceptron (MLP) model with 1 hidden layer and ReLU nonlinearity. It is shared by both branches.
The summation in the denominator over k is computed from 2N − 1 crops in the batch (excluding
zi ). Both Li,j and Lj,i are computed and summed up as the overall loss for the positive pair (exi , x
ej ).
The final loss is computed across all positive pairs in the batch.
2.1 Architecture
For the spectral input format (including spectrograms, log-mel, and MFCCs), we adopt the state-
of-the-art supervised models from [19] by removing the last two linear layers, and directly use the
outputs from the global pooling layer. These models include CNN6, CNN10, CNN14, ResNet22,
Resnet38, and ResNet54. For convenience, in this paper we use their original names even though
there are two less layers.
2
When raw audio is presented, we employ networks previously used as the encoders in various
unsupervised speech models [9, 11, 14]. The building block is a 1D convolutional layer followed
by Group Normalization and ReLU activation. The first layer has a kernel size of 10 and stride 5,
followed by 5 to 9 layers of kernel size of 4 and stride 2. These models shrink the temporal dimension
by 160 to 2560 times, so that we refer them as Conv160, Conv320, ..., Conv2560 in this work. The
number of filters in each layer is 512. Global average pooling is applied on the time dimension at the
end. Besides, we consider the Res1dNet-31 and Res1dNet-51 model from [19]. We remove the last
two linear layers and find it is important to use Group Normalizations for these models.
The CNN6, CNN10 and Conv160 to Conv2560 model output a feature space of 512 dimensions, and
the rest, namely, CNN14, ResNet22, ResNet38, ResNet54, Res1dNet-31, and Res1dNet-51, result
in 2048 dimensions. The latent sizes need to be matched for the loss function when two different
models are used. Note that the features from the encoders are pooled from the time and/or frequency
dimension. In the ablation studies, if only one audio format is used, we let the two branches share the
same model. We have also experimented training two models and concatenate the output features
but found it yields similar results. If two different formats are presented, we use two networks and
concatenate the features for the downstream tasks. In the final results, we also report the performance
of each network.
2.2 Audio augmentations
It is observed previously that augmentations are important to the contrastive learning framework
[3, 4, 6]. We consider different types of augmentations for raw audio and spectrograms [20, 32, 33].
As the pitch shift requires additional STFTs on the fly, in this work we consider the following less
expensive augmentations:
Audio mixing Small additive noise of any sort will not alter the original categories of the audio.
Given two audio clips x1 and x2 , the mixed-up version is
x̂1 = αx1 + (1 − α)x2 (2)
where x̂1 inheritances labels from x1 . In this work, α is samples from β(5, 2) distribution. This
simulates various realistic noise conditions.
Time masking t consecutive time steps [t0 , t0 + t) of the audio can be dropped out and it should
not change the event classes, where t0 is randomly sampled. This can be applied both to raw audio
and spectrograms.
Frequency masking A small amount of f frequency components [f0 , f0 + f ) on the spectrogram
can be masked out without losing semantic information.
Frequency shift One can apply the truncated shift in frequency to the spectrograms by an integer
number sampled from [−F, F ], where F is the maximum shift size. Missing values after the shift are
set to zero energy. Intuitively, this is a less expensive alternative of changing the pitch of the audio.
3 Experiments
We use the audio segments sampled at 16k Hz from AudioSet [15] for both training and evaluation.
We split the original training set into training and validation subset by 95% and 5%, respectively.
We evaluate the representations in the downstream task of training shallow fully connected audio
classifiers following the same setup as in [20, 26], where a 1-hidden-layer MLP with 512 units is
used and the parameters in the pretrained network are fixed. In this section we detail some of key
factors that affect the model performance based on the development set. Then we show how the
proposed method compares to the state of the art on the test set.
Unless noted, the models are trained up to 400k steps with a batch size of 1024. Adam optimizer is
used, starting from an initial learning rate of 10−4 , and follows a cosine learning rate decay down
to 10−6 . We randomly crop two windows of 3 seconds from each data sample during training.
On evaluation we equally split the data into overlapped subclips with the stride of half of the crop
3
Figure 2: Validation mAP of the small (left) and base (right) models with different combinations of
audio formats as input.
Figure 3: Validation mAP of the raw-audio-vs-log-mel models with different combinations of raw
audio (along rows) and spectrogram (along columns) augmentations.
size, and average the logits from the subclips to obtain the overall score of the clip. The loss has
a temperature of 0.1. When the spectral representations are used, they are extracted by a window
size of 20 ms and stride of 10 ms. The spectrogram and log-mel features have 80 dimensions. For
MFCCs we follow the convention and take 13 features [31]. The default small model uses CNN10
for spectrograms and Conv320 for raw input, and the base model employs CNN14 and Res1dNet-31.
3.1 Audio formats
We compare the performances of learning with different combinations of audio formats in Figure 2.
It is noticed that both the small and base model benefit from using two kinds of input formats.
Meanwhile, maximizing agreement between raw waveforms and the frequency representations
outperforms combinations with two spectrograms by a large margin. In particular, the combination of
raw audio and log mel spectrograms really stands out: on the base model, it improves relatively upon
the raw-audio-only and log-mel-only counterpart by 15% and 41%, respectively. The MFCC-based
models have the lowest scores, possibly because they are low in feature dimensions. We think the
reason why this works well is because taking another format of audio can be viewed as an aggressive
way of transforming or augmenting the data to create semantically related but vastly different views,
such that the contrastive learning framework can not leverage the trivial cues to solve the proxy task
without learning meaningful representations.
3.2 Creation of the views
Table 1 shows that the model is better trained when taking two randomly cropped clips of 3 to 5
seconds to create the views. Taking the full length (10 seconds) results in the worst performance.
This is possibly due to the multi-instance nature of AudioSet. Because our models output features
averaged on the time dimension, and some class may only last a very short duration within the clip,
taking a long temporal scale may completely bury the short-lasting classes, resulting in a lower
validation score. In addition, it also shows that the maximum frequency shift is optimal around half
of the frequency dimension size.
4
Table 1: Effects of different crop sizes, maximum frequency shifts, temperature parameters, and
projection latent sizes on the base model when trained with both raw audios and log mel spectrograms.
Crop size (s) 1 2 3 4 5 6 8 10
Val mAP 0.310 0.336 0.340 0.344 0.341 0.328 0.305 0.262
Max freq shift 0 2 4 10 20 40 60 80
Val mAP 0.331 0.329 0.333 0.340 0.340 0.342 0.338 0.326
Temperature 0.05 0.1 0.25 0.5 0.75 1
Val mAP 0.330 0.340 0.326 0.312 0.297 0.285
Latent size 128 256 512 1024 2048
Val mAP 0.331 0.338 0.340 0.342 0.345
Table 2: Impacts of network architectures when trained with both raw waveforms and log mel
spectrograms.
Small model Conv160 Conv320 Conv640 Conv1280 Conv2560
CNN6 0.300 0.305 0.313 0.314 0.314
CNN10 0.303 0.313 0.320 0.320 0.322
Base model Res1dNet-31 Res1dNet-51
CNN14 0.340 0.340
ResNet-22 0.335 0.336
ResNet-38 0.332 0.333
ResNet-54 0.335 0.340
In Figure 3 it is seen that the frequency shift has the biggest impact on both the small and base model.
The base model benefits more from the audio mixing, possibly because the small model does not
have enough capacity to account for it. In our experiments, masking on either time or frequency does
not improve the downstream performance.
3.3 Model architectures
We run ablations on the choice of model architectures and the results are shown in Table 2. For
smaller models, we observe gains by increasing the model capacity. However, the same does not
hold for large models - the performance does not further improve when the model size goes beyond
Res1dNet-31 or CNN14. The same behavior has also been documented in the supervised setting [19].
We leave how to scale it further for future work.
3.4 More ablations
We observe trends similar to the SimCLR image model [6] on the choices of the temperature parameter
(Table 1), projection MLP latent size (Table 1), batch size (Figure 4). In particular, we also find that
it is better to train the contrastive framework with a very large batch size, possibly because it needs a
large pool of negatives for the softmax loss in Equation 1 to pick out the hard ones.
3.5 Comparison to the state of the art
Based on the findings from the ablations above, we train our final model with CNN14 and Res1dNet-
31 using a large batch size of 32768. Frequency shift, audio mixing, time and frequency masking are
applied to the log-mel branch (CNN14), and audio mixing is used for the raw waveform (Res1dNet-
31). It takes 3-second crops and runs up to 700k steps and save the model with the best validation
score. The latent size in the projection head is 1024.
5
log-mel raw audio raw audio & log-mel
0.4
0.3
Val mAP
0.2
0.1
0.0
256 512 1024 2048 4096
Batch size
Figure 4: Validation mean average precision of different batch sizes on the base model.
Table 3: Test performance of shallow model classification on AudioSet with fixed representations.
Model Train inputs Eval inputs Test mAP
Triplet [20] log-mel log-mel 0.244
L3 [22] log-mel + video log-mel 0.249
CPC [21] waveform waveform 0.277
C 3 [26] log-mel + video log-mel 0.285
MMV [28] log-mel + video + text log-mel 0.309
Ours log-mel log-mel 0.329
Ours waveform waveform 0.336
Ours waveform + log-mel log-mel 0.368
Ours waveform + log-mel waveform 0.355
Ours waveform + log-mel waveform + log-mel 0.376
Supervised [19] waveform + log-mel waveform + log-mel 0.439
We report the test scores on AudioSet in Table 3. When only one audio format is used in the
contrastive learning framework, it is observed that the results are already better than the previous
best score trained with multiple modalities. Specifically, our log-mel-only model has a test mAP of
0.329 and the waveform-only one scores 0.336, outperforming the multimodal versatile network at
0.309, which is trained with audio, video, and texts [28]. This is counter-intuitive because previous
works have shown multi-modal learning is better than the single-modal ones. We think this is at least
partially because using only the audio modality allows one to train models using the contrastive loss
with a very large batch size and more training steps given the same computation budget.
When two audio formats are presented and both the log-mel and waveform network are trained
simultaneously, it is seen that each individual network performs better than when trained with
only one format. Note that the log-mel network, performing at 0.368 mAP, employs the CNN14
architecture. The same network trained under the supervised setting is reported with a mAP of
0.375 in [19] (without class balancing). If we use the concatenated features from the two networks,
the performance further increases to 0.376. The current supervised state of the art of 0.439 mAP
is achieved by the Wavegram model trained with both log-mel spectrograms and waveforms [19].
However, it is noted that class balancing is crucial to this model, which requires the access to class
labels. The self-supervised framework used in this work does not require any label and has no
assumption about the class distribution. This ensures that this self-supervised method can scale well
with the large amount of unlabelled data. We leave this for future work.
We also present in Table 4 the results on generalization to a smaller downstream dataset ESC-50 [34].
It is widely used for evaluating audio representations trained by various cross-modal frameworks.
Our multi-format model achieves a new SOTA accuracy of 90.5% without requiring any additional
modality or dataset for pre-training. Moreover, without feature concatenation, the raw audio network
of this model alone has an accuracy of 89.3%, while the log mel network achieves 89.7%, which are
higher than the corresponding results of 84.9% and 86.3% using single-format training, respectively.
6
Table 4: Test accuracy of linear classification on ESC-50 with fixed audio representations. Hy-
perparameters of the classifier are selected with split 1 and the average accuracy over 5 splits is
reported.
Model Train inputs Eval inputs Test accuracy (%)
3
L [22] log-mel + video log-mel 79.3
AVTS [24] log-mel + video log-mel 82.3
XDC [27] log-mel + video log-mel 84.8
GDT [30] log-mel + video log-mel 88.5
MMV [28] log-mel + video + text log-mel 88.9
AVID [29] log-mel + video log-mel 89.2
Ours log-mel log-mel 86.3
Ours waveform waveform 84.9
Ours waveform + log-mel log-mel 89.7
Ours waveform + log-mel waveform 89.3
Ours waveform + log-mel waveform + log-mel 90.5
Supervised [19] waveform + log-mel log-mel 90.8
4 Conclusions
In this work, we study the use of multiple formats of the audio for contrastive learning. We observe
a significant advantage when training with both raw waveforms and log-mel spectrograms. Our
model improves the state of the art on the AudioSet benchmark relatively by 21.7%, bridging the
gap between unsupervised and supervised learning. Our work shows that multi-format training is
promising to fully unlock the potential of large-scale and audio-only unsupervised learning.
Acknowledgments
The authors would like to thank Yan Wu for fruitful discussions. We also appreciate the feedback
from the anonymous reviewers.
References
[1] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman,
Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information
estimation and maximization. In International Conference on Learning Representations, 2018.
[2] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive
predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[3] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by
maximizing mutual information across views. In Advances in Neural Information Processing
Systems (NeurIPS), pages 15509–15519, 2019.
[4] Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In Interna-
tional Conference on Machine Learning, pages 4182–4192. PMLR, 2020.
[5] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for
unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework
for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
[7] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big
self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029,
2020.
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern
recognition, pages 248–255. Ieee, 2009.
7
[9] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised
pre-training for speech recognition. Proc. Interspeech 2019, pages 3465–3469, 2019.
[10] Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of
discrete speech representations. International Conference on Learning Representations (ICLR),
2019.
[11] Kazuya Kawakami, Luyu Wang, Chris Dyer, Phil Blunsom, and Aäron van den Oord. Learning
robust and multilingual speech representations. arXiv preprint arXiv:2001.11128, 2020.
[12] Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. Unsuper-
vised pretraining transfers well across languages. In ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414–7418. IEEE,
2020.
[13] Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-
Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen,
et al. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
7669–7673. IEEE, 2020.
[14] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A frame-
work for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477,
2020.
[15] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing
Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset
for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 776–780. IEEE, 2017.
[16] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Chan-
ning Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures
for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 131–135. IEEE, 2017.
[17] Yun Wang, Juncheng Li, and Florian Metze. A comparison of five multiple instance learning
pooling functions for sound event detection with weak labeling. In 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 31–35. IEEE, 2019.
[18] Logan Ford, Hao Tang, François Grondin, and James R Glass. A deep residual network for
large-scale acoustic scene analysis. In INTERSPEECH, pages 2568–2572, 2019.
[19] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley.
Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020.
[20] Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel PW Ellis, Shawn Hershey, Jiayang Liu,
R Channing Moore, and Rif A Saurous. Unsupervised learning of semantic audio representations.
In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 126–130. IEEE, 2018.
[21] Luyu Wang, Kazuya Kawakami, and Aäron van den Oord. Contrastive predictive coding of
audio with an adversary. Proc. Interspeech 2020, pages 826–830, 2020.
[22] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE
International Conference on Computer Vision (ICCV), pages 609–617, 2017.
[23] Relja Arandjelovic and Andrew Zisserman. Objects that sound. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 435–451, 2018.
[24] Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models
from self-supervised synchronization. In Advances in Neural Information Processing Systems,
pages 7763–7774, 2018.
[25] Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisen-
sory features. In Proceedings of the European Conference on Computer Vision (ECCV), pages
631–648, 2018.
[26] Aren Jansen, Daniel PW Ellis, Shawn Hershey, R Channing Moore, Manoj Plakal, Ashok C
Popat, and Rif A Saurous. Coincidence, categorization, and consolidation: Learning to recognize
8
sounds with minimal supervision. In 2020 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 121–125. IEEE, 2020.
[27] Humam Alwassel, Dhruv Mahajan, Lorenzo Torresani, Bernard Ghanem, and Du Tran. Self-
supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667,
2019.
[28] Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapu-
ram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-supervised
multimodal versatile networks. arXiv preprint arXiv:2006.16228, 2020.
[29] Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio-visual instance discrimination
with cross-modal agreement. arXiv preprint arXiv:2004.12943, 2020.
[30] Mandela Patrick, Yuki M. Asano, Ruth Fong, João F. Henriques, Geoffrey Zweig, and Andrea
Vedaldi. Multi-modal self-supervision from generalized data transformations. arXiv preprint
arXiv:2003.04298, 2020.
[31] Daniel Jurafsky and James H. Martin. Speech and language processing (2nd edition). Pearson
Education, 2008.
[32] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk,
and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech
recognition. Proc. Interspeech 2019, pages 2613–2617, 2019.
[33] Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré,
Matthijs Douze, and Emmanuel Dupoux. Data augmenting contrastive learning of speech
representations in the time domain. arXiv preprint arXiv:2007.00991, 2020.
[34] Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd
ACM international conference on Multimedia, pages 1015–1018, 2015.
9