Cross-modal Embeddings for Video and Audio Retrieval

Amaia Salvador; Jordi Torres; Dídac Surís; Xavier Giró-i-Nieto

doi:10.1007/978-3-030-11018-5_62

Outline

Cross-modal Embeddings for Video and Audio Retrieval

Amanda Duarte

2019, Lecture Notes in Computer Science

https://doi.org/10.1007/978-3-030-11018-5_62

visibility

…

description

6 pages

Abstract

The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audiovisual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings for both scales and assess their quality in a retrieval problem, formulated as using the feature extracted from one modality to retrieve the most similar videos based on the features computed in the other modality.

Surís, D. [et al.]. Cross-modal embeddings for video and audio retrieval. A: Women in Computer Vision Workshop. "Computer Vision, ECCV 2018 Workshops: Munich, Germany, September 8-14, 2018: proceedings, part IV". Berlín: Springer, 2019, p. 711-716. The final authenticated version is available online at https://doi.org/10.1007/978-3-030-11018-5_62 CROSS-MODAL EMBEDDINGS FOR VIDEO AND AUDIO RETRIEVAL Didac Surı́s1 , Amanda Duarte2 , Amaia Salvador1 , Jordi Torres2 and Xavier Giró-i-Nieto1 1 Universitat Politècnica de Catalunya (UPC) 2 Barcelona Supercomputing Center (BSC-CNS) ABSTRACT The popularization of deep neural networks among the computer vision and audio communities has defined a com- The increasing amount of online videos brings several op- mon framework boosting multimodal research. Tasks like portunities for training self-supervised neural networks. The video sonorization, speaker impersonation or self-supervised creation of large scale datasets of videos such as the YouTube- feature learning have exploited the opportunities offered by 8M allows us to deal with this large amount of data in man- artificial neurons to project images, text and audio in a fea- ageable way. In this work, we find new ways of exploiting ture space where bridges across modalities can be built. this dataset by taking advantage of the multi-modal informa- This work exploits the relation between the visual and tion it provides. By means of a neural network, we are able audio contents in a video clip to learn a joint embedding to create links between audio and visual documents, by pro- space with deep neural networks. Two multilayer perceptrons jecting them into a common region of the feature space, ob- (MLPs), one for visual features and a second one for audio taining joint audio-visual embeddings. These links are used features, are trained to be mapped into the same cross-modal to retrieve audio samples that fit well to a given silent video, representation. We adopt a self-supervised approach, as we and also to retrieve images that match a given a query audio. exploit the unsupervised correspondence between the audio The results in terms of Recall@K obtained over a subset of and visual tracks in any video clip. YouTube-8M videos show the potential of this unsupervised We propose a joint audiovisual space to address a retrieval approach for cross-modal feature learning. We train embed- task formulating a query from any of the two modalities. As dings for both scales and assess their quality in a retrieval depicted in Figure 1, whether a video or an audio clip can be problem, formulated as using the feature extracted from one used as a query to search its matching pair in a large collection modality to retrieve the most similar videos based on the fea- of videos. For example, an animated GIF could be sonorized tures computed in the other modality. by finding an adequate audio track, or an audio recording il- Index Terms— Sonorization, embedding, retrieval, lustrated with a related video. cross-modal, YouTube-8M In this paper, we present a simple yet very effective model for retrieving documents with a fast and light search. We do 1. INTRODUCTION not address an exact alignment between the two modalities that would require a much higher computation effort. Videos have become the next frontier in artificial intelligence. The paper is structured as follows. Section 2 introduces The rich semantics contained in them make them a challeng- the related work on learned audiovisual embeddings with neu- ing data type posing several challenges in both perceptual, ral networks. Section 3 presents the architecture of our model reasoning or even computational level. Mimicking the learn- and Section 4 how it was trained. Experiments are reported ing process and knowledge extraction that humans develop in Section 5 and final conclusions drawn in Section 6. The from our visual and audio perception remains an open re- source code and trained model used in this paper is pub- search question, and video contain all this information in a licly available from https://github.com/surisdi/ format manageable for science and research. youtube-8m. Videos are used in this work for two main reasons. Firstly, they naturally integrate both visual and audio data, providing 2. RELATED WORK a weak labeling of one modality with respect to the other. Sec- ondly, the high volume of both visual and audio data allows In the past years, the relationship between the audio and the training machine learning algorithms whose models are gov- visual content in videos has been researched in several con- erned by a high amount of parameters. The huge scale video texts. Overall, conventional approaches can be divided into archives available online and the increasing number of video four categories according to the task: generation, classifica- cameras that constantly monitor our world, offer more data tion, matching and retrieval. than computation power available to process them. As online music streaming and video sharing websites images of the video and the vector of features representing the audio. These features are already precomputed and pro- vided in the YouTube-8M dataset [18]. In particular, we use the video-level features, which represent the whole video clip with two vectors: one for the audio and another one for the video. These feature representations are the result of an av- erage pooling of the local audio features computed over win- Fig. 1. A learned cross-modal embedding allows retrieving dows of one second, and local visual features computed over video from audio, and vice versa. frames sampled at 1 Hz. The main objective of the system is to transform the two different features (image and audio, separately) to other fea- have become increasingly popular, some research has been tures laying in a joint space. This means that for the same done on the relationship between music and album covers video, ideally the image features and the audio features will [1, 2, 3, 4] and also on music and videos (instead of just im- be transformed to the same joint features, in the same space. ages) as the visual modality [5, 6, 7, 8] to explore the multi- We will call these new features embeddings, and will repre- modal information present in both types of data. sent them with Φi , for the image embeddings, and Φa , for the A recent study [9] also explored the cross-modal relations audio embeddings. between the two modalities but using images with people talk- The idea of the joint space is to represent the concept of ing and speech. It is done through Canonical Correlation the video, not just the image or the audio, but a generalization Analysis (CCA) and cross-modal factor analysis. Also ap- of it. As a consequence, videos with similar concepts will plying CCA, [10] uses visual and sound features and com- have closer embeddings and videos with different concepts mon subspace features for aiding clustering in image-audio will have embeddings further apart in the joint space. For datasets. In a work presented by [11], the key idea was to use example, the representation of a tennis match video will be greedy layer-wise training with Restricted Boltzmann Ma- close to the one of a football match, but not to the one of a chines (RBMs) between vision and sound. maths lesson. The present work is focused on using the information Thus, we use a set of fully connected layers of different present in each modality to create a joint embedding space to sizes, stacked one after the other, going from the original fea- perform cross-modal retrieval. This idea has been exploited tures to the embeddings. The audio and the image network especially using text and image joint embeddings [12, 13, 14], are completely separated. These fully connected layers per- but also between other kinds of data, for example creating a form a non-linear transformation on the input features, map- visual-semantic embedding [15] or using synchronous data ping them to the embeddings, being the parameters of this to learn discriminative representations shared across vision, non-linear mapping learned in the optimization process. sound and text [16]. After that, a classification from the two embeddings is However, joint representations between the images done, also using a fully connected layer from them to the dif- (frames) of a video and its audio have yet to be fully exploited, ferent classes, using a sigmoid as activation function. We will being [17] the work that most has explored this option up to get more insight on this step in section 4. the knowledge of the authors. In their paper, they seek for a The number of hidden layers is not necessarily fixed, as joint embedding space but only using music videos to obtain well as the number of neurons per layer, since we experi- the closest and farthest video given a query video, only based mented with different configurations. Each hidden layer uses on either image or audio. ReLu as activation function, and all the weights in each layer The main idea of the current work is borrowed from [14], are regularized using L2 norm. which is the baseline to understand our approach. There, the authors create a joint embedding space for recipes and their 4. TRAINING images. They can then use it to retrieve recipes from any food image, looking to the recipe that has the closest embedding. In this section we present the used losses as well as their Apart from the retrieval results, they also perform other ex- meaning and intuition. periments, such as studying the localized unit activations, or doing arithmetics with the images. 4.1. Similarity Loss 3. ARCHITECTURE The objective of this work is to get the two embeddings of the same video to be as close as possible (ideally, the same), while In this section we present the architecture for our joint em- keeping embeddings from different videos as far as possible. bedding model, which is depicted in the Figure 2. Formally, we are given a video vk , represented by the au- As inputs, we have the vector of features representing the dio and visual features vk = {ik , ak } (ik represents the image Fig. 2. Schematic of the used architecture. features and ak the audio features of vk ). The objective is to maximize the similarity between Φik , the embedding obtained where y = 1 denotes positive sampling, and y = −1 by transformations on ik , and Φak , the embedding obtained by denotes negative sampling. transformations on ak . At the same time, however, we have to prevent embed- dings from different videos to be “close” in the joint space. In 4.2. Classification Regularization other words, we want them to have low similarity. However, Inspired by the work presented in [14], we provide additional the objective is not to force them to be opposite to each other. information to our system by incorporating the video labels Instead of forcing them to have similarity equal to zero, we (classes) provided by the YouTube-8M dataset. This infor- allow a margin of similarity small enough to force the em- mation is added as a regularization term that seeks to solve beddings to be clearly not in the same place in in the joint the high-level classification problem, both from the audio and space. We call this margin α. from the video embeddings, sharing the weights between the During the training, both positive and negative pairs are two branches. The key idea here is to have the classification used, being the positive pairs the ones for which ik and ak weights from the embeddings to the labels shared by the two correspond to the same video vk , and the negative pairs the modalities. ones for which ik1 and ak2 do not correspond to the same video, this is, k1 6= k2. The proportion of negative samples This loss is optimized together with the previously ex- is pnegative . plained similarity loss, serving as a regularization term. Basi- For the negative pairs, we selected random pairs that did cally, the system learns to classify the audio and the images of not have any common label, in order to help the network to a video (separately) into different classes or labels provided learn how to distinguish different videos in the embedding by the dataset. We limit its effect by using a regularization space. The notion of “similarity” or “closeness” is mathemat- parameter λ. ically translated into a cosine similarity between the embed- To incorporate the previously explained regularization to dings, being the cosine similarity definedNas: the joint embedding, we use a single fully connected layer, as P xk zk shown in Figure 2. Formally, we can obtain the label prob- k=1 abilities as pi = softmax(W Φi ) and pa = softmax(W Φa ), similarity = cos(x, z) = s (1) where W represents the learned weights, which are shared s N N 2 2 P P xk zk between the two branches. The softmax activation is used in k i order to obtain probabilities at the output. The objective is to for any pair of real-valued vectors x and z. make pi as similar as possible to ci , and pa as similar as possi- From this reasoning we get to the first and most important ble to ca , where ci and ca are the category labels for the video loss: represented by the image features and the audio features, re- spectively. For positive pairs, ci and ca are the same. The loss function used for the classification is the well Lcos ((Φa ,Φi ), y) = known cross entropy loss: 1 − cos(Φa , Φi ), if y=1 = max(0, cos(Φa , Φi ) − α), if y = −1 X L(x, z) = − xk log(zk ) (3) (2) k Thus, the classification loss is: Table 1. Evaluation of Recall from audio to video X Number of elements Recall@1 Recall@5 Recall@10 Lclass (pi , pa , ci , ca ) = − (pik log(cik )+(pak log(cak )) (4) k 256 21.5% 52.0% 63.1% 512 15.2% 39.5% 52.0% Finally, the loss function to be optimized is: 1024 9.8% 30.4% 39.6% L = Lcos + λLclass (5) Table 2. Evaluation of Recall from video to audio Number of elements Recall@1 Recall@5 Recall@10 256 22.3% 51.7% 64.4% 4.3. Parameters and Implementation Details 512 14.7% 38.0% 51.5% For our experiments we used the following parameters: 1024 10.2% 29.1% 40.3% • Batch size of 1024. • We saw that starting with λ different than zero led to approach in this paper. For this work, and as a baseline, we a bad embedding similarity because the classification only use the video-level features. accuracy was preferred. Thus, we began the training with λ = 0 and set it to 0.02 at step number 10,000. 5.2. Quantitative Performance Evaluation • Margin α = 0.2. • Percentage of negative samples pnegative = 0.6. We divide our results in two different categories: quantitative • 4 hidden layers in each network branch, the number of (numeric) results and qualitative results. neurons per layer being, from features to embedding, To obtain the quantitative results we use the Recall@k 2000, 2000, 700, 700 in the image branch, and 450, metric. We define Recall@k as the recall rate at top K for 450, 200, 200 in the audio branch. all the retrieval experiments, this is, the percentage of all the • Dimensionality of the feature vector = 250. queries where the corresponding video is retrieved in the top • We trained a single epoch. K, hence higher is better. The experiments are performed with different dimension The simulation was programmed using Tensorflow [19], of the feature vector. The Table 1 shows the results of re- having as a baseline the code provided by the YouTube-8M call from audio to video. In other words, from the audio em- challenge authors1 . bedding of a video, how many times we retrieve the embed- ding corresponding to the images of that same video. Table 2 5. RESULTS shows the recall from video to audio. To have a reference, the random guess result would be 5.1. Dataset k/Number of elements. The obtained results show a very The experiments presented in this section were developed clear correspondence between the embeddings coming from over a subset of 6,000 video clips from the YouTube-8M the audio features and the ones coming from the video fea- dataset [18]. This dataset does not contain the raw video files, tures. It is also interesting to notice that the results from audio but their representations as precomputed features, both from to video and from video to audio are very similar, because the audio and video. Audio features were computed using the system has been trained bidirectionally. method explained in [20] over audio windows of 1 second, while visual features were computed over frames sampled at 5.3. Qualitative Performance Evaluation 1 Hz with the Inception model provided in TensorFlow [19]. In addition to the objective results, we performed some in- The dataset provides video-level features, which represent sightful qualitative experiments. They consisted on generat- all the video using a single vector (one for audio and another ing the embeddings of both the audio and the video for a list for visual information), and thus does not maintain tempo- of 6,000 different videos. Then, we randomly chose a video, ral information; and also provides frame-level features, which and from its image embedding, we retrieved the video with consist on a single vector representing each second of audio, the closest audio embedding, and the other way around (from and a single vector representing each frame of the video, sam- one video’s audio we retrieved the video with the closest im- pled at 1 frame per second. age embedding). If the closest embedding corresponded to The main goal of this dataset is to provide enough data to the same video, we took the second one in the ordered list. reach state of the art results in video classification. Neverthe- The Figure 3 shows some experiments. On the left, we less, such a huge dataset also permits approaching other tasks can see the results given a video query and getting the closest related to videos and cross-modal tasks, such as the one we audio; and on the right the input query is an audio. Exam- 1 https://www.kaggle.com/c/youtube8m ples depicting the real videos and audio are available online Fig. 3. Qualitative results. On the left we show the results obtained when we gave a video as a query. On the right, the results are based on an audio as a query. 2 . It shows both the results when going from image to audio, information provided by the individual image and audio fea- and when going from audio to image. Four different random tures is not used in the current work. The most promising examples are shown in each case. For each result and each future work implies using this temporal information to match query, we also show their YouTube-8M labels, for complete- audio and images, making use of the implicit synchronization ness. the audio and the images of a video have, without needing The results show that when starting from the image fea- any supervised control. Thus, the next step in our research is tures of a video, the retrieved audio represents a very accurate introducing a recurrent neural network, which will allow us to fit for those images. Subjectively, there are non negligible create more accurate representations of the video, and also re- cases where the retrieved audio actually fits better the video trieve different audio samples for each image, creating a fully than the original one, for example when the original video has synchronized system. some artificially introduced music, or in cases where there Also, it would be very interesting to study the behavior of is some background commentator explaining the video in a the system depending on the class of the input. Observing the foreign (unknown) language. This analysis can also be done dataset, it is clear that not all the classes have the same degree similarly the other way around, this is, with the audio col- of correspondence between audio and image, as for example orization approach, providing images for a given audio. some videos have artificially (posterior) added music, which is not related at all to the images. 6. CONCLUSIONS AND FUTURE WORK In short, we believe the YouTube-8M dataset allows for promising research in the future in the field of video sonoriza- We presented an effective method to retrieve audio samples tion and audio retrieval, for it having a huge amount of sam- that fit correctly to a given (muted) video. The qualitative ples, and for it capturing multi-modal information in a highly results show that the already existing online videos, due to its compact way. variety, represent a very good source of audio for new videos, even in the case of only retrieving from a small subset of this large amount of videos. Due to the existing difficulty to create 7. ACKNOWLEDGEMENTS new audio from scratch, we believe that a retrieval approach is the path to follow in order to give audio to videos. This work was partially supported by the Spanish Ministry The range of possibilities to extend the presented work is of Economy and Competitivity and the European Regional excitingly broad. The first idea would be to make use of the Development Fund (ERDF) under contract TEC2016-75976- YouTube-8M dataset variety and information. The temporal R. Amanda Duarte was funded by the mobility grant of the Severo Ochoa Program at Barcelona Supercomputing Center 2 https://goo.gl/NAcJah (BSC-CNS). 8. REFERENCES deep learning,” in Proceedings of the 28th international conference on machine learning, 2011, pp. 689–696. 2 [1] Eric Brochu, Nando De Freitas, and Kejie Bao, “The sound of an album cover: Probabilistic multimedia and [12] Liwei Wang, Yin Li, and Svetlana Lazebnik, “Learn- information retrieval,” in Artificial Intelligence and ing deep structure-preserving image-text embeddings,” Statistics (AISTATS), 2003. 2 CoRR, vol. abs/1511.06078, 2015. 2 [2] Rudolf Mayer, “Analysing the similarity of album art [13] Ryan Kiros, Ruslan Salakhutdinov, and Richard S. with self-organising maps,” in International Workshop Zemel, “Unifying visual-semantic embeddings with on Self-Organizing Maps. Springer, 2011, pp. 357–366. multimodal neural language models,” CoRR, vol. 2 abs/1411.2539, 2014. 2 [3] Janis Libeks and Douglas Turnbull, “You can judge an [14] Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier artist by an album cover: Using images for music an- Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba, notation,” IEEE MultiMedia, vol. 18, no. 4, pp. 30–37, “Learning cross-modal embeddings for cooking recipes 2011. 2 and food images,” in CVPR, 2017. 2, 3 [4] Jiansong Chao, Haofen Wang, Wenlei Zhou, Weinan [15] Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Zhang, and Yong Yu, “Tunesensor: A semantic-driven Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and music recommendation service for digital photo al- Tomas Mikolov, “Devise: A deep visual-semantic em- bums,” in 10th International Semantic Web Conference, bedding model,” in Neural Information Processing Sys- 2011. 2 tems, 2013. 2 [5] Alexander Schindler and Andreas Rauber, “An audio- [16] Yusuf Aytar, Carl Vondrick, and Antonio Torralba, “See, visual approach to music genre classification through hear, and read: Deep aligned representations,” arXiv affective color features,” in European Conference on preprint arXiv:1706.00932, 2017. 2 Information Retrieval. Springer, 2015, pp. 61–67. 2 [17] Sungeun Hong, Woobin Im, and Hyun S. Yang, “Deep [6] Xixuan Wu, Yu Qiao, Xiaogang Wang, and Xiaoou learning for content-based, cross-modal retrieval of Tang, “Bridging music and image via cross-modal rank- videos and music,” CoRR, vol. abs/1704.06761, 2017. 2 ing analysis,” IEEE Transactions on Multimedia, vol. [18] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul 18, no. 7, pp. 1305–1318, 2016. 2 Natsev, George Toderici, Balakrishnan Varadarajan, and [7] Esra Acar, Frank Hopfgartner, and Sahin Albayrak, Sudheendra Vijayanarasimhan, “Youtube-8m: A large- “Understanding affective content of music videos scale video classification benchmark,” CoRR, vol. through learned representations,” in International Con- abs/1609.08675, 2016. 2, 4 ference on Multimedia Modeling. Springer, 2014, pp. [19] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eu- 303–314. 2 gene Brevdo, Zhifeng Chen, Craig Citro, Greg S [8] Olivier Gillet, Slim Essid, and Gal Richard, “On the Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, correlation of automatic audio and visual segmentations et al., “Tensorflow: Large-scale machine learning of music videos,” IEEE Transactions on Circuits and on heterogeneous distributed systems,” arXiv preprint Systems for Video Technology, vol. 17, no. 3, pp. 347– arXiv:1603.04467, 2016. 4 355, 2007. 2 [20] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gem- [9] Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ish- meke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. war K Sethi, “Multimedia content processing through Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wil- cross-modal association,” in Proceedings of the eleventh son, “Cnn architectures for large-scale audio classifica- ACM international conference on Multimedia. ACM, tion,” in 2017 IEEE International Conference on Acous- 2003, pp. 604–611. 2 tics, Speech and Signal Processing (ICASSP), March 2017, pp. 131–135. 4 [10] Hong Zhang, Yueting Zhuang, and Fei Wu, “Cross- modal correlation learning for clustering on image- audio dataset,” in 15th ACM international conference on Multimedia. ACM, 2007, pp. 273–276. 2 [11] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng, “Multimodal

References (21)

REFERENCES
Eric Brochu, Nando De Freitas, and Kejie Bao, "The sound of an album cover: Probabilistic multimedia and information retrieval," in Artificial Intelligence and Statistics (AISTATS), 2003. 2
Rudolf Mayer, "Analysing the similarity of album art with self-organising maps," in International Workshop on Self-Organizing Maps. Springer, 2011, pp. 357-366. 2
Janis Libeks and Douglas Turnbull, "You can judge an artist by an album cover: Using images for music an- notation," IEEE MultiMedia, vol. 18, no. 4, pp. 30-37, 2011. 2
Jiansong Chao, Haofen Wang, Wenlei Zhou, Weinan Zhang, and Yong Yu, "Tunesensor: A semantic-driven music recommendation service for digital photo al- bums," in 10th International Semantic Web Conference, 2011. 2
Alexander Schindler and Andreas Rauber, "An audio- visual approach to music genre classification through affective color features," in European Conference on Information Retrieval. Springer, 2015, pp. 61-67. 2
Xixuan Wu, Yu Qiao, Xiaogang Wang, and Xiaoou Tang, "Bridging music and image via cross-modal rank- ing analysis," IEEE Transactions on Multimedia, vol. 18, no. 7, pp. 1305-1318, 2016. 2
Esra Acar, Frank Hopfgartner, and Sahin Albayrak, "Understanding affective content of music videos through learned representations," in International Con- ference on Multimedia Modeling. Springer, 2014, pp. 303-314. 2
Olivier Gillet, Slim Essid, and Gal Richard, "On the correlation of automatic audio and visual segmentations of music videos," IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 3, pp. 347- 355, 2007. 2
Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ish- war K Sethi, "Multimedia content processing through cross-modal association," in Proceedings of the eleventh ACM international conference on Multimedia. ACM, 2003, pp. 604-611. 2
Hong Zhang, Yueting Zhuang, and Fei Wu, "Cross- modal correlation learning for clustering on image- audio dataset," in 15th ACM international conference on Multimedia. ACM, 2007, pp. 273-276. 2
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng, "Multimodal deep learning," in Proceedings of the 28th international conference on machine learning, 2011, pp. 689-696. 2
Liwei Wang, Yin Li, and Svetlana Lazebnik, "Learn- ing deep structure-preserving image-text embeddings," CoRR, vol. abs/1511.06078, 2015. 2
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel, "Unifying visual-semantic embeddings with multimodal neural language models," CoRR, vol. abs/1411.2539, 2014. 2
Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba, "Learning cross-modal embeddings for cooking recipes and food images," in CVPR, 2017. 2, 3
Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov, "Devise: A deep visual-semantic em- bedding model," in Neural Information Processing Sys- tems, 2013. 2
Yusuf Aytar, Carl Vondrick, and Antonio Torralba, "See, hear, and read: Deep aligned representations," arXiv preprint arXiv:1706.00932, 2017. 2
Sungeun Hong, Woobin Im, and Hyun S. Yang, "Deep learning for content-based, cross-modal retrieval of videos and music," CoRR, vol. abs/1704.06761, 2017. 2
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan, "Youtube-8m: A large- scale video classification benchmark," CoRR, vol. abs/1609.08675, 2016. 2, 4
Martín Abadi, Ashish Agarwal, Paul Barham, Eu- gene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016. 4
S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gem- meke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wil- son, "Cnn architectures for large-scale audio classifica- tion," in 2017 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), March 2017, pp. 131-135. 4

About the author

Amanda Duarte

Papers

Followers

View all papers from Amanda Duartearrow_forward

Cross-modal Embeddings for Video and Audio Retrieval

Sign up for access to the world's latest research

Abstract

Related papers

References (21)

Related papers

Related topics

Cited by