Cross-modal Embeddings for Video and Audio Retrieval
2019, Lecture Notes in Computer Science
https://doi.org/10.1007/978-3-030-11018-5_62…
6 pages
Sign up for access to the world's latest research
Abstract
The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audiovisual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings for both scales and assess their quality in a retrieval problem, formulated as using the feature extracted from one modality to retrieve the most similar videos based on the features computed in the other modality.
Related papers
ArXiv, 2021
We present a multimodal framework to learn general audio representations from videos. Existing contrastive audio representation learning methods mainly focus on using the audio modality alone during training. In this work, we show that additional information contained in video can be utilized to greatly improve the learned features. First, we demonstrate that our contrastive framework does not require high resolution images to learn good audio features. This allows us to scale up the training batch size, while keeping the computational load incurred by the additional video modality to a reasonable level. Second, we use augmentations that mix together different samples. We show that this is effective to make the proxy task harder, which leads to substantial performance improvements when increasing the batch size. As a result, our audio model achieves a state-ofthe-art of 42.4 mAP on the AudioSet classification downstream task, closing the gap between supervised and self-supervised me...
ArXiv, 2020
With the advancement in technology and the expansion of broadcasting, cross-media retrieval has gained much attention. It plays a significant role in big data applications and consists in searching and finding data from different types of media. In this paper, we provide a novel taxonomy according to the challenges faced by multi-modal deep learning approaches in solving cross-media retrieval, namely: representation, alignment, and translation. These challenges are evaluated on deep learning (DL) based methods, which are categorized into four main groups: 1) unsupervised methods, 2) supervised methods, 3) pairwise based methods, and 4) rank based methods. Then, we present some well-known cross-media datasets used for retrieval, considering the importance of these datasets in the context in of deep learning based cross-media retrieval approaches. Moreover, we also present an extensive review of the state-of-the-art problems and its corresponding solutions for encouraging deep learnin...
Multimedia Tools and Applications, 2016
Cross-media data representation, which focuses on semantics understanding of multimedia data in different modalities, is a rising hot topic in web media data analysis. The most challenging issues for cross-media data representation include: how to find underlying content-level data correlations and how to use such correlations in the representation model. Most traditional web media data analysis works are based on single modality data sources, such as Flickr images or YouTube videos, leaving cross-media data representation and semantics understanding wide open. In this paper, we propose a multiple kernel visual-auditory representation learning approach, which learns cross-media correlations from visual and auditory feature spaces with multiple kernel strategies. Besides, we give cross-media distance measure for image-audio retrieval in the mutual subspace of co-occurrence. Experiment results on the collected image-audio database are encouraging, and show that the performance of our approach is effective from multiple perspectives. Keywords Multiple kernel learning. Visual-auditory data representation. Cross-media retrieval 1 Introduction Multimedia representation learning has drawn tremendous research attention in the past decades. In areas of Content-based Image Retrieval (CBIR) [9, 19, 31], multimedia data Multimed Tools Appl
Proceedings of the 2019 on International Conference on Multimedia Retrieval - ICMR '19, 2019
Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are usually limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration (global context), and (2) the semantic context of its caption (local context). Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like image classification and object detection, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.
IAEME PUBLICATION, 2020
Cross media retrieval has gained much attention in the digital era due to growth of broadcasting and advancements in the technologies. It plays a main part in large data sets and is made up of searching and locating data from several kinds of media. In this paper we proposed a novel deep belief system frame to fix the challenges faced with multi-modal deep learning techniques in resolving cross-media retrieval, networking representation, orientation, and translation. All these issues are evaluated on deep learning-based. We then supply some renowned cross-media data sets utilized for retrieval, taking under account the need for the data sets from the context of deep learning-based cross-media retrieval approaches. In addition, we provide a thorough breakdown of the higher level challenges and their particular accompanying selections for encouraging deep learning from cross-media retrieval. The simple intention of the task would be to expand Deep Neural Networks such as bridging the "networking gap", and furnish researchers and developers using a better comprehension of the underlying difficulties and also the probable options of deep learning-based cross-media retrieval.
2019
The key of cross-modal retrieval approaches is to find a maximally correlated subspace among multiple datasets. This paper introduces a novel Adversarial Learning and Canonical Correlation Analysis based Cross-Modal Retrieval (ALCCA-CMR) model. For each modality, the ALCCA phase finds an effective common subspace and calculates the similarity by canonical correlation analysis embedding for cross-modal retrieval. We demonstrate an application of ALCCA-CMR model implemented for the dataset of two modalities. Experimental results on real music data show the efficacy of the proposed method in comparison with other existing ones.
2020
In this paper, we propose the use of a new modality characterized by a richer information content, namely acoustic images, for the sake of audio-visual scene understanding. Each pixel in such images is characterized by a spectral signature, associated to a specific direction in space and obtained by processing the audio signals coming from an array of microphones. By coupling such array with a video camera, we obtain spatio-temporal alignment of acoustic images and video frames. This constitutes a powerful source of self-supervision, which can be exploited in the learning pipeline we are proposing, without resorting to expensive data annotations. However, since 2D planar arrays are cumbersome and not as widespread as ordinary microphones, we propose that the richer information content of acoustic images can be distilled, through a self-supervised learning scheme, into more powerful audio and visual feature representations. The learnt feature representations can then be employed for ...
Proceedings of the 29th ACM International Conference on Multimedia, 2021
Cross-modal retrieval has received considerable attention owing to its applicability to enable users to search desired information with diversified forms. Existing retrieval methods retain good performance mainly relying on complex deep neural networks and high-quality supervision signals, which deters them from realworld resource-constrained development and deployment. In this paper, we propose an effective unsupervised learning framework named JOint-teachinG (JOG) to pursue a high-performance yet lightweight cross-modal retrieval model. The key idea is to utilize the knowledge of a pre-trained model (a.k.a. the "teacher") to endow the to-be-learned model (a.k.a. the "student") with strong feature learning ability and predictive power. Considering that a teacher model serving the same task as the student is not always available, we resort to a cross-task teacher to leverage transferrable knowledge to guide student learning. To eliminate the inevitable noises in the distilled knowledge resulting from the task discrepancy, an online knowledge-refinement strategy is designed to progressively improve the quality of the cross-task knowledge in a joint-teaching manner, where a peer student is engaged. In addition, the proposed JOG learns to represent the original high-dimensional data with compact binary codes to accelerate the query processing, further facilitating resource-limited retrieval. Through extensive experiments, we demonstrate that in various network structures, the proposed method can yield promising learning results on widelyused benchmarks. The proposed research is a pioneering work for resource-constrained cross-modal retrieval, which has strong potential to be applied to on-device deployment and is hoped to pave the way for further study. CCS CONCEPTS • Information systems → Multimedia and multimodal retrieval.
Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion, 2016
Video hyperlinking represents a classical example of multimodal problems. Common approaches to such problems are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially deep autoencoders, have proven promising both for crossmodal translation and for early fusion via multimodal embedding. A particular architecture, bidirectional symmetrical deep neural networks, have been proven to yield improved multimodal embeddings over classical autoencoders, while also being able to perform crossmodal translation. In this work, we focus firstly at evaluating good single-modal continuous representations both for textual and for visual information. Word2Vec and paragraph vectors are evaluated for representing collections of words, such as parts of automatic transcripts and multiple visual concepts, while different deep convolutional neural networks are evaluated for directly embedding visual information, avoiding the creation of visual concepts. Secondly, we evaluate methods for multimodal fusion and crossmodal translation, with different single-modal pairs, in the task of video hyperlinking. Bidirectional (symmetrical) deep neural networks were shown to successfully tackle downsides of multimodal autoencoders and yield a superior multimodal representation. In this work, we extensively tests them in different settings, with different single-modal representations, within the context of video hyperlinking. Our novel bidirectional symmetrical deep neural networks are compared to classical autoencoders and are shown to yield significantly improved multimodal embeddings that significantly (α = 0.0001) outperform multimodal embeddings obtained by deep autoencoders with an absolute improvement in precision at 10 of 14.1 % when embedding visual concepts and automatic transcripts and an absolute improvement of 4.3 % when embedding automatic transcripts with features obtained with very deep convolutional neural networks, yielding 80 % of precision at 10.
2022
Text from titles and audio transcriptions, image thumbnails, number of likes, dislikes, and views are examples of available data in a YouTube video. Despite the variability, most standard Deep Learning models use only one type of data. Moreover, the simultaneous use of multiple data sources for such problems is still rare. To shed light on these problems, we empirically evaluate eight different multimodal fusion operations using embeddings extracted from image thumbnails and video titles of YouTube videos using standard Deep Learning models, ResNet-based SE-Net for image feature extraction, and BERT to NLP. Experimental results show that simple operations such as sum or subtract embeddings can improve the accuracy of models. The multimodal fusion operations in this dataset achieved 81.3% accuracy, outperforming the unimodal models by 3.86% (text) and 5.79% (video).
References (21)
- REFERENCES
- Eric Brochu, Nando De Freitas, and Kejie Bao, "The sound of an album cover: Probabilistic multimedia and information retrieval," in Artificial Intelligence and Statistics (AISTATS), 2003. 2
- Rudolf Mayer, "Analysing the similarity of album art with self-organising maps," in International Workshop on Self-Organizing Maps. Springer, 2011, pp. 357-366. 2
- Janis Libeks and Douglas Turnbull, "You can judge an artist by an album cover: Using images for music an- notation," IEEE MultiMedia, vol. 18, no. 4, pp. 30-37, 2011. 2
- Jiansong Chao, Haofen Wang, Wenlei Zhou, Weinan Zhang, and Yong Yu, "Tunesensor: A semantic-driven music recommendation service for digital photo al- bums," in 10th International Semantic Web Conference, 2011. 2
- Alexander Schindler and Andreas Rauber, "An audio- visual approach to music genre classification through affective color features," in European Conference on Information Retrieval. Springer, 2015, pp. 61-67. 2
- Xixuan Wu, Yu Qiao, Xiaogang Wang, and Xiaoou Tang, "Bridging music and image via cross-modal rank- ing analysis," IEEE Transactions on Multimedia, vol. 18, no. 7, pp. 1305-1318, 2016. 2
- Esra Acar, Frank Hopfgartner, and Sahin Albayrak, "Understanding affective content of music videos through learned representations," in International Con- ference on Multimedia Modeling. Springer, 2014, pp. 303-314. 2
- Olivier Gillet, Slim Essid, and Gal Richard, "On the correlation of automatic audio and visual segmentations of music videos," IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 3, pp. 347- 355, 2007. 2
- Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ish- war K Sethi, "Multimedia content processing through cross-modal association," in Proceedings of the eleventh ACM international conference on Multimedia. ACM, 2003, pp. 604-611. 2
- Hong Zhang, Yueting Zhuang, and Fei Wu, "Cross- modal correlation learning for clustering on image- audio dataset," in 15th ACM international conference on Multimedia. ACM, 2007, pp. 273-276. 2
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng, "Multimodal deep learning," in Proceedings of the 28th international conference on machine learning, 2011, pp. 689-696. 2
- Liwei Wang, Yin Li, and Svetlana Lazebnik, "Learn- ing deep structure-preserving image-text embeddings," CoRR, vol. abs/1511.06078, 2015. 2
- Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel, "Unifying visual-semantic embeddings with multimodal neural language models," CoRR, vol. abs/1411.2539, 2014. 2
- Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba, "Learning cross-modal embeddings for cooking recipes and food images," in CVPR, 2017. 2, 3
- Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov, "Devise: A deep visual-semantic em- bedding model," in Neural Information Processing Sys- tems, 2013. 2
- Yusuf Aytar, Carl Vondrick, and Antonio Torralba, "See, hear, and read: Deep aligned representations," arXiv preprint arXiv:1706.00932, 2017. 2
- Sungeun Hong, Woobin Im, and Hyun S. Yang, "Deep learning for content-based, cross-modal retrieval of videos and music," CoRR, vol. abs/1704.06761, 2017. 2
- Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan, "Youtube-8m: A large- scale video classification benchmark," CoRR, vol. abs/1609.08675, 2016. 2, 4
- Martín Abadi, Ashish Agarwal, Paul Barham, Eu- gene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016. 4
- S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gem- meke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wil- son, "Cnn architectures for large-scale audio classifica- tion," in 2017 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), March 2017, pp. 131-135. 4
Amanda Duarte