Multiple kernel visual-auditory representation learning for retrieval
2016, Multimedia Tools and Applications
https://doi.org/10.1007/S11042-016-3294-5…
16 pages
Sign up for access to the world's latest research
Abstract
Cross-media data representation, which focuses on semantics understanding of multimedia data in different modalities, is a rising hot topic in web media data analysis. The most challenging issues for cross-media data representation include: how to find underlying content-level data correlations and how to use such correlations in the representation model. Most traditional web media data analysis works are based on single modality data sources, such as Flickr images or YouTube videos, leaving cross-media data representation and semantics understanding wide open. In this paper, we propose a multiple kernel visual-auditory representation learning approach, which learns cross-media correlations from visual and auditory feature spaces with multiple kernel strategies. Besides, we give cross-media distance measure for image-audio retrieval in the mutual subspace of co-occurrence. Experiment results on the collected image-audio database are encouraging, and show that the performance of our approach is effective from multiple perspectives. Keywords Multiple kernel learning. Visual-auditory data representation. Cross-media retrieval 1 Introduction Multimedia representation learning has drawn tremendous research attention in the past decades. In areas of Content-based Image Retrieval (CBIR) [9, 19, 31], multimedia data Multimed Tools Appl
Related papers
Proceedings of the 35th Annual ACM Symposium on Applied Computing
We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. By applying Latent Semantic Indexing (LSI) we embed corresponding textual information into a latent vector space from which we derive track relatedness for online triplet selection. This LSI topic modelling facilitates fine-grained selection of similar and dissimilar audio-track pairs to learn the audio representation using a Convolution Recurrent Neural Network (CRNN). By this we directly project the semantic context of the unstructured text modality onto the learned representation space of the audio modality without deriving structured ground-truth annotations from it. We evaluate our approach on the Europeana Sounds collection and show how to improve search in digital audio libraries by harnessing the multilingual meta-data provided by numerous European digital libraries. We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection. The learned representations perform comparable to the baseline of handcrafted features, respectively exceeding this baseline in similarity retrieval precision at higher cutoffs with only 15% of the baseline's feature vector length. CCS CONCEPTS • Information systems → Content analysis and feature selection; Multilingual and cross-lingual retrieval; Speech / audio search;
2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
We present an audiovisual multimodal approach for the task of zero-shot learning (ZSL) for classification and retrieval of videos. ZSL has been studied extensively in the recent past but has primarily been limited to visual modality and to images. We demonstrate that both audio and visual modalities are important for ZSL for videos. Since a dataset to study the task is currently not available, we also construct an appropriate multimodal dataset with 33 classes containing 156, 416 videos, from an existing large scale audio event dataset. We empirically show that the performance improves by adding audio modality for both tasks of zero-shot classification and retrieval, when using multimodal extensions of embedding learning methods. We also propose a novel method to predict the 'dominant' modality using a jointly learned modality attention network. We learn the attention in a semi-supervised setting and thus do not require any additional explicit labelling for the modalities. We provide qualitative validation of the modality specific attention, which also successfully generalizes to unseen test classes.
IEEE Transactions on Multimedia, 2022
Due to the rapid advancements of sensory and computing technology, multi-modal data sources that represent the same pattern or phenomenon have attracted growing attention. As a result, finding means to explore useful information from these multi-modal data sources has quickly become a necessity. In this paper, a discriminative vectorial framework is proposed for multi-modal feature representation in knowledge discovery by employing multi-modal hashing (MH) and discriminative correlation maximization (DCM) analysis. Specifically, the proposed framework is capable of minimizing the semantic similarity among different modalities by MH and exacting intrinsic discriminative representations across multiple data sources by DCM analysis jointly, enabling a novel vectorial framework of multimodal feature representation. Moreover, the proposed feature representation strategy is analyzed and further optimized based on canonical and non-canonical cases, respectively. Consequently, the generated feature representation leads to effective utilization of the input data sources of high quality, producing improved, sometimes quite impressive, results in various applications. The effectiveness and generality of the proposed framework are demonstrated by utilizing classical features and deep neural network (DNN) based features with applications to image and multimedia analysis and recognition tasks, including data visualization, face recognition, object recognition; cross-modal (textimage) recognition and audio emotion recognition. Experimental results show that the proposed solutions are superior to state-ofthe-art statistical machine learning (SML) and DNN algorithms.
Currently, image retrieval by content is a research problem of great interest in academia and the industry, due to the large collections of images available in different contexts. One of the main challenges to develop effective image retrieval systems is the automatic identification of semantic image contents. This research proposal aims to design a model for image retrieval able to take advantage of different data sources, i.e. using multimodal information, to improve the response of an image retrieval system. In particular two data modalities associated to contents and context of images are considered in this proposal: visual features and unstructured text annotations. The proposed framework is based on kernel methods that provide two main important advantages over the traditional multimodal approaches: first, the structure of each modality is preserved in a high dimensional feature space, and second, they provide natural ways to fuse feature spaces in a unique information space. This document presents the research agenda to build a Multimodal Information Space for searching images by content.
IEEE transactions on pattern analysis and machine intelligence, 2015
Cross-modal retrieval has recently drawn much attention due to the widespread existence of multimodal data. It takes one type of data as the query to retrieve relevant data objects of another type, and generally involves two basic problems: the measure of relevance and coupled feature selection. Most previous methods just focus on solving the first problem. In this paper, we aim to deal with both problems in a novel joint learning framework. To address the first problem, we learn projection matrices to map multimodal data into a common subspace, in which the similarity between different modalities of data can be measured. In the learning procedure, the ℓ21-norm penalties are imposed on the projection matrices separately to solve the second problem, which selects relevant and discriminative features from different feature spaces simultaneously. A multimodal graph regularization term is further imposed on the projected data, which preserves the inter-modality and intra-modality simila...
In Fourth International Symposium on Independent Component Analysis and Blind Source Separation 2003, 2003
We use kernel Canonical Correlation Analysis to learn a semantic representation of Web images and their associated text. This representation is used in two applications. In first application we consider classification of images into one of three categories. We use SVM in the semantic space and compare against the SVM on raw data and against previously published results using ICA. In the second application we retrieve images based only on their content from a text query. The semantic space provides a common representation and enables a comparison between the text and image. We compare against a standard cross-representation retrieval technique known as the Generalised Vector Space Model.
2020
This HDR manuscript summarizes our work concerning the applications of machine learning techniques to solve various problems in audio and multimodal data. First, we will present nonnegative matrix factorization (NMF) modeling of audio spectrograms to address audio source separation problem, both in single-channel and multichannel settings. Second, we will focus on the multiple instance learning (MIL)-based audio-visual representation learning approach, which allows to tackle several tasks such as event/object classification, audio event detection, and visual object localization. Third, we will present contributions in the multimodal multimedia interestingness and memorability, including novel dataset constructions, analysis, and computational models. This summary is based on major contributions published in three journal papers in the IEEE/ACM Transactions on Audio, Speech, and Language Processing, and a paper presented at the International Conference on Computer Vision (ICCV). Fina...
2011
This work presents the rationale, tasks and procedures of MusiCLEF, a novel benchmarking activity that has been developed along with the Cross-Language Evaluation Forum (CLEF). The main goal of MusiCLEF is to promote the development of new methodologies for music access and retrieval on real public music collections, which can combine content-based information, automatically extracted from music files, with contextual information, provided by users via tags, comments, or reviews. Moreover, MusiCLEF aims at maintaining a tight connection with real application scenarios, focusing on issues on music access and retrieval that are faced by professional users. To this end, this year’s evaluation campaign focused on two main tasks: automatic categorization of music to be used as soundtrack of TV shows and automatic identification of the digitized material of a music digital library.
Proceedings of the 2019 on International Conference on Multimedia Retrieval - ICMR '19, 2019
Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are usually limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration (global context), and (2) the semantic context of its caption (local context). Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like image classification and object detection, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.
IAEME PUBLICATION, 2020
Cross media retrieval has gained much attention in the digital era due to growth of broadcasting and advancements in the technologies. It plays a main part in large data sets and is made up of searching and locating data from several kinds of media. In this paper we proposed a novel deep belief system frame to fix the challenges faced with multi-modal deep learning techniques in resolving cross-media retrieval, networking representation, orientation, and translation. All these issues are evaluated on deep learning-based. We then supply some renowned cross-media data sets utilized for retrieval, taking under account the need for the data sets from the context of deep learning-based cross-media retrieval approaches. In addition, we provide a thorough breakdown of the higher level challenges and their particular accompanying selections for encouraging deep learning from cross-media retrieval. The simple intention of the task would be to expand Deep Neural Networks such as bridging the "networking gap", and furnish researchers and developers using a better comprehension of the underlying difficulties and also the probable options of deep learning-based cross-media retrieval.
References (42)
- Chang X, Yang Y, Hauptmann AG, Xing E, Yu Y (2015) Semantic concept discovery for large-scale zero- shot event detection. International Joint Conference on Artificial Intelligence, IJCAI
- Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly- isotonic SVM. International Conference on Machine Learning (ICML)
- Chang X, Yu Y, Yang Y, Hauptmann A (2015) Searching persuasively: joint event detection and evidence justification with limited supervision. ACM MM Fig. 6 Querying audio by new image
- Gao DD, Huang RB (2000) Some results on canonical correlation and their application to a linear model. Linear Algebra Appl 321:47-59
- Gonen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211-2268
- Jain P, Kulis B, Davis JV, Dhillon IS (2012) Metric and kernel learning using a linear transformation. J Mach Learn Res 13:519-547
- Jain A, Vishwanathan SVN, Varma M (2012) Spg-gmkl: generalized multiple kernel learning with a million kernels. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining
- Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004) Learning the kernel matrix with semi-definite programming. J Mach Learn Res 5:27-72
- Lew MS, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state of the art and challenges. ACM Trans Multimed Comput Commun Appl 2(1):1-19
- Liu Y, Wu F, Zhuang Y, Xiao J (2008) Active post-refined multimodality video semantic concept detection with tensor representation. ACM International Conference on Multimedia. pp.91-100
- Liu G, Yan Y, Gao C, Tong W, Hauptmann AG, Sebe N (2014) The mystery of faces: investigating face contribution for multimedia event detection. ICMR
- Liu H, Yu L (2005) Toward integrating feature selection algorithms for classication and clustering. IEEE Trans Knowl Data Eng 17(4):491-502
- Ma Q, Akiyo N, Katsumi T (2006) Complementary information retrieval for cross-media news content. Inf Syst 31(7):659-678
- Melzer T, Reiter M, Bischof H (2003) Appearance models based on kernel canonical correlation analysis. Pattern Recogn 36:1961-1971
- Shen H, Yan Y, Xu S, Ballas N, Chen W (2015) Evaluation of semi-supervised learning method on action recognition. Multimedia Tools and Applications 74(2):523-542
- Sonnenburg S, Rätsch G, Schafer C, Scholkopf B (2006) Largescale multiple kernel learning. J Mach Learn Res 7:1531-1565
- Sun T, Chen S (2007) Locality preserving CCA with applications to data visualization and pose estimation. Image Vis Comput 25:531-543
- Thomas M, Michael R, Horst B (2003) Appearance models based on kernel canonical correlation analysis. Pattern Recogn 27(2):1-8
- Tolias G, Bursuc A, Furon T, Jégou H (2015) Rotation and translation covariant match kernels for image retrieval. Comp Vis Image Underst 140:9-20
- Tong S, Chang E (2001) Support vector machine active learning for image retrieval. ACM International Conference on Multimedia, pp. 107-118
- Vapnik V (1997) The nature of statistical learning theory. IEEE Trans Neural Netw 8(6)
- Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In Proceedings of International Conference on Machine Learning, pp.1065-1072
- Vishwanathan SVN, Sun Z, Ampornpunt N, Varma M (2010) Multiple kernel learning and the SMO algorithm. In: NIPS, pp. 2361-2369
- Wang D, Hoi SC, He Y, Zhu J, Mei T, Luo J (2014) Retrieval-based face annotation by weak label regularized local coordinate coding. IEEE Trans Pattern Ana Mach Intell (TPAMI) 36(3):550-563
- Wu Y, Chang EY, Chang CC, Kevin, Smith JR (2004) Optimal multimodal fusion for multi-media data analysis. In: ACM Multimedia Conference, pp. 572-579
- Wu Y, Chang EY, Chen-Chuan Chang K, Smith JR (2004) Optimal multimodal fusion for multimedia data analysis. ACM International Conference on Multimedia, pp.572-579
- Xia H, Hoi SC, Jin R, Zhao P (2012) Online multiple kernel similarity learning for visual search. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 1(1)
- Yan Y, Ricci E, Liu G, Sebe N (2015) Egocentric daily activity recognition via multitask clustering. IEEE Trans Image Process 24(10):2984-2995
- Yan Y, Ricci E, Subramanian R, Liu G, Lanz O, Sebe N. A multi-task learning framework for head pose estimation under target motion, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), in press
- Yan Y, Shen H, Liu G, Ma Z, Gao C, Sebe N (2014) GLocal tells you more: coupling glocal structural for feature selection with sparsity for image and video classification. Comp Vision Image Underst (CVIU) 124(7):99-109
- Yang Y, Ma Z, Hauptmann AG, Sebe N (2012) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661-669
- Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi- supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell (TPAMI) 34(5):723-742
- Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann AG (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimedia 15(3):572-58
- Yang Y, Zhuang Y, Wu F, Pan Y (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10(3):437-446
- Yu Z, Wu F, Yang Y, Tian Q, Luo J, Zhuang Y (2014) Discriminative coupled dictionary hashing for fast cross-media retrieval. SIGIR, 395-404
- Zhang H, Liu Y, Ma Z (2013) Fusing inherent and external knowledge with nonlinear learning for cross- media retrieval. Neurocomputing 119:10-16
- Zhang H, Wu P, Beck A, Zhang Z, Gao X (2016) Adaptive incremental learning of image semantics with application to social robot. Neurocomputing 173:93-101
- Zhang H, Yu J, Wang M, Liu Y (2012) Semi-supervised distance metric learning based on local linear regression for data clustering. Neurocomputing 93:100-105
- Zhang H, Yuan J, Gao X, Chen Z (2014) Boosting cross-media retrieval via visual-auditory feature analysis and relevance feedback. ACM International Conference on Multimedia
- Zhuang Y, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross- media retrieval. IEEE Transactions on Multimedia 10(2):221-229
- Hong Zhang corresponding author, received the BS degree from Wuhan University of Technology, China, in 2001, the MS degree from Wuhan University of Technology, China, in 2004, and PhD degree from Zhejiang University, China, in 2007. She is currently a professor in the college of computer science and technology, Wuhan University of Science and Technology, China. Her research interests include content-based multimedia analysis, machine learning and cross-media retrieval.
- Wenping Zhang is currently a master student in the college of computer science and technology, Wuhan University of Science and Technology, China. He received his BS degree from Huazhong Agricultural University Chutian College, China, in 2014. His research interests include machine learning and data mining.
Hehe Fan