Multiple kernel visual-auditory representation learning for retrieval

Hehe Fan

doi:10.1007/S11042-016-3294-5

Outline

Multiple kernel visual-auditory representation learning for retrieval

Hehe Fan

2016, Multimedia Tools and Applications

https://doi.org/10.1007/S11042-016-3294-5

visibility

…

description

16 pages

Abstract

Cross-media data representation, which focuses on semantics understanding of multimedia data in different modalities, is a rising hot topic in web media data analysis. The most challenging issues for cross-media data representation include: how to find underlying content-level data correlations and how to use such correlations in the representation model. Most traditional web media data analysis works are based on single modality data sources, such as Flickr images or YouTube videos, leaving cross-media data representation and semantics understanding wide open. In this paper, we propose a multiple kernel visual-auditory representation learning approach, which learns cross-media correlations from visual and auditory feature spaces with multiple kernel strategies. Besides, we give cross-media distance measure for image-audio retrieval in the mutual subspace of co-occurrence. Experiment results on the collected image-audio database are encouraging, and show that the performance of our approach is effective from multiple perspectives. Keywords Multiple kernel learning. Visual-auditory data representation. Cross-media retrieval 1 Introduction Multimedia representation learning has drawn tremendous research attention in the past decades. In areas of Content-based Image Retrieval (CBIR) [9, 19, 31], multimedia data Multimed Tools Appl

Multimed Tools Appl DOI 10.1007/s11042-016-3294-5 Multiple kernel visual-auditory representation learning for retrieval Hong Zhang 1,2 & Wenping Zhang 1 & Wenhe Liu 3 & Xin Xu 1 & Hehe Fan 4 Received: 4 October 2015 / Revised: 3 December 2015 / Accepted: 21 January 2016 # Springer Science+Business Media New York 2016 Abstract Cross-media data representation, which focuses on semantics understanding of multi- media data in different modalities, is a rising hot topic in web media data analysis. The most challenging issues for cross-media data representation include: how to find underlying content-level data correlations and how to use such correlations in the representation model. Most traditional web media data analysis works are based on single modality data sources, such as Flickr images or YouTube videos, leaving cross-media data representation and semantics understanding wide open. In this paper, we propose a multiple kernel visual-auditory representation learning approach, which learns cross-media correlations from visual and auditory feature spaces with multiple kernel strategies. Besides, we give cross-media distance measure for image-audio retrieval in the mutual subspace of co-occurrence. Experiment results on the collected image-audio database are encour- aging, and show that the performance of our approach is effective from multiple perspectives. Keywords Multiple kernel learning . Visual-auditory data representation . Cross-media retrieval 1 Introduction Multimedia representation learning has drawn tremendous research attention in the past decades. In areas of Content-based Image Retrieval (CBIR) [9, 19, 31], multimedia data * Hong Zhang [email protected] 1 College of Computer Science & Technology, Wuhan University of Science & Technology, Wuhan 430081, China 2 Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan, China 3 The Centre for Quantum Computation & Intelligent Systems, the University of Technology, Sydney (UTS), Sydney, Australia 4 Baidu, Beijing, China Multimed Tools Appl clustering and classification [12, 28, 38], face and motion recognition [11, 24], event detection [1–3], etc. Abundant representation learning methods have been proposed to explore a semantic level data representation model which could be used to better understand underlying data correlations. For example, in CBIR research subspace learning is frequently used to bridge the gap between low-level visual features and high-level image semantics so as to build the semantic data representation. However, most of these works have been focused on multimedia data of single modality, such as image or audio, and cross-media data represen- tation learning is mostly ignored [32, 35]. It is interesting and challenging to retrieval multimedia data of different modalities at the same time, especially nowadays different kinds of multimedia data usually coexist in web sources representing similar semantics. Content feature is the carrier of multimedia semantics. The main challenging problem for cross-media data representation lies in two aspects: how to find underlying feature correlations among multimedia data of different modalities, and how to use such correlations in cross- media data representation. Considering these two issues, some researchers have proposed representation learning models under certain cross-media data environments. For example, Yi Yang et al. [33] proposed a multi-feature fusion algorithm based on Hierarchical Regression to learn general multimodal semantics, and it was verified with the multimodal document database, which contained text, image and audio. Paper [36] proposed a cross-media repre- sentation learning framework, which explored inherent feature correlations and discovered external useful knowledge based on nonlinear low-level feature analysis. Paper [40] learned the uniform cross-media correlation graph, in which different kinds of multimedia objects are represented exactly in the same way. Most of these works explored underlying cross-media correlation and built multimodal data representation with the help of prior knowledge, such as page links, user comments and tagging. However, underlying cross-media correlation among heterogeneous low-level content features is mostly ignored or underestimated. Experimental evidence has shown that different kinds of multimedia data carry their contribution to high- level semantics so that the presence of one modality has usually a Bcomplementary effect^ with the other [33]. Our previous work on cross-media data analysis also showed that such complementary information can be explored and utilized to improve multimedia semantics understanding results [37, 39]. However, it is difficult to learn effective cross-media content representation because multimedia data of different modalities originally reside in heterogeneous low-level feature spaces. Although image and audio data may represent similar semantics, such as an image of bird and an audio clip of bird singing, it is challenging to find a unified representation for both bird images and audio clips. In this paper, we propose a Multiple Kernel Visual-Auditory Representation Learning (MKVARL) method for retrieval. Our framework is formulated based on two typical modalities, i.e., image and audio. In preprocessing, considering audio is a kind of time series data while image data is static, we use fuzzy clustering method proposed in our previous work to get audio indexes so that all audio data is represented in the same dimension. Then, inspired by the recent multiple kernel learning algorithm in visual search [27], we propose multiple kernel visual-auditory learning. Specifically, we first map low-level image feature matrix and audio feature matrix into high-dimensional kernel spaces with multiple kernel functions in order to better explore underlying cross-media correlations; secondly, we calculate visual-auditory canonical correlations between a pair of kernel spaces, and maximize such correlation when we map kernel spaces into the low-dimensional Isomorphic Visual-Auditory Sub- space (IVA-Subspace). With multiple kernel learning, cross-media data correlations are Multimed Tools Appl analyzed from different aspects, and more useful information could be explored in the high-dimensional kernel spaces instead of original visual feature space and auditory feature space. Furthermore, we discuss how to apply our MKVARL method into cross- media retrieval between image and audio. Experiments and comparisons verify the validity, superiority and applicability of our approach from different aspects. The rest of this paper is organized as follows. Section 2 discusses related works from two aspects. Section 3 presents multiple kernel visual-auditory representation learning based on image and audio samples, and describes how to enable flexible cross-media retrieval between image and audio datasets. Section 4 presents the experimental results and comparisons. We give concluding remarks in section 5. 2 Related works As previously discussed, visual-auditory search belongs to the area of cross-media re- trieval, and our paper mainly focuses on the challenge of multi-kernel visual-auditory representation learning. Therefore, in this section, we discuss related works from the perspective of cross-media retrieval [32, 33, 35, 36] and multiple kernel distance metric learning [10, 34]. 2.1 Cross-media retrieval Cross-media retrieval originates from content-based multimedia analysis and retrieval, which is a long-standing research topic in computer vision [30]. As previously discussed, most content-based multimedia retrieval works focus on multimedia data of single modality to bridge the semantic gap between low-level features and high-level semantics [15, 29], such as Content-based Image Retrieval (CBIR) [9, 31]. Considering the content gap between different multimedia data, cross-media retrieval aims to build a flexible retrieval framework, in which users can search multimedia data with a query example of different modality [32, 35]. For example, in a cross-media retrieval system, we can obtain relevant image and audio results by submitting an image query example or an audio query example. The main challenging problem for cross-media retrieval is how to measure the similarity between different kinds of low-level feature spaces. For example, although image and audio data could represent similar semantics, it is very difficult to measure the low-level feature similarity between visual features of images and auditory features of audio clips. In the past few years, researchers have proposed some cross-media retrieval algorithms, and provide possible solution to bridge the content gap for flexible retrieval. Most of those researches could be grouped into three categories: context-based cross-media retrieval, cross-modal video data analysis and retrieval, content-based cross-media retrieval. In the first group, context correlations, such as web links, conclusion relation and text comments, are explored and used to estimate cross-media similarity between multimedia data of different modalities. For example, Yang et al. proposed a distance measure between heterogeneous Multimedia Documents (MMD) which consisted of text, image or audio samples, and constructed a MMD semantic subspace for cross-media retrieval [34]. MMD is a typical cross-media data environment with rich context correlations. If an image and an audio clip are included in the same MMD, we can assume these two multimedia objects represent similar semantics. Web pages and PPT documents are examples of MMD. Multimed Tools Appl Secondly, video data contains different tracks of information, including key frame images, sounds and voices, text subtitles, etc. It was frequently used to synthetically analyze different tracks of low-level video features, such as visual features of key frames, auditory features of speakers and caption features. A great deal of researcher are dedicated to cross- modal retrieval between different tracks of video data [10, 26]. For example, paper [13] proposed a subject model which learned probabilistic collections between semantic con- cepts (keywords) of high frequency and multimedia objects so that users could retrieval news of different types. Besides, a few researchers focus on how to analyze content-level statistical cross-media correlation with labeled and unlabeled data [36, 37, 39]. Although multimedia data of different modalities may Blook^ different in visual and auditory representations, they may have statistical content-level correlation which could be explored and used for retrieval. For example, paper [36] proposed the isomorphic cross-media subspace mapping algo- rithm, which calculated and maintained underlying canonical correlation between visual feature matrix of images and auditory feature matrix of audio clips during subspace mapping. 2.2 Multiple kernel distance metric learning Kernel methods typically consist of two part. The first part maps the input feature space into another space which is often much higher or even infinite dimensionality by applying a nonlinear function; the second part usually applies a linear method in the high dimensional space. Kernel-based methods are not new for multimedia retrieval, for example, kernel SVM algorithms have been successfully introduced into the CBIR tasks [20]. In kernel-based multimedia representation and distance metric learning literature, some algorithms were proposed for similarity learning in CBIR. Connections between representation learning and kernel learning, which can provide kernelization for a set of metric learning methods, have been revealed in recent studies [6]. Multiple kernel learning (MKL) [8, 16] now is a hot research topic in machine learning. It has been used in various studies and applications with great success, such as bioinfor- matics, computer vision, and natural language processing. Paper [8] found the optimal combination of multiple kernels for learning classifiers towards a given classification task. In addition, several recent studies address multiple kernel learning for multi-class and multi-labeled data so as to improve system efficiency and generality [7, 22, 23]. Compared to a single kernel, such as SVM, MKL attempts to achieve better results by combining several base kernels instead of using only one specific kernel [21]. MKL allows the practitioner to optimize over linear combinations of kernels, and it has focused on both formulation learning as well as the corresponding optimization. Different applications need different formulations, the existing MKL methods use different learning functions for determining the kernel combinations [5]. In terms of combination functions, most MKL studies often work with linear combinations which have two basic categories: unweighted sum and weighted sum. In the unweighted sum case, we use sum or mean of the kernels as the combined kernel; in the weighted case, we can linearly optimize weight for each kernel. Besides, there are nonlinear combination studies which apply nonlinear functions of kernel (e.g., multiplication, power and exponentiation). Besides, as for different target functions, MKL algorithms are typically categorized into three groups: the similarity-based Multimed Tools Appl functions; the structural risk functions and the Bayesian functions. All MKL algorithms have the same goal of learning the optimum combination of multiple kernels, but the differences between our methods with others lie in that we aim to learn a kernel-based similarity function for image retrieval while conventional MKL studies often handle classification tasks. 2.3 Discussion Above related works obtained satisfying results on multimedia representation and retriev- al. Our approach of multiple kernel visual-auditory representation and retrieval differs from most related works in the following aspects: we aim to learn a kernel-based similarity function for visual-auditory retrieval while conventional MKL studies often handle single- modality multimedia data analysis tasks. On the other hand, content-based multimedia analysis and retrieval works mostly focus on single modality data and ignore the issue of cross-media correlation analysis and semantics understanding which is addressed in this paper. 3 Multiple kernel visual-auditory representation learning We aim to learn the general visual-auditory representation framework where different types of multimedia data are represented in the isomorphic subspace and cross-media correlation could be easily measured for query results ranking. Figure 1 illustrates the flowchart of the proposed Multiple Kernel Visual-Auditory Representation Learning (MKVARL) method. The main idea of our approach is that: first, we map the audio feature matrix and the image feature matrix into k Hilbert spaces respectively; then, we analyze canonical correlations between a pair of audio Hilbert space and image Hilbert space; thirdly, we map both image samples and audio samples from Hilbert spaces into the Isomorphic Visual-Auditory Subspace (IVA-Subspace) where original canonical correlations are maximally remained. In the IVA-Subspace, we propose cross-media distance metric measure to estimate visual-auditory correlation for retrieval. In this way we can find most similar image samples or audio samples to users based on the query example users submitted. Fig. 1 The framework of the proposed MKVARL method Multimed Tools Appl 3.1 Visual-auditory kernel canonical correlation analysis and mapping Suppose Xn × p = (x1, x2, ⋅ ⋅⋅, xn)T and Yn × q = (y1, y2, ⋅ ⋅⋅, yn)T are original low-level feature matri- ces of images and audio clips respectively, where n is the number of samples and p, q are the feature dimensions. Let φx(x) = (φx(x1), φx(x2), ⋅ ⋅⋅, φx(xn)) denote the transformed Hilbert space Hx for image feature matrix Xn × p, and φy(y) = (φy(y1), φy(y2), ⋅ ⋅⋅, φy(yn)) denote the trans- formed Hilbert space Hy for audio feature matrix Yn × q. Motivated by the canonical correlation analysis method, we hope to find two projection vectors wx(p × m) and wy(q × m), with which underlying correlations between Hx and Hy could be maximally maintained in the m-dimen- sional mutual subspace named as Isomorphic Visual-Auditory Subspace (IVA-Subspace). Let u = wTx φx(x) and v = wTy φy(y) denote the IVA-Subspace mapping process, wx and wy can be found by solving the following Lagrangian function: λx h i λ h y i L wx ; wy ; λx ; λy ¼ E½ðu−E ðuÞÞðv−E ðvÞÞ− E u−E ðuÞ2 − E v−E ðvÞ2 þ L0 ð1Þ 2 2 where L0 ¼ η2 kwx k2 þ wy and η is a regularization constant. L0 is used because the 2 dimensionalities of the Hilbert spaces are large. Equation (1) may lead to some nonsense projection vectors without L0. Based on the reproducing kernel theory [4, 18], we have: X X wx ¼ αi φx ðxi Þ ; wy ¼ β i φy ðyi Þ ð2Þ i i where αi, βi are weight parameters. Thus, we can rewrite u and v as: X u¼ αi φx ðxi ÞT φx ðxÞ ð3Þ i X v¼ β i φy ðyi ÞT φy ðyÞ ð4Þ i Then u and v can be calculated by only inner products in Hilbert spaces. In practice, since we don’t need an explicit form of φ(x), we first determine kx that can be decomposed in the form of inner product. From Mercer theorem, the symmetric positive definite kernel kx can be decomposed into the inner product form. We define the kernel functions kx(xi, xj) and ky(yi, yj) as below: k x xi ; x j ¼ φx ðxi ÞT φx x j ; k y yi ; y j ¼ φy ðyi ÞT φy y j ð5Þ The corresponding kernel matrices are (Kx)ij = kx(xi, xj) and (Ky)ij = ky(yi, xj). Furthermore, we can get M β ¼ λLα; M T α ¼ λN β ð6Þ 1 T 1 1 1 M¼ K J K y ; L ¼ K Tx J K x þ η1 K x ; N ¼ K Ty J K y þ η2 K y ; J ¼ I− llT ð7Þ n x n n N Based on Eq. (6), we can obtain L−1 M N −1 M T α ¼ λ2 α; N −1 M T L−1 M β ¼ λ2 β ð8Þ Multimed Tools Appl Therefore, the visual-auditory kernel canonical correlation analysis and mapping process is as below: 3.2 Extension to multiple kernel visual-auditory analysis As previously defined, Xn × p and Yn × q are original image feature matrix and audio feature matrix respectively. Let xi = (xi1, xi2, ⋅ ⋅⋅, xip)(xik ∈ R) and yi = (yi1, yi2, ⋅ ⋅⋅, yip)(yik ∈ R) denote visual feature vectors and auditory feature vectors respectively. Suppose Kx,y d (d = 1, 2, ⋅ ⋅⋅, k), are k kernel functions, and each of them is associated with Hilbert space Hd. First, we map Xn × p and Yn × q into Hilbert spaces Id and Ad with the kernel function Kdx,y. Then we calculate canonical correlation between each pair of image Hilbert space and audio Hilbert space, obtain the corresponding projection vectors wx and wy. Therefore, we transform the kernel matrices into the m-dimensional IVA-Subspace, where cross-media correlations between image and audio kernel features are remained. We define xdi = (xdi1, xdi2, ⋅ ⋅⋅, xdim)(xdij = a + b × i, (a, b ∈ R)), which is obtained from the Hilbert spaces Id, as the image feature vector in the IVA-Subspace. Also for audio representation, we have m-dimensional representations ydi . To estimate cross-media distance in the IVA-Subspace, we transform the complex numbers in xdi into polar coordinate representation: pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ xdij ¼ β i j ; xdij ; βi j ¼ arctg b a ; xdij ¼ a2 þ b2 ð9Þ We perform the same polar coordinate transformation on all the vectors in ydi , and define the distance between image xdi and audio ydi as: X m d 2 d 2 dis xdi ; ydi ¼ sqrt xi j þ yi j −2 xdij ydij cosβ i j −β i j ð10Þ j¼1 Thus, the similarity of a image xi and an audio yi is: X k S ðxi ; yi Þ ¼ ηd dis xdi ; ydi ð11Þ d¼1 where ηd are the combination weights. Multimed Tools Appl Based on above analysis, we discuss how to enable cross-media retrieval under two situations: query example inside database and query example outside database. If the query example is outside the database, we use the method in our previous work to estimate its coordinates in the IVA-Subspace [39], and then we can measure the cross-media correlation with the same method the database samples use. Our MKVARL algorithm is described as below: 4 Experiments 4.1 Experimental setup We conduct a set of experiments to evaluate the performance of the proposed algorithm in cross-media retrieval. we use the Mean Average Precision (MAP) and top-k retrieval accuracy for performance evaluation. Since there is no benchmark cross-media database available to evaluate the proposed MKVARL approach, we collect an image-audio dataset crawled from websites, including Flickr, http://image.baidu.com, http://encarta.msn.com, http://www. animalbehaviorarchive.org, etc. And some other audio clips are extracted from movies. The collected datasets consist of 10 semantic categories, such as bird, car, dog, violin, etc.. In each category there are 100 images and 70 audio clips. We randomly select 60 images and 60 audio Multimed Tools Appl clips from each category as training data, and the rest are used as new media objects to test the performance of mapping new media objects into the IVA-Subspace. The extracted visual features include Color Histogram (in HSV space), Edge Histogram, Texture feature based on Gray-level co-occurrence matrix, Speeded Up Robust Features (SURF) and GIST. Auditory features are made up of Centroid, Rolloff, Spectral Flux, and Root Mean Square. We concatenate different visual features into high-dimensional vectors as input. Since audio is a kind of time series data, the dimensionalities of auditory feature vectors are inconsistent. We employ Fuzzy Clustering on auditory features in preprocessing to get isomorphic audio feature indexes [39]. As described in section 3, we use two kinds of kernels for visual-auditory correlation analysis. Specifically, we use the following radial basis function in (12), the polynomial kernel function in (13) and the sigmoid function in (14). ! kx−yk2 k ðx; yÞ ¼ exp − ð12Þ γσ2 k ðx; yÞ ¼ ðγ hx; yi þ cÞn ð13Þ k ðx; yÞ ¼ tanhðγ hx; yi þ cÞ ð14Þ where we choose empirical optimal values of γ = 2, σ = 2.4 in (12), γ = 1, c = 1, n = 4.2 in (13) and γ = 0.6, c = 1.9 in (14), and we choose empirical optimal values of combination weights η = (0.35, 0.2, 0.45) in (11). 4.2 Performance comparison results To evaluate the efficacy of the proposed algorithm, we compare the image-audio retrieval performance of the proposed MKVARL approach with PCA [25], CCA [17] and KCCA [14] methods. When users submit an image query example which is in the training set, relevant audio clips are retrieved and returned, and vice versa. In our experiments, if a returned result and the query example are in the same semantic category, it is regarded as a correct result. And the precision performance is defined as the percentage of correctly retrieved samples in the top-k-returned results. Figure 2 shows the Mean Average Precision (MAP) of different algorithms and Fig. 3 shows the comparison results of recall ratio. In Figs. 2 and 3, the MAP and the recall values are the average results of 10 times queries in each semantic category, including 5 times of querying image with audio examples and 5 times of querying audio with image examples. And the query examples are randomly selected. From Figs. 1 and 2 we can see that the performances of CCA, KCCA and MKVARL methods are much better than the performance of the PCA. Meanwhile the KCCA outperforms CCA, while our proposed MKVARL algorithm gains the best performance. Above results are obtained probably because that: (1) the computing process of the projection vectors of CCA,KCCA and MKVARL is based on potential relevance between image features and audio features, it can better reflect the high-level semantics; (2) the use of kernel function in KCCA makes it more appropriate for nonlinear correlation; (3) Different kernels correspond to different notions of similarity between two data samples. In particular, in a high dimensional feature space, it is not optimal to choose one kernel for all the datasets. A single type of kernel function may fail to exploit the potential of all correlations, meanwhile multiple types kernel functions could better explore the potential of all correlations, Multimed Tools Appl Fig. 2 MAP performance comparison results of image-audio retrieval which validates the importance of the proposed method. Our approach generally returns more relevant results and it verifies the effectiveness of the proposed method. Figure 4 is a specific example of image-audio retrieval. The query example is a 5-s audio clip in the violin category. We compute the similarity score between the query audio and the images in database, and return the top-15 relevant images. The numbers below the returned images are the correlation values between the images and the audio query example. It can be seen from Fig. 4 that among the top 15 returned results there are 12 violin images. 4.3 Performance evaluation of new media objects To test image-audio retrieval performance when query examples are out of training set, we first use the method in our previous work to estimate its coordinates in the IVA-Subspace [39], and Fig. 3 Recall performance comparison results of image-audio retrieval Multimed Tools Appl Fig. 4 An example of image-audio retrieval then cosine distance metric to compute the cross-media correlation scores. Figures 5 and 6 are the experiment results with new query examples, including querying image by new audio and querying audio by new image. From Figs. 5 and 6 we can have the similar observation that: the overall retrieval performance with new multimedia data is good. When querying image by an Fig. 5 Querying image by new audio Multimed Tools Appl Fig. 6 Querying audio by new image example of new audio, there are 8.58 correct results in top 20 returns on average. The performance of querying audio by new image is similar to that of querying image by new audio. 5 Conclusions Different from most existing multimedia representation learning methods, this paper proposes multiple kernel visual-auditory representation learning framework, which learns general rep- resentation model from visual and auditory feature space by explicitly learning statistical cross- media correlations from high-dimensional kernel spaces. Besides, we design distance metric learning strategy in the mutual subspace. The performance of our approach is tested with cross- media retrieval between image and audio data. Experiments and comparisons verify the validity, superiority and applicability of our approach from different aspects. The main limitation is that the size of image-audio database is comparatively small (lots of web image galleries are not usable because it is difficult to find suited audios). Future work includes further study on large-scale social media dataset. Acknowledgments This research is supported by the National Natural Science Foundation of China (No.61003127, No. 61373109, No.61440016) and the China Scholarship Council (201508420248). References 1. Chang X, Yang Y, Hauptmann AG, Xing E, Yu Y (2015) Semantic concept discovery for large-scale zero- shot event detection. International Joint Conference on Artificial Intelligence, IJCAI 2. Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly- isotonic SVM. International Conference on Machine Learning (ICML) 3. Chang X, Yu Y, Yang Y, Hauptmann A (2015) Searching persuasively: joint event detection and evidence justification with limited supervision. ACM MM Multimed Tools Appl 4. Gao DD, Huang RB (2000) Some results on canonical correlation and their application to a linear model. Linear Algebra Appl 321:47–59 5. Gonen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268 6. Jain P, Kulis B, Davis JV, Dhillon IS (2012) Metric and kernel learning using a linear transformation. J Mach Learn Res 13:519–547 7. Jain A, Vishwanathan SVN, Varma M (2012) Spg-gmkl: generalized multiple kernel learning with a million kernels. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining 8. Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004) Learning the kernel matrix with semi-definite programming. J Mach Learn Res 5:27–72 9. Lew MS, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state of the art and challenges. ACM Trans Multimed Comput Commun Appl 2(1):1–19 10. Liu Y, Wu F, Zhuang Y, Xiao J (2008) Active post-refined multimodality video semantic concept detection with tensor representation. ACM International Conference on Multimedia. pp.91–100 11. Liu G, Yan Y, Gao C, Tong W, Hauptmann AG, Sebe N (2014) The mystery of faces: investigating face contribution for multimedia event detection. ICMR 12. Liu H, Yu L (2005) Toward integrating feature selection algorithms for classication and clustering. IEEE Trans Knowl Data Eng 17(4):491–502 13. Ma Q, Akiyo N, Katsumi T (2006) Complementary information retrieval for cross-media news content. Inf Syst 31(7):659–678 14. Melzer T, Reiter M, Bischof H (2003) Appearance models based on kernel canonical correlation analysis. Pattern Recogn 36:1961–1971 15. Shen H, Yan Y, Xu S, Ballas N, Chen W (2015) Evaluation of semi-supervised learning method on action recognition. Multimedia Tools and Applications 74(2):523–542 16. Sonnenburg S, Rätsch G, Schafer C, Scholkopf B (2006) Largescale multiple kernel learning. J Mach Learn Res 7:1531–1565 17. Sun T, Chen S (2007) Locality preserving CCA with applications to data visualization and pose estimation. Image Vis Comput 25:531–543 18. Thomas M, Michael R, Horst B (2003) Appearance models based on kernel canonical correlation analysis. Pattern Recogn 27(2):1–8 19. Tolias G, Bursuc A, Furon T, Jégou H (2015) Rotation and translation covariant match kernels for image retrieval. Comp Vis Image Underst 140:9–20 20. Tong S, Chang E (2001) Support vector machine active learning for image retrieval. ACM International Conference on Multimedia, pp. 107–118 21. Vapnik V (1997) The nature of statistical learning theory. IEEE Trans Neural Netw 8(6) 22. Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In Proceedings of International Conference on Machine Learning, pp.1065–1072 23. Vishwanathan SVN, Sun Z, Ampornpunt N, Varma M (2010) Multiple kernel learning and the SMO algorithm. In: NIPS, pp. 2361–2369 24. Wang D, Hoi SC, He Y, Zhu J, Mei T, Luo J (2014) Retrieval-based face annotation by weak label regularized local coordinate coding. IEEE Trans Pattern Ana Mach Intell (TPAMI) 36(3):550–563 25. Wu Y, Chang EY, Chang CC, Kevin, Smith JR (2004) Optimal multimodal fusion for multi-media data analysis. In: ACM Multimedia Conference, pp. 572–579 26. Wu Y, Chang EY, Chen-Chuan Chang K, Smith JR (2004) Optimal multimodal fusion for multimedia data analysis. ACM International Conference on Multimedia, pp.572–579 27. Xia H, Hoi SC, Jin R, Zhao P (2012) Online multiple kernel similarity learning for visual search. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 1(1) 28. Yan Y, Ricci E, Liu G, Sebe N (2015) Egocentric daily activity recognition via multitask clustering. IEEE Trans Image Process 24(10):2984–2995 29. Yan Y, Ricci E, Subramanian R, Liu G, Lanz O, Sebe N. A multi-task learning framework for head pose estimation under target motion, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), in press 30. Yan Y, Shen H, Liu G, Ma Z, Gao C, Sebe N (2014) GLocal tells you more: coupling glocal structural for feature selection with sparsity for image and video classification. Comp Vision Image Underst (CVIU) 124(7):99–109 31. Yang Y, Ma Z, Hauptmann AG, Sebe N (2012) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661–669 32. Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi- supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell (TPAMI) 34(5):723–742 33. Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann AG (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimedia 15(3):572–58 Multimed Tools Appl 34. Yang Y, Zhuang Y, Wu F, Pan Y (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10(3):437–446 35. Yu Z, Wu F, Yang Y, Tian Q, Luo J, Zhuang Y (2014) Discriminative coupled dictionary hashing for fast cross-media retrieval. SIGIR, 395–404 36. Zhang H, Liu Y, Ma Z (2013) Fusing inherent and external knowledge with nonlinear learning for cross- media retrieval. Neurocomputing 119:10–16 37. Zhang H, Wu P, Beck A, Zhang Z, Gao X (2016) Adaptive incremental learning of image semantics with application to social robot. Neurocomputing 173:93–101 38. Zhang H, Yu J, Wang M, Liu Y (2012) Semi-supervised distance metric learning based on local linear regression for data clustering. Neurocomputing 93:100–105 39. Zhang H, Yuan J, Gao X, Chen Z (2014) Boosting cross-media retrieval via visual-auditory feature analysis and relevance feedback. ACM International Conference on Multimedia 40. Zhuang Y, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross- media retrieval. IEEE Transactions on Multimedia 10(2):221–229 Hong Zhang corresponding author, received the BS degree from Wuhan University of Technology, China, in 2001, the MS degree from Wuhan University of Technology, China, in 2004, and PhD degree from Zhejiang University, China, in 2007. She is currently a professor in the college of computer science and technology, Wuhan University of Science and Technology, China. Her research interests include content-based multimedia analysis, machine learning and cross-media retrieval. Wenping Zhang is currently a master student in the college of computer science and technology, Wuhan University of Science and Technology, China. He received his BS degree from Huazhong Agricultural University Chutian College, China, in 2014. His research interests include machine learning and data mining. Multimed Tools Appl Wenhe Liu received the Ms. Degree in Artificial Intelligence from The University of Edinburgh, United Kingdom, 2012. He is now a Ph.D.student with The Centre for Quantum Computation & Intelligent Systems (QCIS), the University of Technology, Sydney (UTS), Sydney, Australia. His research interests include machine learning and its applications to multimedia and computer vision. Xin Xu received the Ph.D. degree in computer science and engineering from Shanghai Jiao Tong University, China. He is a lecturer in the School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China. His current research interests include computer vision, pattern recognition, and visual surveillance. Multimed Tools Appl Hehe Fan received the Master degree of Computer Architecture from Huazhong University of Science and Technology, China, in 2015. His research interests include distributed computing, parallel processing and machine learning. Hehe Fan is currently a Research and Development Engineer in Baidu Inc.

References (42)

Chang X, Yang Y, Hauptmann AG, Xing E, Yu Y (2015) Semantic concept discovery for large-scale zero- shot event detection. International Joint Conference on Artificial Intelligence, IJCAI
Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly- isotonic SVM. International Conference on Machine Learning (ICML)
Chang X, Yu Y, Yang Y, Hauptmann A (2015) Searching persuasively: joint event detection and evidence justification with limited supervision. ACM MM Fig. 6 Querying audio by new image
Gao DD, Huang RB (2000) Some results on canonical correlation and their application to a linear model. Linear Algebra Appl 321:47-59
Gonen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211-2268
Jain P, Kulis B, Davis JV, Dhillon IS (2012) Metric and kernel learning using a linear transformation. J Mach Learn Res 13:519-547
Jain A, Vishwanathan SVN, Varma M (2012) Spg-gmkl: generalized multiple kernel learning with a million kernels. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining
Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004) Learning the kernel matrix with semi-definite programming. J Mach Learn Res 5:27-72
Lew MS, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state of the art and challenges. ACM Trans Multimed Comput Commun Appl 2(1):1-19
Liu Y, Wu F, Zhuang Y, Xiao J (2008) Active post-refined multimodality video semantic concept detection with tensor representation. ACM International Conference on Multimedia. pp.91-100
Liu G, Yan Y, Gao C, Tong W, Hauptmann AG, Sebe N (2014) The mystery of faces: investigating face contribution for multimedia event detection. ICMR
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classication and clustering. IEEE Trans Knowl Data Eng 17(4):491-502
Ma Q, Akiyo N, Katsumi T (2006) Complementary information retrieval for cross-media news content. Inf Syst 31(7):659-678
Melzer T, Reiter M, Bischof H (2003) Appearance models based on kernel canonical correlation analysis. Pattern Recogn 36:1961-1971
Shen H, Yan Y, Xu S, Ballas N, Chen W (2015) Evaluation of semi-supervised learning method on action recognition. Multimedia Tools and Applications 74(2):523-542
Sonnenburg S, Rätsch G, Schafer C, Scholkopf B (2006) Largescale multiple kernel learning. J Mach Learn Res 7:1531-1565
Sun T, Chen S (2007) Locality preserving CCA with applications to data visualization and pose estimation. Image Vis Comput 25:531-543
Thomas M, Michael R, Horst B (2003) Appearance models based on kernel canonical correlation analysis. Pattern Recogn 27(2):1-8
Tolias G, Bursuc A, Furon T, Jégou H (2015) Rotation and translation covariant match kernels for image retrieval. Comp Vis Image Underst 140:9-20
Tong S, Chang E (2001) Support vector machine active learning for image retrieval. ACM International Conference on Multimedia, pp. 107-118
Vapnik V (1997) The nature of statistical learning theory. IEEE Trans Neural Netw 8(6)
Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In Proceedings of International Conference on Machine Learning, pp.1065-1072
Vishwanathan SVN, Sun Z, Ampornpunt N, Varma M (2010) Multiple kernel learning and the SMO algorithm. In: NIPS, pp. 2361-2369
Wang D, Hoi SC, He Y, Zhu J, Mei T, Luo J (2014) Retrieval-based face annotation by weak label regularized local coordinate coding. IEEE Trans Pattern Ana Mach Intell (TPAMI) 36(3):550-563
Wu Y, Chang EY, Chang CC, Kevin, Smith JR (2004) Optimal multimodal fusion for multi-media data analysis. In: ACM Multimedia Conference, pp. 572-579
Wu Y, Chang EY, Chen-Chuan Chang K, Smith JR (2004) Optimal multimodal fusion for multimedia data analysis. ACM International Conference on Multimedia, pp.572-579
Xia H, Hoi SC, Jin R, Zhao P (2012) Online multiple kernel similarity learning for visual search. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 1(1)
Yan Y, Ricci E, Liu G, Sebe N (2015) Egocentric daily activity recognition via multitask clustering. IEEE Trans Image Process 24(10):2984-2995
Yan Y, Ricci E, Subramanian R, Liu G, Lanz O, Sebe N. A multi-task learning framework for head pose estimation under target motion, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), in press
Yan Y, Shen H, Liu G, Ma Z, Gao C, Sebe N (2014) GLocal tells you more: coupling glocal structural for feature selection with sparsity for image and video classification. Comp Vision Image Underst (CVIU) 124(7):99-109
Yang Y, Ma Z, Hauptmann AG, Sebe N (2012) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661-669
Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi- supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell (TPAMI) 34(5):723-742
Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann AG (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimedia 15(3):572-58
Yang Y, Zhuang Y, Wu F, Pan Y (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10(3):437-446
Yu Z, Wu F, Yang Y, Tian Q, Luo J, Zhuang Y (2014) Discriminative coupled dictionary hashing for fast cross-media retrieval. SIGIR, 395-404
Zhang H, Liu Y, Ma Z (2013) Fusing inherent and external knowledge with nonlinear learning for cross- media retrieval. Neurocomputing 119:10-16
Zhang H, Wu P, Beck A, Zhang Z, Gao X (2016) Adaptive incremental learning of image semantics with application to social robot. Neurocomputing 173:93-101
Zhang H, Yu J, Wang M, Liu Y (2012) Semi-supervised distance metric learning based on local linear regression for data clustering. Neurocomputing 93:100-105
Zhang H, Yuan J, Gao X, Chen Z (2014) Boosting cross-media retrieval via visual-auditory feature analysis and relevance feedback. ACM International Conference on Multimedia
Zhuang Y, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross- media retrieval. IEEE Transactions on Multimedia 10(2):221-229
Hong Zhang corresponding author, received the BS degree from Wuhan University of Technology, China, in 2001, the MS degree from Wuhan University of Technology, China, in 2004, and PhD degree from Zhejiang University, China, in 2007. She is currently a professor in the college of computer science and technology, Wuhan University of Science and Technology, China. Her research interests include content-based multimedia analysis, machine learning and cross-media retrieval.
Wenping Zhang is currently a master student in the college of computer science and technology, Wuhan University of Science and Technology, China. He received his BS degree from Huazhong Agricultural University Chutian College, China, in 2014. His research interests include machine learning and data mining.

About the author

Hehe Fan

Papers

Followers

View all papers from Hehe Fanarrow_forward

Multiple kernel visual-auditory representation learning for retrieval

Sign up for access to the world's latest research

Abstract

Related papers

References (42)

Related papers

Related topics