A Tutorial on Canonical Correlation Methods

João M. Monteiro; John Shawe‐Taylor; Jaz Kandola; Viivi Uurtio; Juho Rousu

doi:10.1145/3136624

Outline

A Tutorial on Canonical Correlation Methods

format_quote

citations

Delmiro Fernandez-Reyes

2018, ACM Computing Surveys

https://doi.org/10.1145/3136624

visibility

…

description

36 pages

Abstract

Canonical correlation analysis is a family of multivariate statistical methods for the analysis of paired sets of variables. Since its proposition, canonical correlation analysis has, for instance, been extended to extract relations between two sets of variables when the sample size is insufficient in relation to the data dimensionality, when the relations have been considered to be non-linear, and when the dimensionality is too large for human interpretation. This tutorial explains the theory of canonical correlation analysis, including its regularised, kernel, and sparse variants. Additionally, the deep and Bayesian CCA extensions are briefly reviewed. Together with the numerical examples, this overview provides a coherent compendium on the applicability of the variants of canonical correlation analysis. By bringing together techniques for solving the optimisation problems, evaluating the statistical significance and generalisability of the canonical correlation model, and interpr...

Computing Surveys A Tutorial on Canonical Correlation Methods Journal: Computing Surveys Manuscript ID CSUR-2017-0065.R1 Fo Paper: Tutorial Paper Date Submitted by the Author: n/a r Complete List of Authors: Uurtio, Viivi; Aalto University Helsinki Institute for Information Technology, Computer Science Pe M. Monteiro, João; University College London, Department of Computer Science Kandola, Jaz; Imperial College London Shawe-Taylor, John; University College London, Department of Computer er Science Fernandez-Reyes, Delmiro; University College London, Department of Computer Science Rousu, Juho; Aalto University Helsinki Institute for Information Technology, Computer Science Re Computing Classification • Computing methodologies~Dimensionality reduction and manifold Systems: learning vi ew Page 1 of 35 Computing Surveys 1 2 3 4 5 6 7 8 9 10 11 12 13 Editorial Office 14 ACM Computing Surveys 15 16 17 18 19 20 Fo 21 July 7, 2017 22 23 24 Dear Prof. Sartaj Sahni, rP 25 26 we thank the Associate Editor and the referees for taking their time to provide detailed 27 and constructive comments on the submitted paper ”A Tutorial on Canonical Correla- 28 tion Methods”. We have revised the paper according to the critiques. Please find the 29 ee 30 detailed comments in response to the reviewers’ critiques in the attachment. 31 32 Best regards, rR 33 34 35 Viivi Uurtio, Aalto University 36 Joao M. Monteiro, University College London 37 ev 38 Jaz Kandola, Imperial College London 39 John Shawe-Taylor, University College London 40 Delmiro Fernandez-Reyes, University College London 41 iew Juho Rousu, Aalto University 42 43 44 encl: Author response to the Reviews 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Computing Surveys Page 2 of 35 1 2 3 1 4 5 A Tutorial on Canonical Correlation Methods 6 7 VIIVI UURTIO, Aalto University JOÃO M. MONTEIRO, University College London 8 JAZ KANDOLA, Imperial College London 9 JOHN SHAWE-TAYLOR, University College London 10 DELMIRO FERNANDEZ-REYES, University College London 11 JUHO ROUSU, Aalto University 12 13 14 Canonical correlation analysis is a family of multivariate statistical methods for the analysis of paired sets 15 of variables. Since its proposition, canonical correlation analysis has for instance been extended to extract 16 relations between two sets of variables when the sample size is insufficient in relation to the data dimen- 17 sionality, when the relations have been considered to be non-linear, and when the dimensionality is too large 18 for human interpretation. This tutorial explains the theory of canonical correlation analysis including its Fo 19 regularised, kernel, and sparse variants. Additionally, the deep and Bayesian CCA extensions are briefly reviewed. Together with the numerical examples, this overview provides a coherent compendium on the ap- 20 plicability of the variants of canonical correlation analysis. By bringing together techniques for solving the 21 optimisation problems, evaluating the statistical significance and generalisability of the canonical correla- 22 tion model, and interpreting the relations, we hope that this article can serve as a hands-on tool for applying r 23 canonical correlation methods in data analysis. 24 CCS Concepts: rComputing methodologies → Dimensionality reduction and manifold learning; Pe 25 General Terms: Multivariate Statistical Analysis, Machine Learning, Statistical Learning Theory 26 Additional Key Words and Phrases: Canonical Correlation, Regularisation, Kernel Methods, Sparsity 27 28 ACM Reference Format: Viivi Uurtio, João M. Monteiro, Jaz Kandola, John Shawe-Taylor, Delmiro Fernandez-Reyes, and Juho er 29 Rousu, 2017. A Tutorial on Canonical Correlation Methods. ACM Comput. Surv. 1, 1, Article 1 (February 30 2017), 33 pages. 31 DOI: 0000001.0000001 32 Re 33 1. INTRODUCTION 34 When a process can be described by two sets of variables corresponding to two differ- 35 ent aspects, or views, analysing the relations between these two views may improve 36 the understanding of the underlying system. In this context, a relation is a mapping vi 37 of the observations corresponding to a variable of one view to the observations corre- 38 sponding to a variable of the other view. For example in the field of medicine, one view 39 could comprise variables corresponding to the symptoms of the disease and the other ew 40 to the risk factors that can have an effect on the disease incidence. Identifying the 41 relations between the symptoms and the risk factors can improve the understanding 42 of the disease exposure and give indications for prevention and treatment. Examples 43 44 Author’s addresses: V. Uurtio and J. Rousu, Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Konemiehentie 2, 02150 Espoo, Finland; J. M. Monteiro, Department 45 of Computer Science, University College London, and Max Planck Centre for Computational Psychiatry and 46 Ageing Research, University College London, Gower Street, London WC1E 6BT, UK; J. Shawe-Taylor and 47 D. Fernandez-Reyes, Department of Computer Science, University College London, Gower Street, London 48 WC1E 6BT, UK; J. Kandola, Division of Brain Sciences, Imperial College London, DuCane Road, London WC12 0NN. 49 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted 50 without fee provided that copies are not made or distributed for profit or commercial advantage and that 51 copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). 52 c 2017 Copyright held by the owner/author(s). 0360-0300/2017/02-ART1 $15.00 53 DOI: 0000001.0000001 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 3 of 35 Computing Surveys 1 2 3 1:2 4 5 of these kind of two-view settings, where the analysis of the relations could provide 6 new information about the functioning of the system, occur in several other fields of 7 science. These relations can be determined by means of canonical correlation methods 8 that have been developed specifically for this purpose. 9 Since the proposition of canonical correlation analysis (CCA) by H. Hotelling 10 [Hotelling 1935; 1936], relations between variables have been explored in various 11 fields of science. CCA was first applied to examine the relation of wheat characteristics 12 to flour characteristics in an economics study by F. Waugh in 1942 [Waugh 1942]. Since then, studies in the fields of psychology [Hopkins 1969; Dunham and Kravetz 1975], 13 geography [Monmonier and Finn 1973], medicine [Lindsey et al. 1985], physics [Wong 14 et al. 1980], chemistry [Tu et al. 1989], biology [Sullivan 1982], time-series modeling 15 [Heij and Roorda 1991], and signal processing [Schell and Gardner 1995] constitute 16 examples of the early application fields of CCA. 17 In the beginning of the 21st century, the applicability of CCA has been demonstrated 18 in modern fields of science such as neuroscience, machine learning and bioinformat- Fo 19 ics. Relations have been explored for developing brain-computer interfaces [Cao et al. 20 2015; Nakanishi et al. 2015] and in the field imaging genetics [Fang et al. 2016]. CCA 21 has also been applied for feature selection [Ogura et al. 2013], feature extraction and 22 fusion [Shen et al. 2013], and dimension reduction [Wang et al. 2013]. Examples of r 23 application studies conducted in the fields of bioinformatics and computational biol- 24 ogy include [Rousu et al. 2013; Seoane et al. 2014; Baur and Bozdag 2015; Sarkar and Chakraborty 2015; Cichonska et al. 2016]. The vast range of application domains Pe 25 26 emphasises the utility of CCA in extracting relations between variables. 27 Originally, CCA was developed to extract linear relations in overdetermined set- tings, that is when the number of observations exceeds the number of variables in 28 either view. To extend CCA to underdetermined settings that often occur in modern er 29 data analysis, methods of regularisation have been proposed. When the sample size 30 is small, Bayesian CCA also provides an alternative to perform CCA. The applicabil- 31 ity of CCA to underdetermined settings has been further improved through sparsity- 32 inducing norms that facilitate the interpretation of the final result. Kernel methods Re 33 and neural networks have been introduced for uncovering non-linear relations. At 34 present, canonical correlation methods can be used to extract linear and non-linear 35 relations in both over- and underdetermined settings. 36 In addition to the already described variants of CCA, alternative extensions have vi 37 been proposed, such as the semi-paired and multi-view CCA. In general, CCA algo- 38 rithms assume one-to-one correspondence between the observations in the views, in 39 other words, the data is assumed to be paired. However, in real datasets some of ew 40 the observations may be missing in either view, which means that the observations 41 are semi-paired. Examples of semi-paired CCA algorithms comprise [Blaschko et al. 42 2008], [Kimura et al. 2013], [Chen et al. 2012], and [Zhang et al. 2014]. CCA has also 43 been extended to more than two views by [Horst 1961], [Carroll 1968], [Kettenring 1971], and [Van de Geer 1984]. In multi-view CCA the relations are sought among 44 more than two views. Some of the modern extensions of multi-view CCA comprise its 45 regularised [Tenenhaus and Tenenhaus 2011], kernelised [Tenenhaus et al. 2015], and 46 sparse [Tenenhaus et al. 2014] variants. Application studies of multi-view CCA and its 47 modern variants can be found in neuroscience [Kang et al. 2013], [Chen et al. 2014], 48 feature fusion [Yuan et al. 2011] and dimensionality reduction [Yuan et al. 2014]. How- 49 ever, both the semi-paired and multi-view CCA are beyond the scope of this tutorial. 50 This tutorial begins with an introduction to the original formulation of CCA. The 51 basic framework and statistical assumptions are presented. The techniques for solv- 52 ing the CCA optimisation problem are discussed. After solving the CCA problem, the 53 approaches to interpret and evaluate the result are explained. The variants of CCA 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 4 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:3 4 5 are illustrated using worked examples. Of the extended versions of CCA, the tuto- 6 rial concentrates on the topics of regularised, kernel, and sparse CCA. Additionally, 7 the deep and Bayesian CCA variants are briefly reviewed. This tutorial acquaints the 8 reader with canonical correlation methods, discusses where they are applicable and 9 what kind of information can be extracted. 10 11 2. CANONICAL CORRELATION ANALYSIS 12 2.1. The Basic Principles of CCA 13 CCA is a two-view multivariate statistical method. In multivariate statistical analysis, 14 the data comprises multiple variables measured on a set of observations or individu- 15 als. In the case of CCA, the variables of an observation can be partitioned into two 16 sets that can be seen as the two views of the data. This can be illustrated using the 17 following notations. Let the views a and b be denoted by the matrices Xa and Xb , of 18 sizes n×p and n×q respectively. The row vectors xka ∈ Rp and xkb ∈ Rq for k = 1, 2, . . . , n Fo 19 denote the sets of empirical multivariate observations in Xa and Xb respectively. The 20 observations are assumed to be jointly sampled from a normal multivariate distribu- 21 tion. A reason for this is that the normal multivariate model approximates well the distribution of continuous measurements in several sampled distributions [Anderson 22 2003]. The column vectors ai ∈ Rn for i = 1, 2, . . . , p and bj ∈ Rn for j = 1, 2, . . . , q de- r 23 note the variable vectors of the n observations respectively. The inner product between 24 two vectors is either denoted by ha, bi or aT b. Throughout this tutorial, we assume Pe 25 that the variables are standardised to zero mean and unit variance. In CCA, the aim 26 is to extract the linear relations between the variables of Xa and Xb . 27 CCA is based on linear transformations. We consider the following transformations 28 er 29 Xa wa = za and Xb wb = zb 30 31 where Xa ∈ Rn×p , wa ∈ Rp , za ∈ Rn , Xb ∈ Rn×q , wb ∈ Rq , and zb ∈ Rn . The data 32 matrices Xa and Xb represent linear transformations of the positions wa and wb onto the images za and zb in the space Rn . The positions wa and wb are often referred to as Re 33 canonical weight vectors and the images za and zb are also termed as canonical vari- 34 ates or scores. The constraints of CCA on the mappings are that the position vectors 35 of the images za and zb are unit norm vectors and that the enclosing angle, θ ∈ [0, π2 ] 36 [Golub and Zha 1995; Dauxois and Nkiet 1997], between za and zb is minimised. The vi 37 cosine of the angle, also referred to as the canonical correlation, between the images 38 za and zb is given by the formula cos(za , zb ) = hza , zb i/||za ||||zb || and due to the unit 39 norm constraint cos(za , zb ) = hza , zb i. Hence the basic principle of CCA is to find two ew 40 positions wa ∈ Rp and wb ∈ Rq that after the linear transformations Xa ∈ Rn×p and 41 Xb ∈ Rn×q are mapped onto an n-dimensional unit ball and located in such a way that 42 the cosine of the angle between the position vectors of their images za ∈ Rn and zb ∈ Rn 43 is maximised. 44 The images za and zb of the positions wa and wb that result in the smallest angle, θ1 , 45 determine the first canonical correlation which equals cos θ1 [Björck and Golub 1973]. 46 The smallest angle is given by 47 cos θ1 = max hza , zb i, 48 za ,zb ∈Rn (1) 49 ||za ||2 = 1 ||zb ||2 = 1 50 51 Let the maximum be obtained by z1a and z1b . The pair of images z2a and z2b , that has 52 the second smallest enclosing angle θ2 , is found in the orthogonal complements of z1a 53 and z1b . The procedure is continued until no more pairs are found. Hence the r angles 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 5 of 35 Computing Surveys 1 2 3 1:4 4 5 θr ∈ [0, π2 ] for r = 1, 2, · · · , q when p > q that can be found are recursively defined by 6 7 cos θr = max hzra , zrb i, za ,zb ∈Rn 8 9 ||zra ||2 = 1 ||zrb ||2 = 1 10 hzra , zja i = 0 hzrb , zjb i = 0, 11 ∀j 6= r : j, r = 1, 2, . . . , min(p, q). 12 13 The number of canonical correlations, r, corresponds to the dimensionality of CCA. 14 Qualitatively, the dimensionality of CCA can be also seen as the number of patterns 15 that can be extracted from the data. 16 When the dimensionality of CCA is large, it may not be relevant to solve all the posi- 17 tions wa and wb and images za and zb . In general, the value of the canonical correlation 18 and the statistical significance are considered to convey the importance of the pattern. Fo 19 The first estimation strategy for finding the number of statistically significant canoni- 20 cal correlation coefficients was proposed in [Bartlett 1941]. The techniques have been 21 further developed in [Fujikoshi and Veitch 1979; Tu 1991; Gunderson and Muirhead 22 1997; Yamada and Sugiyama 2006; Lee 2007; Sakurai 2009]. r 23 In summary, the principle behind CCA is to find two positions in the two data spaces 24 respectively that have images on a unit ball such that the angle between them is min- imised and consequently the canonical correlation is maximised. The linear transfor- Pe 25 mations of the positions are given by the data matrices. The number of relevant po- 26 sitions can be determined by analysing the values of the canonical correlations or by 27 applying statistical significance tests. 28 er 29 2.2. Finding the positions and the images in CCA 30 31 The position vectors wa and wb having images za and zb in the new coordinate system 32 of a unit ball that have a maximum cosine of the angle in between can be obtained using techniques of functional analysis. The eigenvalue-based methods comprise solv- Re 33 ing a standard eigenvalue problem, as originally proposed by Hotelling in [Hotelling 34 1936], or a generalised eigenvalue problem [Bach and Jordan 2002; Hardoon et al. 35 2004]. Alternatively, the positions and the images can be found using the singular 36 value decomposition (SVD), as introduced in [Healy 1957]. The techniques can be con- vi 37 sidered as standard ways of solving the CCA problem. 38 39 Solving CCA Through the Standard Eigenvalue Problem. In the technique of ew 40 Hotelling, both the positions wa and wb and the images za and zb are obtained by 41 solving a standard eigenvalue problem. The Lagrange multiplier technique [Hotelling 42 1936; Hooper 1959] is employed to obtain the characteristic equation. Let Xa and Xb 43 denote the data matrices of sizes n × p and n × q respectively. The sample covariance 1 44 matrix Cab between the variable column vectors in Xa and Xb is Cab = n−1 XaT Xb . 45 The empirical variance matrices between the variables in Xa and Xb are given by 1 1 46 Caa = n−1 XaT Xa and Cbb = n−1 XbT Xb respectively. The joint covariance matrix is then 47 48 Caa Cab 49 . (2) Cba Cbb 50 51 The first and greatest canonical correlation that corresponds to the smallest angle is 52 between the first pair of images za = Xa wa and zb = Xa wb . Since the correlation 53 between za and zb does not change with the scaling of za and zb , we can constrain wa 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 6 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:5 4 5 and wb to be such that za and zb have unit variance. This is given by 6 zTa za =waT XaT Xa wa = waT Caa wa = 1, (3) 7 8 zTb zb =wbT XbT Xb wb = wbT Cbb wb = 1. (4) 9 Due to the normality assumption and comparability, the variables of Xa and Xb should 10 be centered such that they have zero means. In this case, the covariance between za 11 and zb is given by 12 13 zTa zb = waT XaT Xb wb = waT Cab wb . (5) 14 Substituting (5), (3) and (4) into the algebraic problem (1), we obtain: 15 cos θ = max n hza , zb i = max waT Cab wb , 16 za ,zb ∈R wa ∈Rp ,wb ∈Rq 17 q q 18 ||za ||2 = waT Caa wa = 1 ||zb ||2 = wbT Cbb wb = 1. Fo 19 In general, the constraints (3) and (4) are expressed in squared form, waT Caa wa = 1 20 and wbT Cbb wb = 1. The problem can be solved using the Lagrange multiplier technique. 21 Let 22 ρ1 ρ2 r 23 L = waT Cab wb − (waT Caa wa − 1) − (wbT Cbb wb − 1) (6) 2 2 24 where ρ1 and ρ2 denote the Lagrange multipliers. Differentiating L with respect to wa Pe 25 26 and wb gives 27 δL 28 = Cab wb − ρ1 Caa wa = 0 (7) δwa er 29 δL 30 = Cba wa − ρ2 Cbb wb = 0 (8) δwb 31 32 Multiplying (7) from the left by waT and (8) from the left by wbT gives Re 33 waT Cab wb − ρ1 waT Caa wa = 0 34 35 wbT Cba wa − ρ2 wbT Cbb wb = 0. 36 vi 37 38 Since waT Caa wa = 1 and wbT Cbb wb = 1, we obtain that 39 ρ1 = ρ2 = ρ. (9) ew 40 41 Substituting (9) into Equation (7) we obtain 42 −1 Caa Cab wb 43 wa = . (10) ρ 44 45 Substituting (10) into (8) we obtain 46 1 −1 47 Cba Caa Cab wb − ρCbb wb = 0 ρ 48 49 which is equivalent to the generalised eigenvalue problem of the form −1 50 Cba Caa Cab wb = ρ2 Cbb wb . 51 If Cbb is invertible, the problem reduces to a standard eigenvalue problem of the form 52 −1 −1 53 Cbb Cba Caa Cab wb = ρ2 wb . 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 7 of 35 Computing Surveys 1 2 3 1:6 4 5 −1 −1 The eigenvalues of the matrix Cbb Cba Caa Cab are found by solving the characteristic 6 equation 7 −1 −1 |Cbb Cba Caa Cab − ρ2 I| = 0. 8 9 The square roots of the eigenvalues correspond to the canonical correlations. The tech- 10 nique of solving the standard eigenvalue problem is shown in Example 2.1. 11 Example 2.1. We generate two data matrices Xa and Xb of sizes n × p and n × 12 q, where n = 60, p = 4 and q = 3, respectively as follows. The variables of Xa are 13 generated from a random univariate normal distribution, a1 , a2 , a3 , a4 ∼ N (0, 1). We 14 generate the following linear relations 15 16 b1 = a3 + ξ 1 17 b2 = a1 + ξ 2 18 b3 = −a4 + ξ 3 Fo 19 where ξ 1 ∼ N (0, 0.2), ξ 2 ∼ N (0, 0.4), and ξ 3 ∼ N (0, 0.3) denote vectors of normal noise. 20 The data is standardised such that every variable has zero mean and unit variance. 21 The joint covariance matrix C in (2) of the generated data is given by 22 r   23 1.00 0.34 −0.11 0.21 −0.10 0.92 −0.21 24  0.34 1.00 −0.08 0.03 −0.10 0.34 0.06   −0.11 −0.08 1.00 −0.30 0.98 −0.03 0.30  Pe   25 Caa Cab C =  0.21 0.03 −0.30 1.00 −0.25 0.12 −0.94  = .   26  −0.10 −0.10 0.98 −0.25 1.00 −0.03 0.25  Cba Cbb 27   0.92 0.34 −0.03 0.12 −0.03 1.00 −0.13   28 −0.21 0.06 0.30 −0.94 0.25 −0.13 1.00 er 29 30 Now we compute the eigenvalues of the characteristic equation 31 −1 −1 32 |Cbb Cba Caa Cab − ρ2 I| = 0. Re 33 The square roots of the eigenvalues of Cbb −1 −1 Cba Caa Cab are ρ1 = 0.99, ρ2 = 0.94, and 34 ρ3 = 0.92. The eigenvectors wb satisfy the equation 35 −1 −1 36 (Cbb Cba Caa Cab − ρ2 I)wb = 0. vi 37 Hence we obtain 38 −0.97 −0.39 0.19 ! ! ! 39 wb1 = −0.04 wb2 = −0.37 wb3 = −0.86 ew 40 −0.22 0.85 −0.46 41 42 and wa vectors satisfy 43  −0.04   −0.41   −0.84  44 −1 C Cab wb 1 −0.00 2 −1 Caa Cab wb2  0.09  3 −1 3 C Cab wb −0.10 45 wa1 = aa = w = = w = aa = . ρ1 −0.99 a ρ2 −0.41 a ρ3 0.14  46 0.18 −0.83 0.52 47 48 The vectors wb1 , wb2 , and wb3 and wa1 , wa2 , and wa3 correspond to the pairs of positions 49 (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) that have the images (z1a , z1b ), (z2a , z2b ) and (z3a , z3b ). In 50 linear CCA, the canonical correlations equal to the square roots of the eigenvalues, that is hz1a , z1b i = 0.99, hz2a , z2b i = 0.94, and hz3a , z3b i = 0.92. 51 52 Solving CCA Through the Generalised Eigenvalue Problem. The positions wa and 53 wb and their images za and zb can also be solved through a generalised eigenvalue 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 8 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:7 4 5 problem [Bach and Jordan 2002; Hardoon et al. 2004]. The equations in (7) and (8) can 6 be represented as simultaneous equations 7 Cab wb =ρCaa wa 8 9 Cba wa =ρCbb wb 10 that are equivalent to 11 0 Cab wa Caa 0 wa 12 =ρ . (11) Cba 0 wb 0 Cbb wb 13 14 The equation (11) represents a generalised eigenvalue problem of the form βAx = αBx 15 where the pair (β, α) = (1, α) is an eigenvalue of the pair (A, B) [Saad 2011; Golub 16 and Van Loan 2012]. The pair of matrices A ∈ R(p+q)×(p+q) and B ∈ R(p+q)×(p+q) 17 is also referred to as matrix pencil. In particular, A is symmetric and B is symmet- 18 ric positive-definite. The pair (A, B) is then called the symmetric pair. As shown in Fo 19 [Watkins 2004], a symmetric pair has real eigenvalues and (p+q) linearly independent 20 eigenvectors. To express the generalised eigenvalue problem in the form Ax = ρBx, the 21 generalised eigenvalue is given by ρ = α β . Since the generalised eigenvalues come in 22 pairs {ρ1 , −ρ1 , ρ2 , −ρ2 , . . . , ρp , −ρp , 0} where p < q, the positive generalised eigenvalues r 23 correspond to the canonical correlations. 24 Example 2.2. Using the data in Example 2.1, we apply the formulation of the gen- Pe 25 eralised eigenvalue problem to obtain the positions wa and wb . The resulting gener- 26 alised eigenvalues are 27 {0.99, 0.94, 0.92, 0.00, −0.92, −0.94, −0.99}. 28 er 29 The generalised eigenvectors that correspond to the positive generalised eigenvalues 30 in descending order are 31  −0.04   0.48   −0.97  32 −0.00 2 −0.11 3 −0.11 wa1 =  Re 33 w = w = −1.00 a  0.48  a  0.16  34 0.18 0.98 0.60 35 36 −0.98 ! 0.46 ! 0.22 ! vi 37 wb1 = −0.04 wb2 = 0.43 wb3 = −1.00 38 −0.23 −1.00 −0.54 39 The vectors wa1 , wa2 , and wa3 and wb1 , wb2 , and wb3 correspond to the pairs of positions ew 40 (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ). The canonical correlations are hz1a , z1b i = 0.99, hz2a , z2b i = 41 0.94, and hz3a , z3b i = 0.92. 42 The entries of the position pairs differ to some extent from the solutions to the stan- 43 dard eigenvalue problem in the Example 2.1. This is due to the numerical algorithms 44 that are applied to solve the eigenvalues and eigenvectors. Additionally, the signs may 45 also be opposite. This can be seen when comparing the second pairs of positions with 46 the Example 2.1. This results from the symmetric nature of CCA. 47 Solving CCA Using the SVD. The technique of applying the SVD to solve the CCA 48 problem was first introduced by [Healy 1957] and described by [Ewerbring and Luk 49 1989] as follows. First, the covariance matrices Caa and Cbb are transformed into iden- 50 tity forms. Due to the symmetric positive definite property, the square root factors of 51 the matrices can be found using a Cholesky or eigenvalue decomposition: 52 1/2 1/2 1/2 1/2 53 Caa = Caa Caa and Cbb = Cbb Cbb . 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 9 of 35 Computing Surveys 1 2 3 1:8 4 5 Applying the inverses of the square root factors symmetrically on the joint covariance 6 matrix in (2) we obtain 7 −1/2 ! −1/2 ! −1/2 −1/2 ! 8 Caa 0 Caa Cab Caa 0 Iq Caa Cab Cbb −1/2 −1/2 = −1/2 −1/2 . 9 0 Cbb Cba Cbb 0 Cbb Cbb Cba Caa Ip 10 The position vectors wa and wb can hence be obtained by solving the following SVD 11 −1/2 −1/2 12 Caa Cab Cbb = U T SV (12) 13 where the columns of the matrices U and V correspond to the sets of orthonormal left 14 and right singular vectors respectively. The singular values of matrix S correspond to 15 the canonical correlations. The positions wa and wb are obtained from 16 −1/2 −1/2 17 wa = Caa U wb = Cbb V 18 The method is shown in Example 2.3. Fo 19 20 Example 2.3. The method of solving CCA using the SVD is demonstrated using the 21 data of Example 2.1. We compute the matrix 22  −0.02 0.90 −0.06  r 23 −1/2 −1/2 −0.07 0.20 0.11  24 Caa Cab Cbb = . 0.98 0.04 0.04  Pe 25 0.01 −0.02 −0.93 26 The SVD gives 27 −1/2 −1/2 28 Caa Cab Cbb = er 29 ! 0.99 0.00  0.00  −0.03 −0.03 0.95 −0.30 0.95 −0.29 0.15 ! 30 0.00 0.94 0.00 31 −0.47 0.03 −0.28 0.84  0.01 −0.44 −0.90 . 0.00 0.00 0.92 32 −0.86 −0.26 0.11 0.44 0.33 0.85 −0.41 } 0.00 0.00 0.00 | Re | {z {z } 33 UT | {z } V S 34 35 The singular values of the matrix S correspond to the canonical correlations. The po- 36 sitions wa and wb are given by vi 37  0.04   −0.43   −0.91  38 −1/2 1  0.00  2 −1/2 2  0.10  3 −1/2 3 −0.10 39 wa1 = Caa u = w = Caa u = w = Caa u = 0.94  a −0.43 a 0.14  ew 40 −0.17 −0.87 0.56 41 42 0.93 ! −0.40 ! 0.21 ! −1/2 −1/2 2 −1/2 3 43 wb1 = Cbb v1 = 0.04 wb2 = Cbb v = −0.38 wb3 = Cbb v = −0.93 44 0.21 0.89 −0.50 45 46 where ui and vi for i = 1, 2, 3 correspond to the left and right singular vectors. 47 The vectors wa1 , wa2 , and wa3 and wb1 , wb2 , and wb3 correspond to the pairs of posi- tions (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ). The canonical correlations are hz1a , z1b i = 0.99, 48 hz2a , z2b i = 0.94, and hz3a , z3b i = 0.92. 49 50 The main motivation for improving the eigenvalue-based technique was the compu- 51 tational complexity. The standard and generalised eigenvalue methods scale with the 52 cube of the input matrix dimension, in other words, the time complexity is O(n3 ), for −1/2 −1/2 53 a matrix of size n × n. The input matrix Caa Cab Cbb in the SVD-based technique is 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 10 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:9 4 5 rectangular. This gives a time complexity of O(mn2 ), for a matrix of size m × n. Hence 6 the SVD-based technique is computationally more tractable for very large datasets. 7 To recapitulate, the images za and zb of the positions wa and wb that successively 8 maximise the canonical correlation can be obtained by solving a standard [Hotelling 9 1936] or a generalised eigenvalue problem [Bach and Jordan 2002; Hardoon et al. 10 2004] or by applying the SVD [Healy 1957; Ewerbring and Luk 1989]. The CCA prob- 11 lem can also be solved using alternative techniques. The only requirements are that 12 the successive images on the unit ball are orthogonal and that the angle is minimised. 13 14 2.3. Evaluating the Canonical Correlation Model 15 The pair of position vectors that have images on the unit ball with a minimum en- 16 closing angle correspond to the canonical correlation model obtained from the training 17 data. The entries of these position vectors convey the relations between the variables 18 obtained from the sampling distribution. In general, a statistical model is validated Fo 19 in terms of statistical significance and generalisability. To assess the statistical sig- 20 nificance of the relations obtained from the training data, Bartlett’s sequential test 21 procedure [Bartlett 1941] can be applied. Although the technique was presented in 22 1941, it is still applied in timely CCA application studies such as [Marttinen et al. r 23 2013; Kabir et al. 2014; Song et al. 2016]. The generalisability of the canonical cor- 24 relation model determines whether the relations obtained from the training data can be considered to represent general patterns occurring in the sampling distribution. Pe 25 26 The methods of testing the statistical significance and generalisability of the extracted 27 relations represent standard ways to evaluate the canonical correlation model. 28 The entries of the position vectors wa and wb can be used as a means to analyse the linear relations between the variables. The linear relation corresponding to the value er 29 of the canonical correlation is found between the entries that are of the greatest value. 30 The values of the entries of the position vectors wa and wb are visualised in Figure 31 1. The linear relation that corresponds to the canonical correlation of hz1a , z2b i = 0.99 is 32 found between the variables a3 and b1 . Since the signs of both entries are negative, the Re 33 relation is positive. The second pair of positions (wa2 , wb2 ) conveys the negative relation 34 between a4 and b3 . The positive relation between a1 and b2 can be identified from the 35 entries of the third pair of positions (wa3 , wb3 ). 36 In [Meredith 1964], structure correlations were introduced as a means to analyse vi 37 the relations between the variables. Structure correlations are the correlations of the 38 original variables, ai ∈ Rn for i = 1, 2, . . . , p and bj ∈ Rn for j = 1, 2, . . . , q, with the 39 images, za ∈ Rn or zb ∈ Rn . In general, the structure correlations convey how the ew 40 images za and zb are aligned in the space Rn in relation to the variable axes. 41 In [Ter Braak 1990], the structure correlations were visualised on a biplot to facili- 42 tate the interpretation of the relations. To plot the variables on the biplot, the correla- 43 tions of the original variables of both sets with two successive images, for example the images (z1a , z2a ), of one of the sets are computed. The plot is interpreted by the cosine 44 of the angles between the variable vectors which is given by cos(a, b) = ha, bi/||a||||b||. 45 Hence a positive linear relation is shown by an acute angle while an obtuse angle de- 46 picts a negative linear relation. A right angle corresponds to a zero correlation. Three 47 biplots of the data and results of Example 2.1 are shown in Figure 2. In each of the bi- 48 plots, the same relations that were identified in Figure 1 can be found by analysing the 49 angles between the variable vectors. The extraction of the relations can be enhanced 50 by changing the pairs of images with which the correlations are computed. 51 The statistical significance tests of the canonical correlations evaluate whether the 52 obtained pattern can be considered to occur non-randomly. The sequential test pro- 53 cedure of Bartlett [Bartlett 1938] determines the number of statistically significant 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 11 of 35 Computing Surveys 1 2 3 1:10 4 5 wa1 wa2 wa3 6 1 1 1 7 0.5 0.5 0.5 8 0 0 0 9 10 -0.5 -0.5 -0.5 11 -1 -1 -1 12 a3 a4 a1 13 14 15 wb1 wb2 wb3 16 1 1 1 17 18 0.5 0.5 0.5 Fo 19 0 0 0 20 -0.5 -0.5 -0.5 21 22 -1 -1 -1 r 23 b1 b3 b2 24 Pe 25 Fig. 1. The entries of the pairs of positions (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) are shown. The entry of max- 26 imum absolute value is coloured blue. 27 28 b3 er 29 30 a4 a4 ab31 ba13 z2a z3a z3a 31 ab31 a2 32 b3 a2 a2 b3 ba21 Re 33 ba21 a4 ab12 34 35 z1a z1a z2a 36 vi 37 Fig. 2. The biplots are generated using the results of Example 2.1. The biplot on the left shows the relations 38 between the variables when viewed with respect to the images z1a and z2a . The biplot in the middle shows 39 the relations between the variables when viewed with respect to the images z1a and z3a . The biplot on the ew 40 right shows the relations between the variables when viewed with respect to the images z2a and z3a . 41 42 canonical correlations in the data. The procedure to evaluate the statistical signifi- 43 cance of the canonical correlations is described in [Fujikoshi and Veitch 1979]. We test 44 the hypothesis 45 H0 : min(p, q) = k against H1 : min(p, q) > k (13) 46 where k = 0, 1, . . . , p when p < q. If the hypothesis H0 : min(p, q) = j is rejected for 47 j = 0, 1, . . . , k − 1 but accepted for H1 : min(p, q) > k − 1 the number of statistically 48 significant canonical correlations can be estimated as k. For the test, the Bartlett- 49 Lawley statistic, Lk is applied 50 51 k min(p,q) 1 X Y rj−2 ln (1 − rj2 ) . 52 Lk = − n − k − (p + q + 1) + (14) 53 2 j=1 j=k+1 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 12 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:11 4 5 where rj denotes the j th canonical correlation. The asymptotic null distribution of Lk 6 is the chi-squared with (p − k)(q − k) degrees of freedom. Hence we first test that no 7 canonical relation exists between the two views. If we reject the hypothesis H0 we 8 continue to test that one canonical relation exists. If all the canonical patterns are 9 statistically significant even the hypothesis H0 : min(p, q) = k − 1 is rejected. 10 Example 2.4. We demonstrate the sequential test procedure of Bartlett using the 11 simulated setting of Examples 2.1, 2.2 and 2.3. In the setting, n = 60, p = 4 and p = 3. 12 Hence min(p, q) = 3. First, we test that there are no canonical correlations 13 H0 : min(p, q) = 0 against H1 : min(p, q) > 0 (15) 14 2 15 The Bartlett-Lawley statistic is L0 = 296.82. Since L0 ∼ χ (12) the critical value at 16 the significance level α = 0.01 is P (χ2 ≥ 26.2) = 0.01. Since L0 = 296.82 > 26.2 the 17 hypothesis H0 is rejected. Next we test that there is one canonical correlation. 18 H0 : min(p, q) = 1 against H1 : min(p, q) > 1 (16) Fo 19 2 20 The Bartlett-Lawley statistic is L1 = 154.56 and L1 ∼ χ (6). The critical value at the significance level α = 0.01 is P (χ2 ≥ 16.8) = 0.01. Since L1 = 154.56 > 16.8 the 21 hypothesis H0 is rejected. We continue to test that there are two canonical correlations 22 r 23 H0 : min(p, q) = 2 against H1 : min(p, q) > 2 (17) 24 2 The Bartlett-Lawley statistic is L2 = 70.95 and L2 ∼ χ (2). The critical value at the Pe 25 significance level α = 0.01 is P (χ2 ≥ 9.21) = 0.01. Since L1 = 70.95 > 9.21 the hypoth- 26 esis H0 is rejected. Hence the hypothesis H1 : min(p, q) > 2 is accepted and all three 27 canonical patterns are statistically significant. 28 To determine whether the extracted relations can be considered generalisable, or in er 29 other words general patterns in the sampling distribution, the linear transformations 30 of the position vectors wa and wb need to be performed using test data. Unlike training 31 data, test data originates from the sampling distribution but were not used in the 32 model computation. Let the matrices Xatest ∈ Rm×p and Xbtest ∈ Rm×q denote the test Re 33 data of m observations. The linear transformations of the position vectors wa and wb 34 are then 35 Xatest wa = ztest a and Xbtest wb = ztest b 36 vi 37 where the images ztest a and ztest b are in the space Rm . The cosine of the angle between 38 the test images cos(za , zb ) = hztest test test a , zb test i implies the generalisability. If the canoni- 39 cal correlations computed from test data also result in high correlation values we can ew 40 deduce that the relations can generally be found from the particular sampling distri- 41 bution. 42 Example 2.5. We evaluate the generalisability of the canonical correlation model 43 obtained in Example 2.1. The test data matrices Xatest and Xbtest of sizes m × p and 44 m × q where m = 40, p = 4, and q = 3 are from the same distributions as described in 45 Example 2.1. The 40 observations were not included in the computation of the model. 46 The test canonical correlations corresponding to the positions (wa1 , wb1 ), (wa2 , wb2 ) and 47 (wa3 , wb3 ) are hz1a , z1b i = 0.98, hz2a , z2b i = 0.98, hz3a , z3b i = 0.98. The high values indicate 48 that the extracted relations can be considered generalisable. 49 The canonical correlation model can be evaluated by assessing the statistical signif- 50 icance and testing the generalisability of the relations. The statistical significance of 51 the model can be determined by testing whether the extracted canonical correlations 52 are not non-zero by chance. The generalisability of the relations can be assessed us- 53 ing new observations from the sampling distribution. These evaluation methods can 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 13 of 35 Computing Surveys 1 2 3 1:12 4 5 generally be applied to test the validity of the extracted relations obtained using any 6 variant of CCA. 7 3. EXTENSIONS OF CANONICAL CORRELATION ANALYSIS 8 9 3.1. Regularisation Techniques in Underdetermined Systems 10 CCA finds linear relations in the data when the number of observations exceeds the 11 number of variables in either view. This possibly guarantees the non-singularity of 12 the variance matrices Caa and Cbb when solving the CCA problem. In the case of the 13 standard eigenvalue problem, the matrices Caa and Cbb should be non-singular so that 14 they can be inverted. In the case of the SVD method, singular Caa and Cbb may not 15 have the square root factors. If the number of observations is less than the number of 16 variables it is likely that some of the variables are collinear. Hence a sufficient sam- ple size reduces the collinearity of the variables and guarantees the non-singularity of 17 the variance matrices. The first proposition to solve the problem of insufficient sample 18 size was presented in [Vinod 1976]. A more recent technique to regularise CCA has Fo 19 been proposed in [Cruz-Cano and Lee 2014]. In the following, we present the origi- 20 nal method of regularisation [Vinod 1976] due to its popularity in CCA applications 21 [González et al. 2009], [Yamamoto et al. 2008], and [Soneson et al. 2010]. 22 In the work of [Vinod 1976], the singularity problem was proposed to be solved by r 23 regularisation. In general, the idea is to improve the invertibility of the variance ma- 24 trices Caa and Cbb by adding arbitrary constants c1 > 0 and c2 > 0 to the diagonal Pe 25 Caa + c1 I and Cbb + c2 I. The constraints of CCA become 26 waT Caa + c1 I wa =1 27 wbT Cbb + c2 I wb =1 28 er 29 and hence the magnitudes of the position vectors wa and wb are smaller when regu- 30 larisation, c1 > 0 and c2 > 0, is applied. The regularised CCA optimisation problem is 31 given by 32 cos θ = max waT Cab wb , Re 33 wa ∈Rp ,wb ∈Rq 34 waT Caa + c1 I wa = 1 wbT Cbb + c2 I wb = 1. 35 36 The positions wa and wb can be found by solving the standard eigenvalue problem vi 37 −1 −1 Cbb + c2 I Cba Caa + c1 I Cab wb = ρ2 wb . 38 39 or the generalised eigenvalue problem ew 40 0 Cab wa Caa + c1 I 0 wa 41 =ρ . Cba 0 wb 0 Cbb + c2 I wb 42 43 As in the case of linear CCA, the canonical correlations correspond to the inner prod- ucts between the consecutive image pairs hzia , zib i where i = 1, 2, . . . , min(p, q). 44 The regularisation proposed by [Vinod 1976] makes the CCA problem solvable but 45 introduces new parameters c1 > 0 and c2 > 0 that have to be chosen. The first proposi- 46 tion of applying a leave-one-out cross-validation procedure to automatically select the 47 regularisation parameters was presented in [Leurgans et al. 1993]. Cross-validation 48 is a well-established nonparametric model selection procedure to evaluate the valid- 49 ity of statistical predictions. One of its earliest applications have been presented in 50 [Larson 1931]. A cross-validation procedure entails the partitioning of the observa- 51 tions into subsamples, selecting and estimating a statistic which is first measured 52 on one subsample, and then validated on the other hold-out subsample. The method 53 of cross-validation is discussed in detail for example in [Stone 1974], [Efron 1979], 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 14 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:13 4 5 [Browne 2000], and more recently in [Arlot et al. 2010]. The cross-validation approach 6 specifically developed for CCA has been further extended in [Waaijenborg et al. 2008; 7 Yamamoto et al. 2008; González et al. 2009; Soneson et al. 2010]. 8 In cross-validation, the size of the hold-out subsample varies depending on the size 9 of the dataset. A leave-one-out cross-validation procedure is an option when the sample 10 size is small and partitioning of the data into several folds, as is done in k-fold cross- 11 validation, is not feasible. 5-fold cross-validation saves computation time in relation 12 to leave-one-out cross-validation if the sample size is large enough to partition the observations into five folds where each fold is used as a test set in turn. 13 In general, as demonstrated for example in [Krstajic et al. 2014], a k-fold cross- 14 validation procedure should be repeated when an optimal set of parameters are 15 searched for. Repetitions decrease the variance of the average values measured across 16 the test folds. Algorithm 1 outlines an approach to determine the optimal regularisa- 17 tion parameters in CCA. 18 Fo 19 20 ALGORITHM 1: Repeated k-fold cross-validation for regularised CCA 21 22 Input: Data matrices Xa and Xb , number of repetitions R, number of folds F Output: Regularisation parameter values c1 and c2 maximising the correlation on test data r 23 Pre-defined ranges for values of c1 ; c2 ; 24 Initialise r = 1; Pe 25 repeat 26 Randomly partition the observations into F folds ; 27 for all values of c1 do 28 for all values of c2 do for i = 1, 2, . . . , F do er 29 Training set: F − i folds, test set: i fold; 30 Standardize the variables in the training and test sets; 31 For the training data, solve |Cbb −1 Cba Caa + c1 I −1 Cab − ρ2 I| = 0; 32 Find wb corresponding to the greatest eigenvalue satisfying Re −1 33 −1 (Cbb Cba Caa + c1 I Cab − ρ2 I)wb = 0 ; 34 Caa +c1 I −1 1 Cab wb 1 35 Find wa using wa = ρ1 ; 36 Transform the training positions wa and wb using the test observations Xa,test wa = za and Xb,test wb = zb ; vi 37 38 Compute cos(za , zb ) = ||zhzaa||||z ,zb i b || ; 39 end ew 40 Store the mean of the F values for cos(za , zb ) obtained at c1 and c2 ; end 41 end 42 r =r+1; 43 until r = R; 44 Compute the mean of the R values for cos(za , zb ) obtained at c1 and c2 ; 45 Return the combination c1 and c2 that maximises cos(za , zb ) 46 47 48 49 Example 3.1. To demonstrate the procedure of regularisation in underdetermined 50 settings, we use the same simulated data as in the previous examples but we include 51 additional normally distributed variables. The data matrices Xa and Xb of sizes n × p 52 and n × q, where n = 60, p = 70 and q = 10, respectively as follows. The variables of Xa 53 are generated from a random univariate normal distribution, a1 , a2 , . . . , a70 ∼ N (0, 1). 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 15 of 35 Computing Surveys 1 2 3 1:14 4 5 6 7 0.0776 8 9 10 11 12 13 14 0.09 1 15 16 Fig. 3. The maximum test canonical correlation, computed over 50 times repeated 5-fold cross-validation, 17 is obtained at c1 = 0.09. 18 We generate the following linear relations Fo 19 20 b1 = a3 + ξ 1 21 b2 = a1 + ξ 2 22 b3 = −a4 + ξ 3 r 23 24 where ξ 1 ∼ N (0, 0.01), ξ 2 ∼ N (0, 0.03), ξ 3 ∼ N (0, 0.02) denote vectors of normal noise. The remaining variables of Xb are generated from random univariate normal distri- Pe 25 bution, a4 , a5 , . . . , a10 ∼ N (0, 1). The data is standardised such that every variable has 26 zero mean and unit variance. 27 To construct the matrix Cbb −1 −1 Cba Caa Cab , the variance matrices Caa and Cbb need to 28 be non-singular. Since Caa is obtained from a rectangular matrix, collinearity makes er 29 it close to singular. We therefore add a positive constant to the diagonal Caa + c1 I 30 to make it invertible. Cbb is invertible since the data matrix Xb has more rows than 31 columns. The optimal value for the regularisation parameter c1 can be determined for 32 instance through repeated k-fold cross-validation. As shown in Figure 3, the optimal Re 33 value c1 = 0.09 was obtained through 50 times repeated 5-fold cross-validation using 34 the procedure presented in the Algorithm 1. 35 The positions wa and wb and their respective images za and zb on a unit ball are 36 found by solving the eigenvalues of the characteristic equation vi 37 −1 |Cbb Cba Caa + c1 I −1 Cab − ρ2 I| = 0. (18) 38 39 The number of relations equals min(p, q) = 10. The square roots of the first three ew 40 eigenvalues are ρ1 = 0.98, ρ2 = 0.97 and ρ3 = 0.96. The respective three eigenvectors 41 that correspond to the positions wb satisfy the equation 42 −1 −1 (Cbb Cba Caa + c1 I Cab − ρ2 I)wb = 0. (19) 43 44 The positions wa are found using the formula −1 45 Caa + c1 I Cab wbi 46 wai = (20) ρi 47 48 where i = 1, 2, 3 corresponds to the sorted eigenvalues and eigenvectors. By rounding 49 correct to three decimal places, the first three canonical correlations are hz1a , z1b i = 50 0.999, hz2a , z2b i = 0.998, hz3a , z3b i = 0.996. The extracted linear relations are visualised in 51 Figure 4. 52 When either or both of the data views consists of more variables than observations, 53 regularisation can be applied to make the variance matrices non-singular. This in- 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 16 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:15 4 5 wa1 wa2 wa3 6 1 1 1 7 0.5 0.5 0.5 8 0 0 0 9 10 -0.5 -0.5 -0.5 11 -1 -1 -1 12 a3 a1 a4 13 14 15 wb1 wb2 wb3 16 1 1 1 17 18 0.5 0.5 0.5 Fo 19 0 0 0 20 -0.5 -0.5 -0.5 21 22 -1 -1 -1 r 23 b1 b2 b3 24 Pe 25 Fig. 4. The entries of the pairs of positions (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) are shown. The entry of max- 26 imum absolute value is coloured blue. The positive linear relation between a3 and b1 , the positive linear 27 relation between a1 and b2 and the negative linear relation between a4 and b3 are extracted by the pairs (wa1 , wb1 ), (wa2 , wb2 ), and (wa3 , wb3 ) respectively. 28 er 29 30 volves finding optimal non-negative scalar parameters that, when added to the diag- 31 onal entries, improve the invertibility of the variance matrices. After improving the 32 invertibility of the variance matrices, the regularised CCA problem can be solved us- Re 33 ing the standard techniques. 34 35 3.2. Bayesian Approaches for Robustness 36 Probabilistic approaches have been proposed to improve the robustness of CCA when vi 37 the sample size is small and to be able to make more flexible distributional assump- 38 tions. A robust method generates a valid model regardless of outlying observations. In 39 the following, a brief introduction to Bayesian CCA is provided. A detailed review on ew 40 Bayesian CCA and its recent extensions can be found in [Klami et al. 2013]. 41 An extension of CCA to probabilistic models was first proposed in [Bach and Jor- 42 dan 2005]. The probabilistic model contains the latent variables yk ∈ Ro , where 43 o = min(p, q), that generate the observations xka ∈ Rp and xkb ∈ Rq for k = 1, 2, . . . , n. 44 The latent variable model is defined by 45 46 y ∼ N (0, Id ), o≥d≥1 47 xa |y ∼ N (Sa y + µa , Ψa ), Sa ∈ Rp×d , Ψa 0 48 49 xb |y ∼ N (Sb y + µb , Ψb ), Sb ∈ Rq×d , Ψb 0 50 51 where N (µ, Σ) denotes the normal multivariate distribution with mean µ and co- 52 variance Σ. The Sa and Sb correspond to the transformations of the latent variables 53 yk ∈ Ro . The Ψa and Ψb denote the noise covariance matrices. The maximum likeli- 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 17 of 35 Computing Surveys 1 2 3 1:16 4 5 hood estimates of the parameters Sa , Sb , Ψa , Ψb , µa and µb are given by 6 7 Sˆa = Caa Wad Ma Sˆb = Cbb Wbd Mb 8 T T Ψˆa = Caa − Sˆa Sˆa Ψ̂b = Cbb − Sˆb Sˆb 9 n n 10 1X k 1X k µ̂a = xa µ̂b = xb 11 n n k=1 k=1 12 13 where Ma , Mb ∈ Rd×d are arbitrary matrices such that Ma MbT = Pd and the spectral 14 norms of Ma and Mb are smaller than one. Pd is the diagonal matrix of the first d 15 canonical correlations. The d columns of Wad and Wbd correspond to the positions wai 16 and wbi for i = 1, 2, . . . , d obtained using any of the standard techniques described in 17 section 2.1. 18 The posterior expectations of y given xa and xb are E(y|xa ) = MaT Wad T (xa − µˆa ) and Fo T T 19 E(y|xb ) = Mb Wbd (xb − µˆb ). As stated in [Bach and Jordan 2005], regardless of what 20 Ma and Mb are, E(y|xa ) and E(y|xb ) lie in the d-dimensional subspaces of Rp and Rq 21 which are identical to those obtained by linear CCA. The generative model of [Bach 22 and Jordan 2005] was further developed in [Archambeau et al. 2006] by replacing the normal noise with the multivariate Student’s t distribution. This improves the r 23 24 robustness against outlying observations that are then better modeled by the noise term [Klami et al. 2013]. Pe 25 A Bayesian extension of CCA was proposed by [Klami and Kaski 2007] and [Wang 26 2007]. To perform Bayesian analysis, the probabilistic model has to be supplemented 27 with prior distributions of the model parameters. In [Klami and Kaski 2007] and 28 [Wang 2007], the prior distribution of the covariance matrices Ψa and Ψb was chosen er 29 to be the inverse-Wishart distribution. The automatic relevance determination [Neal 30 2012] prior was selected for the linear transformations Sa and Sb . The inference on the 31 posterior distribution was made by applying a variational mean-field algorithm [Wang 32 2007] and Gibbs sampling [Klami and Kaski 2007]. Re 33 As in the case of the linear CCA, the covariance matrices obtained from high- 34 dimensional data make the inference of the probabilistic and Bayesian CCA models 35 difficult [Klami et al. 2013]. This is because the covariance matrices need to be in- 36 verted in the inference algorithms. To perform Bayesian CCA on high-dimensional data, dimensionality reduction techniques should be applied as a preprocessing step, vi 37 38 as has been done for example in [Huopaniemi et al. 2010]. 39 An advantage of Bayesian CCA, in relation to linear CCA, is the application of the ew prior distributions that enable to take the possible underlying structure in the data 40 into account. Examples of studies where sparse models were obtained by means of the 41 prior distribution include [Archambeau and Bach 2009] and [Rai and Daume 2009]. In 42 addition to modeling the structure of the data, in [Klami et al. 2012] the Bayesian CCA 43 was extended such that any exponential family distribution could model the noise, not 44 only the normal. 45 In summary, probabilistic and Bayesian CCA provide alternative ways to interpret 46 the CCA by means of latent variables. Bayesian CCA may be more feasible in settings 47 where additional knowledge regarding the data can be incorporated through the prior 48 distributions or some other exponential family distribution functions better as a noise 49 model than the normal distribution. 50 51 3.3. Uncovering Linear and Non-Linear Relations 52 CCA [Hotelling 1936] finds linear relations between variables belonging to two views 53 that both are overdetermined. The first proposition to extend CCA to uncover non- 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 18 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:17 4 5 linear relations using an optimal scaling method was presented in [Burg and Leeuw 6 1983]. At the turn of the 21st century, artificial neural networks were incorporated in 7 the CCA framework for finding non-linear relations [Lai and Fyfe 1999; Fyfe and Lai 8 2000; Hsieh 2000]. Deep CCA [Andrew et al. 2013] is an example of a recent non-linear 9 CCA variant employing artificial neural networks. Shortly after the introduction of the 10 neural networks, propositions of applying kernel methods in CCA were presented in 11 [Lai and Fyfe 2000; Akaho 2001; Van Gestel et al. 2001; Melzer et al. 2001; Bach and 12 Jordan 2002]. Since then, the kernelised version of CCA has received considerable attention in terms of theoretical foundations [Hardoon et al. 2004; Fukumizu et al. 13 2007; Alam et al. 2008; Blaschko et al. 2008; Hardoon and Shawe-Taylor 2009; Cai 14 2013] and applications [Melzer et al. 2003; Wang et al. 2005; Hardoon et al. 2007; 15 Larson et al. 2014]. In the following, we present how kernel CCA can be applied to 16 uncover nonlinear relations between the variables. We then provide a brief overview 17 on deep CCA. 18 To extract linear relations, CCA is performed in the data spaces of Xa ∈ Rn×p and Fo 19 Xb ∈ Rn×q where the n rows correspond to the observations and the p and q columns 20 correspond to the variables. The relations between the variables are found analysing 21 the positions wa ∈ Rp and wb ∈ Rq that have such images za = Xa wa and zb = Xb wb 22 on a unit ball in Rn that have a minimum enclosing angle. The extracted relations are r 23 linear since the positions wa and wb and their images za and zb were obtained in the 24 Euclidean space. To extract non-linear relations, the positions wa and wb should be found in a space Pe 25 26 where the distances, or measures of similarity, between objects are non-linear. This can 27 be achieved using kernel methods, that is by transforming the original observations xia ∈ Rp and xib ∈ Rq , where i = 1, 2, . . . , n, to Hilbert spaces Ha and Hb through feature 28 maps φa : Rp 7→ Ha and φb : Rq 7→ Hb . The similarity of the objects is captured by er 29 a symmetric positive semi-definite kernel, corresponding to the inner products in the 30 Hilbert spaces Ka (xia , xja ) = hφa (xia ), φa (xja )iHa and Kb (xib , xjb ) = hφb (xib ), φb (xjb )iHb . 31 The feature maps are typically non-linear and result in a high-dimensional intrinsic 32 spaces φa (xia ) ∈ Ha and φb (xib ) ∈ Hb for i = 1, 2, . . . , n. Through kernels, CCA can be Re 33 used to extract non-linear correlations, relying on the fact that the CCA solution can 34 always be found within the span of the data [Bach and Jordan 2002; Schölkopf et al. 35 1998]. 36 The basic principles behind kernel CCA are similar to CCA. First, the observations vi 37 are transformed to Hilbert spaces Ha and Hb using symmetric positive semi-definite 38 kernels 39 Ka (xia , xja ) = hφa (xia ), φa (xja )iHa and Kb (xib , xjb ) = hφb (xib ), φb (xjb )iHb ew 40 41 where i, j = 1, 2, . . . , n. As derived in [Bach and Jordan 2002], the original data matri- 42 ces Xa ∈ Rn×p and Xb ∈ Rn×q can be substituted by the Gram matrices Ka ∈ Rn×n 43 and Kb ∈ Rn×n . Let α and β denote the positions in the kernel space Rn that have the 44 images za = Ka α and zb = Kb β on the unit ball in Rn with a minimum enclosing angle 45 in between. The kernel CCA problem is hence 46 cos(za , zb ) = max n hza , zb i = αT KaT Kb β, (21) za ,zb ∈R 47 q q 48 ||za ||2 = αT Ka2 α = 1 ||zb ||2 = β T Kb2 β = 1 (22) 49 50 As in CCA, the optimisation problem can be solved using the Lagrange multiplier 51 technique. 52 ρ1 ρ2 53 L = αT KaT Kb β − (αT Ka2 α − 1) − (β T Kb2 β − 1) (23) 2 2 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 19 of 35 Computing Surveys 1 2 3 1:18 4 5 where ρ1 and ρ2 denote the Lagrange multipliers. Differentiating L with respect to α 6 and β gives 7 δL 8 = Ka Kb β − ρ1 Ka2 α = 0 (24) δα 9 δL 10 = Kb Ka α − ρ2 Kb2 β = 0 (25) δβ 11 12 Multiplying (7) from the left by αT and (8) from the left by β T gives 13 αT Ka Kb β − ρ1 αT Ka2 α = 0 (26) 14 T T 15 β Kb Ka α − ρ2 β Kb2 β = 0. (27) 16 T Since αT Ka2 α = 1 and β Kb2 β = 1, we obtain that 17 18 ρ1 = ρ2 = ρ. (28) Fo 19 Substituting (28) into Equation (24) we obtain 20 Ka−1 Ka−1 Ka Kb β K −1 Kb β 21 α= = a . (29) ρ ρ 22 r 23 Substituting (29) into (25) we obtain 24 1 Kb Ka Ka−1 Kb β − ρKb2 β = 0 (30) Pe 25 ρ 26 which is equivalent to the generalised eigenvalue problem of the form 27 28 Kb2 β = ρ2 Kb2 β. (31) er 29 If Kb2 is invertible, the problem reduces to a standard eigenvalue problem of the form 30 Iβ = ρ2 β. (32) 31 32 Clearly, in the kernel space, if the Gram matrices are invertible the resulting canonical Re 33 correlations are all equal to one. Regularisation is therefore needed to solve the kernel 34 CCA problem. 35 Kernel CCA can be regularised in a similar manner as presented in Section 3.1 [Bach 36 and Jordan 2002; Hardoon et al. 2004]. We constrain the norms of the position vectors α and β by adding constants c1 and c2 to the diagonals of the Gram matrices Ka and vi 37 Kb 38 2 39 αT Ka + c1 I α =1 (33) ew 40 T 2 41 β Kb + c2 I β =1. (34) 42 The solution can then be found by solving the standard eigenvalue problem 43 −2 −2 Kb + c1 I Kb Ka Ka + c2 I Ka Kb α = ρ2 α. 44 45 As in the case of CCA, kernel CCA can also be solved through the generalised eigen- 46 value problem [Bach and Jordan 2002]. Since the data matrices Xa and Xb can be 47 substituted by the corresponding Gram matrices Ka and Kb , the formulation becomes 48 2 ! 0 Ka Kb α Ka + c1 I 0 α 49 =ρ 2 (35) Kb Ka 0 β 0 Kb + c2 I β 50 51 where the constants c1 and c2 denote the regularisation parameters. In Example 3.2, 52 kernel CCA, solved through the generalised eigenvalue problem, is performed on sim- 53 ulated data. 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 20 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:19 4 5 Table I. Extracted relations by ker- nel CCA 6 7 z1a z2a z3a exp(a3 ) 0.00 0.81 0.09 8 a31 0.05 0.14 0.74 9 −a4 0.99 0.07 0.04 10 z1b z2b z3b 11 b1 0.02 0.93 0.12 b2 0.08 0.15 0.87 12 b3 0.98 0.01 0.03 13 14 15 Example 3.2. We generate a simulated dataset as follows. The data matrices Xa 16 and Xb of sizes n × p and n × q, where n = 150, p = 7 and q = 8, respectively as 17 follows. The seven variables of Xa are generated from a random univariate normal 18 distribution, a1 , a2 , . . . , a7 ∼ N (0, 1). We generate the following relations Fo 19 b1 = exp(a3 ) + ξ 1 20 21 b2 = a31 + ξ 2 22 b3 = −a4 + ξ 3 r 23 where ξ 1 ∼ N (0, 0.4), ξ 2 ∼ N (0, 0.2) and ξ 3 ∼ N (0, 0.3) denote vectors of normal noise. 24 The five other variables of Xb are generated from a random univariate normal distri- Pe 25 bution, b4 , b5 , . . . , b8 ∼ N (0, 1). The data is standardised such that every variable has 26 zero mean and unit variance. 27 In kernel CCA, the choice of the kernel function affects what kind of relations can be 28 extracted. In general, a Gaussian RBF kernel K(x, y) = exp(− ||x−y|| 2 2σ 2 ) is used when the er 29 data is assumed to contain non-linear relations. The width parameter σ determines the 30 non-linearity in the distances between the data points computed in the form of inner 31 products. Increasing the value of σ makes the space closer to Euclidean while decreas- 32 ing makes the distances more non-linear. The optimal value for σ is best determined Re 33 using a re-sampling method such as a cross-validation scheme, for example procedure 34 similar to the one presented in Algorithm 1. In this example, we applied the ”median 35 trick”, presented in [Song et al. 2010], according to which the σ corresponds to the me- 36 dian of Euclidean distances computed between all pairs of observations. The median vi 37 distances for the data in this example were σa = 3.53 and σb = 3.62 for the views Xa 38 and Xb respectively. The kernels were centered by K̃ = K− 1l jjT K− 1l KjjT + l12 (jT Kj)jjT 39 where j contains only entries of value one [Shawe-Taylor and Cristianini 2004]. ew 40 In addition to the kernel parameters, also the regularisation parameters c1 and c2 41 need to be optimised to extract the correct relations. As in the case of regularised 42 CCA, a repeated cross-validation procedure can be applied to identify the optimal pair 43 of parameters. For the data in this example, the optimal regularisation parameters were c1 = 1.50 and c2 = 0.60 when a 20 times repeated 5-fold cross-validation was 44 applied. The first three canonical correlations at the optimal parameter values were 45 hz1a , z1b i = 0.95, hz2a , z2b i = 0.89, and hz3a , z3b i = 0.87. 46 The interpretation of the relations cannot be performed from the positions α and β 47 since they are obtained in the kernel spaces. In the case of simulated data, we know 48 what kind of relations are contained in the data. We can compute the linear correla- 49 tion coefficient between the simulated relations and the transformed pairs of positions 50 za and zb [Chang et al. 2013]. The correlation coefficients are shown in Table I. The 51 exponential relation was extracted in the second pair (z2a , z2b ), the 3rd order polynomial 52 relation was extracted in the third pair (z3a , z3b ) and the linear relation in the first pair 53 (z1a , z1b ). 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 21 of 35 Computing Surveys 1 2 3 1:20 4 5 In [Hardoon et al. 2004], an alternative formulation of the standard eigenvalue prob- 6 lem was presented when the data contains a large number of observations. If the sam- 7 ple size is large, the dimensionality of the Gram matrices Ka and Kb can cause compu- 8 tational problems. Partial Gram-Schmidt orthogonalization (PGSO) [Cristianini et al. 9 2002] was proposed as a matrix decomposition method. PGSO results in 10 Ka 'Ra RaT 11 Kb 'Rb RbT . 12 13 Substituting these into the Equations (24) and (25) and multiplying by RaT and RbT 14 respectively we obtain 15 RaT Ra RaT Rb RbT β − ρRaT RaT Ra RaT Ra α = 0 (36) 16 17 RbT Rb RbT Ra RaT α − µRbT RbT Rb RbT Rb β = 0. (37) 18 Let Daa = RaT Ra , Dab = RaT Rb , Dba = RbT Ra , and Dbb = RbT Rb denote the blocks of the Fo 19 new sample covariance matrix. Let α̃ = RaT α and β̃ = RbT β denote the positions α and 20 β in the reduced space. Using these substitutions in (36) and (37) we obtain 21 2 22 Daa Dab β̃ − ρDaa α̃ = 0 (38) r 23 2 Dbb Dba α̃ − ρDbb β̃ = 0. (39) 24 −1 −1 If Daa and Dbb are invertible we can multiply (38) by Daa and (39) by Dbb which gives Pe 25 26 Dab β̃ − ρDaa α̃ = 0 (40) 27 28 Dba α̃ − ρDbb β̃ = 0. (41) er 29 and hence 30 −1 Dbb Dba α̃ 31 β̃ = (42) ρ 32 which, after a substitution into (38), results in a generalised eigenvalue problem Re 33 −1 34 Dab Dbb Dba α̃ = ρ2 Daa α̃. (43) 35 To formulate the problem as a standard eigenvalue problem, let Daa = SS denote T 36 the complete Cholesky decomposition where S is a lower triangular matrix and let vi 37 α̂ = S T α̃. Substituting these into (43) we obtain 38 −1 S −1 Dab Dbb Dba S 0−1 α̂ = ρ2 α̂. 39 ew 40 If regularisation using the parameter κ is combined with dimensionality reduction the 41 problem becomes 42 −1 S −1 Dab Dbb + κI Dba S 0−1 α̂ = ρ2 α̂. (44) 43 44 A numerical example of the method presented by [Hardoon et al. 2004] is given in Example 3.3. 45 46 Example 3.3. We generate a simulated dataset as follows. The data matrices Xa 47 and Xb of sizes n × p and n × q, where n = 10000, p = 7 and q = 8, respectively as 48 follows. The seven variables of Xa are generated from a random univariate normal 49 distribution, a1 , a2 , . . . , a7 ∼ N (0, 1). We generate the following relations 50 b1 = exp(a3 ) + ξ 1 51 b2 = a31 + ξ 2 52 53 b3 = −a4 + ξ 3 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 22 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:21 4 5 Table II. Extracted relations by ker- nel CCA 6 7 z1a z2a z3a exp(a3 ) 0.91 0.01 0.05 8 a31 0.01 0.92 0.04 9 −a4 0.06 0.03 0.99 10 z1b z2b z3b 11 b1 0.91 0.01 0.04 b2 0.01 0.94 0.05 12 b3 0.07 0.04 0.99 13 14 where ξ 1 ∼ N (0, 0.4), ξ 2 ∼ N (0, 0.2) and ξ 3 ∼ N (0, 0.3) denote vectors of normal noise. 15 The five other variables of Xb are generated from a random univariate normal distri- 16 bution, b4 , b5 , . . . , b8 ∼ N (0, 1). The data is standardised such that every variable has 17 zero mean and unit variance. 18 A Gaussian RBF kernel is used for both views. The width parameter is set using Fo 19 the median trick to σa = 3.56 and σb = 3.60. The kernels were centered. The positions 20 α and β are found solving the standard eigenvalue problem in (44) and applying the 21 Equation (42). We set the regularisation parameter κ = 0.5. 22 The first three canonical correlations at the optimal parameter values were hz1a , z1b i = r 23 0.97, hz2a , z2b i = 0.97, and hz3a , z3b i = 0.96. The correlation coefficients between the simu- 24 lated relations and the transformed variables are shown in Table II. The exponential relation was extracted in the first pair (z1a , z1b ), the 3rd order polynomial relation was Pe 25 26 extracted in the second pair (z2a , z2b ) and the linear relation in the third pair (z3a , z3b ). 27 28 Non-linear relations are also taken into account through neural networks which are er 29 employed in deep CCA [Andrew et al. 2013]. In deep CCA, every observation xka ∈ Rp 30 and xkb ∈ Rq for k = 1, 2, . . . , n is non-linearly transformed multiple times in an iter- 31 ative manner through a layered network. The number of units in a layer determines 32 the dimension of the output vector which is put in the next layer. As is explained in Re 33 [Andrew et al. 2013], let the first layer have c1 units and the final layer o units. The 34 output vector of the first layer for the observation x1a ∈ Rp , is h1 = s(S11 x1a + b11 ) ∈ Rc1 , 35 where S11 ∈ Rc1 ×p is a matrix of weights, b11 ∈ Rc1 is a vector of bias, and s : R 7→ R 36 is a non-linear function applied to each element. The logistic and tanh functions are examples of popular non-linear functions. The output vector h1 is then used to com- vi 37 pute the output of the following layer in similar manner. The final transformed vector 38 f1 (x1a ) = s(Sd1 hd−1 + b1d ) is in the space of Ro , for a network with d layers. The same 39 procedure is applied to the observations xkb ∈ Rq for k = 1, 2, . . . , n. ew 40 In deep CCA, the aim is to learn the optimal parameters Sd and bd for both views 41 such that the correlation between the transformed observations is maximised. Let 42 Ha ∈ Ro×n and Hb ∈ Ro×n denote the matrices that have the final transformed output 43 vectors in their columns. Let H̃a = Ha − n1 Ha 1 denote the centered data matrix and 44 1 1 let Ĉab = m−1 H̃a H̃bT and Ĉaa = m−1 H̃a H̃aT + ra I, where ra is a regularisation constant, 45 46 denote the covariance and variance matrices. Same formulae are used to compute the 47 covariance and variance matrices for view b. As in section 2.1, the total correlation of the top k components of Ha and Hb is the sum of the top k singular values of the matrix 48 −1/2 −1/2 49 T = Ĉaa Ĉab Ĉbb . If k = o, the correlation is given by the trace norm of T , that is 50 corr(Ha , Hb ) = tr(T T T )1/2 . 51 52 The optimal parameters Sd and bd maximise the trace norm using gradient-based op- 53 timisation. The details of the algorithm can be found in [Andrew et al. 2013]. 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 23 of 35 Computing Surveys 1 2 3 1:22 4 5 In summary, kernel and deep CCA provide alternatives to the linear CCA when the 6 relations in the data can be considered to be non-linear and the sample size is small 7 in relation to the data dimensionality. When applying kernel CCA on a real dataset, 8 prior knowledge of the relations of interest can help in the analysis of the results. If 9 the data is assumed to contain both linear and non-linear relations a Gaussian RBF 10 kernel could be a first option. The choice of the kernel function depends on what kind of 11 relations the data can be considered to contain. The possible relations can be extracted 12 by testing how the image pairs correlate with the functions of variables. Deep CCA provides an alternative to compute maximal correlation between the views although 13 the neural network makes the identification of the type of relations difficult. 14 15 3.4. Improving the Interpretability by Enforcing Sparsity 16 The extraction of the linear relations between the variables in CCA and regularised 17 CCA relies on the values of the entries of the position vectors that have images on 18 the unit ball with a minimum enclosing angle. The relations can be inferred when the Fo 19 number of variables is not too large for a human to interpret. However, in modern data 20 analysis, it is common that the number of variables is of the order of tens of thousands. 21 In this case, the values of the entries of the position vectors should be constrained such 22 that only a subset of the variables would have a non-zero value. This would facilitate r 23 the interpretation since only a fraction of the total number of variables need to be 24 considered when inferring the relations. To constrain some of the values of the entries of the position vectors to zero, which Pe 25 26 is also referred to as to enforce sparsity, tools of convex analysis can be applied. In 27 literature, sparsity has been enforced on the position vectors using soft-thresholding 28 operators [Parkhomenko et al. 2007], elastic net regularisation [Waaijenborg et al. 2008], penalised matrix decomposition combined with soft-thresholding [Witten et al. er 29 2009], and convex least squares optimisation [Hardoon and Shawe-Taylor 2011]. The 30 sparse CCA formulations presented in [Parkhomenko et al. 2007; Waaijenborg et al. 31 2008; Witten et al. 2009] find sparse position vectors that can be applied to infer linear 32 relations between the variables with non-zero entries. The formulation in [Hardoon Re 33 and Shawe-Taylor 2011] differs from the preceding propositions in terms of the op- 34 timisation criterion. The canonical correlation is found between the image obtained 35 from the linear transformation defined by the data space of one view and the image 36 obtained from the linear transformation defined by the kernel of the other view. The vi 37 selection of which sparse CCA should be applied for a specific task depends on the 38 research question and prior knowledge regarding the variables. 39 The sparse CCA algorithm of [Parkhomenko et al. 2007] can be applied when the ew 40 aim is to find sparse position vectors and no prior knowledge regarding the variables is 41 available. The positions and images are solved using the SVD, as presented in Section 42 2.2. Sparsity is enforced on the entries of the positions by iteratively applying the 43 soft-thresholding operator [Donoho and Johnstone 1995] on the pair of left and right orthonormal singular vectors until convergence. The soft-thresholding operator is a 44 proximal mapping of the L1 norm [Bach et al. 2011]. The consecutive pairs of sparse 45 left and right singular vectors are obtained by deflating the found pattern from the 46 matrix on which the SVD is computed. The sparse CCA hence results in a sparse set 47 of linearly related variables. 48 The elastic net CCA [Waaijenborg et al. 2008] finds sparse position vectors but also 49 considers possible groupings in the variables. The elastic net [Zou and Hastie 2005] 50 combines the LASSO [Tibshirani 1996] and the ridge [Hoerl and Kennard 1970] penal- 51 ties. The elastic net penalty incorporates a grouping effect in the variable selection. 52 The term variable selection refers to that a selected variable has a non-zero entry in 53 the position vector. In the soft-thresholding CCA of [Parkhomenko et al. 2007], the as- 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 24 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:23 4 5 signment of a non-zero entry is independent of the other entries within the vector. In 6 the elastic net CCA, the ridge penalty groups the variables by the values of the en- 7 tries and the LASSO penalty either eliminates a group by shrinking the entries of the 8 variables within the group to zero or leaves them as non-zero. The algorithm is based 9 on an iterative scheme of multiple regression. As in [Parkhomenko et al. 2007], the 10 computations are performed in the data space and therefore the extracted relations 11 are also linear. 12 The penalised matrix decomposition (PMD) formulation of sparse CCA [Witten et al. 2009] is based on finding low-rank approximations of the covariance matrix Cab . An 13 n × p sized matrix X with rank K < min(p, q) can be approximated using the SVD 14 [Eckart and Young 1936] by 15 r 16 X 17 σk uk vkT = argmin||X − X̃||2F k=1 X̃∈M (r) 18 Fo 19 where uk denotes the column of the matrix U , vk denotes the column of the matrix 20 U , σk denotes the k th singular value on the diagonal of S, M (r) is the set of rank r 21 n × p matrices and r << K. In the case of CCA, the matrix to be approximated is the 22 covariance matrix X = Cab . The optimisation problem in the PMD context is given by r 23 1 24 min ||Cab − σwa wbT ||2F , p wa ∈R ,wb ∈Rq 2 Pe 25 ||wa ||2 = 1 ||wb ||2 = 1, 26 ||wa ||1 ≤ c1 ||wb ||1 ≤ c2 , σ ≥ 0 27 28 which is equivalent to er 29 cos θ = max waT Cab wb , 30 wa ∈Rp ,wb ∈Rq 31 ||wa ||2 ≤ 1 ||wb ||2 ≤ 1, 32 ||wa ||1 ≤ c1 ||wb ||1 ≤ c2 . Re 33 34 The aim is to find r pairs of sparse position vectors wa and wb such that their outer 35 products represent low-rank approximations of the original Cab and hence extracts the 36 r linear relations from the data. The exact derivation of the algorithm to solve the PMD optimisation problem is vi 37 given in [Witten et al. 2009]. In general, the position vectors, that generate 1-rank 38 approximations of the covariance matrix, are found in an iterative manner. To find 39 one 1-rank approximation, the soft-thresholding operator is applied as follows. Let the ew 40 soft-thresholding operator be given by 41 42 S(a, c) = sign(a)(|a| − c)+ 43 where c > 0 is a constant. The following formula is applied in the derivation of the 44 algorithm 45 maxhu, ai, 46 u 47 s.t. ||u||22 ≤ 1, ||u||1 < c. 48 S(a,δ) 49 The solution is given by u = ||S(a,δ)||2 with δ = 0 if ||u1 || ≤ c. Otherwise, δ is selected 50 such that ||u1 || = c. Sparse position vectors are the obtained by Algorithm 2. At ev- 51 ery iteration, the δ1 and δ2 are selected by binary search. To obtain several 1-rank 52 approximations, a deflation step is included such that when the converged vectors wa 53 and wb are found, the extracted relations are subtracted from the covariance matrix 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 25 of 35 Computing Surveys 1 2 3 1:24 4 5 ALGORITHM 2: Computation of a 1-rank approximation of the covariance matrix 6 7 Initialise ||wb ||2 = 1 ; repeat 8 S(Cab wb ,δ1 ) wa ← ||S(C where δ1 = 0 if ||wa ||1 ≤ c1 , otherwise δ1 is chosen such that 9 ab wb ,δ1 )||2 ||wa ||1 = c1 and c1 > 0 ; 10 T S(Cab wa ,δ2 ) 11 wb ← T w ,δ )|| ||S(Cab where δ2 = 0 if ||wb ||1 ≤ c2 , otherwise δ2 is chosen such that a 2 2 12 ||wb ||1 = c2 and c2 > 0 ; 13 until convergence; 14 σ ← waT Cab wb 15 16 k+1 17 Cab ← C k − σk wak wbkT . In this way, the successive solutions remain orthogonal which 18 is a contstraint of CCA. Fo 19 20 Example 3.4. To demonstrate the PMD formulation of sparse CCA, we generate 21 the following data. The data matrices Xa and Xb of sizes n × p and n × q, where n = 50, 22 p = 100 and q = 150, respectively as follows. The variables of Xa are generated from r 23 a random univariate normal distribution, a1 , a2 , · · · , a100 ∼ N (0, 1). We generate the 24 following linear relations Pe 25 26 b1 = a3 + ξ 1 (45) 27 b2 = a1 + ξ 2 (46) 28 b3 = −a4 + ξ 3 (47) er 29 30 where ξ 1 ∼ N (0, 0.08), ξ 2 ∼ N (0, 0.07), and ξ 3 ∼ N (0, 0.05) denote vectors of normal 31 noise. The other variables of Xb are generated from a random univariate normal dis- 32 tribution, b4 , b5 , · · · , b150 ∼ N (0, 1). The data is standardised such that every variable Re 33 has zero mean and unit variance. 34 We apply the R implementation of [Witten et al. 2009] which is available in the 35 PMA package. We extract three rank-1 approximations. The values of the entries of 36 the pairs of position vectors (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) corresponding to canonical correlations hz1a , z1b i = 0.95, hz2a , z2b i = 0.92, hz3a , z3b i = 0.91 are shown in Figure 5. The vi 37 38 first 1-rank approximation extracted (47), the second (46), and the third (47). 39 ew 40 The sparse CCA of [Hardoon and Shawe-Taylor 2011] is a sparse convex least 41 squares formulation that differs from the preceding versions. The canonical correla- 42 tion is found between the linear transformations between a data space view and a kernel space view. The aim is to find a sparse set of variables in the data space view 43 that relate to a sparse set of observations, represented in terms of relative similari- 44 ties, in the kernel space view. An example of a setting, where relations of this type 45 can provide useful information, is in bilingual analysis as described in [Hardoon and 46 Shawe-Taylor 2011]. When finding relations between words of two languages, it may 47 be useful to know in what kind of contexts can a word be used in the translated lan- 48 guage. The optimisation problem is given by 49 50 cos(za , zb ) = max n hza , zb i = waT XaT Kb β, 51 za ,zb ∈R q 52 q ||za ||2 = waT XaT Xa wa = 1 ||zb ||2 = β T Kb2 β = 1 53 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 26 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:25 4 5 wa1 wa2 wa3 6 1 1 1 7 0.5 0.5 0.5 8 0 0 0 9 10 -0.5 -0.5 -0.5 11 -1 -1 -1 12 a4 a1 a3 13 14 15 wb1 wb2 wb3 16 1 1 1 17 18 0.5 0.5 0.5 Fo 19 0 0 0 20 -0.5 -0.5 -0.5 21 22 -1 -1 -1 r 23 b3 b2 b1 24 Pe 25 Fig. 5. The values of the entries of the position vector pairs (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) obtained using 26 the PMD method for sparse CCA are shown. The entry of maximum absolute value is coloured blue. The 27 negative linear relation between a4 and b3 is extracted in the first 1-rank approximation. The positive linear relations between a1 and b2 and a3 and b1 are extracted in the second and third 1-rank approximations. 28 er 29 30 ALGORITHM 3: Pseudo-code to solve the convex sparse least squares problem 31 repeat 32 1. Use the dual Lagrangian variables to solve the primal variables ; Re 33 2. Check whether all constraints on the primal variables hold ; 34 3. Use the primal variables to solve the dual Lagrangian variables ; 4. Check whether all dual Lagrangian variable constraints hold ; 35 5. Check whether 2 holds, if not go to 1 ; 36 until convergence; vi 37 38 39 which is equivalent to the convex sparse least squares problem ew 40 41 min ||Xa wa − Kb β||2 + µ||wa ||1 + γ||β̃||1 (48) wa ,β 42 43 s.t ||β||∞ = 1 (49) 44 where µ and γ are fixed parameters that control the trade-off between function ob- 45 jective and the level of sparsity of the entries of the position vectors wa and β. The 46 constraint ||β||∞ = 1 is set to avoid the trivial solution (wa = 0, β = 0). The k th en- 47 try of β is set to βk = 1 and the remaining entries in β̃ are constrained by 1-norm. 48 The idea is to fix one sample as a basis for comparison and rank the other similar 49 samples based on the fixed sample. The optimisation problem is solved by iteratively 50 minimising the gap between the primal and dual Lagrangian solutions. The procedure 51 is outlined in Algorithm 3. The exact computational steps can be found in [Hardoon 52 and Shawe-Taylor 2011]. The Algorithm 3 is used to extract one relation or pattern 53 from the data. To extract the successive patterns, deflation is applied to obtain the 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 27 of 35 Computing Surveys 1 2 3 1:26 4 5 wa β 6 0.014 1 7 0.012 0.9 8 0.8 0.01 0.7 9 0.6 10 0.008 0.5 11 0.006 0.4 12 0.3 13 0.004 0.2 14 0.002 0.1 15 0 0 16 β2 ββ1112 β26β29 β35 β40 β46 a9 aaa11165 8 a220 a24 9 a62 a77 71 2 4 a6 a9 17 18 Fo 19 Fig. 6. The values of the entries of the positions wa and β at the optimal value of k are shown. 20 21 residual matrices from which the already found pattern is removed. In Example 3.5, 22 the extraction of the first pattern is shown. r 23 Example 3.5. In the sparse CCA of [Hardoon and Shawe-Taylor 2011], the idea is 24 to determine the relations of the variables in the data space view Xa to the observa- Pe 25 tions in the kernel space view Kb where the observations comprise the variables of the 26 view b. This setting differs from all of the previous examples where the idea was to find 27 relations between the variables. Since one of the views is kernelised, the relations can- 28 not be explicitly simulated. We therefore demonstrate the procedure on data generated er 29 from random univariate normal distribution as follows. The data matrices Xa and Xb 30 of sizes n × p and n × q, where n = 50, p = 100 and q = 150, respectively are generated 31 as follows. The variables of Xa and Xb are generated from random univariate normal 32 distribution, a1 , a2 , · · · , a100 ∼ N (0, 1) and b1 , b2 , · · · , b150 ∼ N (0, 1) respectively. The data is standardised such that every variable has zero mean and unit variance. Re 33 The Gaussian kernel function K(x, y) = exp(−||x − y||2 /2σ 2 ) is used to compute the 34 similarities for the view b. The choice of the kernel is justified since the underlying 35 distribution is normal. The width parameter is set to σ = 17.25 using the median trick. 36 The kernel matrix is centered by K̃ = K − 1l jjT K − 1l KjjT + l12 (jT Kj)jjT where j contains vi 37 only entries of value one [Shawe-Taylor and Cristianini 2004]. 38 To find the positions wa and β, we solve 39 ew 40 f = min ||Xa wa − Kb β||2 + µ||wa ||1 + γ||β̃||1 wa ,β 41 42 s.t ||β||∞ = 1 43 using the implementation proposed in [Uurtio et al. 2015]. As stated in [Hardoon and 44 Shawe-Taylor 2011], to determine which variable in the data space view Xa is most 45 related to the observation in Kb , the algorithm needs to be run for all possible values 46 of k. This means that every observation is in turn set as a basis for comparison and a 47 sparse set of the remaining observations β̃ is computed. The optimal value of k gives 48 the minimum objective value f . 49 We run the algorithm by initially setting the value of the entry βk = 1 for k = 50 1, 2, . . . , n. The minimum objective value f = 0.03 was obtained at k = 29. This corre- 51 sponds to a canonical correlation of hza , zb i = 0.88. The values of the entries of wa and 52 β are shown in Figure 6. The observation corresponding to k = 29 in the kernelised 53 view Kb is most related to the variables a15 , a16 , a18 , a20 , and a24 . 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 28 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:27 4 5 The sparse versions of CCA can be applied to settings when the large number of 6 variables hinders the inference of the relations. When the interest is to extract sparse 7 linear relations between the variables, the proposed algorithms of [Parkhomenko et al. 8 2007; Waaijenborg et al. 2008; Witten et al. 2009] provide a solution. The algorithm 9 of [Hardoon and Shawe-Taylor 2011] can be applied if the focus is to find how the 10 variables of one view relate to the observations that correspond to the combined sets 11 of the variables in the other view. In other words, the approach is useful if the focus is 12 not to uncover the explicit relations between the variables but to gain insight how a variable relates to a complete set of variables of an observation. 13 14 4. DISCUSSION 15 This tutorial presented an overview on the methodological evolution of canonical cor- 16 relation methods focusing on the original linear, regularised, kernel, and sparse CCA. 17 Succinct reviews were also conducted on the Bayesian and neural network-based deep 18 CCA variants. The aim was to explain the theoretical foundations of the variants us- Fo 19 ing the linear algebraic interpretation of CCA. The methods to solve the CCA problems 20 were described using numerical examples. Additionally, techniques to assess the sta- 21 tistical significance of the extracted relations and the generalisability of the patterns 22 were explained. The aim was to delineate the applicabilities of the different CCA vari- r 23 ants in relation to the properties of the data. 24 In CCA, the aim is to determine linear relations between variables belonging to two sets. From a linear algebraic point of view, the relations can be found by analysing the Pe 25 26 linear transformations defined by the two views of the data. The most distinct relations 27 are obtained by analysing the entries of the first pair of position vectors in the two data 28 spaces that are mapped onto a unit ball such that their images have a minimum en- closing angle. The less distinct relations can be identified from the successive pairs er 29 of position vectors that correspond to the images with a minimum enclosing angle ob- 30 tained from the orthogonal complements of the preceding pairs of images. This tutorial 31 presented three standard ways of solving the CCA problem, that is by solving either a 32 standard [Hotelling 1935; 1936] or a generalised eigenvalue problem [Bach and Jordan Re 33 2002; Hardoon et al. 2004], or by applying the SVD [Healy 1957; Ewerbring and Luk 34 1989]. 35 The position vectors of the two data spaces, that convey the related pairs of variables, 36 can be obtained using alternative techniques than the ones selected for this tutorial. vi 37 The three methods were chosen because they have been much applied in CCA litera- 38 ture and they are relatively straightforward to explain and implement. Additionally, 39 to understand the further extensions of CCA, it is important to know how it originally ew 40 has been solved. The extensions are often further developed versions of the standard 41 techniques. 42 For didactic purposes, the synthetic datasets used for the worked examples were de- 43 signed to represent optimal data settings for the particular CCA variants to uncover the relations. The relations were generated to be one-to-one, in other words one vari- 44 able in one view was related with only one variable in the other view. In real datasets, 45 which are often much larger than the synthetic ones in this paper, the relations may 46 not be one-to-one but rather many-to-many (one-to-two, two-to-three, etc.). As in the 47 worked examples, these relations can also be inferred by examining the entries of the 48 position vectors of the two data spaces. However, the understanding of how the one-to- 49 one relations are extracted provides means to uncover the more complex relations. 50 To apply the linear CCA, the sample size needs to exceed the number of variables 51 of both views which means that the system is required to be overdetermined. This is 52 to guarantee the non-singularity of the variance matrices. If the sample size is not 53 sufficient, regularisation [Vinod 1976] or Bayesian CCA [Klami et al. 2013] can be ap- 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 29 of 35 Computing Surveys 1 2 3 1:28 4 5 plied. The feasibility of regularisation has not been studied in relation to the number 6 of variables exceeding the number of observations. Improving the invertibility by in- 7 troducing additional bias has been shown to work in various settings but the limit 8 when the system is too underdetermined that regularisation cannot assist in recov- 9 ering the underlying relations has not been resolved. Bayesian CCA is more robust 10 against outlying observations, when compared with linear CCA, due to its generative 11 model structure. 12 In addition to linear relations, non-linear relations are taken into account in ker- nelised and neural network-based CCA. Kernel methods enable the extraction of non- 13 linear relations through the mapping to a Hilbert space [Bach and Jordan 2002; 14 Hardoon et al. 2004]. When applying kernel methods in CCA, the disparity between 15 the number of observations and variables can be huge due to very dimensional kernel 16 induced feature spaces, a challenge that is tackled by regularisation. The types of rela- 17 tions that can be extracted, is determined by the kernel function that was selected for 18 the mapping. Linear relations are extracted by a linear kernel and non-linear relations Fo 19 by non-linear kernel functions such as the Gaussian kernel. Although kernelisation 20 extends the range of extractable relations, it also complicates the identification of the 21 type of relation. A method to determine the type of relation involves testing how the 22 image vectors correlate with a certain type of function. However, this may be difficult r 23 if no prior knowledge of the relations is available. Further research on how to select 24 the optimal kernel functions to determine the most distinct relations underlying in the data could facilitate the final inference making. Neural network-based deep CCA is an Pe 25 26 alternative to kernelised CCA, when the aim is to find a high correlation between the 27 final output vectors obtained through multiple non-linear transformations. However, due to the network structure, it is not straightforward to identify the relations between 28 the variables. er 29 As a final branch of the CCA evolution, this tutorial covered sparse versions of CCA. 30 Sparse CCA variants have been developed to facilitate the extraction of the relations 31 when the data dimensionality is too high for human interpretation. This has been ad- 32 dressed by enforcing sparsity on the entries of the position vectors [Parkhomenko et al. Re 33 2007; Waaijenborg et al. 2008; Witten et al. 2009]. As an alternative to operating in the 34 data spaces, [Hardoon and Shawe-Taylor 2011] proposed a primal-dual sparse CCA in 35 which the relations are obtained between the variables of one view and observations of 36 the other. The sparse variants of CCA in this tutorial were selected based on how much vi 37 they have been applied in literature. As a limitation of the selected variants, sparsity 38 is enforced on the entries of the position vectors without regarding the possible under- 39 lying dependencies between the variables which has been addressed in the literature ew 40 of structured sparsity [Chen et al. 2012]. 41 In addition to studying the techniques of solving the optimisation problems of CCA 42 variants, this tutorial gave a brief introduction to evaluating the canonical correla- 43 tion model. Bartlett’s sequential test procedure [Bartlett 1938; 1941] was given as an example of a standard method to assess the statistical significance of the canonical 44 correlations. The techniques of identifying the related variables through visual inspec- 45 tion of biplots [Meredith 1964; Ter Braak 1990] were presented. To assess whether 46 the extracted relations can be considered to occur in any data with the same under- 47 lying sampling distribution, the method of applying both training and test data was 48 explained. As an alternative method, the statistical significance of the canonical cor- 49 relation model could be assessed using permutation tests [Rousu et al. 2013]. The 50 visualisation of the results using the biplots is mainly applicable in the case of lin- 51 ear relations. Alternative approaches could be considered to visualise the non-linear 52 relations extracted by kernel CCA. 53 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 30 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:29 4 5 To conclude, this tutorial compiled the original, regularised, kernel, and sparse CCA 6 into a unified framework to emphasise the applicabilities of the four variants in dif- 7 ferent data settings. The work highlights which CCA variant is most applicable de- 8 pending on the sample size, data dimensionality and the type of relations of interest. 9 Techniques for extracting the relations are also presented. Additionally, the impor- 10 tance of assessing the statistical significance and generalisability of the relations is 11 emphasised. The tutorial hopefully advances both the practice of CCA variants in data 12 analysis and further development of novel extensions. The software used to produce the examples in this paper are available for download 13 at https://github.com/aalto-ics-kepaco/cca-tutorial. 14 15 16 ACKNOWLEDGMENTS 17 The work by Viivi Uurtio and Juho Rousu has been supported in part by Academy of Finland (grant 18 295496/D4Health). João M. Monteiro was supported by a PhD studentship awarded by Fundação para a Fo 19 Ciência e a Tecnologia (SFRH/BD/88345/2012). John Shawe-Taylor acknowledges the support of the EPSRC 20 through the C-PLACID project Reference: EP/M006093/1. 21 22 r 23 REFERENCES 24 S Akaho. 2001. A Kernel Method For Canonical Correlation Analysis. In In Proceedings of the International Meeting of the Psychometric Society (IMPS2001. Pe 25 Md A Alam, M Nasser, and K Fukumizu. 2008. Sensitivity analysis in robust and kernel canonical correla- 26 tion analysis. In Computer and Information Technology, 2008. ICCIT 2008. 11th International Confer- 27 ence on. IEEE, 399–404. 28 TW Anderson. 2003. An introduction to statistical multivariate analysis. (2003). er 29 G Andrew, R Arora, J Bilmes, and K Livescu. 2013. Deep canonical correlation analysis. In International 30 Conference on Machine Learning. 1247–1255. 31 C Archambeau and FR Bach. 2009. Sparse probabilistic projections. In Advances in neural information processing systems. 73–80. 32 C Archambeau, N Delannay, and M Verleysen. 2006. Robust probabilistic projections. In Proceedings of the Re 33 23rd International conference on machine learning. ACM, 33–40. 34 S Arlot, A Celisse, and others. 2010. A survey of cross-validation procedures for model selection. Statistics 35 surveys 4 (2010), 40–79. 36 F Bach, R Jenatton, J Mairal, G Obozinski, and others. 2011. Convex optimization with sparsity-inducing norms. Optimization for Machine Learning 5 (2011). vi 37 FR Bach and MI Jordan. 2002. Kernel independent component analysis. Journal of machine learning re- 38 search 3, Jul (2002), 1–48. 39 FR Bach and MI Jordan. 2005. A probabilistic interpretation of canonical correlation analysis. (2005). ew 40 MS Bartlett. 1938. Further aspects of the theory of multiple regression. In Mathematical Proceedings of the 41 Cambridge Philosophical Society, Vol. 34. Cambridge Univ Press, 33–40. 42 MS Bartlett. 1941. The statistical significance of canonical correlations. Biometrika 32, 1 (1941), 29–37. 43 B Baur and S Bozdag. 2015. A canonical correlation analysis-based dynamic bayesian network prior to infer gene regulatory networks from multiple types of biological data. Journal of Computational Biology 22, 44 4 (2015), 289–299. 45 Å Björck and GH Golub. 1973. Numerical methods for computing angles between linear subspaces. Mathe- 46 matics of computation 27, 123 (1973), 579–594. 47 MB Blaschko, CH Lampert, and A Gretton. 2008. Semi-supervised laplacian regularization of kernel canon- 48 ical correlation analysis. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 133–145. 49 MW Browne. 2000. Cross-validation methods. Journal of mathematical psychology 44, 1 (2000), 108–132. 50 E Burg and J Leeuw. 1983. Non-linear canonical correlation. British journal of mathematical and statistical 51 psychology 36, 1 (1983), 54–80. 52 J Cai. 2013. The distance between feature subspaces of kernel canonical correlation analysis. Mathematical 53 and Computer Modelling 57, 3 (2013), 970–975. 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 31 of 35 Computing Surveys 1 2 3 1:30 4 5 L Cao, Z Ju, J Li, R Jian, and C Jiang. 2015. Sequence detection analysis based on canonical correlation 6 for steady-state visual evoked potential brain computer interfaces. Journal of neuroscience methods 253 (2015), 10–17. 7 JD Carroll. 1968. Generalization of canonical correlation analysis to three or more sets of variables. In 8 Proceedings of the 76th annual convention of the American Psychological Association, Vol. 3. 227–228. 9 B Chang, U Krüger, R Kustra, and J Zhang. 2013. Canonical Correlation Analysis based on Hilbert-Schmidt 10 Independence Criterion and Centered Kernel Target Alignment.. In ICML (2). 316–324. 11 X Chen, S Chen, H Xue, and X Zhou. 2012. A unified dimensionality reduction framework for semi-paired and semi-supervised multi-view data. Pattern Recognition 45, 5 (2012), 2005–2018. 12 X Chen, C He, and H Peng. 2014. Removal of muscle artifacts from single-channel EEG based on ensemble 13 empirical mode decomposition and multiset canonical correlation analysis. Journal of Applied Mathe- 14 matics 2014 (2014). 15 X Chen, H Liu, and JG Carbonell. 2012. Structured sparse canonical correlation analysis. In International 16 Conference on Artificial Intelligence and Statistics. 199–207. 17 A Cichonska, J Rousu, P Marttinen, AJ Kangas, P Soininen, T Lehtimäki, OT Raitakari, M-R Järvelin, V Salomaa, M Ala-Korpela, and others. 2016. metaCCA: Summary statistics-based multivariate meta- 18 analysis of genome-wide association studies using canonical correlation analysis. Bioinformatics (2016), Fo 19 btw052. 20 N Cristianini, J Shawe-Taylor, and H Lodhi. 2002. Latent semantic kernels. Journal of Intelligent Informa- 21 tion Systems 18, 2-3 (2002), 127–152. R Cruz-Cano and MLT Lee. 2014. Fast regularized canonical correlation analysis. Computational Statistics 22 & Data Analysis 70 (2014), 88–100. r 23 J Dauxois and GM Nkiet. 1997. Canonical analysis of two Euclidean subspaces and its applications. Linear 24 Algebra Appl. 264 (1997), 355–388. Pe 25 DL Donoho and IM Johnstone. 1995. Adapting to unknown smoothness via wavelet shrinkage. Journal of 26 the american statistical association 90, 432 (1995), 1200–1224. 27 RB Dunham and DJ Kravetz. 1975. Canonical correlation analysis in a predictive system. The Journal of Experimental Education 43, 4 (1975), 35–42. 28 C Eckart and G Young. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1, er 29 3 (1936), 211–218. 30 B Efron. 1979. Computers and the theory of statistics: thinking the unthinkable. SIAM review 21, 4 (1979), 31 460–480. 32 LM Ewerbring and FT Luk. 1989. Canonical correlations and generalized SVD: applications and new algo- Re 33 rithms. In 32nd Annual Technical Symposium. International Society for Optics and Photonics, 206–222. J Fang, D Lin, SC Schulz, Z Xu, VD Calhoun, and Y-P Wang. 2016. Joint sparse canonical correlation analysis 34 for detecting differential imaging genetics modules. Bioinformatics 32, 22 (2016), 3480–3488. 35 Y Fujikoshi and LG Veitch. 1979. Estimation of dimensionality in canonical correlation analysis. Biometrika 36 66, 2 (1979), 345–351. vi 37 K Fukumizu, FR Bach, and A Gretton. 2007. Statistical consistency of kernel canonical correlation analysis. 38 Journal of Machine Learning Research 8, Feb (2007), 361–383. 39 C Fyfe and PL Lai. 2000. Canonical correlation analysis neural networks. In Pattern Recognition, 2000. ew Proceedings. 15th International Conference on, Vol. 2. IEEE, 977–980. 40 GH Golub and CF Van Loan. 2012. Matrix computations. Vol. 3. JHU Press. 41 GH Golub and H Zha. 1995. The canonical correlations of matrix pairs and their numerical computation. In 42 Linear algebra for signal processing. Springer, 27–49. 43 I González, S Déjean, PGP Martin, O Gonçalves, P Besse, and A Baccini. 2009. Highlighting relationships 44 between heterogeneous biological data through graphical displays based on regularized canonical cor- relation analysis. Journal of Biological Systems 17, 02 (2009), 173–199. 45 BK Gunderson and RJ Muirhead. 1997. On estimating the dimensionality in canonical correlation analysis. 46 Journal of Multivariate Analysis 62, 1 (1997), 121–136. 47 DR Hardoon, J Mourao-Miranda, M Brammer, and J Shawe-Taylor. 2007. Unsupervised analysis of fMRI 48 data using kernel canonical correlation. NeuroImage 37, 4 (2007), 1250–1259. 49 DR Hardoon and J Shawe-Taylor. 2009. Convergence analysis of kernel canonical correlation analysis: the- 50 ory and practice. Machine learning 74, 1 (2009), 23–38. 51 DR Hardoon and J Shawe-Taylor. 2011. Sparse canonical correlation analysis. Machine Learning 83, 3 (2011), 331–353. 52 53 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 32 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:31 4 5 DR Hardoon, S Szedmak, and J Shawe-Taylor. 2004. Canonical correlation analysis: An overview with ap- 6 plication to learning methods. Neural computation 16, 12 (2004), 2639–2664. MJR Healy. 1957. A rotation method for computing canonical correlations. Math. Comp. 11, 58 (1957), 83– 7 86. 8 C Heij and B Roorda. 1991. A modified canonical correlation approach to approximate state space modelling. 9 In Decision and Control, 1991., Proceedings of the 30th IEEE Conference on. IEEE, 1343–1348. 10 AE Hoerl and RW Kennard. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Techno- 11 metrics 12, 1 (1970), 55–67. 12 JW Hooper. 1959. Simultaneous equations and canonical correlation theory. Econometrica: Journal of the Econometric Society (1959), 245–256. 13 CE Hopkins. 1969. Statistical analysis by canonical correlation: a computer application. Health services 14 research 4, 4 (1969), 304. 15 P Horst. 1961. Relations among sets of measures. Psychometrika 26, 2 (1961), 129–149. 16 H Hotelling. 1935. The most predictable criterion. Journal of educational Psychology 26, 2 (1935), 139. 17 H Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321–377. 18 WW Hsieh. 2000. Nonlinear canonical correlation analysis by neural networks. Neural Networks 13, 10 Fo 19 (2000), 1095–1105. 20 I Huopaniemi, T Suvitaival, J Nikkilä, M Orešič, and S Kaski. 2010. Multivariate multi-way analysis of multi-source data. Bioinformatics 26, 12 (2010), i391–i398. 21 A Kabir, RD Merrill, AA Shamim, RDW Klemn, AB Labrique, P Christian, KP West Jr, and M Nasser. 2014. 22 Canonical correlation analysis of infant’s size at birth and maternal factors: a study in rural Northwest r 23 Bangladesh. PloS one 9, 4 (2014), e94243. 24 M Kang, B Zhang, X Wu, C Liu, and J Gao. 2013. Sparse generalized canonical correlation analysis for biological model integration: a genetic study of psychiatric disorders. In Engineering in Medicine and Pe 25 Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE. IEEE, 1490–1493. 26 JR Kettenring. 1971. Canonical analysis of several sets of variables. Biometrika (1971), 433–451. 27 A Kimura, M Sugiyama, T Nakano, H Kameoka, H Sakano, E Maeda, and K Ishiguro. 2013. SemiCCA: 28 Efficient semi-supervised learning of canonical correlations. Information and Media Technologies 8, 2 er 29 (2013), 311–318. 30 A Klami and S Kaski. 2007. Local dependent components. In Proceedings of the 24th international conference on Machine learning. ACM, 425–432. 31 A Klami, S Virtanen, and S Kaski. 2012. Bayesian exponential family projections for coupled data sources. 32 arXiv preprint arXiv:1203.3489 (2012). Re 33 A Klami, S Virtanen, and S Kaski. 2013. Bayesian canonical correlation analysis. Journal of Machine Learn- 34 ing Research 14, Apr (2013), 965–1003. 35 D Krstajic, LJ Buturovic, DE Leahy, and S Thomas. 2014. Cross-validation pitfalls when selecting and 36 assessing regression and classification models. Journal of cheminformatics 6, 1 (2014), 1. PL Lai and C Fyfe. 1999. A neural implementation of canonical correlation analysis. Neural Networks 12, vi 37 10 (1999), 1391–1397. 38 PL Lai and C Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of 39 Neural Systems 10, 05 (2000), 365–377. ew 40 NB Larson, GD Jenkins, MC Larson, RA Vierkant, TA Sellers, CM Phelan, JM Schildkraut, R Sutphen, 41 PPD Pharoah, S A Gayther, and others. 2014. Kernel canonical correlation analysis for assessing gene– gene interactions and application to ovarian cancer. European Journal of Human Genetics 22, 1 (2014), 42 126–131. 43 SC Larson. 1931. The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology 44 22, 1 (1931), 45. 45 H-S Lee. 2007. Canonical correlation analysis using small number of samples. Communications in Statistic- 46 sSimulation and Computation R 36, 5 (2007), 973–985. 47 SE Leurgans, RA Moyeed, and BW Silverman. 1993. Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society. Series B (Methodological) (1993), 725–740. 48 H Lindsey, JT Webster, and S Halpern. 1985. Canonical Correlation as a Discriminant Tool in a Periodontal 49 Problem. Biometrical journal 27, 3 (1985), 257–264. 50 P Marttinen, J Gillberg, A Havulinna, J Corander, and S Kaski. 2013. Genome-wide association studies with 51 high-dimensional phenotypes. Statistical applications in genetics and molecular biology 12, 4 (2013), 413–431. 52 53 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 33 of 35 Computing Surveys 1 2 3 1:32 4 5 T Melzer, M Reiter, and H Bischof. 2001. Nonlinear feature extraction using generalized canonical correla- 6 tion analysis. In International Conference on Artificial Neural Networks. Springer, 353–360. T Melzer, M Reiter, and H Bischof. 2003. Appearance models based on kernel canonical correlation analysis. 7 Pattern recognition 36, 9 (2003), 1961–1971. 8 W Meredith. 1964. Canonical correlations with fallible data. Psychometrika 29, 1 (1964), 55–65. 9 MS Monmonier and FE Finn. 1973. Improving the interpretation of geographical canonical correlation mod- 10 els. The Professional Geographer 25, 2 (1973), 140–142. 11 M Nakanishi, Y Wang, Y-T Wang, and T-P Jung. 2015. A Comparison Study of Canonical Correlation 12 Analysis Based Methods for Detecting Steady-State Visual Evoked Potentials. PloS one 10, 10 (2015), e0140703. 13 RM Neal. 2012. Bayesian learning for neural networks. Vol. 118. Springer Science & Business Media. 14 T Ogura, Y Fujikoshi, and T Sugiyama. 2013. A variable selection criterion for two sets of principal compo- 15 nent scores in principal canonical correlation analysis. Communications in Statistics-Theory and Meth- 16 ods 42, 12 (2013), 2118–2135. 17 E Parkhomenko, D Tritchler, and J Beyene. 2007. Genome-wide sparse canonical correlation of gene expres- sion with genotypes. In BMC proceedings, Vol. 1. BioMed Central Ltd, S119. 18 P Rai and H Daume. 2009. Multi-label prediction via sparse infinite CCA. In Advances in Neural Information Fo 19 Processing Systems. 1518–1526. 20 J Rousu, DD Agranoff, O Sodeinde, J Shawe-Taylor, and D Fernandez-Reyes. 2013. Biomarker discovery by 21 sparse canonical correlation analysis of complex clinical phenotypes of tuberculosis and malaria. PLoS 22 Comput Biol 9, 4 (2013), e1003018. r 23 Y Saad. 2011. Numerical methods for large eigenvalue problems. Vol. 158. SIAM. 24 T Sakurai. 2009. Asymptotic expansions of test statistics for dimensionality and additional information in canonical correlation analysis when the dimension is large. Journal of Multivariate Analysis 100, 5 Pe 25 (2009), 888–901. 26 BK Sarkar and C Chakraborty. 2015. DNA pattern recognition using canonical correlation algorithm. Jour- 27 nal of biosciences 40, 4 (2015), 709–719. 28 SV Schell and WA Gardner. 1995. Programmable canonical correlation analysis: A flexible framework for er blind adaptive spatial filtering. IEEE transactions on signal processing 43, 12 (1995), 2898–2908. 29 B Schölkopf, A Smola, and K-R Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. 30 Neural computation 10, 5 (1998), 1299–1319. 31 JA Seoane, C Campbell, INM Day, JP Casas, and TR Gaunt. 2014. Canonical correlation analysis for gene- 32 based pleiotropy discovery. PLoS Comput Biol 10, 10 (2014), e1003876. Re 33 J Shawe-Taylor and N Cristianini. 2004. Kernel methods for pattern analysis. Cambridge university press. 34 X-B Shen, Q-S Sun, and Y-H Yuan. 2013. Orthogonal canonical correlation analysis and its application in 35 feature fusion. In Information Fusion (FUSION), 2013 16th International Conference on. IEEE, 151– 157. 36 C Soneson, H Lilljebjörn, T Fioretos, and M Fontes. 2010. Integrative analysis of gene expression and copy vi 37 number alterations using canonical correlation analysis. BMC bioinformatics 11, 1 (2010), 1. 38 L Song, B Boots, SM Siddiqi, GJ Gordon, and A Smola. 2010. Hilbert space embeddings of hidden Markov 39 models. (2010). ew 40 Y Song, PJ Schreier, D Ramı́rez, and T Hasija. 2016. Canonical correlation analysis of high-dimensional data with very small sample support. Signal Processing 128 (2016), 449–458. 41 M Stone. 1974. Cross-validatory choice and assessment of statistical predictions. Journal of the royal statis- 42 tical society. Series B (Methodological) (1974), 111–147. 43 MJ Sullivan. 1982. Distribution of Edaphic Diatoms in a Missisippi Salt Marsh: A Canonical Correlation 44 Analysis. Journal of Phycology 18, 1 (1982), 130–133. 45 A Tenenhaus, C Philippe, and V Frouin. 2015. Kernel generalized canonical correlation analysis. Computa- 46 tional Statistics & Data Analysis 90 (2015), 114–131. 47 A Tenenhaus, C Philippe, V Guillemot, K-A Le Cao, J Grill, and V Frouin. 2014. Variable selection for generalized canonical correlation analysis. Biostatistics (2014), kxu001. 48 A Tenenhaus and M Tenenhaus. 2011. Regularized generalized canonical correlation analysis. Psychome- 49 trika 76, 2 (2011), 257–284. 50 CJF Ter Braak. 1990. Interpreting canonical correlation analysis through biplots of structure correlations 51 and weights. Psychometrika 55, 3 (1990), 519–531. 52 R Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), 267–288. 53 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Computing Surveys Page 34 of 35 1 2 3 A Tutorial on Canonical Correlation Methods 1:33 4 5 XM Tu. 1991. A bootstrap resampling scheme for using the canonical correlation technique in rank estima- 6 tion. Journal of chemometrics 5, 4 (1991), 333–343. XM Tu, DS Burdick, DW Millican, and LB McGown. 1989. Canonical correlation technique for rank estima- 7 tion of excitation-emission matrixes. Analytical Chemistry 61, 19 (1989), 2219–2224. 8 V Uurtio, M Bomberg, K Nybo, M Itävaara, and J Rousu. 2015. Canonical correlation methods for exploring 9 microbe-environment interactions in deep subsurface. In International Conference on Discovery Science. 10 Springer, 299–307. 11 JP Van de Geer. 1984. Linear relations amongk sets of variables. Psychometrika 49, 1 (1984), 79–94. 12 T Van Gestel, JAK Suykens, J De Brabanter, B De Moor, and J Vandewalle. 2001. Kernel canonical cor- relation analysis and least squares support vector machines. In International Conference on Artificial 13 Neural Networks. Springer, 384–389. 14 HD Vinod. 1976. Canonical ridge and econometrics of joint production. Journal of Econometrics 4, 2 (1976), 15 147–166. 16 S Waaijenborg, PC Verselewel de Witt Hamer, and AH Zwinderman. 2008. Quantifying the association 17 between gene expressions and DNA-markers by penalized canonical correlation analysis. Statistical Applications in Genetics and Molecular Biology 7, 1 (2008). 18 C Wang. 2007. Variational Bayesian approach to canonical correlation analysis. IEEE Transactions on Neu- Fo 19 ral Networks 18, 3 (2007), 905–910. 20 D Wang, L Shi, DS Yeung, and ECC Tsang. 2005. Nonlinear canonical correlation analysis of fMRI signals 21 using HDR models. In 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. IEEE, 5896–5899. 22 r GC Wang, N Lin, and B Zhang. 2013. Dimension reduction in functional regression using mixed data canon- 23 ical correlation analysis. Stat Interface 6 (2013), 187–196. 24 DS Watkins. 2004. Fundamentals of matrix computations. Vol. 64. John Wiley & Sons. Pe 25 FV Waugh. 1942. Regressions between sets of variables. Econometrica, Journal of the Econometric Society 26 (1942), 290–310. 27 DM Witten, R Tibshirani, and T Hastie. 2009. A penalized matrix decomposition, with applications to sparse 28 principal components and canonical correlation analysis. Biostatistics (2009), kxp008. er KW Wong, PCW Fung, and CC Lau. 1980. Study of the mathematical approximations made in the basis- 29 correlation method and those made in the canonical-transformation method for an interacting Bose gas. 30 Physical Review A 22, 3 (1980), 1272. 31 T Yamada and T Sugiyama. 2006. On the permutation test in canonical correlation analysis. Computational 32 statistics & data analysis 50, 8 (2006), 2111–2123. Re 33 H Yamamoto, H Yamaji, E Fukusaki, H Ohno, and H Fukuda. 2008. Canonical correlation analysis for mul- tivariate regression and its application to metabolic fingerprinting. Biochemical Engineering Journal 34 40, 2 (2008), 199–204. 35 Y-H Yuan, Q-S Sun, and H-W Ge. 2014. Fractional-order embedding canonical correlation analysis and its 36 applications to multi-view dimensionality reduction and recognition. Pattern Recognition 47, 3 (2014), vi 37 1411–1424. 38 Y-H Yuan, Q-S Sun, Q Zhou, and D-S Xia. 2011. A novel multiset integrated canonical correlation analysis framework and its application in feature fusion. Pattern Recognition 44, 5 (2011), 1031–1040. 39 ew B Zhang, J Hao, G Ma, J Yue, and Z Shi. 2014. Semi-paired probabilistic canonical correlation analysis. In 40 International Conference on Intelligent Information Processing. Springer, 1–10. 41 H Zou and T Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal 42 Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301–320. 43 44 Received February 2017; revised –; accepted – 45 46 47 48 49 50 51 52 53 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60 Page 35 of 35 Computing Surveys 1 2 3 4 5 Online Appendix to: 6 A Tutorial on Canonical Correlation Methods 7 8 VIIVI UURTIO, Aalto University 9 JOÃO M. MONTEIRO, University College London 10 JAZ KANDOLA, Imperial College London 11 JOHN SHAWE-TAYLOR, University College London 12 DELMIRO FERNANDEZ-REYES, University College London 13 JUHO ROUSU, Aalto University 14 15 16 A. THIS IS AN EXAMPLE OF APPENDIX SECTION HEAD 17 B. APPENDIX SECTION HEAD 18 Fo 19 20 21 22 r 23 24 Pe 25 26 27 28 er 29 30 31 32 Re 33 34 35 36 vi 37 38 39 ew 40 41 42 43 44 45 46 47 48 49 50 51 52 c 2017 Copyright held by the owner/author(s). 0360-0300/2017/02-ART1 $15.00 53 DOI: 0000001.0000001 54 55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017. 56 57 58 59 60

References (92)

S Akaho. 2001. A Kernel Method For Canonical Correlation Analysis. In In Proceedings of the International Meeting of the Psychometric Society (IMPS2001.
Md A Alam, M Nasser, and K Fukumizu. 2008. Sensitivity analysis in robust and kernel canonical correla- tion analysis. In Computer and Information Technology, 2008. ICCIT 2008. 11th International Confer- ence on. IEEE, 399-404.
TW Anderson. 2003. An introduction to statistical multivariate analysis. (2003).
G Andrew, R Arora, J Bilmes, and K Livescu. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning. 1247-1255.
C Archambeau and FR Bach. 2009. Sparse probabilistic projections. In Advances in neural information processing systems. 73-80.
C Archambeau, N Delannay, and M Verleysen. 2006. Robust probabilistic projections. In Proceedings of the 23rd International conference on machine learning. ACM, 33-40.
S Arlot, A Celisse, and others. 2010. A survey of cross-validation procedures for model selection. Statistics surveys 4 (2010), 40-79.
F Bach, R Jenatton, J Mairal, G Obozinski, and others. 2011. Convex optimization with sparsity-inducing norms. Optimization for Machine Learning 5 (2011).
FR Bach and MI Jordan. 2002. Kernel independent component analysis. Journal of machine learning re- search 3, Jul (2002), 1-48.
FR Bach and MI Jordan. 2005. A probabilistic interpretation of canonical correlation analysis. (2005).
MS Bartlett. 1938. Further aspects of the theory of multiple regression. In Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 34. Cambridge Univ Press, 33-40.
MS Bartlett. 1941. The statistical significance of canonical correlations. Biometrika 32, 1 (1941), 29-37.
B Baur and S Bozdag. 2015. A canonical correlation analysis-based dynamic bayesian network prior to infer gene regulatory networks from multiple types of biological data. Journal of Computational Biology 22, 4 (2015), 289-299.
Å Björck and GH Golub. 1973. Numerical methods for computing angles between linear subspaces. Mathe- matics of computation 27, 123 (1973), 579-594.
MB Blaschko, CH Lampert, and A Gretton. 2008. Semi-supervised laplacian regularization of kernel canon- ical correlation analysis. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 133-145.
MW Browne. 2000. Cross-validation methods. Journal of mathematical psychology 44, 1 (2000), 108-132.
E Burg and J Leeuw. 1983. Non-linear canonical correlation. British journal of mathematical and statistical psychology 36, 1 (1983), 54-80.
J Cai. 2013. The distance between feature subspaces of kernel canonical correlation analysis. Mathematical and Computer Modelling 57, 3 (2013), 970-975.
1:30 L Cao, Z Ju, J Li, R Jian, and C Jiang. 2015. Sequence detection analysis based on canonical correlation for steady-state visual evoked potential brain computer interfaces. Journal of neuroscience methods 253 (2015), 10-17.
JD Carroll. 1968. Generalization of canonical correlation analysis to three or more sets of variables. In Proceedings of the 76th annual convention of the American Psychological Association, Vol. 3. 227-228.
B Chang, U Kr üger, R Kustra, and J Zhang. 2013. Canonical Correlation Analysis based on Hilbert-Schmidt Independence Criterion and Centered Kernel Target Alignment.. In ICML (2). 316-324.
X Chen, S Chen, H Xue, and X Zhou. 2012. A unified dimensionality reduction framework for semi-paired and semi-supervised multi-view data. Pattern Recognition 45, 5 (2012), 2005-2018.
X Chen, C He, and H Peng. 2014. Removal of muscle artifacts from single-channel EEG based on ensemble empirical mode decomposition and multiset canonical correlation analysis. Journal of Applied Mathe- matics 2014 (2014).
X Chen, H Liu, and JG Carbonell. 2012. Structured sparse canonical correlation analysis. In International Conference on Artificial Intelligence and Statistics. 199-207.
A Cichonska, J Rousu, P Marttinen, AJ Kangas, P Soininen, T Lehtim äki, OT Raitakari, M-R J ärvelin, V Salomaa, M Ala-Korpela, and others. 2016. metaCCA: Summary statistics-based multivariate meta- analysis of genome-wide association studies using canonical correlation analysis. Bioinformatics (2016), btw052.
N Cristianini, J Shawe-Taylor, and H Lodhi. 2002. Latent semantic kernels. Journal of Intelligent Informa- tion Systems 18, 2-3 (2002), 127-152.
R Cruz-Cano and MLT Lee. 2014. Fast regularized canonical correlation analysis. Computational Statistics & Data Analysis 70 (2014), 88-100.
J Dauxois and GM Nkiet. 1997. Canonical analysis of two Euclidean subspaces and its applications. Linear Algebra Appl. 264 (1997), 355-388.
DL Donoho and IM Johnstone. 1995. Adapting to unknown smoothness via wavelet shrinkage. Journal of the american statistical association 90, 432 (1995), 1200-1224.
RB Dunham and DJ Kravetz. 1975. Canonical correlation analysis in a predictive system. The Journal of Experimental Education 43, 4 (1975), 35-42.
C Eckart and G Young. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1, 3 (1936), 211-218.
B Efron. 1979. Computers and the theory of statistics: thinking the unthinkable. SIAM review 21, 4 (1979), 460-480.
LM Ewerbring and FT Luk. 1989. Canonical correlations and generalized SVD: applications and new algo- rithms. In 32nd Annual Technical Symposium. International Society for Optics and Photonics, 206-222.
J Fang, D Lin, SC Schulz, Z Xu, VD Calhoun, and Y-P Wang. 2016. Joint sparse canonical correlation analysis for detecting differential imaging genetics modules. Bioinformatics 32, 22 (2016), 3480-3488.
Y Fujikoshi and LG Veitch. 1979. Estimation of dimensionality in canonical correlation analysis. Biometrika 66, 2 (1979), 345-351.
K Fukumizu, FR Bach, and A Gretton. 2007. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research 8, Feb (2007), 361-383.
C Fyfe and PL Lai. 2000. Canonical correlation analysis neural networks. In Pattern Recognition, 2000. Proceedings. 15th International Conference on, Vol. 2. IEEE, 977-980.
GH Golub and CF Van Loan. 2012. Matrix computations. Vol. 3. JHU Press.
GH Golub and H Zha. 1995. The canonical correlations of matrix pairs and their numerical computation. In Linear algebra for signal processing. Springer, 27-49.
I Gonz ález, S Déjean, PGP Martin, O Gonc ¸alves, P Besse, and A Baccini. 2009. Highlighting relationships between heterogeneous biological data through graphical displays based on regularized canonical cor- relation analysis. Journal of Biological Systems 17, 02 (2009), 173-199.
BK Gunderson and RJ Muirhead. 1997. On estimating the dimensionality in canonical correlation analysis. Journal of Multivariate Analysis 62, 1 (1997), 121-136.
DR Hardoon, J Mourao-Miranda, M Brammer, and J Shawe-Taylor. 2007. Unsupervised analysis of fMRI data using kernel canonical correlation. NeuroImage 37, 4 (2007), 1250-1259.
DR Hardoon and J Shawe-Taylor. 2009. Convergence analysis of kernel canonical correlation analysis: the- ory and practice. Machine learning 74, 1 (2009), 23-38.
DR Hardoon and J Shawe-Taylor. 2011. Sparse canonical correlation analysis. Machine Learning 83, 3 (2011), 331-353.
DR Hardoon, S Szedmak, and J Shawe-Taylor. 2004. Canonical correlation analysis: An overview with ap- plication to learning methods. Neural computation 16, 12 (2004), 2639-2664.
MJR Healy. 1957. A rotation method for computing canonical correlations. Math. Comp. 11, 58 (1957), 83- 86.
C Heij and B Roorda. 1991. A modified canonical correlation approach to approximate state space modelling. In Decision and Control, 1991., Proceedings of the 30th IEEE Conference on. IEEE, 1343-1348.
AE Hoerl and RW Kennard. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Techno- metrics 12, 1 (1970), 55-67.
JW Hooper. 1959. Simultaneous equations and canonical correlation theory. Econometrica: Journal of the Econometric Society (1959), 245-256.
CE Hopkins. 1969. Statistical analysis by canonical correlation: a computer application. Health services research 4, 4 (1969), 304.
P Horst. 1961. Relations among sets of measures. Psychometrika 26, 2 (1961), 129-149.
H Hotelling. 1935. The most predictable criterion. Journal of educational Psychology 26, 2 (1935), 139.
H Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321-377.
WW Hsieh. 2000. Nonlinear canonical correlation analysis by neural networks. Neural Networks 13, 10 (2000), 1095-1105.
I Huopaniemi, T Suvitaival, J Nikkil ä, M Orešič, and S Kaski. 2010. Multivariate multi-way analysis of multi-source data. Bioinformatics 26, 12 (2010), i391-i398.
A Kabir, RD Merrill, AA Shamim, RDW Klemn, AB Labrique, P Christian, KP West Jr, and M Nasser. 2014. Canonical correlation analysis of infant's size at birth and maternal factors: a study in rural Northwest Bangladesh. PloS one 9, 4 (2014), e94243.
M Kang, B Zhang, X Wu, C Liu, and J Gao. 2013. Sparse generalized canonical correlation analysis for biological model integration: a genetic study of psychiatric disorders. In Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE. IEEE, 1490-1493.
JR Kettenring. 1971. Canonical analysis of several sets of variables. Biometrika (1971), 433-451.
A Kimura, M Sugiyama, T Nakano, H Kameoka, H Sakano, E Maeda, and K Ishiguro. 2013. SemiCCA: Efficient semi-supervised learning of canonical correlations. Information and Media Technologies 8, 2 (2013), 311-318.
A Klami and S Kaski. 2007. Local dependent components. In Proceedings of the 24th international conference on Machine learning. ACM, 425-432.
A Klami, S Virtanen, and S Kaski. 2012. Bayesian exponential family projections for coupled data sources. arXiv preprint arXiv:1203.3489 (2012).
A Klami, S Virtanen, and S Kaski. 2013. Bayesian canonical correlation analysis. Journal of Machine Learn- ing Research 14, Apr (2013), 965-1003.
D Krstajic, LJ Buturovic, DE Leahy, and S Thomas. 2014. Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of cheminformatics 6, 1 (2014), 1.
PL Lai and C Fyfe. 1999. A neural implementation of canonical correlation analysis. Neural Networks 12, 10 (1999), 1391-1397.
PL Lai and C Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10, 05 (2000), 365-377.
NB Larson, GD Jenkins, MC Larson, RA Vierkant, TA Sellers, CM Phelan, JM Schildkraut, R Sutphen, PPD Pharoah, S A Gayther, and others. 2014. Kernel canonical correlation analysis for assessing gene- gene interactions and application to ovarian cancer. European Journal of Human Genetics 22, 1 (2014), 126-131.
SC Larson. 1931. The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology 22, 1 (1931), 45.
H-S Lee. 2007. Canonical correlation analysis using small number of samples. Communications in Statistic- sSimulation and Computation R 36, 5 (2007), 973-985.
SE Leurgans, RA Moyeed, and BW Silverman. 1993. Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society. Series B (Methodological) (1993), 725-740.
H Lindsey, JT Webster, and S Halpern. 1985. Canonical Correlation as a Discriminant Tool in a Periodontal Problem. Biometrical journal 27, 3 (1985), 257-264.
P Marttinen, J Gillberg, A Havulinna, J Corander, and S Kaski. 2013. Genome-wide association studies with high-dimensional phenotypes. Statistical applications in genetics and molecular biology 12, 4 (2013), 413-431.
XM Tu. 1991. A bootstrap resampling scheme for using the canonical correlation technique in rank estima- tion. Journal of chemometrics 5, 4 (1991), 333-343.
XM Tu, DS Burdick, DW Millican, and LB McGown. 1989. Canonical correlation technique for rank estima- tion of excitation-emission matrixes. Analytical Chemistry 61, 19 (1989), 2219-2224.
V Uurtio, M Bomberg, K Nybo, M It ävaara, and J Rousu. 2015. Canonical correlation methods for exploring microbe-environment interactions in deep subsurface. In International Conference on Discovery Science. Springer, 299-307.
JP Van de Geer. 1984. Linear relations amongk sets of variables. Psychometrika 49, 1 (1984), 79-94.
T Van Gestel, JAK Suykens, J De Brabanter, B De Moor, and J Vandewalle. 2001. Kernel canonical cor- relation analysis and least squares support vector machines. In International Conference on Artificial Neural Networks. Springer, 384-389.
HD Vinod. 1976. Canonical ridge and econometrics of joint production. Journal of Econometrics 4, 2 (1976), 147-166.
S Waaijenborg, PC Verselewel de Witt Hamer, and AH Zwinderman. 2008. Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis. Statistical Applications in Genetics and Molecular Biology 7, 1 (2008).
C Wang. 2007. Variational Bayesian approach to canonical correlation analysis. IEEE Transactions on Neu- ral Networks 18, 3 (2007), 905-910.
D Wang, L Shi, DS Yeung, and ECC Tsang. 2005. Nonlinear canonical correlation analysis of fMRI signals using HDR models. In 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. IEEE, 5896-5899.
GC Wang, N Lin, and B Zhang. 2013. Dimension reduction in functional regression using mixed data canon- ical correlation analysis. Stat Interface 6 (2013), 187-196.
DS Watkins. 2004. Fundamentals of matrix computations. Vol. 64. John Wiley & Sons.
FV Waugh. 1942. Regressions between sets of variables. Econometrica, Journal of the Econometric Society (1942), 290-310.
DM Witten, R Tibshirani, and T Hastie. 2009. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics (2009), kxp008.
KW Wong, PCW Fung, and CC Lau. 1980. Study of the mathematical approximations made in the basis- correlation method and those made in the canonical-transformation method for an interacting Bose gas. Physical Review A 22, 3 (1980), 1272.
T Yamada and T Sugiyama. 2006. On the permutation test in canonical correlation analysis. Computational statistics & data analysis 50, 8 (2006), 2111-2123.
H Yamamoto, H Yamaji, E Fukusaki, H Ohno, and H Fukuda. 2008. Canonical correlation analysis for mul- tivariate regression and its application to metabolic fingerprinting. Biochemical Engineering Journal 40, 2 (2008), 199-204.
Y-H Yuan, Q-S Sun, and H-W Ge. 2014. Fractional-order embedding canonical correlation analysis and its applications to multi-view dimensionality reduction and recognition. Pattern Recognition 47, 3 (2014), 1411-1424.
Y-H Yuan, Q-S Sun, Q Zhou, and D-S Xia. 2011. A novel multiset integrated canonical correlation analysis framework and its application in feature fusion. Pattern Recognition 44, 5 (2011), 1031-1040.
B Zhang, J Hao, G Ma, J Yue, and Z Shi. 2014. Semi-paired probabilistic canonical correlation analysis. In International Conference on Intelligent Information Processing. Springer, 1-10.
H Zou and T Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301-320.
Received February 2017; revised -; accepted - ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.

About the author

Delmiro Fernandez-Reyes

University College London, Faculty Member

Papers

Followers

View all papers from Delmiro Fernandez-Reyesarrow_forward

A Tutorial on Canonical Correlation Methods

Sign up for access to the world's latest research

Abstract

Related papers

References (92)

Related papers

Related topics

Cited by