Computing Surveys
A Tutorial on Canonical Correlation Methods
Journal: Computing Surveys
Manuscript ID CSUR-2017-0065.R1
Fo
Paper: Tutorial Paper
Date Submitted by the Author: n/a
r
Complete List of Authors: Uurtio, Viivi; Aalto University Helsinki Institute for Information Technology,
Computer Science
Pe
M. Monteiro, João; University College London, Department of Computer
Science
Kandola, Jaz; Imperial College London
Shawe-Taylor, John; University College London, Department of Computer
er
Science
Fernandez-Reyes, Delmiro; University College London, Department of
Computer Science
Rousu, Juho; Aalto University Helsinki Institute for Information Technology,
Computer Science
Re
Computing Classification • Computing methodologies~Dimensionality reduction and manifold
Systems: learning
vi
ew
Page 1 of 35 Computing Surveys
1
2
3
4
5
6
7
8
9
10
11
12
13 Editorial Office
14 ACM Computing Surveys
15
16
17
18
19
20
Fo
21
July 7, 2017
22
23
24 Dear Prof. Sartaj Sahni,
rP
25
26 we thank the Associate Editor and the referees for taking their time to provide detailed
27 and constructive comments on the submitted paper ”A Tutorial on Canonical Correla-
28 tion Methods”. We have revised the paper according to the critiques. Please find the
29
ee
30 detailed comments in response to the reviewers’ critiques in the attachment.
31
32 Best regards,
rR
33
34
35 Viivi Uurtio, Aalto University
36
Joao M. Monteiro, University College London
37
ev
38 Jaz Kandola, Imperial College London
39 John Shawe-Taylor, University College London
40 Delmiro Fernandez-Reyes, University College London
41
iew
Juho Rousu, Aalto University
42
43
44 encl: Author response to the Reviews
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Computing Surveys Page 2 of 35
1
2
3 1
4
5 A Tutorial on Canonical Correlation Methods
6
7 VIIVI UURTIO, Aalto University
JOÃO M. MONTEIRO, University College London
8
JAZ KANDOLA, Imperial College London
9 JOHN SHAWE-TAYLOR, University College London
10 DELMIRO FERNANDEZ-REYES, University College London
11 JUHO ROUSU, Aalto University
12
13
14 Canonical correlation analysis is a family of multivariate statistical methods for the analysis of paired sets
15 of variables. Since its proposition, canonical correlation analysis has for instance been extended to extract
16 relations between two sets of variables when the sample size is insufficient in relation to the data dimen-
17 sionality, when the relations have been considered to be non-linear, and when the dimensionality is too large
18 for human interpretation. This tutorial explains the theory of canonical correlation analysis including its
Fo
19 regularised, kernel, and sparse variants. Additionally, the deep and Bayesian CCA extensions are briefly
reviewed. Together with the numerical examples, this overview provides a coherent compendium on the ap-
20 plicability of the variants of canonical correlation analysis. By bringing together techniques for solving the
21 optimisation problems, evaluating the statistical significance and generalisability of the canonical correla-
22 tion model, and interpreting the relations, we hope that this article can serve as a hands-on tool for applying
r
23 canonical correlation methods in data analysis.
24 CCS Concepts: rComputing methodologies → Dimensionality reduction and manifold learning;
Pe
25 General Terms: Multivariate Statistical Analysis, Machine Learning, Statistical Learning Theory
26
Additional Key Words and Phrases: Canonical Correlation, Regularisation, Kernel Methods, Sparsity
27
28 ACM Reference Format:
Viivi Uurtio, João M. Monteiro, Jaz Kandola, John Shawe-Taylor, Delmiro Fernandez-Reyes, and Juho
er
29
Rousu, 2017. A Tutorial on Canonical Correlation Methods. ACM Comput. Surv. 1, 1, Article 1 (February
30 2017), 33 pages.
31 DOI: 0000001.0000001
32
Re
33 1. INTRODUCTION
34 When a process can be described by two sets of variables corresponding to two differ-
35 ent aspects, or views, analysing the relations between these two views may improve
36 the understanding of the underlying system. In this context, a relation is a mapping
vi
37 of the observations corresponding to a variable of one view to the observations corre-
38 sponding to a variable of the other view. For example in the field of medicine, one view
39 could comprise variables corresponding to the symptoms of the disease and the other
ew
40 to the risk factors that can have an effect on the disease incidence. Identifying the
41 relations between the symptoms and the risk factors can improve the understanding
42 of the disease exposure and give indications for prevention and treatment. Examples
43
44 Author’s addresses: V. Uurtio and J. Rousu, Helsinki Institute for Information Technology HIIT, Department
of Computer Science, Aalto University, Konemiehentie 2, 02150 Espoo, Finland; J. M. Monteiro, Department
45 of Computer Science, University College London, and Max Planck Centre for Computational Psychiatry and
46 Ageing Research, University College London, Gower Street, London WC1E 6BT, UK; J. Shawe-Taylor and
47 D. Fernandez-Reyes, Department of Computer Science, University College London, Gower Street, London
48 WC1E 6BT, UK; J. Kandola, Division of Brain Sciences, Imperial College London, DuCane Road, London
WC12 0NN.
49 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
50 without fee provided that copies are not made or distributed for profit or commercial advantage and that
51 copies bear this notice and the full citation on the first page. Copyrights for third-party components of this
work must be honored. For all other uses, contact the owner/author(s).
52
c 2017 Copyright held by the owner/author(s). 0360-0300/2017/02-ART1 $15.00
53 DOI: 0000001.0000001
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 3 of 35 Computing Surveys
1
2
3 1:2
4
5 of these kind of two-view settings, where the analysis of the relations could provide
6 new information about the functioning of the system, occur in several other fields of
7 science. These relations can be determined by means of canonical correlation methods
8 that have been developed specifically for this purpose.
9 Since the proposition of canonical correlation analysis (CCA) by H. Hotelling
10 [Hotelling 1935; 1936], relations between variables have been explored in various
11 fields of science. CCA was first applied to examine the relation of wheat characteristics
12 to flour characteristics in an economics study by F. Waugh in 1942 [Waugh 1942]. Since
then, studies in the fields of psychology [Hopkins 1969; Dunham and Kravetz 1975],
13
geography [Monmonier and Finn 1973], medicine [Lindsey et al. 1985], physics [Wong
14 et al. 1980], chemistry [Tu et al. 1989], biology [Sullivan 1982], time-series modeling
15 [Heij and Roorda 1991], and signal processing [Schell and Gardner 1995] constitute
16 examples of the early application fields of CCA.
17 In the beginning of the 21st century, the applicability of CCA has been demonstrated
18 in modern fields of science such as neuroscience, machine learning and bioinformat-
Fo
19 ics. Relations have been explored for developing brain-computer interfaces [Cao et al.
20 2015; Nakanishi et al. 2015] and in the field imaging genetics [Fang et al. 2016]. CCA
21 has also been applied for feature selection [Ogura et al. 2013], feature extraction and
22 fusion [Shen et al. 2013], and dimension reduction [Wang et al. 2013]. Examples of
r
23 application studies conducted in the fields of bioinformatics and computational biol-
24 ogy include [Rousu et al. 2013; Seoane et al. 2014; Baur and Bozdag 2015; Sarkar
and Chakraborty 2015; Cichonska et al. 2016]. The vast range of application domains
Pe
25
26 emphasises the utility of CCA in extracting relations between variables.
27 Originally, CCA was developed to extract linear relations in overdetermined set-
tings, that is when the number of observations exceeds the number of variables in
28
either view. To extend CCA to underdetermined settings that often occur in modern
er
29
data analysis, methods of regularisation have been proposed. When the sample size
30 is small, Bayesian CCA also provides an alternative to perform CCA. The applicabil-
31 ity of CCA to underdetermined settings has been further improved through sparsity-
32 inducing norms that facilitate the interpretation of the final result. Kernel methods
Re
33 and neural networks have been introduced for uncovering non-linear relations. At
34 present, canonical correlation methods can be used to extract linear and non-linear
35 relations in both over- and underdetermined settings.
36 In addition to the already described variants of CCA, alternative extensions have
vi
37 been proposed, such as the semi-paired and multi-view CCA. In general, CCA algo-
38 rithms assume one-to-one correspondence between the observations in the views, in
39 other words, the data is assumed to be paired. However, in real datasets some of
ew
40 the observations may be missing in either view, which means that the observations
41 are semi-paired. Examples of semi-paired CCA algorithms comprise [Blaschko et al.
42 2008], [Kimura et al. 2013], [Chen et al. 2012], and [Zhang et al. 2014]. CCA has also
43 been extended to more than two views by [Horst 1961], [Carroll 1968], [Kettenring
1971], and [Van de Geer 1984]. In multi-view CCA the relations are sought among
44
more than two views. Some of the modern extensions of multi-view CCA comprise its
45 regularised [Tenenhaus and Tenenhaus 2011], kernelised [Tenenhaus et al. 2015], and
46 sparse [Tenenhaus et al. 2014] variants. Application studies of multi-view CCA and its
47 modern variants can be found in neuroscience [Kang et al. 2013], [Chen et al. 2014],
48 feature fusion [Yuan et al. 2011] and dimensionality reduction [Yuan et al. 2014]. How-
49 ever, both the semi-paired and multi-view CCA are beyond the scope of this tutorial.
50 This tutorial begins with an introduction to the original formulation of CCA. The
51 basic framework and statistical assumptions are presented. The techniques for solv-
52 ing the CCA optimisation problem are discussed. After solving the CCA problem, the
53 approaches to interpret and evaluate the result are explained. The variants of CCA
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 4 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:3
4
5 are illustrated using worked examples. Of the extended versions of CCA, the tuto-
6 rial concentrates on the topics of regularised, kernel, and sparse CCA. Additionally,
7 the deep and Bayesian CCA variants are briefly reviewed. This tutorial acquaints the
8 reader with canonical correlation methods, discusses where they are applicable and
9 what kind of information can be extracted.
10
11 2. CANONICAL CORRELATION ANALYSIS
12 2.1. The Basic Principles of CCA
13 CCA is a two-view multivariate statistical method. In multivariate statistical analysis,
14 the data comprises multiple variables measured on a set of observations or individu-
15 als. In the case of CCA, the variables of an observation can be partitioned into two
16 sets that can be seen as the two views of the data. This can be illustrated using the
17 following notations. Let the views a and b be denoted by the matrices Xa and Xb , of
18 sizes n×p and n×q respectively. The row vectors xka ∈ Rp and xkb ∈ Rq for k = 1, 2, . . . , n
Fo
19 denote the sets of empirical multivariate observations in Xa and Xb respectively. The
20 observations are assumed to be jointly sampled from a normal multivariate distribu-
21 tion. A reason for this is that the normal multivariate model approximates well the
distribution of continuous measurements in several sampled distributions [Anderson
22
2003]. The column vectors ai ∈ Rn for i = 1, 2, . . . , p and bj ∈ Rn for j = 1, 2, . . . , q de-
r
23
note the variable vectors of the n observations respectively. The inner product between
24 two vectors is either denoted by ha, bi or aT b. Throughout this tutorial, we assume
Pe
25 that the variables are standardised to zero mean and unit variance. In CCA, the aim
26 is to extract the linear relations between the variables of Xa and Xb .
27 CCA is based on linear transformations. We consider the following transformations
28
er
29 Xa wa = za and Xb wb = zb
30
31 where Xa ∈ Rn×p , wa ∈ Rp , za ∈ Rn , Xb ∈ Rn×q , wb ∈ Rq , and zb ∈ Rn . The data
32 matrices Xa and Xb represent linear transformations of the positions wa and wb onto
the images za and zb in the space Rn . The positions wa and wb are often referred to as
Re
33
canonical weight vectors and the images za and zb are also termed as canonical vari-
34
ates or scores. The constraints of CCA on the mappings are that the position vectors
35 of the images za and zb are unit norm vectors and that the enclosing angle, θ ∈ [0, π2 ]
36 [Golub and Zha 1995; Dauxois and Nkiet 1997], between za and zb is minimised. The
vi
37 cosine of the angle, also referred to as the canonical correlation, between the images
38 za and zb is given by the formula cos(za , zb ) = hza , zb i/||za ||||zb || and due to the unit
39 norm constraint cos(za , zb ) = hza , zb i. Hence the basic principle of CCA is to find two
ew
40 positions wa ∈ Rp and wb ∈ Rq that after the linear transformations Xa ∈ Rn×p and
41 Xb ∈ Rn×q are mapped onto an n-dimensional unit ball and located in such a way that
42 the cosine of the angle between the position vectors of their images za ∈ Rn and zb ∈ Rn
43 is maximised.
44 The images za and zb of the positions wa and wb that result in the smallest angle, θ1 ,
45 determine the first canonical correlation which equals cos θ1 [Björck and Golub 1973].
46 The smallest angle is given by
47
cos θ1 = max hza , zb i,
48 za ,zb ∈Rn
(1)
49 ||za ||2 = 1 ||zb ||2 = 1
50
51 Let the maximum be obtained by z1a and z1b . The pair of images z2a and z2b , that has
52 the second smallest enclosing angle θ2 , is found in the orthogonal complements of z1a
53 and z1b . The procedure is continued until no more pairs are found. Hence the r angles
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 5 of 35 Computing Surveys
1
2
3 1:4
4
5 θr ∈ [0, π2 ] for r = 1, 2, · · · , q when p > q that can be found are recursively defined by
6
7 cos θr = max hzra , zrb i,
za ,zb ∈Rn
8
9 ||zra ||2 = 1 ||zrb ||2 = 1
10 hzra , zja i = 0 hzrb , zjb i = 0,
11 ∀j 6= r : j, r = 1, 2, . . . , min(p, q).
12
13 The number of canonical correlations, r, corresponds to the dimensionality of CCA.
14 Qualitatively, the dimensionality of CCA can be also seen as the number of patterns
15 that can be extracted from the data.
16 When the dimensionality of CCA is large, it may not be relevant to solve all the posi-
17 tions wa and wb and images za and zb . In general, the value of the canonical correlation
18 and the statistical significance are considered to convey the importance of the pattern.
Fo
19 The first estimation strategy for finding the number of statistically significant canoni-
20 cal correlation coefficients was proposed in [Bartlett 1941]. The techniques have been
21 further developed in [Fujikoshi and Veitch 1979; Tu 1991; Gunderson and Muirhead
22 1997; Yamada and Sugiyama 2006; Lee 2007; Sakurai 2009].
r
23 In summary, the principle behind CCA is to find two positions in the two data spaces
24 respectively that have images on a unit ball such that the angle between them is min-
imised and consequently the canonical correlation is maximised. The linear transfor-
Pe
25
mations of the positions are given by the data matrices. The number of relevant po-
26
sitions can be determined by analysing the values of the canonical correlations or by
27 applying statistical significance tests.
28
er
29
2.2. Finding the positions and the images in CCA
30
31 The position vectors wa and wb having images za and zb in the new coordinate system
32 of a unit ball that have a maximum cosine of the angle in between can be obtained
using techniques of functional analysis. The eigenvalue-based methods comprise solv-
Re
33
ing a standard eigenvalue problem, as originally proposed by Hotelling in [Hotelling
34
1936], or a generalised eigenvalue problem [Bach and Jordan 2002; Hardoon et al.
35 2004]. Alternatively, the positions and the images can be found using the singular
36 value decomposition (SVD), as introduced in [Healy 1957]. The techniques can be con-
vi
37 sidered as standard ways of solving the CCA problem.
38
39 Solving CCA Through the Standard Eigenvalue Problem. In the technique of
ew
40 Hotelling, both the positions wa and wb and the images za and zb are obtained by
41 solving a standard eigenvalue problem. The Lagrange multiplier technique [Hotelling
42 1936; Hooper 1959] is employed to obtain the characteristic equation. Let Xa and Xb
43 denote the data matrices of sizes n × p and n × q respectively. The sample covariance
1
44 matrix Cab between the variable column vectors in Xa and Xb is Cab = n−1 XaT Xb .
45 The empirical variance matrices between the variables in Xa and Xb are given by
1 1
46 Caa = n−1 XaT Xa and Cbb = n−1 XbT Xb respectively. The joint covariance matrix is then
47
48 Caa Cab
49 . (2)
Cba Cbb
50
51 The first and greatest canonical correlation that corresponds to the smallest angle is
52 between the first pair of images za = Xa wa and zb = Xa wb . Since the correlation
53 between za and zb does not change with the scaling of za and zb , we can constrain wa
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 6 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:5
4
5 and wb to be such that za and zb have unit variance. This is given by
6
zTa za =waT XaT Xa wa = waT Caa wa = 1, (3)
7
8 zTb zb =wbT XbT Xb wb = wbT Cbb wb = 1. (4)
9 Due to the normality assumption and comparability, the variables of Xa and Xb should
10 be centered such that they have zero means. In this case, the covariance between za
11 and zb is given by
12
13 zTa zb = waT XaT Xb wb = waT Cab wb . (5)
14 Substituting (5), (3) and (4) into the algebraic problem (1), we obtain:
15
cos θ = max n hza , zb i = max waT Cab wb ,
16 za ,zb ∈R wa ∈Rp ,wb ∈Rq
17 q q
18 ||za ||2 = waT Caa wa = 1 ||zb ||2 = wbT Cbb wb = 1.
Fo
19
In general, the constraints (3) and (4) are expressed in squared form, waT Caa wa = 1
20
and wbT Cbb wb = 1. The problem can be solved using the Lagrange multiplier technique.
21 Let
22 ρ1 ρ2
r
23 L = waT Cab wb − (waT Caa wa − 1) − (wbT Cbb wb − 1) (6)
2 2
24
where ρ1 and ρ2 denote the Lagrange multipliers. Differentiating L with respect to wa
Pe
25
26 and wb gives
27 δL
28 = Cab wb − ρ1 Caa wa = 0 (7)
δwa
er
29 δL
30 = Cba wa − ρ2 Cbb wb = 0 (8)
δwb
31
32 Multiplying (7) from the left by waT and (8) from the left by wbT gives
Re
33
waT Cab wb − ρ1 waT Caa wa = 0
34
35 wbT Cba wa − ρ2 wbT Cbb wb = 0.
36
vi
37
38 Since waT Caa wa = 1 and wbT Cbb wb = 1, we obtain that
39 ρ1 = ρ2 = ρ. (9)
ew
40
41 Substituting (9) into Equation (7) we obtain
42 −1
Caa Cab wb
43 wa = . (10)
ρ
44
45 Substituting (10) into (8) we obtain
46 1 −1
47 Cba Caa Cab wb − ρCbb wb = 0
ρ
48
49 which is equivalent to the generalised eigenvalue problem of the form
−1
50 Cba Caa Cab wb = ρ2 Cbb wb .
51
If Cbb is invertible, the problem reduces to a standard eigenvalue problem of the form
52
−1 −1
53 Cbb Cba Caa Cab wb = ρ2 wb .
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 7 of 35 Computing Surveys
1
2
3 1:6
4
5 −1 −1
The eigenvalues of the matrix Cbb Cba Caa Cab are found by solving the characteristic
6 equation
7 −1 −1
|Cbb Cba Caa Cab − ρ2 I| = 0.
8
9 The square roots of the eigenvalues correspond to the canonical correlations. The tech-
10 nique of solving the standard eigenvalue problem is shown in Example 2.1.
11 Example 2.1. We generate two data matrices Xa and Xb of sizes n × p and n ×
12 q, where n = 60, p = 4 and q = 3, respectively as follows. The variables of Xa are
13 generated from a random univariate normal distribution, a1 , a2 , a3 , a4 ∼ N (0, 1). We
14 generate the following linear relations
15
16 b1 = a3 + ξ 1
17 b2 = a1 + ξ 2
18 b3 = −a4 + ξ 3
Fo
19
where ξ 1 ∼ N (0, 0.2), ξ 2 ∼ N (0, 0.4), and ξ 3 ∼ N (0, 0.3) denote vectors of normal noise.
20
The data is standardised such that every variable has zero mean and unit variance.
21
The joint covariance matrix C in (2) of the generated data is given by
22
r
23 1.00 0.34 −0.11 0.21 −0.10 0.92 −0.21
24 0.34 1.00 −0.08 0.03 −0.10 0.34 0.06
−0.11 −0.08 1.00 −0.30 0.98 −0.03 0.30
Pe
25
Caa Cab
C = 0.21 0.03 −0.30 1.00 −0.25 0.12 −0.94 = .
26 −0.10 −0.10 0.98 −0.25 1.00 −0.03 0.25 Cba Cbb
27
0.92 0.34 −0.03 0.12 −0.03 1.00 −0.13
28
−0.21 0.06 0.30 −0.94 0.25 −0.13 1.00
er
29
30 Now we compute the eigenvalues of the characteristic equation
31 −1 −1
32 |Cbb Cba Caa Cab − ρ2 I| = 0.
Re
33 The square roots of the eigenvalues of Cbb −1 −1
Cba Caa Cab are ρ1 = 0.99, ρ2 = 0.94, and
34 ρ3 = 0.92. The eigenvectors wb satisfy the equation
35 −1 −1
36 (Cbb Cba Caa Cab − ρ2 I)wb = 0.
vi
37 Hence we obtain
38
−0.97 −0.39 0.19
! ! !
39 wb1 = −0.04 wb2 = −0.37 wb3 = −0.86
ew
40 −0.22 0.85 −0.46
41
42 and wa vectors satisfy
43
−0.04
−0.41
−0.84
44 −1
C Cab wb 1
−0.00 2 −1
Caa Cab wb2 0.09 3 −1 3
C Cab wb −0.10
45 wa1 = aa = w = = w = aa = .
ρ1 −0.99 a ρ2 −0.41 a ρ3 0.14
46 0.18 −0.83 0.52
47
48 The vectors wb1 , wb2 , and wb3 and wa1 , wa2 , and wa3 correspond to the pairs of positions
49 (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) that have the images (z1a , z1b ), (z2a , z2b ) and (z3a , z3b ). In
50 linear CCA, the canonical correlations equal to the square roots of the eigenvalues,
that is hz1a , z1b i = 0.99, hz2a , z2b i = 0.94, and hz3a , z3b i = 0.92.
51
52 Solving CCA Through the Generalised Eigenvalue Problem. The positions wa and
53 wb and their images za and zb can also be solved through a generalised eigenvalue
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 8 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:7
4
5 problem [Bach and Jordan 2002; Hardoon et al. 2004]. The equations in (7) and (8) can
6 be represented as simultaneous equations
7 Cab wb =ρCaa wa
8
9 Cba wa =ρCbb wb
10 that are equivalent to
11
0 Cab wa Caa 0 wa
12 =ρ . (11)
Cba 0 wb 0 Cbb wb
13
14 The equation (11) represents a generalised eigenvalue problem of the form βAx = αBx
15 where the pair (β, α) = (1, α) is an eigenvalue of the pair (A, B) [Saad 2011; Golub
16 and Van Loan 2012]. The pair of matrices A ∈ R(p+q)×(p+q) and B ∈ R(p+q)×(p+q)
17 is also referred to as matrix pencil. In particular, A is symmetric and B is symmet-
18 ric positive-definite. The pair (A, B) is then called the symmetric pair. As shown in
Fo
19 [Watkins 2004], a symmetric pair has real eigenvalues and (p+q) linearly independent
20 eigenvectors. To express the generalised eigenvalue problem in the form Ax = ρBx, the
21 generalised eigenvalue is given by ρ = α β . Since the generalised eigenvalues come in
22 pairs {ρ1 , −ρ1 , ρ2 , −ρ2 , . . . , ρp , −ρp , 0} where p < q, the positive generalised eigenvalues
r
23 correspond to the canonical correlations.
24 Example 2.2. Using the data in Example 2.1, we apply the formulation of the gen-
Pe
25 eralised eigenvalue problem to obtain the positions wa and wb . The resulting gener-
26 alised eigenvalues are
27
{0.99, 0.94, 0.92, 0.00, −0.92, −0.94, −0.99}.
28
er
29 The generalised eigenvectors that correspond to the positive generalised eigenvalues
30 in descending order are
31
−0.04
0.48
−0.97
32 −0.00 2 −0.11 3 −0.11
wa1 =
Re
33 w = w =
−1.00 a 0.48 a 0.16
34 0.18 0.98 0.60
35
36 −0.98
!
0.46
!
0.22
!
vi
37 wb1 = −0.04 wb2 = 0.43 wb3 = −1.00
38 −0.23 −1.00 −0.54
39
The vectors wa1 , wa2 , and wa3 and wb1 , wb2 , and wb3 correspond to the pairs of positions
ew
40
(wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ). The canonical correlations are hz1a , z1b i = 0.99, hz2a , z2b i =
41
0.94, and hz3a , z3b i = 0.92.
42 The entries of the position pairs differ to some extent from the solutions to the stan-
43 dard eigenvalue problem in the Example 2.1. This is due to the numerical algorithms
44 that are applied to solve the eigenvalues and eigenvectors. Additionally, the signs may
45 also be opposite. This can be seen when comparing the second pairs of positions with
46 the Example 2.1. This results from the symmetric nature of CCA.
47
Solving CCA Using the SVD. The technique of applying the SVD to solve the CCA
48
problem was first introduced by [Healy 1957] and described by [Ewerbring and Luk
49
1989] as follows. First, the covariance matrices Caa and Cbb are transformed into iden-
50 tity forms. Due to the symmetric positive definite property, the square root factors of
51 the matrices can be found using a Cholesky or eigenvalue decomposition:
52
1/2 1/2 1/2 1/2
53 Caa = Caa Caa and Cbb = Cbb Cbb .
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 9 of 35 Computing Surveys
1
2
3 1:8
4
5 Applying the inverses of the square root factors symmetrically on the joint covariance
6 matrix in (2) we obtain
7 −1/2
! −1/2
!
−1/2 −1/2
!
8 Caa 0 Caa Cab Caa 0 Iq Caa Cab Cbb
−1/2 −1/2 = −1/2 −1/2 .
9 0 Cbb Cba Cbb 0 Cbb Cbb Cba Caa Ip
10 The position vectors wa and wb can hence be obtained by solving the following SVD
11
−1/2 −1/2
12 Caa Cab Cbb = U T SV (12)
13
where the columns of the matrices U and V correspond to the sets of orthonormal left
14 and right singular vectors respectively. The singular values of matrix S correspond to
15 the canonical correlations. The positions wa and wb are obtained from
16
−1/2 −1/2
17 wa = Caa U wb = Cbb V
18 The method is shown in Example 2.3.
Fo
19
20 Example 2.3. The method of solving CCA using the SVD is demonstrated using the
21 data of Example 2.1. We compute the matrix
22
−0.02 0.90 −0.06
r
23 −1/2 −1/2 −0.07 0.20 0.11
24 Caa Cab Cbb = .
0.98 0.04 0.04
Pe
25 0.01 −0.02 −0.93
26
The SVD gives
27
−1/2 −1/2
28 Caa Cab Cbb =
er
29 ! 0.99 0.00
0.00
−0.03 −0.03 0.95 −0.30 0.95 −0.29 0.15
!
30 0.00 0.94 0.00
31 −0.47 0.03 −0.28 0.84 0.01 −0.44 −0.90 .
0.00 0.00 0.92
32 −0.86 −0.26 0.11 0.44 0.33 0.85 −0.41
} 0.00 0.00 0.00 |
Re
| {z {z }
33 UT
| {z } V
S
34
35 The singular values of the matrix S correspond to the canonical correlations. The po-
36 sitions wa and wb are given by
vi
37
0.04
−0.43
−0.91
38 −1/2 1 0.00 2 −1/2 2 0.10 3 −1/2 3 −0.10
39 wa1 = Caa u = w = Caa u = w = Caa u =
0.94 a −0.43 a 0.14
ew
40 −0.17 −0.87 0.56
41
42 0.93
!
−0.40
!
0.21
!
−1/2 −1/2 2 −1/2 3
43 wb1 = Cbb v1 = 0.04 wb2 = Cbb v = −0.38 wb3 = Cbb v = −0.93
44 0.21 0.89 −0.50
45
46 where ui and vi for i = 1, 2, 3 correspond to the left and right singular vectors.
47 The vectors wa1 , wa2 , and wa3 and wb1 , wb2 , and wb3 correspond to the pairs of posi-
tions (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ). The canonical correlations are hz1a , z1b i = 0.99,
48
hz2a , z2b i = 0.94, and hz3a , z3b i = 0.92.
49
50 The main motivation for improving the eigenvalue-based technique was the compu-
51 tational complexity. The standard and generalised eigenvalue methods scale with the
52 cube of the input matrix dimension, in other words, the time complexity is O(n3 ), for
−1/2 −1/2
53 a matrix of size n × n. The input matrix Caa Cab Cbb in the SVD-based technique is
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 10 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:9
4
5 rectangular. This gives a time complexity of O(mn2 ), for a matrix of size m × n. Hence
6 the SVD-based technique is computationally more tractable for very large datasets.
7 To recapitulate, the images za and zb of the positions wa and wb that successively
8 maximise the canonical correlation can be obtained by solving a standard [Hotelling
9 1936] or a generalised eigenvalue problem [Bach and Jordan 2002; Hardoon et al.
10 2004] or by applying the SVD [Healy 1957; Ewerbring and Luk 1989]. The CCA prob-
11 lem can also be solved using alternative techniques. The only requirements are that
12 the successive images on the unit ball are orthogonal and that the angle is minimised.
13
14 2.3. Evaluating the Canonical Correlation Model
15 The pair of position vectors that have images on the unit ball with a minimum en-
16 closing angle correspond to the canonical correlation model obtained from the training
17 data. The entries of these position vectors convey the relations between the variables
18 obtained from the sampling distribution. In general, a statistical model is validated
Fo
19 in terms of statistical significance and generalisability. To assess the statistical sig-
20 nificance of the relations obtained from the training data, Bartlett’s sequential test
21 procedure [Bartlett 1941] can be applied. Although the technique was presented in
22 1941, it is still applied in timely CCA application studies such as [Marttinen et al.
r
23 2013; Kabir et al. 2014; Song et al. 2016]. The generalisability of the canonical cor-
24 relation model determines whether the relations obtained from the training data can
be considered to represent general patterns occurring in the sampling distribution.
Pe
25
26 The methods of testing the statistical significance and generalisability of the extracted
27 relations represent standard ways to evaluate the canonical correlation model.
28 The entries of the position vectors wa and wb can be used as a means to analyse the
linear relations between the variables. The linear relation corresponding to the value
er
29
of the canonical correlation is found between the entries that are of the greatest value.
30 The values of the entries of the position vectors wa and wb are visualised in Figure
31 1. The linear relation that corresponds to the canonical correlation of hz1a , z2b i = 0.99 is
32 found between the variables a3 and b1 . Since the signs of both entries are negative, the
Re
33 relation is positive. The second pair of positions (wa2 , wb2 ) conveys the negative relation
34 between a4 and b3 . The positive relation between a1 and b2 can be identified from the
35 entries of the third pair of positions (wa3 , wb3 ).
36 In [Meredith 1964], structure correlations were introduced as a means to analyse
vi
37 the relations between the variables. Structure correlations are the correlations of the
38 original variables, ai ∈ Rn for i = 1, 2, . . . , p and bj ∈ Rn for j = 1, 2, . . . , q, with the
39 images, za ∈ Rn or zb ∈ Rn . In general, the structure correlations convey how the
ew
40 images za and zb are aligned in the space Rn in relation to the variable axes.
41 In [Ter Braak 1990], the structure correlations were visualised on a biplot to facili-
42 tate the interpretation of the relations. To plot the variables on the biplot, the correla-
43 tions of the original variables of both sets with two successive images, for example the
images (z1a , z2a ), of one of the sets are computed. The plot is interpreted by the cosine
44
of the angles between the variable vectors which is given by cos(a, b) = ha, bi/||a||||b||.
45
Hence a positive linear relation is shown by an acute angle while an obtuse angle de-
46 picts a negative linear relation. A right angle corresponds to a zero correlation. Three
47 biplots of the data and results of Example 2.1 are shown in Figure 2. In each of the bi-
48 plots, the same relations that were identified in Figure 1 can be found by analysing the
49 angles between the variable vectors. The extraction of the relations can be enhanced
50 by changing the pairs of images with which the correlations are computed.
51 The statistical significance tests of the canonical correlations evaluate whether the
52 obtained pattern can be considered to occur non-randomly. The sequential test pro-
53 cedure of Bartlett [Bartlett 1938] determines the number of statistically significant
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 11 of 35 Computing Surveys
1
2
3 1:10
4
5 wa1 wa2 wa3
6 1 1 1
7 0.5 0.5 0.5
8
0 0 0
9
10 -0.5 -0.5 -0.5
11 -1 -1 -1
12 a3 a4 a1
13
14
15
wb1 wb2 wb3
16
1 1 1
17
18 0.5 0.5 0.5
Fo
19 0 0 0
20
-0.5 -0.5 -0.5
21
22 -1 -1 -1
r
23 b1 b3 b2
24
Pe
25 Fig. 1. The entries of the pairs of positions (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) are shown. The entry of max-
26 imum absolute value is coloured blue.
27
28 b3
er
29
30 a4 a4
ab31 ba13
z2a
z3a
z3a
31 ab31 a2
32 b3 a2 a2 b3
ba21
Re
33 ba21
a4 ab12
34
35 z1a z1a z2a
36
vi
37
Fig. 2. The biplots are generated using the results of Example 2.1. The biplot on the left shows the relations
38 between the variables when viewed with respect to the images z1a and z2a . The biplot in the middle shows
39 the relations between the variables when viewed with respect to the images z1a and z3a . The biplot on the
ew
40 right shows the relations between the variables when viewed with respect to the images z2a and z3a .
41
42 canonical correlations in the data. The procedure to evaluate the statistical signifi-
43 cance of the canonical correlations is described in [Fujikoshi and Veitch 1979]. We test
44 the hypothesis
45 H0 : min(p, q) = k against H1 : min(p, q) > k (13)
46
where k = 0, 1, . . . , p when p < q. If the hypothesis H0 : min(p, q) = j is rejected for
47
j = 0, 1, . . . , k − 1 but accepted for H1 : min(p, q) > k − 1 the number of statistically
48
significant canonical correlations can be estimated as k. For the test, the Bartlett-
49 Lawley statistic, Lk is applied
50
51 k min(p,q)
1 X Y
rj−2 ln (1 − rj2 ) .
52 Lk = − n − k − (p + q + 1) + (14)
53 2 j=1 j=k+1
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 12 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:11
4
5 where rj denotes the j th canonical correlation. The asymptotic null distribution of Lk
6 is the chi-squared with (p − k)(q − k) degrees of freedom. Hence we first test that no
7 canonical relation exists between the two views. If we reject the hypothesis H0 we
8 continue to test that one canonical relation exists. If all the canonical patterns are
9 statistically significant even the hypothesis H0 : min(p, q) = k − 1 is rejected.
10 Example 2.4. We demonstrate the sequential test procedure of Bartlett using the
11 simulated setting of Examples 2.1, 2.2 and 2.3. In the setting, n = 60, p = 4 and p = 3.
12 Hence min(p, q) = 3. First, we test that there are no canonical correlations
13
H0 : min(p, q) = 0 against H1 : min(p, q) > 0 (15)
14
2
15 The Bartlett-Lawley statistic is L0 = 296.82. Since L0 ∼ χ (12) the critical value at
16 the significance level α = 0.01 is P (χ2 ≥ 26.2) = 0.01. Since L0 = 296.82 > 26.2 the
17 hypothesis H0 is rejected. Next we test that there is one canonical correlation.
18 H0 : min(p, q) = 1 against H1 : min(p, q) > 1 (16)
Fo
19 2
20 The Bartlett-Lawley statistic is L1 = 154.56 and L1 ∼ χ (6). The critical value at
the significance level α = 0.01 is P (χ2 ≥ 16.8) = 0.01. Since L1 = 154.56 > 16.8 the
21
hypothesis H0 is rejected. We continue to test that there are two canonical correlations
22
r
23 H0 : min(p, q) = 2 against H1 : min(p, q) > 2 (17)
24 2
The Bartlett-Lawley statistic is L2 = 70.95 and L2 ∼ χ (2). The critical value at the
Pe
25 significance level α = 0.01 is P (χ2 ≥ 9.21) = 0.01. Since L1 = 70.95 > 9.21 the hypoth-
26 esis H0 is rejected. Hence the hypothesis H1 : min(p, q) > 2 is accepted and all three
27 canonical patterns are statistically significant.
28
To determine whether the extracted relations can be considered generalisable, or in
er
29
other words general patterns in the sampling distribution, the linear transformations
30 of the position vectors wa and wb need to be performed using test data. Unlike training
31 data, test data originates from the sampling distribution but were not used in the
32 model computation. Let the matrices Xatest ∈ Rm×p and Xbtest ∈ Rm×q denote the test
Re
33 data of m observations. The linear transformations of the position vectors wa and wb
34 are then
35
Xatest wa = ztest
a and Xbtest wb = ztest
b
36
vi
37 where the images ztest
a and ztest
b are in the space Rm . The cosine of the angle between
38 the test images cos(za , zb ) = hztest
test test
a , zb
test
i implies the generalisability. If the canoni-
39 cal correlations computed from test data also result in high correlation values we can
ew
40 deduce that the relations can generally be found from the particular sampling distri-
41 bution.
42 Example 2.5. We evaluate the generalisability of the canonical correlation model
43 obtained in Example 2.1. The test data matrices Xatest and Xbtest of sizes m × p and
44 m × q where m = 40, p = 4, and q = 3 are from the same distributions as described in
45 Example 2.1. The 40 observations were not included in the computation of the model.
46 The test canonical correlations corresponding to the positions (wa1 , wb1 ), (wa2 , wb2 ) and
47 (wa3 , wb3 ) are hz1a , z1b i = 0.98, hz2a , z2b i = 0.98, hz3a , z3b i = 0.98. The high values indicate
48 that the extracted relations can be considered generalisable.
49 The canonical correlation model can be evaluated by assessing the statistical signif-
50 icance and testing the generalisability of the relations. The statistical significance of
51 the model can be determined by testing whether the extracted canonical correlations
52 are not non-zero by chance. The generalisability of the relations can be assessed us-
53 ing new observations from the sampling distribution. These evaluation methods can
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 13 of 35 Computing Surveys
1
2
3 1:12
4
5 generally be applied to test the validity of the extracted relations obtained using any
6 variant of CCA.
7
3. EXTENSIONS OF CANONICAL CORRELATION ANALYSIS
8
9 3.1. Regularisation Techniques in Underdetermined Systems
10 CCA finds linear relations in the data when the number of observations exceeds the
11 number of variables in either view. This possibly guarantees the non-singularity of
12 the variance matrices Caa and Cbb when solving the CCA problem. In the case of the
13 standard eigenvalue problem, the matrices Caa and Cbb should be non-singular so that
14 they can be inverted. In the case of the SVD method, singular Caa and Cbb may not
15 have the square root factors. If the number of observations is less than the number of
16 variables it is likely that some of the variables are collinear. Hence a sufficient sam-
ple size reduces the collinearity of the variables and guarantees the non-singularity of
17
the variance matrices. The first proposition to solve the problem of insufficient sample
18 size was presented in [Vinod 1976]. A more recent technique to regularise CCA has
Fo
19 been proposed in [Cruz-Cano and Lee 2014]. In the following, we present the origi-
20 nal method of regularisation [Vinod 1976] due to its popularity in CCA applications
21 [González et al. 2009], [Yamamoto et al. 2008], and [Soneson et al. 2010].
22 In the work of [Vinod 1976], the singularity problem was proposed to be solved by
r
23 regularisation. In general, the idea is to improve the invertibility of the variance ma-
24 trices Caa and Cbb by adding arbitrary constants c1 > 0 and c2 > 0 to the diagonal
Pe
25 Caa + c1 I and Cbb + c2 I. The constraints of CCA become
26 waT Caa + c1 I wa =1
27
wbT Cbb + c2 I wb =1
28
er
29 and hence the magnitudes of the position vectors wa and wb are smaller when regu-
30 larisation, c1 > 0 and c2 > 0, is applied. The regularised CCA optimisation problem is
31 given by
32
cos θ = max waT Cab wb ,
Re
33 wa ∈Rp ,wb ∈Rq
34 waT Caa + c1 I wa = 1 wbT Cbb + c2 I wb = 1.
35
36 The positions wa and wb can be found by solving the standard eigenvalue problem
vi
37 −1 −1
Cbb + c2 I Cba Caa + c1 I Cab wb = ρ2 wb .
38
39 or the generalised eigenvalue problem
ew
40
0 Cab wa Caa + c1 I 0 wa
41 =ρ .
Cba 0 wb 0 Cbb + c2 I wb
42
43 As in the case of linear CCA, the canonical correlations correspond to the inner prod-
ucts between the consecutive image pairs hzia , zib i where i = 1, 2, . . . , min(p, q).
44
The regularisation proposed by [Vinod 1976] makes the CCA problem solvable but
45
introduces new parameters c1 > 0 and c2 > 0 that have to be chosen. The first proposi-
46 tion of applying a leave-one-out cross-validation procedure to automatically select the
47 regularisation parameters was presented in [Leurgans et al. 1993]. Cross-validation
48 is a well-established nonparametric model selection procedure to evaluate the valid-
49 ity of statistical predictions. One of its earliest applications have been presented in
50 [Larson 1931]. A cross-validation procedure entails the partitioning of the observa-
51 tions into subsamples, selecting and estimating a statistic which is first measured
52 on one subsample, and then validated on the other hold-out subsample. The method
53 of cross-validation is discussed in detail for example in [Stone 1974], [Efron 1979],
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 14 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:13
4
5 [Browne 2000], and more recently in [Arlot et al. 2010]. The cross-validation approach
6 specifically developed for CCA has been further extended in [Waaijenborg et al. 2008;
7 Yamamoto et al. 2008; González et al. 2009; Soneson et al. 2010].
8 In cross-validation, the size of the hold-out subsample varies depending on the size
9 of the dataset. A leave-one-out cross-validation procedure is an option when the sample
10 size is small and partitioning of the data into several folds, as is done in k-fold cross-
11 validation, is not feasible. 5-fold cross-validation saves computation time in relation
12 to leave-one-out cross-validation if the sample size is large enough to partition the
observations into five folds where each fold is used as a test set in turn.
13
In general, as demonstrated for example in [Krstajic et al. 2014], a k-fold cross-
14 validation procedure should be repeated when an optimal set of parameters are
15 searched for. Repetitions decrease the variance of the average values measured across
16 the test folds. Algorithm 1 outlines an approach to determine the optimal regularisa-
17 tion parameters in CCA.
18
Fo
19
20
ALGORITHM 1: Repeated k-fold cross-validation for regularised CCA
21
22 Input: Data matrices Xa and Xb , number of repetitions R, number of folds F
Output: Regularisation parameter values c1 and c2 maximising the correlation on test data
r
23 Pre-defined ranges for values of c1 ; c2 ;
24 Initialise r = 1;
Pe
25 repeat
26 Randomly partition the observations into F folds ;
27 for all values of c1 do
28 for all values of c2 do
for i = 1, 2, . . . , F do
er
29 Training set: F − i folds, test set: i fold;
30 Standardize the variables in the training and test sets;
31 For the training data, solve |Cbb −1
Cba Caa + c1 I
−1
Cab − ρ2 I| = 0;
32 Find wb corresponding to the greatest eigenvalue satisfying
Re
−1
33 −1
(Cbb Cba Caa + c1 I Cab − ρ2 I)wb = 0 ;
34 Caa +c1 I
−1
1
Cab wb
1
35 Find wa using wa = ρ1
;
36 Transform the training positions wa and wb using the test observations
Xa,test wa = za and Xb,test wb = zb ;
vi
37
38 Compute cos(za , zb ) = ||zhzaa||||z
,zb i
b ||
;
39 end
ew
40 Store the mean of the F values for cos(za , zb ) obtained at c1 and c2 ;
end
41
end
42 r =r+1;
43 until r = R;
44 Compute the mean of the R values for cos(za , zb ) obtained at c1 and c2 ;
45 Return the combination c1 and c2 that maximises cos(za , zb )
46
47
48
49 Example 3.1. To demonstrate the procedure of regularisation in underdetermined
50 settings, we use the same simulated data as in the previous examples but we include
51 additional normally distributed variables. The data matrices Xa and Xb of sizes n × p
52 and n × q, where n = 60, p = 70 and q = 10, respectively as follows. The variables of Xa
53 are generated from a random univariate normal distribution, a1 , a2 , . . . , a70 ∼ N (0, 1).
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 15 of 35 Computing Surveys
1
2
3 1:14
4
5
6
7 0.0776
8
9
10
11
12
13
14 0.09 1
15
16 Fig. 3. The maximum test canonical correlation, computed over 50 times repeated 5-fold cross-validation,
17 is obtained at c1 = 0.09.
18
We generate the following linear relations
Fo
19
20 b1 = a3 + ξ 1
21 b2 = a1 + ξ 2
22 b3 = −a4 + ξ 3
r
23
24 where ξ 1 ∼ N (0, 0.01), ξ 2 ∼ N (0, 0.03), ξ 3 ∼ N (0, 0.02) denote vectors of normal noise.
The remaining variables of Xb are generated from random univariate normal distri-
Pe
25
bution, a4 , a5 , . . . , a10 ∼ N (0, 1). The data is standardised such that every variable has
26 zero mean and unit variance.
27 To construct the matrix Cbb −1 −1
Cba Caa Cab , the variance matrices Caa and Cbb need to
28 be non-singular. Since Caa is obtained from a rectangular matrix, collinearity makes
er
29 it close to singular. We therefore add a positive constant to the diagonal Caa + c1 I
30 to make it invertible. Cbb is invertible since the data matrix Xb has more rows than
31 columns. The optimal value for the regularisation parameter c1 can be determined for
32 instance through repeated k-fold cross-validation. As shown in Figure 3, the optimal
Re
33 value c1 = 0.09 was obtained through 50 times repeated 5-fold cross-validation using
34 the procedure presented in the Algorithm 1.
35 The positions wa and wb and their respective images za and zb on a unit ball are
36 found by solving the eigenvalues of the characteristic equation
vi
37 −1
|Cbb Cba Caa + c1 I
−1
Cab − ρ2 I| = 0. (18)
38
39 The number of relations equals min(p, q) = 10. The square roots of the first three
ew
40 eigenvalues are ρ1 = 0.98, ρ2 = 0.97 and ρ3 = 0.96. The respective three eigenvectors
41 that correspond to the positions wb satisfy the equation
42 −1
−1
(Cbb Cba Caa + c1 I Cab − ρ2 I)wb = 0. (19)
43
44 The positions wa are found using the formula
−1
45 Caa + c1 I Cab wbi
46 wai = (20)
ρi
47
48 where i = 1, 2, 3 corresponds to the sorted eigenvalues and eigenvectors. By rounding
49 correct to three decimal places, the first three canonical correlations are hz1a , z1b i =
50 0.999, hz2a , z2b i = 0.998, hz3a , z3b i = 0.996. The extracted linear relations are visualised in
51 Figure 4.
52 When either or both of the data views consists of more variables than observations,
53 regularisation can be applied to make the variance matrices non-singular. This in-
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 16 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:15
4
5 wa1 wa2 wa3
6 1 1 1
7 0.5 0.5 0.5
8
0 0 0
9
10 -0.5 -0.5 -0.5
11 -1 -1 -1
12 a3 a1 a4
13
14
15
wb1 wb2 wb3
16
1 1 1
17
18 0.5 0.5 0.5
Fo
19 0 0 0
20
-0.5 -0.5 -0.5
21
22 -1 -1 -1
r
23 b1 b2 b3
24
Pe
25 Fig. 4. The entries of the pairs of positions (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) are shown. The entry of max-
26 imum absolute value is coloured blue. The positive linear relation between a3 and b1 , the positive linear
27 relation between a1 and b2 and the negative linear relation between a4 and b3 are extracted by the pairs
(wa1 , wb1 ), (wa2 , wb2 ), and (wa3 , wb3 ) respectively.
28
er
29
30 volves finding optimal non-negative scalar parameters that, when added to the diag-
31 onal entries, improve the invertibility of the variance matrices. After improving the
32 invertibility of the variance matrices, the regularised CCA problem can be solved us-
Re
33 ing the standard techniques.
34
35 3.2. Bayesian Approaches for Robustness
36
Probabilistic approaches have been proposed to improve the robustness of CCA when
vi
37 the sample size is small and to be able to make more flexible distributional assump-
38 tions. A robust method generates a valid model regardless of outlying observations. In
39 the following, a brief introduction to Bayesian CCA is provided. A detailed review on
ew
40 Bayesian CCA and its recent extensions can be found in [Klami et al. 2013].
41 An extension of CCA to probabilistic models was first proposed in [Bach and Jor-
42 dan 2005]. The probabilistic model contains the latent variables yk ∈ Ro , where
43 o = min(p, q), that generate the observations xka ∈ Rp and xkb ∈ Rq for k = 1, 2, . . . , n.
44 The latent variable model is defined by
45
46 y ∼ N (0, Id ), o≥d≥1
47
xa |y ∼ N (Sa y + µa , Ψa ), Sa ∈ Rp×d , Ψa 0
48
49 xb |y ∼ N (Sb y + µb , Ψb ), Sb ∈ Rq×d , Ψb 0
50
51 where N (µ, Σ) denotes the normal multivariate distribution with mean µ and co-
52 variance Σ. The Sa and Sb correspond to the transformations of the latent variables
53 yk ∈ Ro . The Ψa and Ψb denote the noise covariance matrices. The maximum likeli-
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 17 of 35 Computing Surveys
1
2
3 1:16
4
5 hood estimates of the parameters Sa , Sb , Ψa , Ψb , µa and µb are given by
6
7 Sˆa = Caa Wad Ma Sˆb = Cbb Wbd Mb
8 T T
Ψˆa = Caa − Sˆa Sˆa Ψ̂b = Cbb − Sˆb Sˆb
9
n n
10 1X k 1X k
µ̂a = xa µ̂b = xb
11 n n
k=1 k=1
12
13 where Ma , Mb ∈ Rd×d are arbitrary matrices such that Ma MbT = Pd and the spectral
14 norms of Ma and Mb are smaller than one. Pd is the diagonal matrix of the first d
15 canonical correlations. The d columns of Wad and Wbd correspond to the positions wai
16 and wbi for i = 1, 2, . . . , d obtained using any of the standard techniques described in
17 section 2.1.
18 The posterior expectations of y given xa and xb are E(y|xa ) = MaT Wad T
(xa − µˆa ) and
Fo
T T
19 E(y|xb ) = Mb Wbd (xb − µˆb ). As stated in [Bach and Jordan 2005], regardless of what
20 Ma and Mb are, E(y|xa ) and E(y|xb ) lie in the d-dimensional subspaces of Rp and Rq
21 which are identical to those obtained by linear CCA. The generative model of [Bach
22 and Jordan 2005] was further developed in [Archambeau et al. 2006] by replacing
the normal noise with the multivariate Student’s t distribution. This improves the
r
23
24 robustness against outlying observations that are then better modeled by the noise
term [Klami et al. 2013].
Pe
25
A Bayesian extension of CCA was proposed by [Klami and Kaski 2007] and [Wang
26 2007]. To perform Bayesian analysis, the probabilistic model has to be supplemented
27 with prior distributions of the model parameters. In [Klami and Kaski 2007] and
28 [Wang 2007], the prior distribution of the covariance matrices Ψa and Ψb was chosen
er
29 to be the inverse-Wishart distribution. The automatic relevance determination [Neal
30 2012] prior was selected for the linear transformations Sa and Sb . The inference on the
31 posterior distribution was made by applying a variational mean-field algorithm [Wang
32 2007] and Gibbs sampling [Klami and Kaski 2007].
Re
33 As in the case of the linear CCA, the covariance matrices obtained from high-
34 dimensional data make the inference of the probabilistic and Bayesian CCA models
35 difficult [Klami et al. 2013]. This is because the covariance matrices need to be in-
36 verted in the inference algorithms. To perform Bayesian CCA on high-dimensional
data, dimensionality reduction techniques should be applied as a preprocessing step,
vi
37
38 as has been done for example in [Huopaniemi et al. 2010].
39 An advantage of Bayesian CCA, in relation to linear CCA, is the application of the
ew
prior distributions that enable to take the possible underlying structure in the data
40
into account. Examples of studies where sparse models were obtained by means of the
41
prior distribution include [Archambeau and Bach 2009] and [Rai and Daume 2009]. In
42 addition to modeling the structure of the data, in [Klami et al. 2012] the Bayesian CCA
43 was extended such that any exponential family distribution could model the noise, not
44 only the normal.
45 In summary, probabilistic and Bayesian CCA provide alternative ways to interpret
46 the CCA by means of latent variables. Bayesian CCA may be more feasible in settings
47 where additional knowledge regarding the data can be incorporated through the prior
48 distributions or some other exponential family distribution functions better as a noise
49 model than the normal distribution.
50
51 3.3. Uncovering Linear and Non-Linear Relations
52 CCA [Hotelling 1936] finds linear relations between variables belonging to two views
53 that both are overdetermined. The first proposition to extend CCA to uncover non-
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 18 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:17
4
5 linear relations using an optimal scaling method was presented in [Burg and Leeuw
6 1983]. At the turn of the 21st century, artificial neural networks were incorporated in
7 the CCA framework for finding non-linear relations [Lai and Fyfe 1999; Fyfe and Lai
8 2000; Hsieh 2000]. Deep CCA [Andrew et al. 2013] is an example of a recent non-linear
9 CCA variant employing artificial neural networks. Shortly after the introduction of the
10 neural networks, propositions of applying kernel methods in CCA were presented in
11 [Lai and Fyfe 2000; Akaho 2001; Van Gestel et al. 2001; Melzer et al. 2001; Bach and
12 Jordan 2002]. Since then, the kernelised version of CCA has received considerable
attention in terms of theoretical foundations [Hardoon et al. 2004; Fukumizu et al.
13
2007; Alam et al. 2008; Blaschko et al. 2008; Hardoon and Shawe-Taylor 2009; Cai
14 2013] and applications [Melzer et al. 2003; Wang et al. 2005; Hardoon et al. 2007;
15 Larson et al. 2014]. In the following, we present how kernel CCA can be applied to
16 uncover nonlinear relations between the variables. We then provide a brief overview
17 on deep CCA.
18 To extract linear relations, CCA is performed in the data spaces of Xa ∈ Rn×p and
Fo
19 Xb ∈ Rn×q where the n rows correspond to the observations and the p and q columns
20 correspond to the variables. The relations between the variables are found analysing
21 the positions wa ∈ Rp and wb ∈ Rq that have such images za = Xa wa and zb = Xb wb
22 on a unit ball in Rn that have a minimum enclosing angle. The extracted relations are
r
23 linear since the positions wa and wb and their images za and zb were obtained in the
24 Euclidean space.
To extract non-linear relations, the positions wa and wb should be found in a space
Pe
25
26 where the distances, or measures of similarity, between objects are non-linear. This can
27 be achieved using kernel methods, that is by transforming the original observations
xia ∈ Rp and xib ∈ Rq , where i = 1, 2, . . . , n, to Hilbert spaces Ha and Hb through feature
28
maps φa : Rp 7→ Ha and φb : Rq 7→ Hb . The similarity of the objects is captured by
er
29
a symmetric positive semi-definite kernel, corresponding to the inner products in the
30
Hilbert spaces Ka (xia , xja ) = hφa (xia ), φa (xja )iHa and Kb (xib , xjb ) = hφb (xib ), φb (xjb )iHb .
31
The feature maps are typically non-linear and result in a high-dimensional intrinsic
32 spaces φa (xia ) ∈ Ha and φb (xib ) ∈ Hb for i = 1, 2, . . . , n. Through kernels, CCA can be
Re
33 used to extract non-linear correlations, relying on the fact that the CCA solution can
34 always be found within the span of the data [Bach and Jordan 2002; Schölkopf et al.
35 1998].
36 The basic principles behind kernel CCA are similar to CCA. First, the observations
vi
37 are transformed to Hilbert spaces Ha and Hb using symmetric positive semi-definite
38 kernels
39
Ka (xia , xja ) = hφa (xia ), φa (xja )iHa and Kb (xib , xjb ) = hφb (xib ), φb (xjb )iHb
ew
40
41 where i, j = 1, 2, . . . , n. As derived in [Bach and Jordan 2002], the original data matri-
42 ces Xa ∈ Rn×p and Xb ∈ Rn×q can be substituted by the Gram matrices Ka ∈ Rn×n
43 and Kb ∈ Rn×n . Let α and β denote the positions in the kernel space Rn that have the
44 images za = Ka α and zb = Kb β on the unit ball in Rn with a minimum enclosing angle
45 in between. The kernel CCA problem is hence
46 cos(za , zb ) = max n hza , zb i = αT KaT Kb β, (21)
za ,zb ∈R
47 q q
48 ||za ||2 = αT Ka2 α = 1 ||zb ||2 = β T Kb2 β = 1 (22)
49
50 As in CCA, the optimisation problem can be solved using the Lagrange multiplier
51 technique.
52 ρ1 ρ2
53 L = αT KaT Kb β − (αT Ka2 α − 1) − (β T Kb2 β − 1) (23)
2 2
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 19 of 35 Computing Surveys
1
2
3 1:18
4
5 where ρ1 and ρ2 denote the Lagrange multipliers. Differentiating L with respect to α
6 and β gives
7 δL
8 = Ka Kb β − ρ1 Ka2 α = 0 (24)
δα
9 δL
10 = Kb Ka α − ρ2 Kb2 β = 0 (25)
δβ
11
12 Multiplying (7) from the left by αT and (8) from the left by β T gives
13 αT Ka Kb β − ρ1 αT Ka2 α = 0 (26)
14 T T
15 β Kb Ka α − ρ2 β Kb2 β = 0. (27)
16 T
Since αT Ka2 α = 1 and β Kb2 β = 1, we obtain that
17
18 ρ1 = ρ2 = ρ. (28)
Fo
19 Substituting (28) into Equation (24) we obtain
20 Ka−1 Ka−1 Ka Kb β K −1 Kb β
21 α= = a . (29)
ρ ρ
22
r
23 Substituting (29) into (25) we obtain
24 1
Kb Ka Ka−1 Kb β − ρKb2 β = 0 (30)
Pe
25 ρ
26 which is equivalent to the generalised eigenvalue problem of the form
27
28 Kb2 β = ρ2 Kb2 β. (31)
er
29 If Kb2 is invertible, the problem reduces to a standard eigenvalue problem of the form
30
Iβ = ρ2 β. (32)
31
32 Clearly, in the kernel space, if the Gram matrices are invertible the resulting canonical
Re
33 correlations are all equal to one. Regularisation is therefore needed to solve the kernel
34 CCA problem.
35 Kernel CCA can be regularised in a similar manner as presented in Section 3.1 [Bach
36 and Jordan 2002; Hardoon et al. 2004]. We constrain the norms of the position vectors
α and β by adding constants c1 and c2 to the diagonals of the Gram matrices Ka and
vi
37
Kb
38 2
39 αT Ka + c1 I α =1 (33)
ew
40 T
2
41 β Kb + c2 I β =1. (34)
42 The solution can then be found by solving the standard eigenvalue problem
43 −2 −2
Kb + c1 I Kb Ka Ka + c2 I Ka Kb α = ρ2 α.
44
45 As in the case of CCA, kernel CCA can also be solved through the generalised eigen-
46 value problem [Bach and Jordan 2002]. Since the data matrices Xa and Xb can be
47 substituted by the corresponding Gram matrices Ka and Kb , the formulation becomes
48 2 !
0 Ka Kb α Ka + c1 I 0 α
49 =ρ 2 (35)
Kb Ka 0 β 0 Kb + c2 I β
50
51 where the constants c1 and c2 denote the regularisation parameters. In Example 3.2,
52 kernel CCA, solved through the generalised eigenvalue problem, is performed on sim-
53 ulated data.
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 20 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:19
4
5 Table I. Extracted relations by ker-
nel CCA
6
7 z1a z2a z3a
exp(a3 ) 0.00 0.81 0.09
8 a31 0.05 0.14 0.74
9 −a4 0.99 0.07 0.04
10 z1b z2b z3b
11 b1 0.02 0.93 0.12
b2 0.08 0.15 0.87
12 b3 0.98 0.01 0.03
13
14
15 Example 3.2. We generate a simulated dataset as follows. The data matrices Xa
16 and Xb of sizes n × p and n × q, where n = 150, p = 7 and q = 8, respectively as
17 follows. The seven variables of Xa are generated from a random univariate normal
18 distribution, a1 , a2 , . . . , a7 ∼ N (0, 1). We generate the following relations
Fo
19 b1 = exp(a3 ) + ξ 1
20
21 b2 = a31 + ξ 2
22 b3 = −a4 + ξ 3
r
23
where ξ 1 ∼ N (0, 0.4), ξ 2 ∼ N (0, 0.2) and ξ 3 ∼ N (0, 0.3) denote vectors of normal noise.
24 The five other variables of Xb are generated from a random univariate normal distri-
Pe
25 bution, b4 , b5 , . . . , b8 ∼ N (0, 1). The data is standardised such that every variable has
26 zero mean and unit variance.
27 In kernel CCA, the choice of the kernel function affects what kind of relations can be
28 extracted. In general, a Gaussian RBF kernel K(x, y) = exp(− ||x−y||
2
2σ 2 ) is used when the
er
29 data is assumed to contain non-linear relations. The width parameter σ determines the
30 non-linearity in the distances between the data points computed in the form of inner
31 products. Increasing the value of σ makes the space closer to Euclidean while decreas-
32 ing makes the distances more non-linear. The optimal value for σ is best determined
Re
33 using a re-sampling method such as a cross-validation scheme, for example procedure
34 similar to the one presented in Algorithm 1. In this example, we applied the ”median
35 trick”, presented in [Song et al. 2010], according to which the σ corresponds to the me-
36 dian of Euclidean distances computed between all pairs of observations. The median
vi
37 distances for the data in this example were σa = 3.53 and σb = 3.62 for the views Xa
38 and Xb respectively. The kernels were centered by K̃ = K− 1l jjT K− 1l KjjT + l12 (jT Kj)jjT
39 where j contains only entries of value one [Shawe-Taylor and Cristianini 2004].
ew
40 In addition to the kernel parameters, also the regularisation parameters c1 and c2
41 need to be optimised to extract the correct relations. As in the case of regularised
42 CCA, a repeated cross-validation procedure can be applied to identify the optimal pair
43 of parameters. For the data in this example, the optimal regularisation parameters
were c1 = 1.50 and c2 = 0.60 when a 20 times repeated 5-fold cross-validation was
44
applied. The first three canonical correlations at the optimal parameter values were
45
hz1a , z1b i = 0.95, hz2a , z2b i = 0.89, and hz3a , z3b i = 0.87.
46 The interpretation of the relations cannot be performed from the positions α and β
47 since they are obtained in the kernel spaces. In the case of simulated data, we know
48 what kind of relations are contained in the data. We can compute the linear correla-
49 tion coefficient between the simulated relations and the transformed pairs of positions
50 za and zb [Chang et al. 2013]. The correlation coefficients are shown in Table I. The
51 exponential relation was extracted in the second pair (z2a , z2b ), the 3rd order polynomial
52 relation was extracted in the third pair (z3a , z3b ) and the linear relation in the first pair
53 (z1a , z1b ).
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 21 of 35 Computing Surveys
1
2
3 1:20
4
5 In [Hardoon et al. 2004], an alternative formulation of the standard eigenvalue prob-
6 lem was presented when the data contains a large number of observations. If the sam-
7 ple size is large, the dimensionality of the Gram matrices Ka and Kb can cause compu-
8 tational problems. Partial Gram-Schmidt orthogonalization (PGSO) [Cristianini et al.
9 2002] was proposed as a matrix decomposition method. PGSO results in
10 Ka 'Ra RaT
11
Kb 'Rb RbT .
12
13 Substituting these into the Equations (24) and (25) and multiplying by RaT and RbT
14 respectively we obtain
15 RaT Ra RaT Rb RbT β − ρRaT RaT Ra RaT Ra α = 0 (36)
16
17 RbT Rb RbT Ra RaT α − µRbT RbT Rb RbT Rb β = 0. (37)
18 Let Daa = RaT Ra ,
Dab = RaT Rb ,
Dba = RbT Ra , and Dbb = RbT Rb denote the blocks of the
Fo
19 new sample covariance matrix. Let α̃ = RaT α and β̃ = RbT β denote the positions α and
20 β in the reduced space. Using these substitutions in (36) and (37) we obtain
21 2
22 Daa Dab β̃ − ρDaa α̃ = 0 (38)
r
23 2
Dbb Dba α̃ − ρDbb β̃ = 0. (39)
24 −1 −1
If Daa and Dbb are invertible we can multiply (38) by Daa and (39) by Dbb which gives
Pe
25
26 Dab β̃ − ρDaa α̃ = 0 (40)
27
28 Dba α̃ − ρDbb β̃ = 0. (41)
er
29 and hence
30 −1
Dbb Dba α̃
31 β̃ = (42)
ρ
32
which, after a substitution into (38), results in a generalised eigenvalue problem
Re
33
−1
34 Dab Dbb Dba α̃ = ρ2 Daa α̃. (43)
35 To formulate the problem as a standard eigenvalue problem, let Daa = SS denote T
36 the complete Cholesky decomposition where S is a lower triangular matrix and let
vi
37 α̂ = S T α̃. Substituting these into (43) we obtain
38 −1
S −1 Dab Dbb Dba S 0−1 α̂ = ρ2 α̂.
39
ew
40 If regularisation using the parameter κ is combined with dimensionality reduction the
41 problem becomes
42 −1
S −1 Dab Dbb + κI Dba S 0−1 α̂ = ρ2 α̂. (44)
43
44 A numerical example of the method presented by [Hardoon et al. 2004] is given in
Example 3.3.
45
46 Example 3.3. We generate a simulated dataset as follows. The data matrices Xa
47 and Xb of sizes n × p and n × q, where n = 10000, p = 7 and q = 8, respectively as
48 follows. The seven variables of Xa are generated from a random univariate normal
49 distribution, a1 , a2 , . . . , a7 ∼ N (0, 1). We generate the following relations
50 b1 = exp(a3 ) + ξ 1
51
b2 = a31 + ξ 2
52
53 b3 = −a4 + ξ 3
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 22 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:21
4
5 Table II. Extracted relations by ker-
nel CCA
6
7 z1a z2a z3a
exp(a3 ) 0.91 0.01 0.05
8 a31 0.01 0.92 0.04
9 −a4 0.06 0.03 0.99
10 z1b z2b z3b
11 b1 0.91 0.01 0.04
b2 0.01 0.94 0.05
12 b3 0.07 0.04 0.99
13
14 where ξ 1 ∼ N (0, 0.4), ξ 2 ∼ N (0, 0.2) and ξ 3 ∼ N (0, 0.3) denote vectors of normal noise.
15 The five other variables of Xb are generated from a random univariate normal distri-
16 bution, b4 , b5 , . . . , b8 ∼ N (0, 1). The data is standardised such that every variable has
17 zero mean and unit variance.
18 A Gaussian RBF kernel is used for both views. The width parameter is set using
Fo
19 the median trick to σa = 3.56 and σb = 3.60. The kernels were centered. The positions
20 α and β are found solving the standard eigenvalue problem in (44) and applying the
21 Equation (42). We set the regularisation parameter κ = 0.5.
22 The first three canonical correlations at the optimal parameter values were hz1a , z1b i =
r
23 0.97, hz2a , z2b i = 0.97, and hz3a , z3b i = 0.96. The correlation coefficients between the simu-
24 lated relations and the transformed variables are shown in Table II. The exponential
relation was extracted in the first pair (z1a , z1b ), the 3rd order polynomial relation was
Pe
25
26 extracted in the second pair (z2a , z2b ) and the linear relation in the third pair (z3a , z3b ).
27
28 Non-linear relations are also taken into account through neural networks which are
er
29 employed in deep CCA [Andrew et al. 2013]. In deep CCA, every observation xka ∈ Rp
30 and xkb ∈ Rq for k = 1, 2, . . . , n is non-linearly transformed multiple times in an iter-
31 ative manner through a layered network. The number of units in a layer determines
32 the dimension of the output vector which is put in the next layer. As is explained in
Re
33 [Andrew et al. 2013], let the first layer have c1 units and the final layer o units. The
34 output vector of the first layer for the observation x1a ∈ Rp , is h1 = s(S11 x1a + b11 ) ∈ Rc1 ,
35 where S11 ∈ Rc1 ×p is a matrix of weights, b11 ∈ Rc1 is a vector of bias, and s : R 7→ R
36 is a non-linear function applied to each element. The logistic and tanh functions are
examples of popular non-linear functions. The output vector h1 is then used to com-
vi
37
pute the output of the following layer in similar manner. The final transformed vector
38
f1 (x1a ) = s(Sd1 hd−1 + b1d ) is in the space of Ro , for a network with d layers. The same
39
procedure is applied to the observations xkb ∈ Rq for k = 1, 2, . . . , n.
ew
40
In deep CCA, the aim is to learn the optimal parameters Sd and bd for both views
41 such that the correlation between the transformed observations is maximised. Let
42 Ha ∈ Ro×n and Hb ∈ Ro×n denote the matrices that have the final transformed output
43 vectors in their columns. Let H̃a = Ha − n1 Ha 1 denote the centered data matrix and
44 1 1
let Ĉab = m−1 H̃a H̃bT and Ĉaa = m−1 H̃a H̃aT + ra I, where ra is a regularisation constant,
45
46 denote the covariance and variance matrices. Same formulae are used to compute the
47 covariance and variance matrices for view b. As in section 2.1, the total correlation of
the top k components of Ha and Hb is the sum of the top k singular values of the matrix
48 −1/2 −1/2
49 T = Ĉaa Ĉab Ĉbb . If k = o, the correlation is given by the trace norm of T , that is
50
corr(Ha , Hb ) = tr(T T T )1/2 .
51
52 The optimal parameters Sd and bd maximise the trace norm using gradient-based op-
53 timisation. The details of the algorithm can be found in [Andrew et al. 2013].
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 23 of 35 Computing Surveys
1
2
3 1:22
4
5 In summary, kernel and deep CCA provide alternatives to the linear CCA when the
6 relations in the data can be considered to be non-linear and the sample size is small
7 in relation to the data dimensionality. When applying kernel CCA on a real dataset,
8 prior knowledge of the relations of interest can help in the analysis of the results. If
9 the data is assumed to contain both linear and non-linear relations a Gaussian RBF
10 kernel could be a first option. The choice of the kernel function depends on what kind of
11 relations the data can be considered to contain. The possible relations can be extracted
12 by testing how the image pairs correlate with the functions of variables. Deep CCA
provides an alternative to compute maximal correlation between the views although
13
the neural network makes the identification of the type of relations difficult.
14
15 3.4. Improving the Interpretability by Enforcing Sparsity
16 The extraction of the linear relations between the variables in CCA and regularised
17 CCA relies on the values of the entries of the position vectors that have images on
18 the unit ball with a minimum enclosing angle. The relations can be inferred when the
Fo
19 number of variables is not too large for a human to interpret. However, in modern data
20 analysis, it is common that the number of variables is of the order of tens of thousands.
21 In this case, the values of the entries of the position vectors should be constrained such
22 that only a subset of the variables would have a non-zero value. This would facilitate
r
23 the interpretation since only a fraction of the total number of variables need to be
24 considered when inferring the relations.
To constrain some of the values of the entries of the position vectors to zero, which
Pe
25
26 is also referred to as to enforce sparsity, tools of convex analysis can be applied. In
27 literature, sparsity has been enforced on the position vectors using soft-thresholding
28 operators [Parkhomenko et al. 2007], elastic net regularisation [Waaijenborg et al.
2008], penalised matrix decomposition combined with soft-thresholding [Witten et al.
er
29
2009], and convex least squares optimisation [Hardoon and Shawe-Taylor 2011]. The
30 sparse CCA formulations presented in [Parkhomenko et al. 2007; Waaijenborg et al.
31 2008; Witten et al. 2009] find sparse position vectors that can be applied to infer linear
32 relations between the variables with non-zero entries. The formulation in [Hardoon
Re
33 and Shawe-Taylor 2011] differs from the preceding propositions in terms of the op-
34 timisation criterion. The canonical correlation is found between the image obtained
35 from the linear transformation defined by the data space of one view and the image
36 obtained from the linear transformation defined by the kernel of the other view. The
vi
37 selection of which sparse CCA should be applied for a specific task depends on the
38 research question and prior knowledge regarding the variables.
39 The sparse CCA algorithm of [Parkhomenko et al. 2007] can be applied when the
ew
40 aim is to find sparse position vectors and no prior knowledge regarding the variables is
41 available. The positions and images are solved using the SVD, as presented in Section
42 2.2. Sparsity is enforced on the entries of the positions by iteratively applying the
43 soft-thresholding operator [Donoho and Johnstone 1995] on the pair of left and right
orthonormal singular vectors until convergence. The soft-thresholding operator is a
44
proximal mapping of the L1 norm [Bach et al. 2011]. The consecutive pairs of sparse
45
left and right singular vectors are obtained by deflating the found pattern from the
46 matrix on which the SVD is computed. The sparse CCA hence results in a sparse set
47 of linearly related variables.
48 The elastic net CCA [Waaijenborg et al. 2008] finds sparse position vectors but also
49 considers possible groupings in the variables. The elastic net [Zou and Hastie 2005]
50 combines the LASSO [Tibshirani 1996] and the ridge [Hoerl and Kennard 1970] penal-
51 ties. The elastic net penalty incorporates a grouping effect in the variable selection.
52 The term variable selection refers to that a selected variable has a non-zero entry in
53 the position vector. In the soft-thresholding CCA of [Parkhomenko et al. 2007], the as-
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 24 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:23
4
5 signment of a non-zero entry is independent of the other entries within the vector. In
6 the elastic net CCA, the ridge penalty groups the variables by the values of the en-
7 tries and the LASSO penalty either eliminates a group by shrinking the entries of the
8 variables within the group to zero or leaves them as non-zero. The algorithm is based
9 on an iterative scheme of multiple regression. As in [Parkhomenko et al. 2007], the
10 computations are performed in the data space and therefore the extracted relations
11 are also linear.
12 The penalised matrix decomposition (PMD) formulation of sparse CCA [Witten et al.
2009] is based on finding low-rank approximations of the covariance matrix Cab . An
13
n × p sized matrix X with rank K < min(p, q) can be approximated using the SVD
14 [Eckart and Young 1936] by
15
r
16 X
17 σk uk vkT = argmin||X − X̃||2F
k=1 X̃∈M (r)
18
Fo
19 where uk denotes the column of the matrix U , vk denotes the column of the matrix
20 U , σk denotes the k th singular value on the diagonal of S, M (r) is the set of rank r
21 n × p matrices and r << K. In the case of CCA, the matrix to be approximated is the
22 covariance matrix X = Cab . The optimisation problem in the PMD context is given by
r
23 1
24 min ||Cab − σwa wbT ||2F ,
p
wa ∈R ,wb ∈Rq 2
Pe
25 ||wa ||2 = 1 ||wb ||2 = 1,
26
||wa ||1 ≤ c1 ||wb ||1 ≤ c2 , σ ≥ 0
27
28 which is equivalent to
er
29 cos θ = max waT Cab wb ,
30 wa ∈Rp ,wb ∈Rq
31 ||wa ||2 ≤ 1 ||wb ||2 ≤ 1,
32 ||wa ||1 ≤ c1 ||wb ||1 ≤ c2 .
Re
33
34 The aim is to find r pairs of sparse position vectors wa and wb such that their outer
35 products represent low-rank approximations of the original Cab and hence extracts the
36 r linear relations from the data.
The exact derivation of the algorithm to solve the PMD optimisation problem is
vi
37
given in [Witten et al. 2009]. In general, the position vectors, that generate 1-rank
38
approximations of the covariance matrix, are found in an iterative manner. To find
39 one 1-rank approximation, the soft-thresholding operator is applied as follows. Let the
ew
40 soft-thresholding operator be given by
41
42 S(a, c) = sign(a)(|a| − c)+
43 where c > 0 is a constant. The following formula is applied in the derivation of the
44 algorithm
45
maxhu, ai,
46 u
47 s.t. ||u||22 ≤ 1, ||u||1 < c.
48
S(a,δ)
49 The solution is given by u = ||S(a,δ)||2
with δ = 0 if ||u1 || ≤ c. Otherwise, δ is selected
50 such that ||u1 || = c. Sparse position vectors are the obtained by Algorithm 2. At ev-
51 ery iteration, the δ1 and δ2 are selected by binary search. To obtain several 1-rank
52 approximations, a deflation step is included such that when the converged vectors wa
53 and wb are found, the extracted relations are subtracted from the covariance matrix
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 25 of 35 Computing Surveys
1
2
3 1:24
4
5
ALGORITHM 2: Computation of a 1-rank approximation of the covariance matrix
6
7 Initialise ||wb ||2 = 1 ;
repeat
8 S(Cab wb ,δ1 )
wa ← ||S(C where δ1 = 0 if ||wa ||1 ≤ c1 , otherwise δ1 is chosen such that
9 ab wb ,δ1 )||2
||wa ||1 = c1 and c1 > 0 ;
10 T
S(Cab wa ,δ2 )
11 wb ← T w ,δ )||
||S(Cab
where δ2 = 0 if ||wb ||1 ≤ c2 , otherwise δ2 is chosen such that
a 2 2
12 ||wb ||1 = c2 and c2 > 0 ;
13 until convergence;
14 σ ← waT Cab wb
15
16
k+1
17 Cab ← C k − σk wak wbkT . In this way, the successive solutions remain orthogonal which
18 is a contstraint of CCA.
Fo
19
20 Example 3.4. To demonstrate the PMD formulation of sparse CCA, we generate
21 the following data. The data matrices Xa and Xb of sizes n × p and n × q, where n = 50,
22 p = 100 and q = 150, respectively as follows. The variables of Xa are generated from
r
23 a random univariate normal distribution, a1 , a2 , · · · , a100 ∼ N (0, 1). We generate the
24 following linear relations
Pe
25
26 b1 = a3 + ξ 1 (45)
27 b2 = a1 + ξ 2 (46)
28 b3 = −a4 + ξ 3 (47)
er
29
30 where ξ 1 ∼ N (0, 0.08), ξ 2 ∼ N (0, 0.07), and ξ 3 ∼ N (0, 0.05) denote vectors of normal
31 noise. The other variables of Xb are generated from a random univariate normal dis-
32 tribution, b4 , b5 , · · · , b150 ∼ N (0, 1). The data is standardised such that every variable
Re
33 has zero mean and unit variance.
34 We apply the R implementation of [Witten et al. 2009] which is available in the
35 PMA package. We extract three rank-1 approximations. The values of the entries of
36 the pairs of position vectors (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) corresponding to canonical
correlations hz1a , z1b i = 0.95, hz2a , z2b i = 0.92, hz3a , z3b i = 0.91 are shown in Figure 5. The
vi
37
38 first 1-rank approximation extracted (47), the second (46), and the third (47).
39
ew
40 The sparse CCA of [Hardoon and Shawe-Taylor 2011] is a sparse convex least
41 squares formulation that differs from the preceding versions. The canonical correla-
42 tion is found between the linear transformations between a data space view and a
kernel space view. The aim is to find a sparse set of variables in the data space view
43
that relate to a sparse set of observations, represented in terms of relative similari-
44
ties, in the kernel space view. An example of a setting, where relations of this type
45 can provide useful information, is in bilingual analysis as described in [Hardoon and
46 Shawe-Taylor 2011]. When finding relations between words of two languages, it may
47 be useful to know in what kind of contexts can a word be used in the translated lan-
48 guage. The optimisation problem is given by
49
50 cos(za , zb ) = max n hza , zb i = waT XaT Kb β,
51 za ,zb ∈R
q
52
q
||za ||2 = waT XaT Xa wa = 1 ||zb ||2 = β T Kb2 β = 1
53
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 26 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:25
4
5 wa1 wa2 wa3
6 1 1 1
7 0.5 0.5 0.5
8
0 0 0
9
10 -0.5 -0.5 -0.5
11 -1 -1 -1
12 a4 a1 a3
13
14
15
wb1 wb2 wb3
16
1 1 1
17
18 0.5 0.5 0.5
Fo
19 0 0 0
20
-0.5 -0.5 -0.5
21
22 -1 -1 -1
r
23 b3 b2 b1
24
Pe
25 Fig. 5. The values of the entries of the position vector pairs (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) obtained using
26 the PMD method for sparse CCA are shown. The entry of maximum absolute value is coloured blue. The
27 negative linear relation between a4 and b3 is extracted in the first 1-rank approximation. The positive linear
relations between a1 and b2 and a3 and b1 are extracted in the second and third 1-rank approximations.
28
er
29
30 ALGORITHM 3: Pseudo-code to solve the convex sparse least squares problem
31 repeat
32 1. Use the dual Lagrangian variables to solve the primal variables ;
Re
33 2. Check whether all constraints on the primal variables hold ;
34 3. Use the primal variables to solve the dual Lagrangian variables ;
4. Check whether all dual Lagrangian variable constraints hold ;
35
5. Check whether 2 holds, if not go to 1 ;
36 until convergence;
vi
37
38
39 which is equivalent to the convex sparse least squares problem
ew
40
41 min ||Xa wa − Kb β||2 + µ||wa ||1 + γ||β̃||1 (48)
wa ,β
42
43 s.t ||β||∞ = 1 (49)
44 where µ and γ are fixed parameters that control the trade-off between function ob-
45 jective and the level of sparsity of the entries of the position vectors wa and β. The
46 constraint ||β||∞ = 1 is set to avoid the trivial solution (wa = 0, β = 0). The k th en-
47 try of β is set to βk = 1 and the remaining entries in β̃ are constrained by 1-norm.
48 The idea is to fix one sample as a basis for comparison and rank the other similar
49 samples based on the fixed sample. The optimisation problem is solved by iteratively
50 minimising the gap between the primal and dual Lagrangian solutions. The procedure
51 is outlined in Algorithm 3. The exact computational steps can be found in [Hardoon
52 and Shawe-Taylor 2011]. The Algorithm 3 is used to extract one relation or pattern
53 from the data. To extract the successive patterns, deflation is applied to obtain the
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 27 of 35 Computing Surveys
1
2
3 1:26
4
5 wa β
6 0.014 1
7 0.012
0.9
8 0.8
0.01 0.7
9
0.6
10 0.008
0.5
11 0.006
0.4
12
0.3
13 0.004
0.2
14 0.002
0.1
15
0 0
16 β2 ββ1112 β26β29 β35 β40 β46
a9
aaa11165
8
a220
a24
9
a62
a77
71
2
4
a6
a9
17
18
Fo
19 Fig. 6. The values of the entries of the positions wa and β at the optimal value of k are shown.
20
21 residual matrices from which the already found pattern is removed. In Example 3.5,
22 the extraction of the first pattern is shown.
r
23 Example 3.5. In the sparse CCA of [Hardoon and Shawe-Taylor 2011], the idea is
24 to determine the relations of the variables in the data space view Xa to the observa-
Pe
25 tions in the kernel space view Kb where the observations comprise the variables of the
26 view b. This setting differs from all of the previous examples where the idea was to find
27 relations between the variables. Since one of the views is kernelised, the relations can-
28 not be explicitly simulated. We therefore demonstrate the procedure on data generated
er
29 from random univariate normal distribution as follows. The data matrices Xa and Xb
30 of sizes n × p and n × q, where n = 50, p = 100 and q = 150, respectively are generated
31 as follows. The variables of Xa and Xb are generated from random univariate normal
32 distribution, a1 , a2 , · · · , a100 ∼ N (0, 1) and b1 , b2 , · · · , b150 ∼ N (0, 1) respectively. The
data is standardised such that every variable has zero mean and unit variance.
Re
33
The Gaussian kernel function K(x, y) = exp(−||x − y||2 /2σ 2 ) is used to compute the
34
similarities for the view b. The choice of the kernel is justified since the underlying
35 distribution is normal. The width parameter is set to σ = 17.25 using the median trick.
36
The kernel matrix is centered by K̃ = K − 1l jjT K − 1l KjjT + l12 (jT Kj)jjT where j contains
vi
37 only entries of value one [Shawe-Taylor and Cristianini 2004].
38 To find the positions wa and β, we solve
39
ew
40 f = min ||Xa wa − Kb β||2 + µ||wa ||1 + γ||β̃||1
wa ,β
41
42 s.t ||β||∞ = 1
43
using the implementation proposed in [Uurtio et al. 2015]. As stated in [Hardoon and
44 Shawe-Taylor 2011], to determine which variable in the data space view Xa is most
45 related to the observation in Kb , the algorithm needs to be run for all possible values
46 of k. This means that every observation is in turn set as a basis for comparison and a
47 sparse set of the remaining observations β̃ is computed. The optimal value of k gives
48 the minimum objective value f .
49 We run the algorithm by initially setting the value of the entry βk = 1 for k =
50 1, 2, . . . , n. The minimum objective value f = 0.03 was obtained at k = 29. This corre-
51 sponds to a canonical correlation of hza , zb i = 0.88. The values of the entries of wa and
52 β are shown in Figure 6. The observation corresponding to k = 29 in the kernelised
53 view Kb is most related to the variables a15 , a16 , a18 , a20 , and a24 .
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 28 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:27
4
5 The sparse versions of CCA can be applied to settings when the large number of
6 variables hinders the inference of the relations. When the interest is to extract sparse
7 linear relations between the variables, the proposed algorithms of [Parkhomenko et al.
8 2007; Waaijenborg et al. 2008; Witten et al. 2009] provide a solution. The algorithm
9 of [Hardoon and Shawe-Taylor 2011] can be applied if the focus is to find how the
10 variables of one view relate to the observations that correspond to the combined sets
11 of the variables in the other view. In other words, the approach is useful if the focus is
12 not to uncover the explicit relations between the variables but to gain insight how a
variable relates to a complete set of variables of an observation.
13
14 4. DISCUSSION
15 This tutorial presented an overview on the methodological evolution of canonical cor-
16 relation methods focusing on the original linear, regularised, kernel, and sparse CCA.
17 Succinct reviews were also conducted on the Bayesian and neural network-based deep
18 CCA variants. The aim was to explain the theoretical foundations of the variants us-
Fo
19 ing the linear algebraic interpretation of CCA. The methods to solve the CCA problems
20 were described using numerical examples. Additionally, techniques to assess the sta-
21 tistical significance of the extracted relations and the generalisability of the patterns
22 were explained. The aim was to delineate the applicabilities of the different CCA vari-
r
23 ants in relation to the properties of the data.
24 In CCA, the aim is to determine linear relations between variables belonging to two
sets. From a linear algebraic point of view, the relations can be found by analysing the
Pe
25
26 linear transformations defined by the two views of the data. The most distinct relations
27 are obtained by analysing the entries of the first pair of position vectors in the two data
28 spaces that are mapped onto a unit ball such that their images have a minimum en-
closing angle. The less distinct relations can be identified from the successive pairs
er
29
of position vectors that correspond to the images with a minimum enclosing angle ob-
30 tained from the orthogonal complements of the preceding pairs of images. This tutorial
31 presented three standard ways of solving the CCA problem, that is by solving either a
32 standard [Hotelling 1935; 1936] or a generalised eigenvalue problem [Bach and Jordan
Re
33 2002; Hardoon et al. 2004], or by applying the SVD [Healy 1957; Ewerbring and Luk
34 1989].
35 The position vectors of the two data spaces, that convey the related pairs of variables,
36 can be obtained using alternative techniques than the ones selected for this tutorial.
vi
37 The three methods were chosen because they have been much applied in CCA litera-
38 ture and they are relatively straightforward to explain and implement. Additionally,
39 to understand the further extensions of CCA, it is important to know how it originally
ew
40 has been solved. The extensions are often further developed versions of the standard
41 techniques.
42 For didactic purposes, the synthetic datasets used for the worked examples were de-
43 signed to represent optimal data settings for the particular CCA variants to uncover
the relations. The relations were generated to be one-to-one, in other words one vari-
44
able in one view was related with only one variable in the other view. In real datasets,
45
which are often much larger than the synthetic ones in this paper, the relations may
46 not be one-to-one but rather many-to-many (one-to-two, two-to-three, etc.). As in the
47 worked examples, these relations can also be inferred by examining the entries of the
48 position vectors of the two data spaces. However, the understanding of how the one-to-
49 one relations are extracted provides means to uncover the more complex relations.
50 To apply the linear CCA, the sample size needs to exceed the number of variables
51 of both views which means that the system is required to be overdetermined. This is
52 to guarantee the non-singularity of the variance matrices. If the sample size is not
53 sufficient, regularisation [Vinod 1976] or Bayesian CCA [Klami et al. 2013] can be ap-
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 29 of 35 Computing Surveys
1
2
3 1:28
4
5 plied. The feasibility of regularisation has not been studied in relation to the number
6 of variables exceeding the number of observations. Improving the invertibility by in-
7 troducing additional bias has been shown to work in various settings but the limit
8 when the system is too underdetermined that regularisation cannot assist in recov-
9 ering the underlying relations has not been resolved. Bayesian CCA is more robust
10 against outlying observations, when compared with linear CCA, due to its generative
11 model structure.
12 In addition to linear relations, non-linear relations are taken into account in ker-
nelised and neural network-based CCA. Kernel methods enable the extraction of non-
13
linear relations through the mapping to a Hilbert space [Bach and Jordan 2002;
14 Hardoon et al. 2004]. When applying kernel methods in CCA, the disparity between
15 the number of observations and variables can be huge due to very dimensional kernel
16 induced feature spaces, a challenge that is tackled by regularisation. The types of rela-
17 tions that can be extracted, is determined by the kernel function that was selected for
18 the mapping. Linear relations are extracted by a linear kernel and non-linear relations
Fo
19 by non-linear kernel functions such as the Gaussian kernel. Although kernelisation
20 extends the range of extractable relations, it also complicates the identification of the
21 type of relation. A method to determine the type of relation involves testing how the
22 image vectors correlate with a certain type of function. However, this may be difficult
r
23 if no prior knowledge of the relations is available. Further research on how to select
24 the optimal kernel functions to determine the most distinct relations underlying in the
data could facilitate the final inference making. Neural network-based deep CCA is an
Pe
25
26 alternative to kernelised CCA, when the aim is to find a high correlation between the
27 final output vectors obtained through multiple non-linear transformations. However,
due to the network structure, it is not straightforward to identify the relations between
28
the variables.
er
29
As a final branch of the CCA evolution, this tutorial covered sparse versions of CCA.
30 Sparse CCA variants have been developed to facilitate the extraction of the relations
31 when the data dimensionality is too high for human interpretation. This has been ad-
32 dressed by enforcing sparsity on the entries of the position vectors [Parkhomenko et al.
Re
33 2007; Waaijenborg et al. 2008; Witten et al. 2009]. As an alternative to operating in the
34 data spaces, [Hardoon and Shawe-Taylor 2011] proposed a primal-dual sparse CCA in
35 which the relations are obtained between the variables of one view and observations of
36 the other. The sparse variants of CCA in this tutorial were selected based on how much
vi
37 they have been applied in literature. As a limitation of the selected variants, sparsity
38 is enforced on the entries of the position vectors without regarding the possible under-
39 lying dependencies between the variables which has been addressed in the literature
ew
40 of structured sparsity [Chen et al. 2012].
41 In addition to studying the techniques of solving the optimisation problems of CCA
42 variants, this tutorial gave a brief introduction to evaluating the canonical correla-
43 tion model. Bartlett’s sequential test procedure [Bartlett 1938; 1941] was given as an
example of a standard method to assess the statistical significance of the canonical
44
correlations. The techniques of identifying the related variables through visual inspec-
45 tion of biplots [Meredith 1964; Ter Braak 1990] were presented. To assess whether
46 the extracted relations can be considered to occur in any data with the same under-
47 lying sampling distribution, the method of applying both training and test data was
48 explained. As an alternative method, the statistical significance of the canonical cor-
49 relation model could be assessed using permutation tests [Rousu et al. 2013]. The
50 visualisation of the results using the biplots is mainly applicable in the case of lin-
51 ear relations. Alternative approaches could be considered to visualise the non-linear
52 relations extracted by kernel CCA.
53
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 30 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:29
4
5 To conclude, this tutorial compiled the original, regularised, kernel, and sparse CCA
6 into a unified framework to emphasise the applicabilities of the four variants in dif-
7 ferent data settings. The work highlights which CCA variant is most applicable de-
8 pending on the sample size, data dimensionality and the type of relations of interest.
9 Techniques for extracting the relations are also presented. Additionally, the impor-
10 tance of assessing the statistical significance and generalisability of the relations is
11 emphasised. The tutorial hopefully advances both the practice of CCA variants in data
12 analysis and further development of novel extensions.
The software used to produce the examples in this paper are available for download
13
at https://github.com/aalto-ics-kepaco/cca-tutorial.
14
15
16 ACKNOWLEDGMENTS
17
The work by Viivi Uurtio and Juho Rousu has been supported in part by Academy of Finland (grant
18
295496/D4Health). João M. Monteiro was supported by a PhD studentship awarded by Fundação para a
Fo
19 Ciência e a Tecnologia (SFRH/BD/88345/2012). John Shawe-Taylor acknowledges the support of the EPSRC
20 through the C-PLACID project Reference: EP/M006093/1.
21
22
r
23 REFERENCES
24 S Akaho. 2001. A Kernel Method For Canonical Correlation Analysis. In In Proceedings of the International
Meeting of the Psychometric Society (IMPS2001.
Pe
25
Md A Alam, M Nasser, and K Fukumizu. 2008. Sensitivity analysis in robust and kernel canonical correla-
26 tion analysis. In Computer and Information Technology, 2008. ICCIT 2008. 11th International Confer-
27 ence on. IEEE, 399–404.
28 TW Anderson. 2003. An introduction to statistical multivariate analysis. (2003).
er
29 G Andrew, R Arora, J Bilmes, and K Livescu. 2013. Deep canonical correlation analysis. In International
30 Conference on Machine Learning. 1247–1255.
31 C Archambeau and FR Bach. 2009. Sparse probabilistic projections. In Advances in neural information
processing systems. 73–80.
32
C Archambeau, N Delannay, and M Verleysen. 2006. Robust probabilistic projections. In Proceedings of the
Re
33 23rd International conference on machine learning. ACM, 33–40.
34 S Arlot, A Celisse, and others. 2010. A survey of cross-validation procedures for model selection. Statistics
35 surveys 4 (2010), 40–79.
36 F Bach, R Jenatton, J Mairal, G Obozinski, and others. 2011. Convex optimization with sparsity-inducing
norms. Optimization for Machine Learning 5 (2011).
vi
37
FR Bach and MI Jordan. 2002. Kernel independent component analysis. Journal of machine learning re-
38 search 3, Jul (2002), 1–48.
39 FR Bach and MI Jordan. 2005. A probabilistic interpretation of canonical correlation analysis. (2005).
ew
40 MS Bartlett. 1938. Further aspects of the theory of multiple regression. In Mathematical Proceedings of the
41 Cambridge Philosophical Society, Vol. 34. Cambridge Univ Press, 33–40.
42 MS Bartlett. 1941. The statistical significance of canonical correlations. Biometrika 32, 1 (1941), 29–37.
43 B Baur and S Bozdag. 2015. A canonical correlation analysis-based dynamic bayesian network prior to infer
gene regulatory networks from multiple types of biological data. Journal of Computational Biology 22,
44 4 (2015), 289–299.
45 Å Björck and GH Golub. 1973. Numerical methods for computing angles between linear subspaces. Mathe-
46 matics of computation 27, 123 (1973), 579–594.
47 MB Blaschko, CH Lampert, and A Gretton. 2008. Semi-supervised laplacian regularization of kernel canon-
48 ical correlation analysis. In Joint European Conference on Machine Learning and Knowledge Discovery
in Databases. Springer, 133–145.
49
MW Browne. 2000. Cross-validation methods. Journal of mathematical psychology 44, 1 (2000), 108–132.
50
E Burg and J Leeuw. 1983. Non-linear canonical correlation. British journal of mathematical and statistical
51 psychology 36, 1 (1983), 54–80.
52 J Cai. 2013. The distance between feature subspaces of kernel canonical correlation analysis. Mathematical
53 and Computer Modelling 57, 3 (2013), 970–975.
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 31 of 35 Computing Surveys
1
2
3 1:30
4
5 L Cao, Z Ju, J Li, R Jian, and C Jiang. 2015. Sequence detection analysis based on canonical correlation
6 for steady-state visual evoked potential brain computer interfaces. Journal of neuroscience methods 253
(2015), 10–17.
7 JD Carroll. 1968. Generalization of canonical correlation analysis to three or more sets of variables. In
8 Proceedings of the 76th annual convention of the American Psychological Association, Vol. 3. 227–228.
9 B Chang, U Krüger, R Kustra, and J Zhang. 2013. Canonical Correlation Analysis based on Hilbert-Schmidt
10 Independence Criterion and Centered Kernel Target Alignment.. In ICML (2). 316–324.
11 X Chen, S Chen, H Xue, and X Zhou. 2012. A unified dimensionality reduction framework for semi-paired
and semi-supervised multi-view data. Pattern Recognition 45, 5 (2012), 2005–2018.
12
X Chen, C He, and H Peng. 2014. Removal of muscle artifacts from single-channel EEG based on ensemble
13 empirical mode decomposition and multiset canonical correlation analysis. Journal of Applied Mathe-
14 matics 2014 (2014).
15 X Chen, H Liu, and JG Carbonell. 2012. Structured sparse canonical correlation analysis. In International
16 Conference on Artificial Intelligence and Statistics. 199–207.
17 A Cichonska, J Rousu, P Marttinen, AJ Kangas, P Soininen, T Lehtimäki, OT Raitakari, M-R Järvelin,
V Salomaa, M Ala-Korpela, and others. 2016. metaCCA: Summary statistics-based multivariate meta-
18 analysis of genome-wide association studies using canonical correlation analysis. Bioinformatics (2016),
Fo
19 btw052.
20 N Cristianini, J Shawe-Taylor, and H Lodhi. 2002. Latent semantic kernels. Journal of Intelligent Informa-
21 tion Systems 18, 2-3 (2002), 127–152.
R Cruz-Cano and MLT Lee. 2014. Fast regularized canonical correlation analysis. Computational Statistics
22 & Data Analysis 70 (2014), 88–100.
r
23 J Dauxois and GM Nkiet. 1997. Canonical analysis of two Euclidean subspaces and its applications. Linear
24 Algebra Appl. 264 (1997), 355–388.
Pe
25 DL Donoho and IM Johnstone. 1995. Adapting to unknown smoothness via wavelet shrinkage. Journal of
26 the american statistical association 90, 432 (1995), 1200–1224.
27 RB Dunham and DJ Kravetz. 1975. Canonical correlation analysis in a predictive system. The Journal of
Experimental Education 43, 4 (1975), 35–42.
28
C Eckart and G Young. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1,
er
29 3 (1936), 211–218.
30 B Efron. 1979. Computers and the theory of statistics: thinking the unthinkable. SIAM review 21, 4 (1979),
31 460–480.
32 LM Ewerbring and FT Luk. 1989. Canonical correlations and generalized SVD: applications and new algo-
Re
33 rithms. In 32nd Annual Technical Symposium. International Society for Optics and Photonics, 206–222.
J Fang, D Lin, SC Schulz, Z Xu, VD Calhoun, and Y-P Wang. 2016. Joint sparse canonical correlation analysis
34 for detecting differential imaging genetics modules. Bioinformatics 32, 22 (2016), 3480–3488.
35 Y Fujikoshi and LG Veitch. 1979. Estimation of dimensionality in canonical correlation analysis. Biometrika
36 66, 2 (1979), 345–351.
vi
37 K Fukumizu, FR Bach, and A Gretton. 2007. Statistical consistency of kernel canonical correlation analysis.
38 Journal of Machine Learning Research 8, Feb (2007), 361–383.
39 C Fyfe and PL Lai. 2000. Canonical correlation analysis neural networks. In Pattern Recognition, 2000.
ew
Proceedings. 15th International Conference on, Vol. 2. IEEE, 977–980.
40
GH Golub and CF Van Loan. 2012. Matrix computations. Vol. 3. JHU Press.
41 GH Golub and H Zha. 1995. The canonical correlations of matrix pairs and their numerical computation. In
42 Linear algebra for signal processing. Springer, 27–49.
43 I González, S Déjean, PGP Martin, O Gonçalves, P Besse, and A Baccini. 2009. Highlighting relationships
44 between heterogeneous biological data through graphical displays based on regularized canonical cor-
relation analysis. Journal of Biological Systems 17, 02 (2009), 173–199.
45
BK Gunderson and RJ Muirhead. 1997. On estimating the dimensionality in canonical correlation analysis.
46 Journal of Multivariate Analysis 62, 1 (1997), 121–136.
47 DR Hardoon, J Mourao-Miranda, M Brammer, and J Shawe-Taylor. 2007. Unsupervised analysis of fMRI
48 data using kernel canonical correlation. NeuroImage 37, 4 (2007), 1250–1259.
49 DR Hardoon and J Shawe-Taylor. 2009. Convergence analysis of kernel canonical correlation analysis: the-
50 ory and practice. Machine learning 74, 1 (2009), 23–38.
51 DR Hardoon and J Shawe-Taylor. 2011. Sparse canonical correlation analysis. Machine Learning 83, 3
(2011), 331–353.
52
53
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 32 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:31
4
5 DR Hardoon, S Szedmak, and J Shawe-Taylor. 2004. Canonical correlation analysis: An overview with ap-
6 plication to learning methods. Neural computation 16, 12 (2004), 2639–2664.
MJR Healy. 1957. A rotation method for computing canonical correlations. Math. Comp. 11, 58 (1957), 83–
7 86.
8 C Heij and B Roorda. 1991. A modified canonical correlation approach to approximate state space modelling.
9 In Decision and Control, 1991., Proceedings of the 30th IEEE Conference on. IEEE, 1343–1348.
10 AE Hoerl and RW Kennard. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Techno-
11 metrics 12, 1 (1970), 55–67.
12 JW Hooper. 1959. Simultaneous equations and canonical correlation theory. Econometrica: Journal of the
Econometric Society (1959), 245–256.
13
CE Hopkins. 1969. Statistical analysis by canonical correlation: a computer application. Health services
14 research 4, 4 (1969), 304.
15 P Horst. 1961. Relations among sets of measures. Psychometrika 26, 2 (1961), 129–149.
16 H Hotelling. 1935. The most predictable criterion. Journal of educational Psychology 26, 2 (1935), 139.
17 H Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321–377.
18 WW Hsieh. 2000. Nonlinear canonical correlation analysis by neural networks. Neural Networks 13, 10
Fo
19 (2000), 1095–1105.
20 I Huopaniemi, T Suvitaival, J Nikkilä, M Orešič, and S Kaski. 2010. Multivariate multi-way analysis of
multi-source data. Bioinformatics 26, 12 (2010), i391–i398.
21 A Kabir, RD Merrill, AA Shamim, RDW Klemn, AB Labrique, P Christian, KP West Jr, and M Nasser. 2014.
22 Canonical correlation analysis of infant’s size at birth and maternal factors: a study in rural Northwest
r
23 Bangladesh. PloS one 9, 4 (2014), e94243.
24 M Kang, B Zhang, X Wu, C Liu, and J Gao. 2013. Sparse generalized canonical correlation analysis for
biological model integration: a genetic study of psychiatric disorders. In Engineering in Medicine and
Pe
25 Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE. IEEE, 1490–1493.
26 JR Kettenring. 1971. Canonical analysis of several sets of variables. Biometrika (1971), 433–451.
27 A Kimura, M Sugiyama, T Nakano, H Kameoka, H Sakano, E Maeda, and K Ishiguro. 2013. SemiCCA:
28 Efficient semi-supervised learning of canonical correlations. Information and Media Technologies 8, 2
er
29 (2013), 311–318.
30 A Klami and S Kaski. 2007. Local dependent components. In Proceedings of the 24th international conference
on Machine learning. ACM, 425–432.
31
A Klami, S Virtanen, and S Kaski. 2012. Bayesian exponential family projections for coupled data sources.
32 arXiv preprint arXiv:1203.3489 (2012).
Re
33 A Klami, S Virtanen, and S Kaski. 2013. Bayesian canonical correlation analysis. Journal of Machine Learn-
34 ing Research 14, Apr (2013), 965–1003.
35 D Krstajic, LJ Buturovic, DE Leahy, and S Thomas. 2014. Cross-validation pitfalls when selecting and
36 assessing regression and classification models. Journal of cheminformatics 6, 1 (2014), 1.
PL Lai and C Fyfe. 1999. A neural implementation of canonical correlation analysis. Neural Networks 12,
vi
37 10 (1999), 1391–1397.
38 PL Lai and C Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of
39 Neural Systems 10, 05 (2000), 365–377.
ew
40 NB Larson, GD Jenkins, MC Larson, RA Vierkant, TA Sellers, CM Phelan, JM Schildkraut, R Sutphen,
41 PPD Pharoah, S A Gayther, and others. 2014. Kernel canonical correlation analysis for assessing gene–
gene interactions and application to ovarian cancer. European Journal of Human Genetics 22, 1 (2014),
42 126–131.
43 SC Larson. 1931. The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology
44 22, 1 (1931), 45.
45 H-S Lee. 2007. Canonical correlation analysis using small number of samples. Communications in Statistic-
46 sSimulation and Computation
R 36, 5 (2007), 973–985.
47 SE Leurgans, RA Moyeed, and BW Silverman. 1993. Canonical correlation analysis when the data are
curves. Journal of the Royal Statistical Society. Series B (Methodological) (1993), 725–740.
48 H Lindsey, JT Webster, and S Halpern. 1985. Canonical Correlation as a Discriminant Tool in a Periodontal
49 Problem. Biometrical journal 27, 3 (1985), 257–264.
50 P Marttinen, J Gillberg, A Havulinna, J Corander, and S Kaski. 2013. Genome-wide association studies with
51 high-dimensional phenotypes. Statistical applications in genetics and molecular biology 12, 4 (2013),
413–431.
52
53
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 33 of 35 Computing Surveys
1
2
3 1:32
4
5 T Melzer, M Reiter, and H Bischof. 2001. Nonlinear feature extraction using generalized canonical correla-
6 tion analysis. In International Conference on Artificial Neural Networks. Springer, 353–360.
T Melzer, M Reiter, and H Bischof. 2003. Appearance models based on kernel canonical correlation analysis.
7 Pattern recognition 36, 9 (2003), 1961–1971.
8 W Meredith. 1964. Canonical correlations with fallible data. Psychometrika 29, 1 (1964), 55–65.
9 MS Monmonier and FE Finn. 1973. Improving the interpretation of geographical canonical correlation mod-
10 els. The Professional Geographer 25, 2 (1973), 140–142.
11 M Nakanishi, Y Wang, Y-T Wang, and T-P Jung. 2015. A Comparison Study of Canonical Correlation
12 Analysis Based Methods for Detecting Steady-State Visual Evoked Potentials. PloS one 10, 10 (2015),
e0140703.
13
RM Neal. 2012. Bayesian learning for neural networks. Vol. 118. Springer Science & Business Media.
14 T Ogura, Y Fujikoshi, and T Sugiyama. 2013. A variable selection criterion for two sets of principal compo-
15 nent scores in principal canonical correlation analysis. Communications in Statistics-Theory and Meth-
16 ods 42, 12 (2013), 2118–2135.
17 E Parkhomenko, D Tritchler, and J Beyene. 2007. Genome-wide sparse canonical correlation of gene expres-
sion with genotypes. In BMC proceedings, Vol. 1. BioMed Central Ltd, S119.
18
P Rai and H Daume. 2009. Multi-label prediction via sparse infinite CCA. In Advances in Neural Information
Fo
19 Processing Systems. 1518–1526.
20 J Rousu, DD Agranoff, O Sodeinde, J Shawe-Taylor, and D Fernandez-Reyes. 2013. Biomarker discovery by
21 sparse canonical correlation analysis of complex clinical phenotypes of tuberculosis and malaria. PLoS
22 Comput Biol 9, 4 (2013), e1003018.
r
23 Y Saad. 2011. Numerical methods for large eigenvalue problems. Vol. 158. SIAM.
24 T Sakurai. 2009. Asymptotic expansions of test statistics for dimensionality and additional information in
canonical correlation analysis when the dimension is large. Journal of Multivariate Analysis 100, 5
Pe
25 (2009), 888–901.
26 BK Sarkar and C Chakraborty. 2015. DNA pattern recognition using canonical correlation algorithm. Jour-
27 nal of biosciences 40, 4 (2015), 709–719.
28 SV Schell and WA Gardner. 1995. Programmable canonical correlation analysis: A flexible framework for
er
blind adaptive spatial filtering. IEEE transactions on signal processing 43, 12 (1995), 2898–2908.
29
B Schölkopf, A Smola, and K-R Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem.
30 Neural computation 10, 5 (1998), 1299–1319.
31 JA Seoane, C Campbell, INM Day, JP Casas, and TR Gaunt. 2014. Canonical correlation analysis for gene-
32 based pleiotropy discovery. PLoS Comput Biol 10, 10 (2014), e1003876.
Re
33 J Shawe-Taylor and N Cristianini. 2004. Kernel methods for pattern analysis. Cambridge university press.
34 X-B Shen, Q-S Sun, and Y-H Yuan. 2013. Orthogonal canonical correlation analysis and its application in
35 feature fusion. In Information Fusion (FUSION), 2013 16th International Conference on. IEEE, 151–
157.
36 C Soneson, H Lilljebjörn, T Fioretos, and M Fontes. 2010. Integrative analysis of gene expression and copy
vi
37 number alterations using canonical correlation analysis. BMC bioinformatics 11, 1 (2010), 1.
38 L Song, B Boots, SM Siddiqi, GJ Gordon, and A Smola. 2010. Hilbert space embeddings of hidden Markov
39 models. (2010).
ew
40 Y Song, PJ Schreier, D Ramı́rez, and T Hasija. 2016. Canonical correlation analysis of high-dimensional
data with very small sample support. Signal Processing 128 (2016), 449–458.
41
M Stone. 1974. Cross-validatory choice and assessment of statistical predictions. Journal of the royal statis-
42 tical society. Series B (Methodological) (1974), 111–147.
43 MJ Sullivan. 1982. Distribution of Edaphic Diatoms in a Missisippi Salt Marsh: A Canonical Correlation
44 Analysis. Journal of Phycology 18, 1 (1982), 130–133.
45 A Tenenhaus, C Philippe, and V Frouin. 2015. Kernel generalized canonical correlation analysis. Computa-
46 tional Statistics & Data Analysis 90 (2015), 114–131.
47 A Tenenhaus, C Philippe, V Guillemot, K-A Le Cao, J Grill, and V Frouin. 2014. Variable selection for
generalized canonical correlation analysis. Biostatistics (2014), kxu001.
48 A Tenenhaus and M Tenenhaus. 2011. Regularized generalized canonical correlation analysis. Psychome-
49 trika 76, 2 (2011), 257–284.
50 CJF Ter Braak. 1990. Interpreting canonical correlation analysis through biplots of structure correlations
51 and weights. Psychometrika 55, 3 (1990), 519–531.
52 R Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.
Series B (Methodological) (1996), 267–288.
53
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Computing Surveys Page 34 of 35
1
2
3 A Tutorial on Canonical Correlation Methods 1:33
4
5 XM Tu. 1991. A bootstrap resampling scheme for using the canonical correlation technique in rank estima-
6 tion. Journal of chemometrics 5, 4 (1991), 333–343.
XM Tu, DS Burdick, DW Millican, and LB McGown. 1989. Canonical correlation technique for rank estima-
7 tion of excitation-emission matrixes. Analytical Chemistry 61, 19 (1989), 2219–2224.
8 V Uurtio, M Bomberg, K Nybo, M Itävaara, and J Rousu. 2015. Canonical correlation methods for exploring
9 microbe-environment interactions in deep subsurface. In International Conference on Discovery Science.
10 Springer, 299–307.
11 JP Van de Geer. 1984. Linear relations amongk sets of variables. Psychometrika 49, 1 (1984), 79–94.
12 T Van Gestel, JAK Suykens, J De Brabanter, B De Moor, and J Vandewalle. 2001. Kernel canonical cor-
relation analysis and least squares support vector machines. In International Conference on Artificial
13 Neural Networks. Springer, 384–389.
14 HD Vinod. 1976. Canonical ridge and econometrics of joint production. Journal of Econometrics 4, 2 (1976),
15 147–166.
16 S Waaijenborg, PC Verselewel de Witt Hamer, and AH Zwinderman. 2008. Quantifying the association
17 between gene expressions and DNA-markers by penalized canonical correlation analysis. Statistical
Applications in Genetics and Molecular Biology 7, 1 (2008).
18 C Wang. 2007. Variational Bayesian approach to canonical correlation analysis. IEEE Transactions on Neu-
Fo
19 ral Networks 18, 3 (2007), 905–910.
20 D Wang, L Shi, DS Yeung, and ECC Tsang. 2005. Nonlinear canonical correlation analysis of fMRI signals
21 using HDR models. In 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. IEEE,
5896–5899.
22
r
GC Wang, N Lin, and B Zhang. 2013. Dimension reduction in functional regression using mixed data canon-
23 ical correlation analysis. Stat Interface 6 (2013), 187–196.
24 DS Watkins. 2004. Fundamentals of matrix computations. Vol. 64. John Wiley & Sons.
Pe
25 FV Waugh. 1942. Regressions between sets of variables. Econometrica, Journal of the Econometric Society
26 (1942), 290–310.
27 DM Witten, R Tibshirani, and T Hastie. 2009. A penalized matrix decomposition, with applications to sparse
28 principal components and canonical correlation analysis. Biostatistics (2009), kxp008.
er
KW Wong, PCW Fung, and CC Lau. 1980. Study of the mathematical approximations made in the basis-
29 correlation method and those made in the canonical-transformation method for an interacting Bose gas.
30 Physical Review A 22, 3 (1980), 1272.
31 T Yamada and T Sugiyama. 2006. On the permutation test in canonical correlation analysis. Computational
32 statistics & data analysis 50, 8 (2006), 2111–2123.
Re
33 H Yamamoto, H Yamaji, E Fukusaki, H Ohno, and H Fukuda. 2008. Canonical correlation analysis for mul-
tivariate regression and its application to metabolic fingerprinting. Biochemical Engineering Journal
34 40, 2 (2008), 199–204.
35 Y-H Yuan, Q-S Sun, and H-W Ge. 2014. Fractional-order embedding canonical correlation analysis and its
36 applications to multi-view dimensionality reduction and recognition. Pattern Recognition 47, 3 (2014),
vi
37 1411–1424.
38 Y-H Yuan, Q-S Sun, Q Zhou, and D-S Xia. 2011. A novel multiset integrated canonical correlation analysis
framework and its application in feature fusion. Pattern Recognition 44, 5 (2011), 1031–1040.
39
ew
B Zhang, J Hao, G Ma, J Yue, and Z Shi. 2014. Semi-paired probabilistic canonical correlation analysis. In
40 International Conference on Intelligent Information Processing. Springer, 1–10.
41 H Zou and T Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal
42 Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301–320.
43
44 Received February 2017; revised –; accepted –
45
46
47
48
49
50
51
52
53
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60
Page 35 of 35 Computing Surveys
1
2
3
4
5 Online Appendix to:
6 A Tutorial on Canonical Correlation Methods
7
8 VIIVI UURTIO, Aalto University
9 JOÃO M. MONTEIRO, University College London
10 JAZ KANDOLA, Imperial College London
11 JOHN SHAWE-TAYLOR, University College London
12 DELMIRO FERNANDEZ-REYES, University College London
13 JUHO ROUSU, Aalto University
14
15
16 A. THIS IS AN EXAMPLE OF APPENDIX SECTION HEAD
17 B. APPENDIX SECTION HEAD
18
Fo
19
20
21
22
r
23
24
Pe
25
26
27
28
er
29
30
31
32
Re
33
34
35
36
vi
37
38
39
ew
40
41
42
43
44
45
46
47
48
49
50
51
52
c 2017 Copyright held by the owner/author(s). 0360-0300/2017/02-ART1 $15.00
53 DOI: 0000001.0000001
54
55 ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: February 2017.
56
57
58
59
60