Submitted to the Annals of Statistics
arXiv: arXiv:/1203.6502
QUANTIFYING CAUSAL INFLUENCES
By Dominik Janzing∗ ,
David Balduzzi∗† , Moritz Grosse-Wentrup∗ , and Bernhard Scho
¨ lkopf∗
∗ Max Planck Institute for Intelligent Systems
† ETH Z¨ urich
Many methods for causal inference generate directed acyclic graphs
(DAGs) that formalize causal relations between n variables. Given the
joint distribution on all these variables, the DAG contains all informa-
tion about how intervening on one variable changes the distribution
of the other n − 1 variables. However, quantifying the causal influence
of one variable on another one remains a non-trivial question.
Here we propose a set of natural, intuitive postulates that a mea-
sure of causal strength should satisfy. We then introduce a communi-
cation scenario, where edges in a DAG play the role of channels that
can be locally corrupted by interventions. Causal strength is then the
relative entropy distance between the old and the new distribution.
Many other measures of causal strength have been proposed, in-
cluding average causal effect, transfer entropy, directed information,
and information flow. We explain how they fail to satisfy the pos-
tulates on simple DAGs of ≤ 3 nodes. Finally, we investigate the
behavior of our measure on time-series, supporting our claims with
experiments on simulated data.
1. Introduction. Inferring causal relations is among the most important scientific goals
since causality, as opposed to mere statistical dependencies, provides the basis for reasonable
human decisions. During the past decade, it has become popular to phrase causal relations in
directed acyclic graphs (DAGs) [1] with random variables (formalizing statistical quantities
after repeated observations) as nodes and causal influences as arrows.
We briefly explain this formal setting. Here and throughout the paper, we assume causal
sufficiency, i.e., there are no hidden variables that influence more than one of the n observed
variables. Let G be a causal DAG with nodes X1 , . . . , Xn where Xi → Xj means that Xi
influences Xj “directly” in the sense that intervening on Xi changes the distribution of Xj
even if all other variables are held constant (also by interventions). To simplify notation, we
will mostly assume the Xj to be discrete. P (x1 , . . . , xn ) denotes the probability mass function
of the joint distribution P (X1 , . . . , Xn ). According to the Causal Markov Condition [2, 1],
which we take for granted in this paper, every node Xj is conditionally independent of its
non-descendants, given its parents with respect to the causal DAG G. If P Aj denotes the set
of parent variables of Xj (i.e., its direct causes) in G, the joint probability thus factorizes [3]
into
n
Y
(1) PG (x1 , . . . , xn ) = P (xj |paj ) ,
j=1
where paj denotes the values of P Aj . By slightly abusing the notion of conditional probabili-
ties, we assume that P (Xj |paj ) is also defined for those paj with P (paj ) = 0. In other words,
AMS 2000 subject classifications: Primary Causality, Bayesian networks; secondary Information flow, Trans-
fer entropy
1
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
2 JANZING ET AL.
we know how the causal mechanisms act on potential combinations of values of the parents
that never occur. Note that this assumption has implications because such causal conditionals
cannot be learned from observational data even if the causal DAG is known.
Given this formalism, why define causal strength? After all, the DAG together with the
causal conditionals contain the complete causal information: one can easily compute how the
joint distribution changes when an external intervention sets some of the variables to specific
values [1]. However, describing causal relations in nature with a DAG always requires first
deciding how detailed the description should be. Depending on the desired precision, one may
want to account for some weak causal links or not. Thus, an objective measure distinguishing
weak arrows from strong ones is required.
1.1. Related work. We discuss some definitions of causal strength that are either known
or just come up as straightforward ideas.
Average causal effect: Following [1], P (Y |do(X = x)) denotes the distribution of Y when X is
set to the value x (it will be introduced more formally in eq. (5)). Note that it only coincides
with the usual conditional distribution P (Y | x) if the statistical dependence between X and
Y is due to a direct influence of X on Y , with no confounding common cause. If all Xi are
binary variables, causal strength can then be quantified by the Average Causal Effect [4, 1]
ACE(Xi → Xj ) := P (Xj = 1|do(Xi = 1)) − P (Xj = 1|do(Xi = 0)) .
If a real-valued variable Xj is affected by a binary variable Xi , one considers the shift of the
mean of Xj that is caused by switching Xi from 0 to 1. Formally, one considers the difference
[5]
E(Xj |do(Xi = 1)) − E(Xj |do(Xi = 0)) .
This measure only accounts for the linear aspect of an interaction since it does not reflect
whether Xi changes higher order moments of the distribution of Xj .
Analysis of Variance (ANOVA): Let Xi be caused by X1 , . . . , Xi−1 . The variance of Xi can
formally be split into the average of the variances of Xi , given Xk with k ≤ i − 1 and the
variance of the expectations of Xi , given Xk :
Var(Xi ) = E(Var(Xi |Xk )) + Var(E(Xi |Xk )) .
In the common scenario of drug testing experiments, for instance, the first term describes
the variability of Xi within a group of equal treatments (i.e. fixed xk ), while the second one
describes how much the means of Xi vary between different treatments. It is tempting to say
that the latter describes the part of the total variation of Xi that is caused by the variation
of Xk , but this is conceptually wrong for non-linear influences and if there are statistical
dependencies between Xk and the other parents of Xi [6, 5].
For linear structural equations,
X
Xi = αij Xj + Ei with Ej being jointly independent,
j<i
and additionally assuming Xk to be independent of the other parents of Xi , the second term
is given by Var(αik Xk ), which indeed describes the amount by which the variance of Xi
decreases when Xk is set to a fixed value by intervention. In this sense,
Var(αik Xk )
(2) rik :=
Var(Xi )
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 3
is indeed the fraction of the variance of Xi that is caused by Xk . By rescaling all Xj such that
Var(Xj ) = 1, we have rik = αik 2 . Then, the square of the structure coefficients itself can be
seen as a simple measure for causal strength.
(Conditional) Mutual information: the information of X on Y or vice versa is given by [7]
X P (x, y)
I(X; Y ) := P (x, y) log .
x,y
P (x)P (y)
The information of X on Y or vice versa if Z is given is defined by [7]
X P (x, y|z)
(3) I(X; Y |Z) := P (x, y, z) log .
x,y,z
P (x|z)P (y|z)
There are situations where these expressions (with Z describing some background condition)
can indeed be interpreted as measuring the strength of the arrow X → Y . An essential part of
this paper describes the conditions where this makes sense and how to replace the expressions
with other information-theoretic ones when it does not.
Granger causality / Transfer Entropy / Directed Information: Quantifying causal influence
between time series (for instance between (Xt )t∈Z and (Yt )t∈Z ) is special because one is in-
terested in quantifying the effect of all (Xt ) on all (Yt+s ). If we represent the causal relations
by a DAG where every time instant defines a separate pair of variables, then we ask for the
strength of a set of arrows. If Xt and Yt are considered as instances of the variables X, Y , we
leave the regime of i.i.d. sampling.
Measuring the reduction of uncertainty in one variable after knowing another is also a
key idea in several related methods for quantifying causal strength in time series. Granger
causality in its original formulation uses reduction of variance [8]. Non-linear information-
theoretic extensions in the same spirit are transfer entropy [9] and directed information [10].
Both are essentially based on conditional mutual information, where each variable X, Y, Z in
(3) is replaced with an appropriate set of variables.
Information flow: Since the above measures quantify dependencies rather than causality,
several authors have defined causal strength by replacing the observed probability distribution
with distributions that arise after interventions (computed via the causal DAG). [11] defined
Information Flow via an operation, “source exclusion”, which removes the influence of a
variable in a network. [12] defined a different notion of Information Flow explicitly via Pearl’s
do-calculus. Both measures are close to ours in spirit and in fact the version in [11] coincides
with ours when quantifying the strength of a single arrow. However, both do not satisfy our
postulates.
Mediation analysis: [13, 14, 15] explore how to separate the influence of X on Y into parts
that can be attributed to specific paths by “blocking” other paths. Consider, for instance,
the case where X influences Y directly and indirectly via X → Z → Y . To test its direct
influence, one changes X from some “reference” value x0 to an “active” value x while keeping
the distribution of Z that either corresponds to the reference value x0 or to the natural
distribution P (X). A natural distinction between a reference state and an active state occurs,
for instance, in drug testing scenario where taking the drug means switching from reference
to active. In contrast, our goal is not to study the impact of one specific switching from x0
to x. Instead, we want to construct a measure that quantifies the direct effect of the variable
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
4 JANZING ET AL.
X on Y , while treating all possible values of X in the same way. Nevertheless, there are
interesting relation between these approaches and ours that we briefly discuss at the end of
Subsection 4.2
2. Postulates for causal strength. Let us first discuss the properties we expect a
measure of causal strength to have. The key idea is that causal strength is supposed to
measure the impact of an intervention that removes the respective arrows. We present five
properties that we consider reasonable. Let CS denote the strength of the arrows in set S. By
slightly overloading notation, we write CX→Y instead of C{X→Y } .
P0. Causal Markov condition: If CS = 0, then the joint distribution satisfies the Markov
condition with respect to the DAG GS obtained by removing the arrows in S.
P1. Mutual information: If the true causal DAG reads X → Y , then
CX→Y = I(X; Y ) .
P2. Locality: The strength of X → Y only depends on (1) how Y depends on X and
its other parents, and (2) the joint distribution of all parents of Y . Formally, knowing
P (Y |P AY ) and P (P AY ) is sufficient to compute CX→Y . For strictly positive densities,
this is equivalent to knowing P (Y, P AY ).
P3. Quantitative causal Markov condition: If there is an arrow from X to Y then the
causal influence of X on Y is greater than or equal to the conditional mutual information
between Y and X, given all the other parents of Y . Formally
CX→Y ≥ I(X; Y |P AX
Y ).
P4. Heredity: If the causal influence of a set of arrows is zero, then the causal influence of
all its subsets (in particular, individual arrows) is also zero.
If S ⊂ T, then CT = 0 =⇒ CS = 0.
Note that we do not claim that every reasonable measure of causal strength should sat-
isfy these postulates, but we now explain why we consider them natural and show that the
postulates make sense for simple DAGs.
P0: If the purpose of our measure of causal strength is to quantify relevance of arrows then
removing a set of arrows with zero strength must make no difference. If, for instance, CX→Y =
0, removing X → Y should not yield a DAG that is ruled out by the causal Markov condition.
We should emphasize that CS can be non-zero even if S consists of arrows each individually
having zero strength.
P1: The mutual information actually measures the strength of statistical dependencies. Since
all these dependencies are generated by the influence of X on Y (and not by a common cause
or Y influencing X), it makes sense to measure causal strength by strength of dependencies.
Note that mutual information I(X; Y ) = H(Y ) − H(Y |X) also quantifies the variability in Y
that is due to the variability in X, see also §A.4.
Mutual information versus channel capacity. Given the premise that causal strength should
be an information-like quantity, a natural alternative to mutual information is the capacity of
the information channel x 7→ P (Y |x), i.e. the maximum over all values of mutual information
IQ(X) (X; Y ) for all input distributions Q(X) of X when keeping the conditional P (Y |X).
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 5
While mutual information I(X; Y ) quantifies the observable dependencies, channel capac-
ity quantifies the strength of the strongest dependencies that can be generated using the
information channel P (Y |X). In this sense, I(X; Y ) quantifies the factual causal influence,
while channel capacity measures the potential influence. Channel capacity also accounts for
the impact of setting x to values that rarely or never occur in the observations. However, this
sensitivity regarding effects of rare inputs can certainly be a problem for estimating the effect
from sparse data. We therefore prefer mutual information I(X; Y ) as it better assesses the
extent to which frequently observed changes in X influence Y .
P2: Locality implies that we can ignore causes of X when computing CX→Y , unless they are
at the same time direct causes of Y . Likewise, other effects of Y are irrelevant. Moreover,
it does not matter how the dependencies between the parents are generated (which parent
influences which one or whether they are effects of a common cause), we only need to know
their joint distribution with X.
Violations of locality have paradoxical implications. For example, variable Z should be
irrelevant in DAG 1a). Otherwise, CX→Y would depend on the mechanism that generates
the distribution of X, while we are actually concerned with the information flowing from X
to Y instead of that flowing to X from other nodes. Likewise, (see DAGs 1b) and 1c)) it is
irrelevant whether X and Y have further effects.
P3: To justify the name of this postulate, observe that the restriction of P0 to the single
arrow case S = {X → Y } is equivalent to
CX→Y = 0 =⇒ I(Y ; X |P AX
Y ) = 0.
To see this, we use the ordered Markov condition [1], Theorem 1.2.6, which is known to be
equivalent to the version mentioned in the introduction. It states that every node is condition-
ally independent of its predecessors (according to some ordering consistent with the DAG),
given its parents. If P RY denotes the predecessors of Y for some ordering that is consistent
with G and GS , the ordered Markov condition for GS holds iff
(4) ⊥ P RY |P AX
Y ⊥ Y ,
since the conditions for all other nodes remain the same as in G. Due to the semi-graphoid
axioms (weak union and contraction rule [1]), (4) is equivalent to
Y ⊥
⊥ P RY \ {X} |P AY ∧ ⊥ X |P AX
Y ⊥ Y .
Since the condition on the left is guaranteed by the Markov condition on G, the Markov
condition on GS is equivalent to I(Y ; X |P AX
Y ) = 0.
In words, the arrow X → Y is the only reason for the conditional dependence I(Y ; X |P AX
Y)
to be non-zero, hence it is natural to postulate that its strength cannot be smaller than the
dependence that it generates. Subsection 4.3 explains why we should not postulate equality.
P4: The postulate provides a compatibility condition: if a set of arrows has zero causal
influence, and so can be eliminated without affecting the causal DAG, then the same should
hold for all subsets of that set. We refer to this as the heredity property by analogy with
matroid theory, where heredity implies that every subset of an independent set is independent.
3. Problems of known definitions. Our definition of causal strength is presented in
Section 4. This section discusses problems with alternate measures of causal strength.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
6 JANZING ET AL.
Z X X Z
X Y Z X
a) Y b) Z Y d) Y
c)
Fig 1. DAGs for which the (conditional) mutual information is a reasonable measure of causal strength: For
a) to c), our postulates imply CX→Y = I(X; Y ). For c) we will obtain CX→Y = I(X; Y |Z). The nodes X and
Y are shaded because they are source and target of the arrow X → Y , respectively.
Z X
X Z
a) Y b) Y
Fig 2. DAGs for which finding a proper definition of CX→Y is challenging.
3.1. ACE and ANOVA. The first two measures are ruled out by P0. Consider a relation
between three binary variables X, Y, Z, where Y = X ⊕ Z with X and Z being unbiased and
independent. Then changing X has no influence on the statistics of Y . Likewise, knowing X
does not reduce the variance of Y . To satisfy P0, we would need modifications that we do
observe an influence of X on Y for each fixed value z rather than marginalizing over Z.
3.2. Mutual information and conditional mutual information. It suffices to consider a few
simple DAGs to illustrate why mutual information and conditional mutual information are
not suitable measures of causal strength in general.
Mutual information is not suitable in Figure 2a. It is clear that I(X; Y ) is inappropriate
because we can obtain I(X; Y ) 6= 0 even when the arrow X → Y is missing, due to the
common cause Z.
Conditional mutual information is not suitable for Figure 2a. Consider the limiting case where
the direct influence Z → Y gets weaker until it almost disappears (P (y|x, z) ≈ P (y|x)). Then
the behavior of the system (observationally and interventionally) is approximately described
by the DAG 1a). Using I(X; Y |Z) makes no sense in this scenario since, for example, X may
be obtained from Z by a simple copy operation, in which case I(X; Y |Z) = 0 necessarily,
even when X influences Y strongly.
3.3. Transfer entropy. Transfer entropy [9] is intended to measure the influence of one
time-series on another one. Let (Xt , Yt )t∈Z be a bivariate stochastic process where Xt influence
some Ys with s > t, see figure 3, left. Then transfer entropy is defined as the following
conditional mutual information:
I(X(−∞,t−1] → Yt |Y(−∞,t−1] ) := I(X(−∞,t−1] ; Yt |Y(−∞,t−1] ) .
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 7
Xt2 Xt1 Xt Xt+1 Xt2 Xt1 Xt Xt+1
...
Yt2 Yt1 Yt Yt+1 Yt2 Yt1 Yt Yt+1
Fig 3. Left: Typical causal DAG for two time series with mutual causal influence. The structure is acyclic
because instantaneous influences are excluded. Right: counter example in [12]. Transfer entropy vanishes if all
arrows are copy operations although the time series strongly influence each other.
Xt2 Xt1 Xt Xt+1
...
Yt2 Yt1 Yt Yt+1
Fig 4. Time series with only two causal arrows, where transfer entropy fails satisfying our postulates.
It measures the amount of information the past of X provides about the present of Y given
the past of Y . To quantify causal influence by conditional information relevance is also in
the spirit of Granger causality, where information is usually understood in the sense of the
amount of reduction of the linear prediction error.
Transfer entropy is an unsatisfactory measure of causal strength. [12] pointed out that transfer
entropy fails to quantify causal influence for the following toy model: Assume the information
from Xt is perfectly copied to Yt+1 and the information from Yt to Xt+1 (see Figure 3, right).
Then the past of Y is already sufficient to perfectly predict the present value of Y and the past
of X does not provide any further information. Therefore, transfer entropy vanishes although
both variables heavily influence one another. If the copy operation is noisy, transfer entropy
is non-zero and thus seems more reasonable, but the quantitative behavior is still wrong (as
we will argue in Example 7).
Transfer entropy violates the postulates. Transfer entropy yields 0 bits of causal influence in a
situation where common sense and P1 together with P2 require that causal strength is 1 bit
(P2 reduces the DAG to one in which P1 applies). Since our postulates refer to the strength
of a single arrow while transfer entropy is supposed to measure the strength of all arrows from
X to Y , we reduce the DAG such that there is only one arrow from X to Y , see Figure 4.
Then,
I(X(−∞,t−1] → Yt |Y(−∞,t−1] ) = I(X(−∞,t−1] ; Yt |Y(−∞,t−1] ) = I(Xt−1 ; Yt |Yt−2 ) .
The causal structure coincides with DAG 1a) by setting Yt−2 ≡ Z, Xt−1 ≡ X, and Yt ≡ Y .
With these replacements, transfer entropy yields I(X; Y |Z) = 0 bits instead of I(X; Y ) = 1
bit, as required by P1 and P2.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
8 JANZING ET AL.
Note that the same problem occurs if causal strength between time series is quantified by
directed information [10] because this measure also conditions on the entire past of Y .
3.4. Information flow. Note that [12] and [11] introduce two different quantities, both
called “information flow”. We consider them in turn.
After arguing that transfer entropy does not properly capture the strength of the impact
of interventions, [12] proposes to define causal strength using Pearl’s do calculus [1]. Given a
causal directed acyclic graph G, Pearl computes the joint distribution obtained if variable Xj
is forcibly set to the value xj as
Y
P x1 , . . . , xn do x0j :=
(5) P (xi |pai ) · δxj ,x0j .
i6=j
Intuitively, the intervention on Xj removes the dependence of Xj on its parents and therefore
replaces P (xj |paj ) with the kronecker symbol. Likewise, one can define interventions on several
nodes by replacing all conditionals with kronecker symbols.
Given three sets of nodes XA , XB and XC in a directed acyclic graph G, information flow
is defined by
I XA → XB do(XC )
X P xB do(xA xC )
:= P (xC )P xA |do(xC ) P xB do(xA , xC ) log P
0 do(x ) P x do(x0 , x )
0 P x
xA A C B C A
xC ,xA ,xB
To better understand this expression, we first consider the case where the set XC is empty.
Then we obtain
X P (xB |do(xA ))
I(XA → XB ) := P (xA )P (xB |do(xA )) log P ,
xA ,xB x0A P (x0A )P (xB |do(x0A ))
which measures the mutual information between XA and XB obtained when the information
channel xA 7→ P (XB |do(xA )) is used with the input distribution P (XA ).
Information flow, as defined in [12], is an unsatisfactory measure of causal strength. To quan-
tify X → Y in DAGs 2a) and 2b) using information flow, we may either choose I(X → Y )
or I(X → Y |do(Z)). Both choices are inconsistent with our postulates and intuitive expec-
tations.
Start with I(X → Y ) and DAG 2a). Let X, Y, Z be binary with Y := X ⊕ Z an XOR. Let
Z be an unbiased coin toss and X obtained from Z by a faulty copy operation with two-sided
symmetric error. One easily checks that I(X → Y ) is zero in the limit of error probability
1/2 (making X and Y independent). Nevertheless, dropping the arrow X → Y violates the
Markov condition, contradicting P0. For error rate close to 1/2, we still violate P3 because
I(Y ; X |Z) is close to 1, while I(X → Y ) is close to zero. A similar argument applies to DAG
2b).
Now consider I(X → Y |do(Z)). Note that it yields different results for DAGs 2a) and 2b)
when the joint distribution is the same, contradicting P2. This is because P (x|do(z)) = P (x|z)
for 2a), while P (x|do(z)) = P (x) for 2b). In other words, I(X → Y |do(Z)) depends on the
causal relation between the two causes X and Z, rather than only on the relation between
causes and effects.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 9
Apart from being inconsistent with our postulate, it is unsatisfactory that I(X → Y |do(Z))
tends to zero for the example above if the error rate of copying X from Z in DAG 2a) tends
to zero (conditioned on setting Z to some value, the information passed from X to Y is zero
because X attains a fixed value, too). In this limit, Y is always zero. Clearly, however, link
X → Y is important for explaining the behavior of the XOR: without the link, the gate would
not output “zero” for both Z = 0 and Z = 1.
Information flow, as defined in [11], is unsatisfactory as a measure of causal strength for
sets of edges. Since this measure is close to ours, we will explain (see caption of fig. 5) the
difference when introducing ours and show that P4 fails without our modification.
4. Defining the strength of causal arrows.
4.1. Definition in terms of conditional probabilities. This section proposes a way to quan-
tify the causal influence of a set of arrows that yields satisfactory answers in all the cases
discussed above. Our measure is motivated by a scenario where nodes represent different
parties communicating with each other via channels. Hence, we think of arrows as physical
channels that propagate information between distant points in space, e.g., wires that con-
nect electronic devices. Each such wire connects the output of a device with the input of
another one. For the intuitive ideas below, it is also important that the wire connecting Xi
and Xj physically contains full information about Xi (which may be more than the infor-
mation that is required to explain the output behavior P (Xj |P Aj )). We then think of the
strength of arrow Xi → Xj as the impact of corrupting it, i.e., the impact of cutting the wire.
To get a well-defined “post-cutting” distribution we have to say what to do with the open
end corresponding to Xj , because it needs to be fed with some input. It is natural to feed it
probabilistically with inputs xi according to P (Xi ) because this is the only distribution of Xi
that is locally observable (feeding it with some conditional distribution P (Xi |..) assumes that
the one cutting the edge has access to other nodes – and not only the physical state of the
channel. Note that this notion of cutting edges coincides with the “source exclusion” defined
in [11] if only one edge is cut. However, we define the deletion of a set of arrows by feeding
all open ends with the product of the corresponding marginal distributions, while [11] keeps
the dependencies between the open ends and removes the dependencies between open ends
and the other variables. Our post-cutting distribution can be thought of as arising from a
scenario where each channel is cut by an independent attacker, who tries to blur the attack
by feeding her open end with P (Xi ) (which is the only distribution she can see), while [11]
requires communicating attackers who agree on feeding their open ends with the observed
joint distribution.
Lemma 1 and Remark 1 below provide a more mathematical argument for the product
distribution. Figure 5 visualizes the deletion of one edge (left) and two edges (right).
We now define the “post-cutting” distribution formally:
Definition 1 (removing causal arrows). Let G be a causal DAG and P be Markovian
with respect to G. Let S ⊂ G be a set of arrows. Set P ASj as the set of those parents Xi of
¯
Xj for which (i, j) ∈ S and P ASj those for which (i, j) 6∈ S. Set
¯ X ¯
(6) PS (xj |paSj ) := P (xj |paSj , paSj )PQ (paSj ) ,
paS
j
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
10 JANZING ET AL.
Z Z
P(Z)
X X
P(X) Y P(X) Y
Fig 5. Left: deletion of the arrow X → Y . The conditional P (Y |X, Z) is fed with P an independent copy of X, dis-
tributed with P (X). The resulting distribution reads PX→Y (x, y, z) = P (x, z) x0 P (y|z, x0 )P (x0 ). Right: dele-
tion of both incoming arrows. The conditional P (Y |X, Z) is then fed with the product distribution P (X)P (Z)
instead of the joint P (X, Z) as inP[11], since the latter would require communication between the open ends.
We obtain PX→Y,Z→Y (x, y, z) = x0 ,z0 P (x, z)P (y|x0 , z 0 )P (x0 )P (z 0 ). Feeding with independent inputs is par-
ticularly relevant for the following example: let X and Z be binary with X = Z and Y = X ⊕ Z. Then, the
cutting had no impact if we would keep the dependences.
where PQ (paSj ) denotes for a given j the product of marginal distributions of all variables in
P ASj . Define a new joint distribution, the interventional distribution1
Y ¯
(7) PS (x1 , . . . , xn ) := PS (xj |paSj ) .
j
See Figure 5, left, for a simple example with cutting only one edge. Eq. (7) formalizes the
fact that each open end of the wires is independently fed with the corresponding marginal
distribution, see also Figure 5, right. Information flow in the sense of [11] is obtained when
¯ ¯
the product distribution PΠ (paSj ) in (6) is replaced with the joint distribution P (paSj ).
The modified joint distribution PS can be considered as generated by the reduced DAG:
Lemma 1 (Markovian). The interventional distribution PS is Markovian with respect to
the graph GS obtained from G by removing the edges in S.
Proof. By construction, PS factorizes according to GS in the sense of (1).
Remark 1. Markovianity is violated if the dependencies between open ends are kept.
Consider, for instance, the DAG X → Y → Z. Cutting both edges yields
X X
PS (x, y, z) = P (x) P (y|x0 )P (x0 ) P (z|y 0 )P (y 0 ) = P (x)P (y)P (z) ,
x0 y0
which is obviously Markovian with respect to the DAG without arrows. Feeding the “open
ends” with P (x0 , y 0 ) instead obtains
X
P˜S (x, y, z) = P (x) P (y|x0 )P (z|y 0 )P (x0 , y 0 ) ,
x0 y 0
which induces dependencies between Y and Z, although we have claimed to have removed all
links between the three variables.
1
Note that this intervention differs from the kind of interventions considered by [1], where variables are set
to specific values. Here we intervene on the arrows, the information channels, and not on the nodes.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 11
Definition 2 (causal influence of a set of arrows). The causal influence of the arrows in
S is given by the Kullback-Leibler divergence
(8) CS (P ) := D(P kPS ) .
If S = {Xk → Xl } is a single edge we write Ck→l instead of CXk →Xl .
Remark 2 (observing versus intervening). Note that PS could easily be confused with
a different distribution obtained when the open ends are fed with conditional distributions
rather than marginal distributions. As an illustrative example, consider DAG 2a) and define
P˜X→Y (X, Y, Z) as
X
P˜X→Y (x, y, z) := P (x, z)P (y|z) = P (x, z) P (y|x0 )P (x0 |z) ,
x0
and recall that replacing P (x0 |z) with P (x0 ) in the right most expression yields PX→Y . We
call P˜X→Y the “partially observed distribution”. It is the distribution obtained by ignoring
the influence of X on Y : P˜X→Y is computed according to (1), but uses a DAG where X → Y
is missing. The difference between “ignoring” and “cutting” the edge is important for the
following reason. By a known rephrasing of mutual information as relative entropy [7] we
obtain
X P (y|z, x)
(9) D(P kP˜X→Y ) = P (x, y, z) log = I(X; Y |Z) ,
x,y,z
P (y|z)
which, as we have already discussed, is not a satisfactory measure of causal strength. On the
other hand, we have
(10) CX→Y = D(P kPX→Y ) = D [P (Y |Z, X)kPX→Y (Y |Z, X)]
X P (y|z, x)
(11) = D [P (Y |Z, X)kPX→Y (Y |Z)] = P (x, y, z) log P 0 0
.
x,y,z x0 P (y|z, x )P (x )
Comparing the second expressions in (11) and (9) shows again thatP the difference between
0 0
ignoring and cutting is due to the difference between P (y|z) and x0 P (y|z, x )P (x ).
The following scenario provides a better intuition for the rightmost expression in (11).
Example 1 (Redistributing a vaccine). Consider the task of quantifying the effectiveness
of a vaccine. Let X indicate whether a patient decides to get vaccinated or not and Y whether
the patient becomes infected. Further assume that the vaccine’s effectiveness is strongly con-
founded by age Z because the vaccination often fails for elderly people. At the same time,
elderly people request the vaccine more often because they are more afraid of infection. Ignor-
ing other confounders, the DAG in Fig. 2a visualizes the causal structure.
Deleting the edge X → Y corresponds to an experiment where the vaccine is randomly as-
signed to patients regardless of their intent and age (while keeping the total fraction of patients
vaccinated constant). Then PX→Y (y|z, x) = PX→Y (y|z) = x0 P (y|z, x0 )P (x0 ) represents the
P
conditional probability of infection, given age, when vaccines are distributed randomly. CX→Y
quantifies the difference to P (y|z, x), which is the conditional probability of infection, given
age and intention when patients act on their intentions. It thus measures the impact of de-
stroying the coupling between the intention to get the vaccine and getting it via randomized
redistribution.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
12 JANZING ET AL.
4.2. Definition via structural equations. The definition above uses the conditional density
P (xj |paj ). Estimating a conditional density from empirical data requires huge samples or
strong assumptions – particularly for continuous variables. Fortunately, however, structural
equations (also called functional models [1]) allow more direct estimation of causal strength
without referring to the conditional distribution.
Definition 3 (structural equation). A structure equation is a model that explains the
joint distribution P (X1 , . . . , Xn ) by a deterministic dependence
Xj = fj (P Aj , Ej ) ,
where the variables Ej are jointly independent unobserved noise variables. Note that functions
fj that correspond to parentless variables can be chosen to be the identity, i.e., Xj = Ej .
Suppose that we are given a causal inference method that directly infers the structural
equations (e.g., [16, 17]) in the sense that it outputs n-tuples (ei1 , . . . , ein ) with i = 1, . . . , m
(with m denoting the sample size) as well as the functions fj from the observed n-tuples
(xi1 , . . . , xin ).
Definition 4 (removing a causal arrow in a structural equation). Deletion of the arrow
Xk → Xl is modeled by (i) introducing an i.i.d. copy Xk0 of Xk and (ii) subsuming the new
random variable Xk0 into the noise term of fl . The result is a new set of structural equations:
xj = fj paj , ej if j 6= l, and
(12) xl = fl pal \ {xk }, (x0k , el ) ,
where we have omitted the superscript i to simplify notation.
Remark 3. To measure the causal influence of a set of arrows, we apply the same pro-
cedure after first introducing jointly independent i.i.d. copies of all variables at the tails of
deleted arrows.
Remark 4. The change introduced by the deletion only affects Xl and its descendants,
the virtual sample thus keeps all xj with j < l. Moreover, we can ignore all variables Xj with
j > l due to Lemma 3.
Note that x0k must be chosen to be independent of all xj with j ≤ k, but, by virtue of the
structural equations, not independent of xl and its descendants. The new structural equations
thus generate n-tuples of “virtual” observations xS1 , . . . , xSn from the input
(e1 , . . . , (x0k , el ), . . . , en ) .
We show below that n-tuples generated this way indeed follow the distribution PS (X1 , . . . , Xn ).
We can therefore estimate causal influence via any method that estimates relative entropy
using the observed samples x1 , . . . , xn and the virtual ones x
˜1 , . . . , x
˜n . To illustrate the above
scheme, we consider the case where Z and X are causes of Y and we want to delete the edge
X → Y . The case where Y has more than 2 parents follows easily.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 13
Example 2 (Two parents). The following table corresponds to the observed variables
X, Z, Y , as well as the unobserved noise E Y which we assumed to be estimated together with
learning the structural equations.
EY
Z X Y
z1
x1 eY1 fY (z1 , x1 , eY1 )
(13)
z2
x2 eY2 fY (z2 , x2 , eY2 ) .
. ..
..
.
zm xm eYm fY (zm , xm , eYm )
To simulate the deletion of X → Y we first generate a list of virtual observations for Y
after generating samples from an i.i.d. copy X 0 of X:
Z X X 0 EY
Y
z1 x1 x0 eY fY (z1 , x01 , eY1 )
1 1
(14) .
.. ..
. .
0 Y
zm xm xm em fY (zm , xm , em ) 0 Y
A simple method to simulate the i.i.d. copy is to apply some random permutation π ∈ Sm to
x1 , . . . , xn and obtain xπ(1) , . . . , xπ(n) (see [18], S.1). Deleting several arrows with source node
X requires several identical copies X 0 , X 00 , . . . of X, each generated by a different permutation.
We then throw away the two noise columns, i.e., the original noise E Y and the additional
noise X 0 :
Z X Y
z1 x1 fY z1 , x0 , eY )
1 1
(15) .
.. ..
. .
0
zm xm fY zm , xm , em ) Y
To see that this triple is indeed sampled from the desired distribution PS (X, Y, Z), we
recall that the original structural equationPsimulates the conditional P (Y |X, Z). After in-
serting X 0 we obtain the new conditional x0 P (Y |x0 , Z)P (x0 ). Multiplying it with P (X, Z)
yields PS (X, Y, Z), by definition. Using the above samples from PS (X, Y, Z) and samples from
P (X, Y, Z) we can estimate
CX→Y = D(P (X, Y, Z)kPS (X, Y, Z))
using some known schemes [19] for estimating relative entropies from empirical data. It is
important that the samples from the two distributions are disjoint, meaning that we need to
split the original sample into two halves, one for P and one for PS .
The generation of PS for a set S of arrows works similarly: every input of a structural
equation that corresponds to an arrow to be removed is fed with an independent copy of
the respective variable. Although it is conceptually simple to estimate causal strength by
generating the entire joint distribution PS , Theorem 5(a) will show how to break the problem
into parts that make estimation of relative entropies from finite data more feasible.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
14 JANZING ET AL.
We now revisit mediation analysis [1, 14, 15], which is also based on structural equations,
and mention an interesting relation to our work. Although we have pointed out that interven-
ing by “cutting edges” is complementary to the intervention on nodes considered there, distri-
butions like PS can also occur in an implicit way. To explore the indirect effect X → Z → Y
in Fig. 2 b), one can study the effect of X on Y in the reduced DAG X → Z → Y under the
distribution PX→Y or under the distribution obtained by setting the copy X 0 to some fixed
value x0 . Remarkably, cutting X → Y is then used to study the strength of the other path
while we use it to study the strength2 of X → Y .
4.3. Properties of causal strength. This subsection shows that our definition of causal
strength satisfies postulates P0–P4. We observe at the same time some other useful properties.
We start with a property that is used to show P0.
Causal
Q strength majorizes observed dependence. Recalling that P (X1 , . . . , Xn ) factorizes into
j P (Xj |P Aj ) with respect to the true causal DAG G, one may ask how much error one
would cause if one was not aware of all causal influences and erroneously assumed that the
true DAG would be the one where some set S of arrows is missing. The conditionals with
respect to the reduced set of parents define a different joint distribution:
Definition 5 (distribution after ignoring arrows).
Given distribution P Markovian with respect to G and set of arrows S, let the partially
observed distribution (where interactions across S are hidden) for node Xj be
¯ X ¯ ¯
P˜S (xj |paSj ) = P (xj |paSj , paSj )P (paSj |paSj ).
paS
j
Let the partially observed distribution for all the nodes be the product
Y ¯
(16) P˜S (x1 , . . . , xn ) = P˜S (xj |paSj ).
j
Remark 5. Intuitively, the observed influence of a set of arrows should be quantified
by comparing the data available to an observer who can see the entire DAG with the data
available to an observer who sees all the nodes of the graph, but only some of the arrows.
Definition 5 formalizes “seeing only some of the arrows”.
Building on Remark 2, the definition of the observed dependence of a set of arrows takes the
same general form as for causal influence. However, instead of inserting noise on the arrows,
we instead simply prevent ourselves from seeing them:
Definition 6 (observed influence).
Given distribution P Markovian with respect to G and set of arrows S, let the observed
influence of the arrows in S be
OS (P ) := D(P kP˜S ) ,
with P˜S defined in (16).
The following result, proved in subsection A.1, is crucial to proving P0:
2
We are grateful to an anonymous referee for this observation.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 15
Theorem 2 (causal influence majorizes observed dependence).
Causal influence decomposes into observed influence plus a non-negative term quantifying
the divergence between the partially observed and interventional distributions
n
X ¯ ¯
¯
(17) CS (P ) = OS (P ) + P (paSj ) · D P˜S (Xj |paSj
PS (Xj |paSj ) .
j=1
The theorem shows that “snapping upstream dependencies” by using purely local data – i.e.
by marginalizing using the distribution of the source node P (Xi ) rather than the conditional
P (Xi |P Ai ) – is essential to quantifying causal influence.
Proof of postulates for causal strength.
¯
P0: Let GS be the DAG obtained by removing the arrows in S from G. Let P ASj be the
parents of Xj in GS , i.e., those that are not in S and introduce the set of nodes Zj such that
¯
P Aj = P ASj ∪ Zj . By Theorem 2, CS = 0 implies OS = 0, i.e., P˜S = P , which implies
¯ ¯ ¯ ¯
(18) P (Xj |paj ) = P (Xj |paSj ) ∀paSj with P (paSj ) 6= 0 , i.e. ⊥ Zj |P ASj .
Xj ⊥
We use again the Ordered Markov condition
(19) Xj ⊥
⊥ P Rj |P Aj ∀j ,
where P Rj denote the predecessors of Xj with respect to some ordering of nodes that is
consistent with G. By the contraction rule [1], (18) and (19) yields
¯
⊥ P Rj ∪ Zj |P ASj ,
Xj ⊥
and hence
¯
⊥ P Rj |P ASj ,
Xj ⊥
which is the Ordered Markov condition for GS if we use the same ordering of nodes for GS .
P1: One easily checks CX→Y = I(X; Y ) for the 2-node DAG X → Y , because PX→Y (x, y) =
P (x)P (y) , and thus
D(P kPX→Y ) = D(P (X, Y )kP (X)P (Y )) = I(X; Y ) .
P2: Follows from
Lemma 3 (causal strength as local relative entropy). Causal strength Ck→l can be written
as the following relative entropy distance or conditional relative entropy distance:
Ck→l = D [P (Xl , P Al ) k PS (Xl , P Al )]
X
= D [P (Xl |pal ) k PS (Xl |pal )] P (pal ) = D [P (Xl |P Al ) k PS (Xl |P Al ))] .
pal
Note that PS (Xl |pal ) actually depends on the reduced set of parents P Al \ Xk only, but it
is more convenient for the notation and the proof to keep the formal dependence on all P Al .
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
16 JANZING ET AL.
Proof.
n
Y P (xj |paj )
X P (x1 . . . xn ) X
D(P kPS ) = P (x1 . . . xn ) log = P (x1 . . . xn ) log
x1 ...xn
PS (x1 . . . xn ) x ...x PS (xj |paj )
1 n j=1
n X n
X P (xj |paj ) X
= P (xj , paj ) log = D [P (Xj |P Aj ) k PS (Xj |P Aj )] .
PS (xj |paj )
j=1 xj ,paj j=1
For all j 6= l we have D [P (Xj |P Aj )kPS (Xj |P Aj )] = 0, because P (Xl |P Al ) is the only
conditional that is modified by the deletion.
P3: Apart from demonstrating the postulated inequality, the following result shows that we
have the equality CX→Y = I(X; Y |P AX Y ) for independent causes. To keep notation simple,
we have restricted our attention to the case where Y has only two causes X and Z, but Z
can also be interpreted as representing all parents of Y other than X.
Theorem 4 (decomposition of causal strength). For the DAGs in Figure 2 we have
(20) CX→Y = I(X; Y |Z) + D [P (Y |Z)kPX→Y (Y |Z)] .
If X and Z are independent, the second term vanishes.
Proof. Eq. (20) follows from Theorem 2: First, we observe OS (P ) = I(X; Y |Z) be-
cause both measure the relative entropy distance between P (X, Y, Z) and P˜S (X, Y, Z) =
P (X, Z)P (Y |Z). Second, we have
PS (X, Y, Z) = P (X, Z)PX→Y (Y |Z) .
The second summand in (17) reduces to
X X
P (z)D[P (Y |z)kPS (Y |z)] = PS (z)D[P (Y |z)kPS (Y |z)] = D[P (Y |Z)kPS (Y |Z)] .
z z
To see that the second term vanishes for independent X, Z, we observe PX→Y (Y |Z) = P (Y |Z)
because X X
PX→Y (y|z) = P (y|x, z)P (x) = P (y|x, z)P (x|z) = P (y|z) .
x x
Theorem 4 states that conditional mutual information underestimates causal strength.
Assume, for instance, that X and Z are almost always equal because Z has such a strong
influence on X that it is an almost perfect copy of it. Then I(X; Y |Z) ≈ 0 because knowing Z
leaves almost no uncertainty about X. In other words, strong dependencies between the causes
X and Z makes the influence of cause X almost invisible when looking at the conditional
mutual information I(X; Y |Z) only. The second term in (20) corrects for the underestimation.
When X depends deterministically on Z, it is even the only remaining term (here, we have
again assumed that the conditional distributions are defined for events that do not occur in
observational data).
To provide a further interpretation of Theorem 4, we recall that I(X; Y |Z) can be seen as
the impact of ignoring the edge X → Y , see Remark 2. Then the impact of cutting X → Y is
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 17
given by the impact of ignoring this link plus the impact that cutting has on the conditional
P (Y |Z).
P4: This postulate is part d) of the following collection of results that relates strength of sets
to its subsets:
Theorem 5 (relation between strength of sets and subsets).
The causal influence given in Definition 2 has the following properties
a) Additivity regarding targets.
Given set of arrows S, let Si = {s ∈ S|trg(s) = Xi }, then
X
CS (P ) = CSi (P ).
i
b) Locality.
Every CSi only depends on the conditional P (Xi |P Ai ) and the joint distribution of all parents
P (P Ai ).
c) Monotonicity.
Given sets of arrows S1 ⊂ S2 targeting single node Z, such that the source nodes in S1 are
jointly independent and independent of the other parents of Z. Then we have
CS1 (P ) ≤ CS2 (P ).
d) Heredity property.
Given sets of arrows S ⊂ T , we have
CT (P ) = 0 =⇒ CS (P ) = 0.
The proof is presented in appendix A.3. The intuitive meaning of these properties is as
follows. Part (a) says that causal influence is additive if the arrows have different targets.
Otherwise, we can still decompose the set S into equivalence classes of arrows having the
same target and obtain additivity regarding the decomposition. This can be helpful for prac-
tical applications because estimating each D[P (P Ai , Xi )kPSi (P Ai , Xi )] from empirical data
requires less data then estimating the distance D(P kPS ) for the entire high dimensional dis-
tributions.
We will show in Subsection 4.4 that general additivity fails. Part (b) is an analog of P2
for multiple arrows. According to (c), the strength of a subset of arrows cannot be smaller
than the strength of its superset, provided that there are no dependencies among the parent
nodes. Finally, part (d) is exactly our postulate P4.
Parts (c) and (d) suggest that monotonicity may generalize to the case of dependent parents:
S ⊂ T =⇒ CS (P ) ≤ CT (P ). However, the following counterexample due to Bastian Steudel
shows this is not the case.
Example 3 (XOR – counterexample to monotonicity when parents are dependent).
Consider the DAG a) in figure 2 and let the relation between X, Y, Z be given by the structural
equations
(21) X = Z,
(22) Y = X ⊕Z.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
18 JANZING ET AL.
E
B1 B2 ... B2k+1
D
Fig 6. Causal structure of an error-correcting scheme: the encoder generates 2k + 1 bits from a single one.
The decoder decodes the 2k + 1 bit words into a single bit again.
Let P (Z = 0) = a and P (Z = 1) = 1 − a. Letting S = {Z → X} and T = {Z → X, X → Y }
we find that
CS (P ) = −a log(a) − (1 − a) log(1 − a) and
CT (P ) = − log a2 + (1 − a)2 .
For a 6∈ { 12 , 0, 1}, strict convexity of the logarithm implies CT (P ) < CS (P ).
4.4. Examples and paradoxes. Failure of subadditivity: The strength of a set of arrows is
not bounded from above by the sum of strength of the single arrows. It can even happen that
removing one arrow from a set has no impact on the joint distribution while removing all of
them has significant impact, which occurs in communication scenarios that use redundancy:
Example 4 (error correcting code). Let E and D be binary variables that we call “en-
coder” and “decoder” (see figure 6) communicating over a channel that consists of the bits
B1 , . . . , B2k+1 . Using the simple repetition code, all Bj are just copies of E. Then D is set to
the logical value that is attained by the majority of Bj . This way, k errors can be corrected, i.e.,
removing k or less of the links Bj → D has no effect on the joint distribution, i.e., PS = P
for S := (B1 → D, B2 → D, . . . , Bk → D), hence CS (P ) = 0. In words: removing k or less
arrows is without impact, but removing all of them is, of course. After all, the arrows jointly
generate the dependence I(E; D) = I((E, B1 , . . . , Bk ); D) = 1, provided that E is uniformly
distributed.
Clearly, the outputs of E causally influence the behavior of D. We therefore need to consider
interventions that destroy many arrows at once if we want to capture the fact that their joint
influence is non-zero.
Thus, causal influence of arrows is not subadditive: the strength of each arrow Bj → D is
zero, but the strength of the set of all Bj → D is 1 bit.
Failure of superadditivity: The following example reveals an opposing phenomenon, where the
causal strength of a set is smaller than the sum of the single arrows:
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 19
X
Y1 Y2 ... Yn
Fig 7. Broadcasting one bit from one node to multiple nodes.
Example 5 (XOR with uniform input).
Consider the structural equations (21) and (22) with uniformly distributed Z. The causal
influence of each arrow targeting the XOR-gate individually is the same as the causal influence
of both arrows taken together:
CX→Y (P ) = CZ→Y (P ) = C{X→Y,Z→Y } (P ) = 1 bit.
Strong influence without dependence / failure of converse of P0: Revisiting Example 5 is also
instructive because it demonstrates an extreme case of confounding where I(X; Y |Z) vanishes
but causal influence is strong. Removing X → Y yields
PX→Y (x, y, z) = P (x, z)P (y) ,
where P (z) = P (y) = 1/2 and P (x|z) = δx,z . It is easy to see that
D(P k PX→Y ) = 1 ,
because P is a uniform distribution over 2 possible triples (x, y, z), whereas PX→Y is a uniform
distribution over a superset of 4 triples.
The impact of cutting the edge X → Y is remarkable: both distributions, the observed one
P as well as the post-cutting distribution PS , factorize PS (X, Y, Z) = PS (X, Z)PS (Y ) and
P (X, Y, Z) = P (X, Z)P (Y ). Cutting the edge keeps this product structure and changes the
joint distributions by only changing the marginal distribution of Y from P (Y ) to PS (Y ).
Note that P satisfies the Markov condition with respect to GX→Y (i.e. the DAG obtained
from the original one by dropping X → Y ) because Y is a constant. Since CX→Y 6= 0, this
shows that the converse of P0 does not hold.
Strong effect of little information: The following example considers multiple arrows and shows
that their joint strength may even be strong when they carry the same small amount of
information:
Example 6 (broadcasting).
Consider a single source X with many targets Y1 , . . . , Yn such that each Yi copies X, see
Figure 7. Assume P (X = 0) = P (X = 1) = 21 . If S is the set of all arrows X → Yj then
CS = n. Thus, the single node X exerts n bits of causal influence on its dependents.
5. Causal influence between two time series.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
20 JANZING ET AL.
5.1. Definition. Since causal analysis of time series is of high practical importance, we
devote a section to this case. For some fixed t, we introduce the short notation X → Yt for
the set of all arrows that point to Yt from some Xs with s < t. Then
CX→Yt
measures the impact of deleting all these arrows. We propose to replace transfer entropy with
this measure since it does not suffer from the drawbacks described in Subsection 3.3.
Subsection 4.2 describes how to estimate causal strength from finite data for one arrow and
briefly mentions how this generalizes to set of arrows. To keep this section self-consistent, we
briefly rephrase the description for the case of time series.
Suppose we have learned the structural equation model
(23) Yt = ft (Xt−1 , Xt−2 , . . . , Xt−p , Et ) ,
from observed data (xt , yt )t≤0 , where the noise variables Et are jointly independent and inde-
pendent of Xt , Xt−1 , . . . , Yt−1 , Yt−2 , . . . . Assume, moreover, that we have inferred the corre-
sponding values (et )t≤0 of the noise. If we have multiple copies of the time series, we can apply
the method described in Subsection 4.2 in a straightforward way: Due to the locality property
stated in Theorem 5(b), we only consider the variables Xt−p , . . . , Xt−1 , Yt and feed (23) with
i.i.d. copies of Xt−p , . . . , Xt−1 by applying random permutations to the observations, which
then yields samples from the modified distribution PS (Xt−p , . . . , Xt−1 , Yt ).
If we have only one observation for each time instance, we have to assume stationarity
(with constant function ft = f ) and ergodicity and generate an artificial statistical sample by
looking at sufficiently distant windows.
5.2. Comparison of causal influence with transfer entropy. We first recall the example
given by [12] showing a problem with transfer entropy (Subsection 3.3). Assume that the
variables Xt , Yt in figure 3, right, are binary and the transition from Xt−1 to Yt is a perfect
copy and likewise the transition from Yt−1 to Xt . Assume, moreover, that the system has been
initialized such that, with probability 1/2, all variables are 1 and with probability 1/2 all are
zero. Then the set X → Yt is the singleton S := {Xt−1 → Yt }. Using Lemma 3, we have
CXt−1 →Yt = D [P (Yt , Xt−1 )kPS (Yt , Xt−1 )] .
Since Yt is a perfect copy of Xt−1 , we have
1/2 for xt−1 = yt
P (yt , xt−1 ) =
0 otherwise
into
PS (yt , xt−1 ) = 1/4 for (yt , xt−1 ) ∈ {0, 1}2 .
One easily checks D(P kPS ) = 1.
Note that the example is somewhat unfair, since it is impossible to distinguish the structural
equations from a model without interaction between X and Y , where Xt+1 is obtained from
Xt by inversion and similarly for Y , no matter how many observations are performed. Thus,
from observing the system it is impossible to tell whether or not X exerts an influence on
Y . However, the following modification shows that transfer entropy still goes quantitatively
wrong if small errors are introduced:
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 21
Example 7 (perturbed transfer entropy counterexample).
Perturb Ay and Polani’s example by having Yt copy Xt−1 correctly with probability p = 1 − .
Set node Yt ’s transitions as Markov matrix
xt−1 = 0 xt−1 = 1
yt = 0 1− ,
yt = 1 1−
and similarly for the transition from Yt−1 to Xt .
The transfer entropy from X to Y at time t is
T E = I(X(−∞,t−1] ; Yt |Y(−∞,t−1] ) = I(Xt−1 ; Yt |Yt−2 )
= H(Yt |Yt−2 ) − H(Yt |Yt−2 , Xt−1 ) = H(Yt |Yt−2 ) − H(Yt |Xt−1 ) ,
where H(.|.) denotes the conditional Shannon entropy. The equalities can be derived from
d-separation in the causal DAG Fig. 3, right [1]. For instance, conditioning on Yt−2 , renders
the pair (Yt , Xt−1 ) independent of all the remaining past of X and Y . We find
−H(Yt |Xt−1 ) = log + (1 − ) log(1 − )
1 1
H(Yt |Yt−2 ) = 2(1 − ) log + (1 − 2 + 22 ) log .
2(1 − ) 1 − 2 + 22
Hence,
1 1
T E = (1 − 2 + 22 ) log 2
+ 2(1 − ) log + log + (1 − ) log(1 − )
1 − 2 + 2 2(1 − )
which tends to zero as → 0.
Causal influence, on the other hand, is given by the mutual information I(Yt ; Xt−1 ) because
all edges other than Xt−1 → Yt are irrelevant (see Postulate 2). Thus,
CX→Yt = H(Yt ) − H(Yt |Xt−1 ) = 1 + (1 − ) log(1 − ) + log ,
which tends to 1 for → 0. Hence, causal influence detects the causal interactions between X
and Y based on empirical data, whereas transfer entropy does not. Thanks to the perturbation,
the joint distribution tells us the kind of causal relations by which it is generated. For large
enough samples, the strong discrepancy between transfer entropy and our causal strength thus
becomes apparent.
6. Causal strength for linear structural equations. For linear structural equations,
we can provide a more explicit expression of causal strength under the assumption of multi-
variate Gaussianity. Let n random variables X1 , . . . , Xn be ordered such that there are only
arrows from Xi to Xj for i < j. Then we have structural equations
X
Xj = Aij Xi + Ej ,
i<j
where all Ej are jointly independent noise variables. In vector and matrix notation we have
(24) X = AX + E i.e., X = (I − A)−1 E ,
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
22 JANZING ET AL.
where A is lower triangular with zeros in the diagonal.
To compute the strength of S ⊂ {1, . . . , n}, we assume for reasons of convenience that all
variables have zero mean. Then D(P kPS ) can then be computed from the covariance matrices
alone.
The covariance matrix of X reads
Σ = (I − A)−1 ΣE (I − A)−T ,
where ΣE denotes the covariance matrix of the noise (which is diagonal by assumption) and
(.)−T the transpose of the inverse of a matrix.
To compute the covariance matrix ΣS of PS , we first split A into AS + AS¯ , where AS
contains only those entries that correspond to the edges in the set S and AS¯ only those
corresponding to the complement of S. Using this notation, the modified structural equations
read
(25) X = AS¯ X + E + AS X 0 ,
where X 0 = (X10 , . . . , Xn0 )T and each Xj0 has the same distribution as Xj and satisfies joint
independence of all X10 , . . . , Xn0 , E1 , . . . , En . It is convenient to define the modified noise
E 0 := E + AS X 0 ,
with covariance matrix
(26) ΣE 0 = ΣE + AS ΣD T
X AS ,
where ΣD 0
X contains only the diagonal entries of ΣX (recall that all Xj are independent). The
modified variables X S are now given by the equation
X S = AS¯ X + E 0 ,
which formally looks like (24), although the components of E 0 are dependent while the Ej in
(24) are independent. Thus, we obtain the modified covariance matrix of X by
ΣS = (I − AS¯ )−1 ΣE 0 (I − AS¯ )−T .
The causal strength now reads
1 −1 det Σ
CS = D(P kPS ) = tr ΣS Σ − log −n
2 det ΣS
1
tr (I − AS¯ )Σ−1 −1 −1
= E 0 (I − AS ¯ )(I − A) ΣE (I − A)
2
det(I − A)−1 ΣE (I − A)−1
− log − n ,
det(I − AS¯ )−1 ΣE 0 (I − AS¯ )−1
with ΣE 0 given by (26).
Example 8 (linear structural equations with independent parents).
It is instructive to look at the following simple case:
X
Xn := αnj Xj + En with En , X1 , . . . , Xn−1 jointly independent.
j
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 23
For the set S := {X1 → Xn , . . . , Xk → Xn } with k ≤ n some calculations show
Pn−1 2 Var(X )
1 Var(Xn ) − j=k+1 αnj j
CS = log Pn−1 2 .
2 Var(Xn ) − j=1 αnj Var(Xj )
For the single arrow X1 → Xn we thus obtain
Var(Xn ) − n−1 2
P
1 j=2 αnj Var(Xj )
CX1 →Xn = log .
2 Var(Xn ) − n−1 2
P
j=1 αnj Var(Xj )
If X1 is the only parent, i.e., n = 2, we have
1 Var(X2 ) 1
CX1 →X2 = log 2 = − log(1 − r21 ) ,
2 Var(X2 ) − α21 Var(X1 ) 2
with r21 as in eq. (2) introduced in the context of ANOVA. Note that the relation between our
measure and rn1 is less simple for n > 2 because rn1 would then still measure the fraction of
the variance of Xn explained by X1 , while CX1 →Xn is related to the fraction of the conditional
variance of Xn , given its other parents, explained by X1 . This is because our causal strength
reduces to a conditional mutual information for independent parents, see the last sentence of
Theorem 4.
7. Experiments. Code for all experiments can be downloaded at
http://webdav.tuebingen.mpg.de/causality/ (to be completed after acceptance)
7.1. DAGs without time structure. We here restrict attention to linear structural equa-
tions, but interesting generalizations are given by additive noise models [16, 17, 20] and
post-nonlinear models [21].
The first step in estimating the causal strength consists in inferring the structure matrix
A in (24) from the given matrix X of observations xij with j = 1, . . . , n and i = 1, . . . , 2k (the
jth row corresponds to the observed values of Xj ). We did this step by ridge regression. We
decompose A into the sum AS + AS¯ as in Section 6.
Then we divide the columns of X into two parts XA and XB of sample size k. While XA is
kept as it is, XB is used to generate new samples according to the modified structural equa-
tions: First we note that the values of the noise variables corresponding to the observations
XB are given by the residuals
EB := XB − A · XB .
Then we generate a matrix X0B by applying independent random permutations to the columns
of XB , which simulates samples of the random variables Xj0 in (25). Samples from the modified
structural equation are now given by
XSB := (I − AS¯ )−1 · XB + EB + AS · X0B .
To estimate the relative entropy distance between P and PS (with samples XA and XSB ),
we use the method described in [19]: Let di be the euclidean distance from the ith column in
XA to the rth nearest neighbor among the other columns of XA and dSi be the distance to
the rth nearest neighbor among all columns of XB , then the estimator reads
k
ˆ kPS ) := n
X dS k
D(P log i + log .
k di k−1
i=1
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
24 JANZING ET AL.
Fig 8. Estimated and computed value C1→2 for X1 → X2 , indicated by ∗ and +, respectively. The underlying
linear Gaussian model reads X2 = a · X1 + E. Left for sample size 1000, which amounts to 500 samples in each
part. Right: sample size 2000, which yields more reliable results.
Figure 8 shows the difference between estimated and computed causal strength for the simplest
DAG X1 → X2 with increasing structure coefficient. For some edges we obtain significant bias.
However, since the bias depends on the distributions [19], it would be challenging to correct
for it.
To provide a more general impression on the estimation error we have considered a complete
DAG on n = 3 and n = 6 nodes and randomly generated structure coefficients. In each of
` = 1, . . . , 100 runs, the structure matrix is generatedby independently drawing each entry
from a standard normal distribution. For each of the n2 arrows i → j and each ` we computed
and estimated Ci→j , which yields the x-value and the y-value, respectively, of one of the 100· n2
points in the scatter plots in Figure 9. Remarkably, we do not see a significant degradation
for n = 6 nodes (right) compared to n = 3 (left).
7.2. Time series. The fact that transfer entropy fails to capture causal strength has been
one of our motivations for defining a different measure. We revisit the critical example in
Section 5.2, where the dynamical evolution on two bits was given by noisy copy operations
from Xt−1 to Yt and Yt−1 to Xt . This way, we obtained causal strength 1 bit when the copy
operations is getting perfect. Our software for estimating causal strength only covers the
case of linear structural equations, with the additional assumption of Gaussianity for the
subroutines that compute the causal strength from the covariance matrices for comparison
with the estimated value.
A natural linear version of Example 7 is an autoregressive (AR-) model of order 1 given by
√ X
Xt √ 0 1 − 2 Xt−1 Et
= + ,
Yt 1− 2 0 Yt−1 EtY
where EtX , EtY are independent noise terms. We consider the stationary regime where Xt and
Yt have unit variance and Et has variance 2 . For → 0 the influence from Xt−1 on Yt , and
similarly from Yt−1 to Xt gets deterministic. We thus obtain infinite causal strength (note that
two deterministically coupled random variables with probability density have infinite mutual
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 25
Fig 9. Relation between computed and estimated single arrow strengths for 100 randomly generated structure
matrices and noise variance 1. The estimation is based on sample size 1000. Left: complete DAG on 3 nodes.
Right: the same for 5 nodes.
Fig 10. Estimated and computed value CX→Yt where = 2−m and m runs from 1 to 10. Left: for
length T = 5000. Right: T = 50, 000.
information). It is easy to see that transfer entropy does not diverge, because the conditional
variance of Yt is 22 if only the past of Y is given and 2 if the past of X is given in addition.
Reducing the variance by the factor 2 corresponds to the constant information gain of 12 log 2,
regardless of how small is.
Figure 10 shows the computed and estimated values of causal strength for decreasing ,
i.e., the deterministic limit. Note that, in this limit, the empirical estimator underestimates
the relative entropy D(P kPS ). This is because the estimated relative entropy is even finite for
the case where the true one is infinite because one distribution lives on a lover dimensional
manifold. This probably explains the large deviations for m = 8, 9, 10, since m = 8 already
corresponds to very small noise having standard deviation about 10−3 when Xt has unit
variance.
8. Conclusions. We have defined the strength of an arrow or a set of arrows in a causal
Bayesian network by quantifying the impact of an operation that we called “destruction of
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
26 JANZING ET AL.
edges”. We have stated a few postulates that we consider natural for a measure of causal
strength and shown that they are satisfied by our measure. We do not claim that our list is
complete, nor do we claim that measures violating our postulates are inappropriate. How to
quantify causal influence may strongly depend on the purpose of the respective measure.
For a brief discussion of an alternative measure of causal strength and some of the difficulties
that arising when quantifying the total influence of one set of nodes on another, see the
supplementary material [18].
The goal of this paper is to encourage discussions on how to define causal strength within
a framework that is general enough to include dependencies between variables of arbitrary
domains, including non-linear interactions, and multi-dimensional and discrete variables at
the same time.
APPENDIX A: FURTHER PROPERTIES OF CAUSAL STRENGTH AND PROOFS
A.1. Proof of Theorem 2. Expand CS (P ) as
X P (x1 . . . xn )
(27) D P kPS = P (x1 . . . xn ) log
x ...x
P S (x1 . . . xn )
1 n
X P (x1 . . . xn ) X P˜S (x1 . . . xn )
(28) = P (x1 . . . xn ) log + P (x1 . . . xn ) log .
x1 ...xn P˜S (x1 . . . xn ) x1 ...xn PS (x1 . . . xn )
Note that the second term can be written as
n ˜ ¯ ¯
X Y PS (xj |paSj ) n
X X P˜S (xj |paSj )
(29) P (x1 . . . xn ) log ¯ = P (x1 . . . xn ) log ¯
x1 ...xn j=1 PS (xj |paSj ) j=1 x1 ...xn PS (xj |paSj )
¯
n X
X ¯ P˜S (xj |paSj )
(30) = P (xj , paSj , paSj ) log ¯
j=1 xj ,paj PS (xj |paSj )
¯
n X
X ¯ ¯ P˜S (xj |paSj )
(31) = P˜ (xj |paSj )P (paSj ) log ¯
j=1 xj ,paS¯ PS (xj |paSj )
j
n h i
X ¯ ¯
¯
(32) = P (paSj ) · D P˜S (Xj |paSj
PS (Xj |paSj ) .
j=1
Causal influence is thus observed influence plus a correction term that quantifies the diver-
gence between the partially observed and interventional distributions. The correction term is
non-negative since it is a weighted sum of conditional Kullback-Leibler divergences.
A.2. Decomposition into conditional relative entropies. The following result gen-
eralizes Lemma 3 to the case where S contains more than one edge. It shows that the relative
entropy expression defining causal strength decomposes into a sum of conditional relative
entropies, each of it referring to the conditional distribution of one of the target nodes, given
its parents:
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 27
Lemma 6 (causal influence decomposes into a sum of expectations).
The causal influence of set of arrows S can be rewritten
X
X ¯
CS (P ) = D P (Xj |P Aj )
P (Xj |P ASj , paSj ) · PQ (paSj ) ,
S
j∈trg(S)
paj
where trg(S) denotes the target nodes of arrows in S.
The result is used in the proof of Theorem 5 below.
Proof. Using the chain rule for relative entropy [7] we get
n
X
(33) D(P kPS ) = D [P (Xj |P Aj ) kPS (Xj |P Aj )] =
j=1
n X
X X
(34) P (paj )D [P (Xj |paj ) kPS (Xj |paj )] = D [P (Xj |P Aj ) kPS (Xj |P Aj )] ,
j=1 paj j∈trg(S)
where we have used that P (Xj |P Aj ) = PS (Xj |P Aj ) for all j 6∈ trg(S). Then the statement
follows from the definition of PS (Xj |P Aj ). Note that a similar statement for D(PS kP ) (i.e.,
swapping the roles of P and PS ) would not hold because then the weighting factor P (paj ) in
(34) needed to be replaced with the factor PS (paj ), which is sensitive even to deleting edges
not targeting j.
A.3. Proof of Theorem 5. Parts (a) and (b) follow from Lemma 6 since CSi (P ) is the
ith summand in (34), which obviously depends on P (Xi |P Ai ) and P (P Ai ) only.
To prove part (c), we will show that the restrictions of P, PS1 , PS2 to the variables Z, P AZ
form a so-called pythagorean triple in the sense of [22], i.e.,
(35) D P (Z, P AZ )
PS2 (Z, P AZ ) =
D P (Z, P AZ )
PS (Z, P AZ ) + D[PS (Z, P AZ )kPS (Z, P AZ )] .
1 1 2
This is sufficient because the left hand side and the first term on the right hand side of eq. (35)
coincide with CS2 and CS1 , respectively, due to part b). Note, however, that
D PS1 (Z, P AZ )
PS2 (Z, P AZ ) 6= D PS1
PS2
because we have such a locality statement only for terms of the form D(P kPS ). We therefore
consistently restrict attention to Z, P AZ and find:
X P (z|paZ )
D P (Z, P AZ )
PS2 (Z, P AZ ) = P (z, paZ ) log ¯
z,paZ PS2 (z|paSZ2 )
¯
X P (z|paZ ) X PS1 (z|paSZ1 )
= P (z, paZ ) log ¯ + P (z, paZ ) log ¯
z,paZ PS1 (z|paSZ1 ) z,paZ PS2 (z|paSZ2 )
¯
X S¯1 S1 S1 S¯1 PS1 (z|paSZ1 )
=D P (Z, P AZ ) PS1 (Z, P AZ ) +
P (z|paZ , paZ )P (paZ )P (paZ ) log
Q
¯ ,
z,paZ PS2 (z|paSZ2 )
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
28 JANZING ET AL.
where we have used that the sources in S1 are jointly independent and independent of the
other parents of Z. By definition of PS1 , the second summand reads:
¯
X ¯ PS1 (z|paSZ1 )
PS1 (z, paSZ1 ) log
¯ = D PS1 (Z, P AZ )
PS2 (Z, P AZ ) ,
¯
S PS2 (z|paSZ2 )
z,paZ1
which proves (35).
By Lemma 6, it is only necessary to prove part (d) in the case where both S and T consist of
arrows targeting a single node. To keep the exposition simple, we consider the particular case
of a DAG containing three nodes X, Y, Z where S = {X → Z} and T = {X → Z, Y → Z}.
The more general case follows similarly. Observe that D(P kPT ) = 0 if and only if
X
(36) P (Z|x, y) = P (Z|ˆx, yˆ)P (ˆ
x)P (ˆ
y) ,
x
ˆ,ˆ
y
for all x, y such that P (x, y) > 0. Multiplying both sides with P (x0 ) and summing over all x0
yields X X
P (Z|x0 , y)P (x0 ) = P (Z|ˆ
x, yˆ)P (ˆ
x)P (ˆ
y) ,
x0 x
ˆ,ˆ
y
because the right hand side does not depend on x. Using (36) again we obtain
X
(Z|x0 , y)P (x0 ) = P (Z|x, y) .
x0
for all x, y with P (x, y) 6= 0. Hence PS = P and thus D(P kPS ) = 0.
A.4. Causal influence measures controllability. Causal influence is intimately re-
lated to control. Suppose an experimenter wishes to understand interactions between compo-
nents of a complex system. For the causal DAG in Figure 1 d), she is able to observe nodes
Y and Z, and manipulate node X. To what extent can she control node Y ? The notion of
control has been formalized information-theoretically in [23]:
Definition 7 (perfect control).
Node Y is perfectly controllable by node X at Z = z if, given z,
i) states of Y are a deterministic function of states of X; and
ii) manipulating X gives rise to all states of Y .
Perfect control can be elegantly characterized:
Theorem 7 (information-theoretic characterization of perfect controllability).
A node Y with inputs X and Z is perfectly controllable by X alone for Z = z iff there exists
a Markov transition matrix R(x|z) such that
X
(C1) H(Y |z, do X) := R(x|z)H(Y |z, do x) = 0, and
x
X
(C2) P (y|z, do (x))R(x|z) 6= 0 for all y.
x∈X
Here, H(Y |z, do (x)) denotes the conditional Shannon entropy of Y , given that Z = z has
been observed and X has been set to x.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
QUANTIFYING CAUSAL INFLUENCES 29
Proof. The theorem restates the criteria in the definition. For a proof, see [23].
It is instructive to compare Theorem 7 to our measure of causal influence. The theorem
highlights two fundamental properties of perfect control. First, (C1), perfect control requires
there is no variation in Y ’s behavior – aside from that due to the manipulation via X – given
that z is observed. Second, (C2), perfect control requires that all potential outputs of Y can
be induced by manipulating node X. This suggests a measure of the degree of control should
reflect (i) the variability in Y ’s behavior that cannot be eliminated by imposing X values and
(ii) the size of the repertoire of behaviors that can be induced on the target by manipulating
a source.
For the DAG under consideration Theorem 4 states that
CX→Y (P ) = I(X; Y |Z) = H(Y |Z) − H(Y |X, Z) .
The first term, H(Y |Z), quantifies size of the repertoire of outputs of Y averaged over manip-
ulations of X. It corresponds to requirement (C2) in the characterization of perfect control:
that P (y|z) > 0 for all z. Specifically, the causal influence, interpreted as a measure of the
degree of controllability, increases with the size of the (weighted) repertoire of outputs that
can be induced by manipulations.
The second term, H(Y |X, Z) (which coincides with H(Y |Z, do(X)) here), quantifies the
variability in Y ’s behavior that cannot be eliminated by controlling X. It corresponds to
requirement (C1) in the characterization of perfect control: that remaining
P variability should
be zero. Causal influence increases as the variability H(Y |Z, do(X)) = z P (z)H(Y |z, do(X))
tends towards zero provided that the first term remains constant.
ACKNOWLEDGEMENTS
We are grateful to G´
abor Lugosi for a helpful hint for the proof of Lemma 1 in Supplement
S.1.
SUPPLEMENTARY MATERIAL
Supplement A: Supplement to “Quantifying causal influences”
(doi: COMPLETED BY THE TYPESETTER; .pdf). Three supplementary sections: (1) Gen-
erating an iid copy via random permutations; (2) Another option to define causal strength;
and (3) The problem of defining total influence.
REFERENCES
[1] J. Pearl. Causality. Cambridge University Press, 2000.
[2] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Lecture Notes in Statistics.
Springer, New York, 1993.
[3] S. L. Lauritzen. Graphical models. Oxford Statistical Science Series. Oxford University Press, Oxford,
1996.
[4] P.W. Holland. Causal inference, path analysis, and recursive structural equations models. In C. Clogg,
editor, Sociological Methodology, pages 449–484. American Sociological Association, Washington DC, 1988.
[5] R. Northcott. Can ANOVA measure causal strength? The Quaterly Review of Biology, 83(1):47–55, 2008.
[6] R.C. Lewontin. Annotation: the analysis of variance and the analysis of causes. American Journal Human
Genetics, 26(3):400–411, 1974.
[7] T. Cover and J. Thomas. Elements of Information Theory. Wileys Series in Telecommunications, New
York, 1991.
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013
30 JANZING ET AL.
[8] C. W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods.
Econometrica, 37(3):424–38, July 1969.
[9] T. Schreiber. Measuring information transfer. Phys. Rev. Lett., 85:461–464, Jul 2000.
[10] J. Massey. Causality, feedback and directed information. In Proc. 1990 Intl. Symp. on Info. Th. and its
Applications, Waikiki, Hawaii, 1990.
[11] N. Ay and D. Krakauer. Geometric robustness and biological networks. Theory in Biosciences, 125:93–121,
2007.
[12] N. Ay and D. Polani. Information flows in causal networks. Advances in Complex Systems, 11(1):17–41,
2008.
[13] J. Pearl. Direct and indirect effects. In Proceedings of the Seventh Conference on Uncertainty in Artificial
Intelligence (UAI), pages 411–420, San Francisco, CA, 2001. Morgan Kaufmann.
[14] C. Avin, I. Shpitser, and J. Pearl. Identifiability of path-specific effects. In Proceedings of the International
Joint Conference in Artificial Intelligence, pages 357–363, Edinburgh, Scotland, 2005.
[15] J.M. Robins and S. Greenland. Identifiability and exchangeability for direct and indirect effects. Epi-
demiology, 3(2):143–155, 1992.
[16] P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B Sch¨ olkopf. Nonlinear causal discovery with additive noise
models. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Proceedings of the conference
Neural Information Processing Systems (NIPS) 2008, Vancouver, Canada, 2009. MIT Press. http://
books.nips.cc/papers/files/nips21/NIPS2008_0266.pdf.
[17] J. Peters, J. Mooij, D. Janzing, and B. Sch¨ olkopf. Identifiability of causal graphs using functional models.
In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011). http://uai.
sis.pitt.edu/papers/11/p589-peters.pdf.
[18] D. Janzing, D. Balduzzi, M. Grosse-Wentrup, and B. Sch¨ olkopf. Supplementary material of this arfticle.
Annals of Statistics.
[19] F. P´erez-Cruz. Estimation of information theoretic measures for continuous random variables. In Pro-
ceedings of Neural Information Processing Systems (NIPS) 2008, 2009.
[20] J. Peters, D. Janzing, and B. Sch¨ olkopf. Causal inference on discrete data using additive noise models.
IEEE Transac. Patt. Analysis and Machine Int., 33(12):2436–2450, 2011.
[21] K. Zhang and A. Hyv¨ arinen. On the identifiability of the post-nonlinear causal model. In Proceedings of
the 25th Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 2009.
[22] S. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 1993.
[23] H Touchette and S Lloyd. Information-theoretic approach to the study of control systems. Physica A,
331:140–172, 2004.
Dominik Janzing David Balduzzi
Spemannstr. 38 CAB F 63.1
72076 Tu¨ bingen Universita¨ tstrasse 6
Germany 8092 Zurich
E-mail:
[email protected] Switzerland
E-mail:
[email protected]
Moritz Grosse-Wentrup Bernhard Scho ¨ lkopf
Spemannstr. 38 Spemannstr. 38
72076 Tu¨ bingen 72076 Tu¨ bingen
Germany Germany
E-mail:
[email protected] E-mail:
[email protected]
imsart-aos ver. 2012/04/10 file: paper_final.tex date: August 27, 2013