Commit 212a7299 authored by Christina Heinze-Deml's avatar Christina Heinze-Deml
Browse files

notebook IV+frontdoor

parent 0ae3d8b9
Pipeline #183280 passed with stage
in 22 seconds
%% Cell type:markdown id: tags:
# Instrumental Variables and Frontdoor Criterion
by Jonas Peters, Niklas Pfister, 04.04.2019
This notebook aims to give you a basic understanding of the instrumental variable approach and the frontdoor criterion and when they can be used to infer causal relations.
%% Cell type:markdown id: tags:
## Instrumental Variable Model
In the following, let all variables have
* zero mean,
* finite second moments, and
* their joint distribution is absolutely continuous with respect to Lebesgue.
The goal of the instrumental variable approach is to estimate the causal effect of a predictor variable $X$ on a target variable $Y$ if the effect from $X$ to $Y$ is confounded. The idea is to account for this confounding by considering an additional variable $I$ called an instrument. Although there exist numerous extensions, here, we focus on the classical case. We provide two definitions.
%% Cell type:markdown id: tags:
First, assume the following SEM
\begin{align}
I &:= N_I\\
H &:= N_H\\
X &:= I \gamma + H \delta_X + N_X\\
Y &:= X \beta + H \delta_Y + N_Y.\\
\end{align}
(All variables except $Y$ could be multi-dimensional, in which case, they should be written as row vectors: $1 \times d$.) If all variables are $1$-dimensional, the corresponding DAG looks as follows.
\begin{align}
&\phantom{0}\\
&\begin{array}{ccc}
& & &H & \\
& &\phantom{abcdefgh}\overset{\delta_X}{\swarrow} & & \overset{\delta_Y}{\searrow}\phantom{abcdefgh}\\
& & & & \\
I &\overset{\gamma}{\longrightarrow} &X & \overset{\beta}{\longrightarrow} & Y\\
\end{array}\\
&\phantom{0}
\end{align}
Here, $I$ is called an instrumental variable for the causal effect from $X$ to $Y$. It is essential that $I$ affects $Y$ only via $X$ (and not directly), and that $I$ and $H$ are independent.
%% Cell type:markdown id: tags:
Second, it is possible to define instrumental variables without SEMs, too. Let us therefore write
\begin{equation}
Y = X \beta + \epsilon_Y
\end{equation}
(this can always be done). Here, $\epsilon_Y$ is allowed to depend on $X$ (if there is a confounder $H$ between $X$ and $Y$, this is usually the case). In this linear setting, we then call a variable $I$ an instrumental variable if it satisfies the following two conditions:
1. $\operatorname{cov}(X,I)$ is of full rank (relevance)
2. $\operatorname{cov}(\epsilon_Y,I)=0$ (exogenity).
Informally speaking, these conditions again mean that $I$ affects $Y$ ''only through its effect on $X$''.
%% Cell type:markdown id: tags:
## Estimation
We now want to illustrate how the existence of an instrumental variable $I$ can be used to estimate the causal effect $\beta$ in the model above. Let us therefore assume that we have received data in matrix form
* $\mathbf{Y}$ - the target variable $n \times 1$
* $\mathbf{X}$ - the covariates $n \times d$
* $\mathbf{I}$ - the instruments $n \times m$
where $n > \max(m, d)$.
%% Cell type:markdown id: tags:
We now assume that $I$ is a valid instrument (we come back to this question in Exercise 2 below). To estimate the causal effect of $X$ on $Y$, there are several options of writing down the same estimator.
OPTION 1: The following estimator is sometimes called the generalized methods of moments (GMM)
$$
\hat{\beta}^{GMM}_n := (\mathbf{X}^t \mathbf{I} (\mathbf{I}^t \mathbf{I})^{-1} \mathbf{I}^t \mathbf{X})^{-1} \, \mathbf{X}^t \mathbf{I} (\mathbf{I}^t \mathbf{I})^{-1} \mathbf{I}^t \mathbf{Y}
$$
OPTION 2:
we can use a so-called 2-stage least squares (2SLS) procedure. Step 1: Regress $X$ on $I$ and compute the corresponding fitted values $\hat{X}$. Step 2: Regress $Y$ on $\hat{X}$. Use the regression coefficients from step 2.
The following four exercises go over some of the details of the 2SLS and apply it to a real data set.
%% Cell type:markdown id: tags:
### Exercise 1
Assume that the data are i.i.d. from the following two structural assignments
\begin{align*}
Y &:= X \cdot \beta + \epsilon_Y \\
X &:= I \cdot \gamma + \epsilon_X,
\end{align*}
where $X$ and $I$ are written as $1 \times d$ and $1 \times m$ vectors, respectively. Here, $\epsilon_X$ and $\epsilon_Y$ are not necessarily independent, but the instrument $I$ is assumed to satisfy the assumptions 1. and 2. above.
a) Write down conditions on $d$ and $m$ that guarantee that $\hat{\beta}^{GMM}_n$ is well-defined (with probability one).
b) Prove that under these conditions, the GMM method is consistent, i.e., $\hat{\beta}^{GMM}_n \rightarrow \beta$ in probability.
c) Assume $d = m$. Prove that the methods 2SLS and GMM provide the same estimate.
%% Cell type:markdown id: tags:
### Solution 1
%% Cell type:code id: tags:
``` R
```
%% Cell type:markdown id: tags:
### End of Solution 1
%% Cell type:markdown id: tags:
For illustration, we use the <tt>CollegeDistance</tt> data set from [1] available in the R package <tt>AER</tt>.
%% Cell type:code id: tags:
``` R
library(AER)
# load CollegeDistance data set
data("CollegeDistance")
# read out relevant variables
Y <- CollegeDistance$score
X <- CollegeDistance$education
I <- CollegeDistance$distance
```
%% Cell type:markdown id: tags:
This data set consists of $4739$ observations on $14$ variables from high school student survey conducted by the Department of Education in $1980$, with a follow-up in $1986$. In this notebook, we only consider the following variables:
* $Y$ - base year composite test score. These are achievement tests given to high school seniors in the sample.
* $X$ - number of years of education.
* $I$ - distance from closest 4-year college (units are in 10 miles).
%% Cell type:markdown id: tags:
### Exercise 2
Argue whether the variable $I$ can be used as an instrumental variable to infer the causal effect of $X$ on $Y$. Are there arguments, why it might not be a valid instrument? Hint: You can perform a regression in order to test if there is significant correlation.
%% Cell type:markdown id: tags:
### Solution 2
%% Cell type:code id: tags:
``` R
```
%% Cell type:markdown id: tags:
### End of Solution 2
%% Cell type:markdown id: tags:
### Exercise 3
Use 2SLS to estimate the causal effect of $X$ on $Y$ based on the instrument $I$. Compare your results with a standard OLS regression of $Y$ on $X$ (that includes an intercept). What happens to the correlation between $X$ and the residuals in both methods? Which attempt yields smaller variance of residuals?
%% Cell type:markdown id: tags:
### Solution 3
%% Cell type:code id: tags:
``` R
```
%% Cell type:markdown id: tags:
### End of Solution 3
%% Cell type:markdown id: tags:
A slightly different approach to 2SLS is to use the formula
OPTION 3:
\begin{equation} \tag{1}
\hat{\beta}_n = (\mathbf{I}^t \mathbf{X})^{-1} \mathbf{I}^t \mathbf{Y}.
\end{equation}
This formula can be shown to be the same as OPTIONS 1 and 2 if $d = m$ (try proving it).
%% Cell type:markdown id: tags:
### Exercise 4
Apply the above estimator (1) to <tt>CollegeDistance</tt> data and compare your result with the one from Exercise 3. (If you have included intercepts in the 2SLS, you need to replace the product moments by sample covariances.)
%% Cell type:markdown id: tags:
### Solution 4
%% Cell type:code id: tags:
``` R
```
%% Cell type:markdown id: tags:
### End of Solution 4
%% Cell type:markdown id: tags:
## Frontdoor Criterion
Similar to the instrumental variable approach this method aims to estimate the causal effect of a predictor variable $X$ on a target variable $Y$ if the effect from $X$ to $Y$ is confounded. Instead of an instrumental variable, the frontdoor criterion resolves the true causal effect based on a variable $Z$ that lies causally between $X$ and $Y$, also called a mediator. The frontdoor criterion is due to [2] and is commonly stated in terms of a DAG model.
%% Cell type:markdown id: tags:
More precisely, assume we are given a the following DAG
\begin{align}
&\phantom{0}\\
&\begin{array}{ccc}
& &H & & \\
&\swarrow & & \searrow & \\
& & & & \\
X &\longrightarrow &Z & \longrightarrow & Y\\
\end{array}\\
&\phantom{0}
\end{align}
Here, $Z$ is a mediator for the causal effect from $X$ to $Y$. It is essential that confounding $H$ does not directly affect $Z$.
More formally, the frontdoor criterion requires that
1. $Z$ blocks all directed paths from $X$ to $Y$
2. There are no unblocked backdoor paths from $X$ to $Z$
3. $X$ blocks all backdoor paths from $M$ to $Y$
If $Z$ satisfies the frontdoor criterion, the interventional density $p^{do(X:=x)} (y)$ can be computed based on observed quantities as follows
\begin{equation*}
p^{do(X:=x)} (y)=\int_{z} p(z|x) \int_{\tilde{x}}p(y|\tilde{x}, z) p(\tilde{x}) \, d\tilde{x}\, dz
\end{equation*}
This formula is also referred to as the *frontdoor adjustment* formula.
The following exercise aims to give some intution on the frontdoor criterion.
%% Cell type:markdown id: tags:
### Exercise 5
We are interested in determining whether dietary cholesterol has a postive causal effect on the risk of atherosclerosis (narrowing of the artery due to the build up of plaque). One might argue that there is a genetic factor which affects a person's risk of atherosclerosis while at the same time increasing that persons appetite for fatty food. In order to account for this, we plan to use a person's body fat content as a mediating variable.
Assume we are given data from a large observational study consisting of the following measurements:
* Does the person consume large amounts of dietary cholesterol? (yes: $x=1$, no: $x=0$)
* Did the person get atherosclerosis? (yes: $y=1$, no: $y=0$)
* Does the person have a high body fat content? (yes: $z=1$, no: $z=0$)
The data is summarized in the following table
$$
\begin{array}{r|c|c}
& p(x=\cdot, z=\cdot) & p(y=1|x=\cdot, z=\cdot)\\\hline
x=0, z=0 & 0.16 & 0.05\\\hline
x=0, z=1 & 0.04 & 0.1\\\hline
x=1, z=0 & 0.45 & 0.4\\\hline
x=1, z=1 & 0.35 & 0.6\\
\end{array}
$$
a) Is the body fat content $Z$ a suitable mediating variable that satisfies the frontdoor criterion? Give reasons for and against.
b) Apply the frontdoor criterion to compute $p^{do(X:=1)} (y=1)$ and $p^{do(X:=0)} (y=1)$.
%% Cell type:markdown id: tags:
### Solution 5
%% Cell type:code id: tags:
``` R
```
%% Cell type:markdown id: tags:
### End of Solution 5
%% Cell type:markdown id: tags:
## References
[1] Kleiber, C., A. Zeileis (2008). Applied Econometrics with R. Springer-Verlag New York.
[2] Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–710.
%% Cell type:code id: tags:
``` R
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment