{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Instrumental Variables and Frontdoor Criterion\n",
    "\n",
    "by Jonas Peters, Niklas Pfister, 04.04.2019\n",
    "\n",
    "This notebook aims to give you a basic understanding of the instrumental variable approach and the frontdoor criterion and when they can be used to infer causal relations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instrumental Variable Model\n",
    "\n",
    "In the following, let all variables have \n",
    "* zero mean,  \n",
    "* finite second moments, and\n",
    "* their joint distribution is absolutely continuous with respect to Lebesgue.\n",
    "\n",
    "The goal of the instrumental variable approach is to estimate the causal effect of a predictor variable $X$ on a target variable $Y$ if the effect from $X$ to $Y$ is confounded. The idea is to account for this confounding by considering an additional variable $I$ called an instrument. Although there exist numerous extensions, here, we focus on the classical case. We provide two definitions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, assume the following SEM\n",
    "\\begin{align}\n",
    "I &:= N_I\\\\\n",
    "H &:= N_H\\\\ \n",
    "X &:= I \\gamma  + H \\delta_X + N_X\\\\\n",
    "Y &:= X \\beta + H \\delta_Y + N_Y.\\\\\n",
    "\\end{align}\n",
    "(All variables except $Y$ could be multi-dimensional, in which case, they should be written as row vectors: $1 \\times d$.) If all variables are $1$-dimensional, the corresponding DAG looks as follows.\n",
    "\\begin{align}\n",
    "    &\\phantom{0}\\\\\n",
    "    &\\begin{array}{ccc}\n",
    "       & & &H                 & \\\\\n",
    "       & &\\phantom{abcdefgh}\\overset{\\delta_X}{\\swarrow} &            & \\overset{\\delta_Y}{\\searrow}\\phantom{abcdefgh}\\\\\n",
    "       & &                    &               & \\\\\n",
    "       I &\\overset{\\gamma}{\\longrightarrow} &X                  & \\overset{\\beta}{\\longrightarrow} & Y\\\\\n",
    "        \\end{array}\\\\\n",
    "      &\\phantom{0}\n",
    "\\end{align}\n",
    "Here, $I$ is called an instrumental variable for the causal effect from $X$ to $Y$. It is essential that $I$ affects $Y$ only via $X$ (and not directly), and that $I$ and $H$ are independent.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second, it is possible to define instrumental variables without SEMs, too. Let us therefore write\n",
    "\\begin{equation}\n",
    "Y =  X \\beta + \\epsilon_Y\n",
    "\\end{equation}\n",
    "(this can always be done). Here, $\\epsilon_Y$ is allowed to depend on $X$ (if there is a confounder $H$ between $X$ and $Y$, this is usually the case). In this linear setting, we then call a variable $I$ an instrumental variable if it satisfies the following two conditions:\n",
    "1. $\\operatorname{cov}(X,I)$ is of full rank (relevance)\n",
    "2. $\\operatorname{cov}(\\epsilon_Y,I)=0$ (exogenity).\n",
    "\n",
    "Informally speaking, these conditions again mean that $I$ affects $Y$ ''only through its effect on $X$''."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Estimation\n",
    "\n",
    "We now want to illustrate how the existence of an instrumental variable $I$ can be used to estimate the causal effect $\\beta$ in the model above. Let us therefore assume that we have received data in matrix form\n",
    "* $\\mathbf{Y}$ - the target variable $n \\times 1$ \n",
    "* $\\mathbf{X}$ - the covariates $n \\times d$\n",
    "* $\\mathbf{I}$ - the instruments $n \\times m$\n",
    "\n",
    "where $n > \\max(m, d)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now assume that $I$ is a valid instrument (we come back to this question in Exercise 2 below). To estimate the causal effect of $X$ on $Y$, there are several options of writing down the same estimator. \n",
    "\n",
    "OPTION 1: The following estimator is sometimes called the generalized methods of moments (GMM)\n",
    "$$\n",
    "\\hat{\\beta}^{GMM}_n := (\\mathbf{X}^t \\mathbf{I} (\\mathbf{I}^t \\mathbf{I})^{-1} \\mathbf{I}^t \\mathbf{X})^{-1} \\, \\mathbf{X}^t \\mathbf{I} (\\mathbf{I}^t \\mathbf{I})^{-1} \\mathbf{I}^t \\mathbf{Y}\n",
    "$$\n",
    "\n",
    "OPTION 2: \n",
    "we can use a so-called 2-stage least squares (2SLS) procedure. Step 1: Regress $X$ on $I$ and compute the corresponding fitted values $\\hat{X}$. Step 2: Regress $Y$ on $\\hat{X}$. Use the regression coefficients from step 2.\n",
    "\n",
    "The following four exercises go over some of the details of the 2SLS and apply it to a real data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1\n",
    "Assume that the data are i.i.d. from the following two structural assignments \n",
    "\\begin{align*}\n",
    "Y &:= X \\cdot \\beta + \\epsilon_Y \\\\\n",
    "X &:= I \\cdot \\gamma + \\epsilon_X,\n",
    "\\end{align*}\n",
    "where $X$ and $I$ are written as $1 \\times d$ and $1 \\times m$ vectors, respectively. Here, $\\epsilon_X$ and $\\epsilon_Y$ are not necessarily independent, but the instrument $I$ is assumed to satisfy the assumptions 1. and 2. above. \n",
    "\n",
    "a) Write down conditions on $d$ and $m$ that guarantee that $\\hat{\\beta}^{GMM}_n$ is well-defined (with probability one).\n",
    "\n",
    "b) Prove that under these conditions, the GMM method is consistent, i.e., $\\hat{\\beta}^{GMM}_n \\rightarrow \\beta$ in probability.\n",
    "\n",
    "c) Assume $d = m$. Prove that the methods 2SLS and GMM provide the same estimate. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of Solution 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For illustration, we use the <tt>CollegeDistance</tt> data set from [1] available in the R package <tt>AER</tt>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(AER)\n",
    "# load CollegeDistance data set\n",
    "data(\"CollegeDistance\")\n",
    "# read out relevant variables\n",
    "Y <- CollegeDistance$score\n",
    "X <- CollegeDistance$education\n",
    "I <- CollegeDistance$distance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This data set consists of $4739$ observations on $14$ variables from high school student survey conducted by the Department of Education in $1980$, with a follow-up in $1986$. In this notebook, we only consider the following variables:\n",
    "* $Y$ - base year composite test score.  These are achievement tests given to high school seniors in the sample.\n",
    "* $X$ - number of years of education.\n",
    "* $I$ - distance from closest 4-year college (units are in 10 miles).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "\n",
    "Argue whether the variable $I$ can be used as an instrumental variable to infer the causal effect of $X$ on $Y$. Are there arguments, why it might not be a valid instrument? Hint: You can perform a regression in order to test if there is significant correlation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of Solution 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3\n",
    "Use 2SLS to estimate the causal effect of $X$ on $Y$ based on the instrument $I$. Compare your results with a standard OLS regression of $Y$ on $X$ (that includes an intercept). What happens to the correlation between $X$ and the residuals in both methods? Which attempt yields smaller variance of residuals?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of Solution 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A slightly different approach to 2SLS is to use the formula\n",
    "\n",
    "OPTION 3:\n",
    "\\begin{equation} \\tag{1}\n",
    "\\hat{\\beta}_n = (\\mathbf{I}^t \\mathbf{X})^{-1} \\mathbf{I}^t \\mathbf{Y}.\n",
    "\\end{equation}\n",
    "\n",
    "This formula can be shown to be the same as OPTIONS 1 and 2 if $d = m$ (try proving it). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 4\n",
    "Apply the above estimator (1) to <tt>CollegeDistance</tt> data and compare your result with the one from Exercise 3. (If you have included intercepts in the 2SLS, you need to replace the product moments by sample covariances.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of Solution 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Frontdoor Criterion\n",
    "\n",
    "Similar to the instrumental variable approach this method aims to estimate the causal effect of a predictor variable $X$ on a target variable $Y$ if the effect from $X$ to $Y$ is confounded. Instead of an instrumental variable, the frontdoor criterion resolves the true causal effect based on a variable $Z$ that lies causally between $X$ and $Y$, also called a mediator. The frontdoor criterion is due to [2] and is commonly stated in terms of a DAG model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More precisely, assume we are given a the following DAG\n",
    "\\begin{align}\n",
    "    &\\phantom{0}\\\\\n",
    "    &\\begin{array}{ccc}\n",
    "       & &H &                & \\\\\n",
    "       &\\swarrow & & \\searrow      & \\\\\n",
    "       & &                    &               & \\\\\n",
    "       X &\\longrightarrow &Z                  & \\longrightarrow & Y\\\\\n",
    "        \\end{array}\\\\\n",
    "      &\\phantom{0}\n",
    "\\end{align}\n",
    "Here, $Z$ is a mediator for the causal effect from $X$ to $Y$. It is essential that confounding $H$ does not directly affect $Z$. \n",
    "\n",
    "More formally, the frontdoor criterion requires that \n",
    "1. $Z$ blocks all directed paths from $X$ to $Y$\n",
    "2. There are no unblocked backdoor paths from $X$ to $Z$\n",
    "3. $X$ blocks all backdoor paths from $M$ to $Y$\n",
    "\n",
    "If $Z$ satisfies the frontdoor criterion, the interventional density $p^{do(X:=x)} (y)$ can be computed based on observed quantities as follows\n",
    "\\begin{equation*}\n",
    "    p^{do(X:=x)} (y)=\\int_{z} p(z|x) \\int_{\\tilde{x}}p(y|\\tilde{x}, z) p(\\tilde{x}) \\, d\\tilde{x}\\, dz\n",
    "\\end{equation*}\n",
    "This formula is also referred to as the *frontdoor adjustment* formula.\n",
    "\n",
    "The following exercise aims to give some intution on the frontdoor criterion."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 5\n",
    "\n",
    "We are interested in determining whether dietary cholesterol has a postive causal effect on the risk of atherosclerosis (narrowing of the artery due to the build up of plaque). One might argue that there is a genetic factor which affects a person's risk of atherosclerosis while at the same time increasing that persons appetite for fatty food. In order to account for this, we plan to use a person's body fat content as a mediating variable.\n",
    "\n",
    "Assume we are given data from a large observational study consisting of the following measurements:\n",
    "\n",
    "* Does the person consume large amounts of dietary cholesterol? (yes: $x=1$, no: $x=0$)\n",
    "* Did the person get atherosclerosis? (yes: $y=1$, no: $y=0$)\n",
    "* Does the person have a high body fat content? (yes: $z=1$, no: $z=0$)\n",
    "\n",
    "The data is summarized in the following table\n",
    "$$\n",
    "\\begin{array}{r|c|c}\n",
    " & p(x=\\cdot, z=\\cdot) & p(y=1|x=\\cdot, z=\\cdot)\\\\\\hline\n",
    "x=0, z=0 & 0.16 & 0.05\\\\\\hline\n",
    "x=0, z=1 & 0.04 & 0.1\\\\\\hline\n",
    "x=1, z=0 & 0.45 & 0.4\\\\\\hline\n",
    "x=1, z=1 & 0.35 & 0.6\\\\\n",
    "\\end{array}\n",
    "$$\n",
    "\n",
    "a) Is the body fat content $Z$ a suitable mediating variable that satisfies the frontdoor criterion? Give reasons for and against.\n",
    "\n",
    "b) Apply the frontdoor criterion to compute $p^{do(X:=1)} (y=1)$ and $p^{do(X:=0)} (y=1)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of Solution 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "[1] Kleiber, C., A. Zeileis (2008). Applied Econometrics with R. Springer-Verlag New York.\n",
    "\n",
    "[2] Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–710.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "4.0.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}