ComputeInterventions.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compute Interventions\n",
    "\n",
    "by Jonas Peters, Niklas Pfister, 01.12.2018\n",
    "\n",
    "This notebook is intended to give you some insight on how to compute intervention distributions. We will have a look at a few examples.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(igraph) #comment out this line if you cannot install igraph \n",
    "library(CondIndTests) \n",
    "library(dHSIC)\n",
    "source(\"../utils.R\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given the observational distribution and the graph, we can now compute interventional distributions. As an example, let us focus on the intervention \n",
    "$$\n",
    "do(X:= 3).\n",
    "$$\n",
    "This intervention distribution can be computed using the adjustment formula. It says\n",
    "(in case of discrete variables, the integral is replaced by a sum)\n",
    "\\begin{equation} \\tag{1}\n",
    "p^{do(X:=x)} (y) = \\int_{\\mathbf{z}} p(y|x, \\mathbf{z}) p(\\mathbf{z}) \\, d\\mathbf{z},\n",
    "\\end{equation}\n",
    "where $\\mathbf{Z}$ is a \"valid adjustment set\". Usually, there are several valid adjustment sets. One set $\\mathbf{Z}$ that always works, is the set of parents of $X$, i.e., $\\mathbf{Z} = \\mathbf{PA}_X$.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will apply formula (1) to a (famous) kidney stone data set. Let us assume that the causal structure looks as follows\n",
    "\\begin{align}\n",
    "    &\\phantom{0}\\\\\n",
    "    &\\begin{array}{cc}\n",
    "         &S                 & \\\\\n",
    "        \\phantom{abcdefgh}\\swarrow &            & \\searrow\\phantom{abcdefgh}\\\\\n",
    "                            &               & \\\\\n",
    "         T                  & \\longrightarrow & R\\\\\n",
    "        \\end{array}\\\\\n",
    "      &\\phantom{0}\n",
    "\\end{align}\n",
    "where $T$ is treatment, $S$ is size of the stone, and $R$ is recovery. The following table shows the recovery rates\n",
    "\n",
    "$$\n",
    "\\begin{array}{r|c|c}\n",
    "& \\text{Treatment A} & \\text{Treatment B}\\\\\\hline\n",
    "\\text{Small Stones } { (\\frac{357}{700} = 0.51)}& \\frac{81}{87} = {0.93} & \\frac{234}{270} = 0.87\\\\\\hline\n",
    "\\text{Large Stones } { (\\frac{343}{700} = 0.49)} & \\frac{192}{263} = {0.73} & \\frac{55}{80} = 0.69\\\\\\hline\n",
    "& \\frac{273}{350} = 0.78 & \\frac{289}{350} = {0.83}\n",
    "\\end{array}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1: \n",
    "Compute \n",
    "$$\n",
    "P^{do(T:=A)} (R = 1)\n",
    "$$\n",
    "and\n",
    "$$\n",
    "P^{do(T:=B)} (R = 1).\n",
    "$$\n",
    "Which treatment is better?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 1: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of Solution 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the remainder of this notebook we will have a look at an artificial example. We first load some data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "load(file = \"../data/ComputeInterventionsData1.Rdata\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This data has been generated from an SEM with the following graph structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Adj <- rbind(c(0,0,0,1,0,0,0,0,0), c(0,0,1,1,0,0,0,0,0), c(0,0,0,0,0,0,0,1,0), c(0,0,0,0,1,1,0,0,0), \n",
    "           c(0,0,0,0,0,0,0,0,0), c(0,0,0,0,0,0,1,1,0), c(0,0,0,0,0,0,0,0,0), c(0,0,0,0,0,0,0,0,1), \n",
    "           c(0,0,0,0,0,0,0,0,0))\n",
    "set.seed(1)\n",
    "plotGraphfromAdj(Adj, labels = c(\"C\", \"A\", \"K\", \"X\", \"F\", \"D\", \"G\", \"Y\", \"H\")) \n",
    "#comment out the above line if you cannot install igraph\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us say, we are interested in the causal effect from $X$ to $Y$, i.e., in the intervention distribution $p^{do(X:=x)} (y)$.\n",
    "\n",
    "Computing equation (1) is often difficult, even if we are given full knowledge of the observational distribution, that is, if we are given the density $p$ (especially in the case of continuous variables and high-dimensional $\\mathbf{Z}$). In this example, we are not even given $p$ but only a sample from $p$ and estimating conditional distributions is a hard statistical problem. Fortunately, the joint distribution $p$ is Gaussian and things become easier. Equation (1) then implies\n",
    "\\begin{equation} \\tag{2}\n",
    "\\mathbf{E}^{do(X:=x)} [Y] = \\alpha x,\n",
    "\\end{equation}\n",
    "where \n",
    "$\\alpha$ is sometimes called the causal effect from $X$ to $Y$ and is determined by \n",
    "$$\n",
    "\\mathbf{E} [Y\\,| \\,X=x, \\mathbf{Z} = \\mathbf{z}] = \\alpha x + \\beta^t \\mathbf{z}\n",
    "$$\n",
    "In practice, we can \n",
    "obtain $\\alpha$ as the regression coefficient, when linearly regressing $Y$ on $X$ and $\\mathbf{Z}$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2: \n",
    "1. Estimate\n",
    "$\n",
    "\\mathbf{E}^{do(X:=3)} [Y] \n",
    "$\n",
    "from data.loaded.\n",
    "\n",
    "2. \n",
    "We have regressed \n",
    "$Y$ on $X$ *and* $\\mathbf{Z}$ --- thus the name \"adjusting for $\\mathbf{Z}$\".\n",
    "Do you see what goes wrong when we try to estimate the causal effect $\\alpha$ \n",
    "by regressing $Y$ on $X$ without adjusting for $\\mathbf{Z}$?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 2: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of Solution 2 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whether or not a set is a valid adjustment set can be answered using the notion of d-separation (e.g., the set should block the path $X \\leftarrow A \\rightarrow K \\rightarrow Y$). Since d-separation is an important concept, we would like to revise it here. \n",
    "\n",
    "In brief, given a DAG , the (disjoint) sets $Q$ and $R$ are d-separated by a set $S$ if all paths between $Q$ and $R$ are blocked by $S$. A path $i_1, i_2, \\ldots, i_m$ is blocked by $S$ if there is a node $i_k$ on the path ($1 < k < m$) such that one of the following conditions hold:\n",
    "* ${i_k} \\in {S}$ and\n",
    "\\begin{align*}\n",
    "&{i_{k-1}} \\rightarrow {i_k} \\rightarrow {i_{k+1}}\\\\\n",
    "\\text{or }\\;&{i_{k-1}} \\leftarrow {i_k} \\leftarrow {i_{k+1}}\\\\\n",
    "\\text{or }\\;&{i_{k-1}} \\leftarrow {i_k} \\rightarrow {i_{k+1}}\n",
    "\\end{align*}\n",
    "* neither ${i_k}$ nor any of its descendants is in ${S}$ and\n",
    "$$\n",
    "{i_{k-1}} \\rightarrow {i_k} \\leftarrow {i_{k+1}}.\n",
    "$$ \n",
    "\n",
    "The data that we have loaded comes from an SEM. One can show that the distribution is Markov with respect to the corresponding graph. This means that d-separation implies conditional independence. We will try to verify this with a few examples. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3: \n",
    "Write down two d-separation statements that hold and two that do not hold. Test for conditional and unconditional independence using the functions ```CondIndTest``` from package ```CondIndTests``` and ```dhsic.test``` from package ```dHSIC```, respectively. Use the data in data.loaded; you can access variables using ```data.loaded[,\"X\"]```, for example. Do you find the correct conditional independences? Note that conditional independence testing is a difficult statistical problem, especially if the conditioning set is large."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 3: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of Solution 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we are given the full SEM (this will often not be the case), there is a different strategy for computing causal effects. We will use this to validate our estimate from Exercise 2. In fact, the data that we have provided above, was generated by the following SEM:\n",
    "\\begin{align*} \n",
    "C &:= N_C\\\\\n",
    "A &:=  0.8 N_A\\\\\n",
    "K &:=  A + 0.1 N_K\\\\\n",
    "X &:=  C - 2 A + 0.2 N_X\\\\\n",
    "F &:=  3 X + 0.8 N_F\\\\\n",
    "D &:=  -2 X + 0.5 N_D\\\\\n",
    "G &:=  D + 0.5 N_G\\\\\n",
    "Y &:=  2 K - D + 0.2 N_Y\\\\\n",
    "H &:=  0.5 Y + 0.1 N_H\n",
    "\\end{align*}\n",
    "with all $N$'s being jointly independent and having a standard normal distribution.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set.seed(1); n <- 200\n",
    "C <- rnorm(n)\n",
    "A <- 0.8*rnorm(n)\n",
    "K <- A + 0.1*rnorm(n)\n",
    "X <- C - 2*A + 0.2*rnorm(n)\n",
    "F <- 3*X + 0.8*rnorm(n)\n",
    "D <- -2*X + 0.5*rnorm(n)\n",
    "G <- D + 0.5*rnorm(n)\n",
    "Y <- 2*K - D + 0.2*rnorm(n)\n",
    "H <- 0.5*Y + 0.1*rnorm(n)\n",
    "data.check <- cbind(C, A, K, X, F, D, G, Y, H)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, this yields the same data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "head(data.loaded)\n",
    "head(data.check)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 4: \n",
    "1. Generate data from the intervened SEM $do(X:=3)$ and compare your findings with the estimate computed above. \n",
    "2. Look at the graph shown above and at the coefficients of the SEM. Compute $\\mathbf{E}^{do(X:=3)}[Y]$ using the path method. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 4: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### End of Solution 4 "
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}