The whole point of the below is to prioritise who gets treatment first. Who do we canvass first? Who gets the first dose of medicine? Average Treatment Effects simply don’t help here, and other methods do very poorly if you have only experimental data at a slice in time (so you can’t do difference-in-differences or some other unit-level estimate of treatment effect).
So here’s my algorithm. The high-level objective is to estimate a continuous treatment effect for out of sample observations that is essentially a function of individual-level characteristics. For an observation in the test set, the estimated treatment effect is a convex combination of the (unobserved, but estimated) unit-level treatment effects in the training set, where the weights are given by similarity to the units in the training set.
Here’s a rough outline.
The true causal effect for each observation in the treated group, where Y is the outcome, the indicator before the pipe is the group (t = treatment, 0 = control), after the pipe is whether they received the treatment, and inside the brackets is the unit (i)-level.
Yt|t(i) – Yt|0(i)* = treatment effect on the treated
Y0|t(i)* – Y0|0(i) = treatment effect on the untreated. Stars indicate that we never observe that outcome.
Let’s say that the conditional ignorability assumption holds (so t is independent of unobserved variables), as it does in a randomised control trial. Sid Chib’s method is to completely bypass matching, and instead predict the counterfactual. He does this by building two models F(X(i)) and G(X(i)), where
Yt|t(i) = F(Xt(i)) + e(i) tries to predict the outcome variable for the treatment group
Y0|0(i) = G(X0(i)) + u(i) tries to predict the outcome variable for the control group
and then ‘predict’ the counterfactual by putting the control’s Xs into the treatment model, and visa versa. We could use any predictive model for this step, though the better models should give better estimates.
Yt|0(i)* = F(X0(i))
Y0|t(i)* = G(Xt(i))
We then substitute these fitted values into the relations given at (1) to arrive at a vector of estimated treatment effects for each unit, T(i)*.
Here’s where the method gets fun. Define X as the rbind of Xt and X0. Now we can run a randomForest, rF, to predict the treatment effects given the Xs.
T(i)* = rF(X(i)) + v(i).
The lovely thing about random forests is that one of the outputs is the fantastic proximity matrix. For a sample of N, this matrix is a symmetric N*N matrix where the i,j-th (j,i-th) entry is the proportion of terminal leaves in the random forest that observation i and j share. Now, what does it mean to share a leaf? In each tree, two observations will share a leaf when they fall down the same branches. This will only happen when they are similar in several ways. So a high proximity score indicates that two observations are similar in all the ways that help predict the treatment effect.
Now, let’s say we do a survey and collect some (N1) more folks’ Xs, called X1. The whole point of this is to work out for each of these guys how much our intervention is expected to work on each of them (that is, not on average). In the problem I’m working on now, we want to know who we should pester first to buy health insurance.
Call X2 the rbind of X1 and X. We can then predict the saved rF, and store the resulting proximity matrix. It is segmented into four submatrices
PROX = [A,B;C,D], where A is the N1*N1 proximity matrix of the new observations, B is the N1*N matrix that maps the proximity of the new observations to the training set, C is the flipside of A, and D is the old proximity matrix.
Now take B, and normalise each row so that it sums to 0, call this B*. These are our weights for the convex combination of in-sample estimated treatment effects. The estimated treatment effect for the new observations is
Tnew* = B’ T*
Validation here can be done by k-fold cross-validation. We get an unbiased estimate of the treatment effect for the test set the usual way, then estimate it using the method above. The difference between the (observed) test set average treatment effects and the implied ATE for the test set using the method above should be mean across folds.