## Building synthetic control groups using the proximity matrix of a Random Forest

This post is a short summary of my talk at the Melbourne Users of R Network last week–specifically on the use of synthetic control groups formed by using the proximity matrix of a random forest. Don’t worry if you don’t know what those things are–a plain-English description is below. The slides are available here.

One of the main problems that applied economists work on is working out a ‘treatment effect’ of some policy variable. So we may be interested in whether those who graduate from university (university education being the ‘treatment’) earn more.

A big problem with many of these sorts of questions is that there is selection into treatment; the people who go to university were probably going to make more anyway, and so simply comparing them to those who haven’t graduated is going to be a poor estimate of the true causal effect of university on earnings. What we want to do is compare the treated person to their untreated self.

Randomised control trials (RCTs) achieve this, in the statistical sense. Their beauty is that the (large sample) distribution of personal characteristics (both observed and unobserved) for the treatment and control groups are exactly the same. Those swallowing the sugar pill have the same probability of being 40, being female, or having a pushy mother, as the group receiving the real drug. Unfortunately for science, we’re not allowed to run RCTs for many interesting policies. Randomly allocating some people to higher rates of education and others less may be seen to be unethical.

Due to this constraint, Economists often use natural experiments to achieve the same objective; discontinuities in policy, weather, or geography that randomly assign some people to a treatment group and others to the control, otherwise the two groups should be very similar. However, good natural experiments are rare, normally don’t exist in our data-sets, and apply to a relatively restricted set of interesting policy questions. While a good one will get you tenure at a top US university, they don’t seem to be the secret ingredient in building a better evidence base for good policy.

So what if you still have an interesting policy question, good quality data, but no natural experiment? Thankfully, some methods exist that allow us to construct synthetic control groups that look a lot more like the treatment group than the old control group.

One method that has been very popular over the last decade or so is propensity score matching, popularised by Rosenbaum and Rubin (1983), and Dehejia and Wahba (2002). This method works in two stages:

1. You set up a predictive model of the ‘treatment’ (in this case, university completion), using personal characteristics for the independent variables. Normally you’d use a logit/probit style model for this. And

2. For each ‘treated’ observation, get the untreated observation with the closest probability of having gone to university. You then chuck out the unmatched observations–they don’t look much like graduates anyway–and run your regression on the remaining observations. The ‘treatment effect’ in the regression is, hopefully, now closer to the true value.

This is very easily done in R using the ‘arm’ package. Some code is in the presentation linked at the top of this post.

There are still some big problems with propensity score matching. Smith and Todd (2005) found that the results are not very robust to changes in the propensity model in part 1. While we have constructed a match on the observed variables (that in the data), but we still have no idea whether the treated observation is more likely to have a pushy mother or not. Also, we have no idea about the direction or scale of remaining bias. The method is not magical.

My improvement on this method is to use a more robust measure of similarity to help get over the Smith and Todd critique, borrowing from the Random Forest–a tool widely used in predictive analytics. For a deeper discussion of how these work, see here.

A Random Forest is basically a collection of models, called trees–in this case, they are models to predict whether someone went to university. Each of these trees is estimated on only a subset of the data–ensuring no individual survey respondent or survey question makes much of a different to the outcome. For every respondent, we ask all of the trees (there are sometimes thousands) whether they think the respondent went to university or not, based on their personal characteristics. The winning vote is the ‘prediction’ for the random forest for that survey respondent.

Every one of the trees is constructed in such a way so that the branches divide people according to some characteristic– in this case, gender, or age, or mother’s education level, etc. A branch will grow off the tree only if dividing people according to one of these characteristics results in a ‘purer’ division of graduates and non-graduates. That is, each tree is constructed with the aim of some ‘leaves’ containing only graduates, and others no graduates.

When two people wind up in the same leaf, then we know they are similar in several ways. Importantly, they are similar in the ways that matter to whether they will have gone to university. They are said to be proximate. The proximity score is the proportion of terminal leaves two observations have in common. My proximity score matching routine works like so:

1. Run a Random Forest with the treatment as the dependent variable, and include the pre-treatment independent variables. As random forests are build on randomly subset data, you should set the random-number generation seed if you want your results to be replicable. Also, make sure you save the proximity matrices!

2. Match on the proximity scores, also save the proximities of the untreated observations. I find that using these as weights improves my estimates (in terms of decrease in deviation from experimental estimates).

3. Discard unmatched observations. Make sure you don’t duplicate matches!

4. Run your regressions on the remaining data.

Example R code is included in the presentation linked above.

Rather than matching on the propensity score, I believe the proximity score produces a control group that is more similar to the treatment group than any other existing method, and consequently allows us to produce less biased estimates of the causal effects of policy. In my experiments with this method so far, I have found that:

– Matching on the proximity score results in a more robust matching with the inclusion of extra independent variables than probit/logit methods; and

– Benchmarking against the famous Lalonde dataset, I find my estimates of the causal effect are closer to the experimental estimate than when using propensity score matching

However I should emphasise that if you have a small number of trees, you will have fairly unstable matches. With current memory constraints, it is not feasible to build proximity score matrices on large datasets for lots of trees. So the method is not well suited to large data-sets without re-writing the Random Forest algorithm to iteratively update the proximity matrix.

If you have any experimental/quasiexperimental data you would like to share with me, I’d love to do some more benchmarking of this routine. As it is, I’m 85% sure it’s an improvement on what we have; I’d like to be more sure!