## Why do government agencies pay so little attention to the tools of contemporary predictive analytics?

Government econometricians, as in the Treasury or Productivity Commission, do most of their work in either a) estimating elasticities, or b) preparing forecasts. They estimate elasticities (the expected percentage change in one variable attributable to a 1% change in another) because governments may be interested in answering interesting questions like “if we reduce tax rates on a second job by x, how much more can we expect people to work?” They use forecasts to inform governments how much certain events are likely to affect budget positions, policy viability, etc. This is all very important work when designing policy.

The primary tools of government agencies to make these predictions are reasonably basic statistical models (or CGE models based on statistically-deduced elasticities). These are the sorts of models one would learn in an advanced econometrics class—they are (in general) linear, non-Bayesian methods which estimate (by definition) the parameters of a (defined) continuous function. In the Treasury at least, these are generally basic Error Correction Models and VARs/SVECMs (though there are a few people there doing more advanced stuff).

The confusing aspect of this is that outside econometrics, when people need to predict or forecast events AND have a lot of money (and bonus upside) riding on whether their forecasts are correct (like in insurance or credit markets), the tools of econometrics are barely used. Indeed, many models, like the couple I will describe below, completely ignore some of the fundamental building-blocks of classical econometrics (like parameters, continuous functions), simply because getting it right is so much more important. After describing a couple of different types of these non-classical-econometric models, I will spend some time hypothesising why they’re not used much in policy.

Decision Tree

An important building block for several of these methods (especially the popular Random Forest algorithm) is the decision tree. I’ll describe how these work here. Bear in mind I am a complete novice at this stuff, and I’m describing it as my intuition understands it. Please let me know if I have anything wrong. For a more technical discussion, please see here
http://www.mayo.edu/hsr/techrpt/61.pdf

Let’s say we are interested in estimating the probability of whether a parent will send at least one of their children to a private school (1 for yes, 0 for no; while I use a binary classification here, your dependent variable can be a classifier, survival rate, continuous variable, etc.).  My independent variables are {Income, # times attend church/month, # children, proportion of children = girl, immigrant?}. Let’s say the data look like this:

 Private school? Joint income # times church/m # Children Prop children = g Immigrant? 0 72,000 0 4 .75 0 0 90,000 0 2 1 1 1 400,000 0 3 0 0 0 120,000 0 1 1 0 1 95,000 1 1 0 0 1 130,000 4 2 .5 0 0 32,000 2 5 0.4 0 0 110,000 0 2 .5 1 0 76,000 0 3 .66 0 1 170,000 2 1 1 1

Each of the rows relates to a particular family, with the first column being our dependent variable (the one we want to predict) and the other columns being those we use to help us. Such a problem may exist in urban planning; for example, we may have all the independent variables for a suburb, but no information on the proportion of families in the suburb sending their children to private school. Or we may be interested in what is likely to happen to a suburb undergoing demographic change.

The idea of decision tree building is to identify “split points” in the independent variables which neatly separate the dependent variables into groups which are less “impure” than the un-split group. I should define these two concepts:

1. First, impurity. An econometrician may think of this as being “heterogeneity”. There are a couple of ways of measuring this. Let p_{iA} be the proportion of class i in group A. “i” would be {public school, private school} in the case above. “A” would simply be the sample represented by the entire table. We want to work out the “impurity” of dependent variable column in the sample. If everyone sends their children to public school, we want it to be 0; likewise if everyone sends their children to public school. We want there to be maximum impurity when there is an equal proportion of every class in the sample. There are two simple functions which we use for this:Let’s say there are K classes in the independent variable column, then the impurity of the sample A is either:$I(A) = \sum_{i=1}^{K} -p_{iA}ln(p_{iA})$  (information index)or$I(A) = \sum_{i=1}^{K} p_{iA}(1-p_{iA})$ (Gini index).These can be thought of intuitively as a simply function which is bigger when the sample is more heterogeneous, and zero when it is perfectly homogeneous.
2. A “split point” can be thought of like this. Let’s say we rank the entire sample by each of the independent variables in turn. So when we rank it in terms of income, it looks like this:
 Private school? Joint income # times church/m # Children Prop children = g Immigrant? 0 32000 2 5 0.4 0 0 72000 0 4 0.75 0 0 76000 0 3 0.66 0 0 90000 0 2 1 1 1 95000 1 1 0 0 0 110000 0 2 0.5 1 0 120000 0 1 1 0 1 130000 4 2 0.5 0 1 170000 2 1 1 1 1 400000 0 3 0 0

The aim is to split the sample into two (or more) groups, based on one of these independent variables, and in doing so, reduce the impurity of the total sample. So letting A_{1} be the first sub-group created, and A_{2} the second sub-group, we want to find the split-point (in one of the independent variables) which maximises I(A)-I(A_{1})-I(A_{2}).

Working so many calculations out by hand is exhausting, so we let computers do it. In the sample above, we split it into two sub groups, one with incomes at or above 130,000 (impurity = 0, all families in this classification have a child in private school), and the other with incomes below 130,000 (impurity = 0.24, one of the seven have a child in private school). Thus the total impurity of the sample is reduced from 0.48 to 0.24.

In building decision trees, this process is iterated on again for each of the sub-groups created, using all of the independent variables which improve purity (in many cases, for more than one split). At the end of the process we get many terminal nodes, each far less “impure” than the original sample. Consequently, for a new observation, we can assign them to a terminal node and so determine the probability the observation will be of each class.

Extensions:

There are several extensions to the basic decision tree. First, a decision tree tends to over-fit, leading to increased probability of the mis-classification of out-of-sample data. This is remedied by attaching a “cost” to having each additional node: a cost of infinity will result in the “tree” just being the initial data set, while a cost of zero will result in the full tree. By increasing the cost, the least efficient splits are not done, and so the tree is “pruned”.

We can also create a matrix of “costs” incurred by misclassifying an observation with type j as a type k. This is especially useful if the costs of a false negative are greater than a false-positive, as would be the case in cancer diagnosis. This works by altering the impurity measure of each split, so that if misclassifying a {public school} as a {private school} family is twice as bad as the opposite (it will result in fewer public schools built), then the maximum impurity occurs at proportion(public) = 0.75, proportion(private) = 0.25, rather than the 50/50 ratio before.

Random Forests:

The Random Forests algorithm is a popular method in predictive analytics which builds on decision-trees. The idea is that a large number of trees are grown using most (but not all) of the data available, randomly selecting the independent variables included in the trees. The remaining data are used to calculate the robustness of the trees. Each tree then has a “vote” on the membership of each of the observations, and the mode of these votes is the winner.

What’s so great about these models?

These models do not assume any structure of the data or the underlying relationships, incorporate non-linear relationships easily, in some cases incorporate prior information sensibly, and in no way are concerned with the continuity of the underlying model (if there is one). They (and their derivatives) are used by practically all contestants in the data-analysis competitions on Kaggle, and do very well.

So why aren’t they the port of first-call for econometricians?

Distributional issues: In economics, many macro variables are dominated by trends, and so their growth rates are not as broadly dispersed as in other types of data. Consequently, the risk of split-points occurring near the mode of growth data leads to increased risk of mis-classification.

Time-series issues: With the exception of the auto-regressive tree models of the Meek, Chickering and Heckerman (2002) and the distance between tuples-approach of Yamada and Yokoi (2003), these models don’t seem especially well developed in time-series analysis, especially in a multi-variate setting. As much econometrics is done on time-series, and as there is a large toolbox of powerful methods already in existence, it may just be that the switching costs between methods could be too large.

The most salient issue, though, is that most econometrics has been devised to estimate parameters for theoretical or argumentative constructs, rather than truly pursuing out-of-sample predictive power. An example of this is the idea of an elasticity: a constant, estimable relationship describing how much one thing changes when another does. Even if we abandon that it must be a constant, and instead try to identify the “deep” parameters of a function describing an elasticity, we have to believe that at some point there must be “deep” parameters which are not functions of other things. Due to the limitations of macroeconomic data (monthly, quarterly or yearly release, underlying data not released, small variance, short samples) these sorts of deep parameters are unlikely to be estimable with any real precision.

In contrast, if you were to try to estimate an elasticity using a random-forest approach, you may well end up with several elasticity estimates, each conditional on the variables in your model. For the theoretical modeller, this isn’t all that acceptable.

Most of all, though, I doubt these modelling methods would become very popular in policy circles simply because five elasticities are far harder to describe to a politician than one.