### To Balance or Not to Balance?

By IVAN DIAZ & JOSEPH KELLY

Determining the causal effects of an action—which we call treatment—on an outcome of interest is at the heart of many data analysis efforts. In an ideal world, experimentation through randomization of the treatment assignment allows the identification and consistent estimation of causal effects. In observational studies treatment is assigned by nature, therefore its mechanism is unknown and needs to be estimated. This can be done through estimation of a quantity known as the propensity score, defined as the probability of receiving treatment within strata of the observed covariates.

There are two types of estimation method for propensity scores. The first tries to predict treatment as accurately as possible. The second tries to balance the distribution of predictors evenly between the treatment and control groups. The two approaches are related, because different predictor values among treated and control units could be used to better predict treatment status. In this post we discuss issues related to these goals, specification of loss functions for the two objectives, and compare both methods via simulation.

We focus on an inverse propensity score weighted estimator for the causal effect. For this estimator, predicting the propensity score accurately is much more important than achieving the balancing property. This is because the estimator is not constructed utilizing the balancing property but wholly relies on reweighting using the propensity score. Other estimators, such as those based on matching and subclassification, may benefit from the balancing property, but the discussion of those estimators is postponed to a later post.

wrong if the data suffer from

To illustrate the concept of selection bias, consider a group of users who have searched for pickup trucks through Google and a group who has not. Individuals in the former group are more likely to be exposed to an ad for pickup trucks. However, they are also more likely to buy a pickup truck regardless of ad exposure, because we know already that they are interested in pick up trucks given that they had done online research on them. A naïve comparison of the exposed and unexposed groups would produce an overly optimistic measurement of the effect of the ad, since the exposed group has a higher baseline likelihood of purchasing a pickup truck.

There are two ways of getting around the selection bias problem: (i)

Because potential outcomes are unobserved, we need to link the distribution of $Y_t$ to the distribution of the observed data $(X, T, Y)$. A formula linking these distributions is known as an

The ATT can be estimated from the observed data if all three of the following

The first assumption makes clear the fundamental problem of causal inference: for any given unit we observe at most one of the potential outcomes $Y=T\times Y_1 +(1-T)\times Y_0$. A violation to the second assumption formalizes the concept of selection bias. In our pickup truck example, $T$ indicates exposure to the ad, and $Y$ indicates purchase of a pick-up truck. Failure to include an indicator of search for the term "pickup truck" in $X$ would be a violation of strong ignorability. Identifiability assumptions are met by design in a randomized trial for every $X$, including $X=\emptyset$.

These assumptions allow the identifiability of $E(Y_1|T=1)$ and $E(Y_0|T=1)$, both quantities necessary in estimating $\delta$. Identifiability of $E(Y_1|T=1)$ is simple: $E(Y_1|T=1) = E(Y|T=1)$ because $Y=Y_1$ in the event $T=1$. Identifiability of $\psi = E(Y_0|T=1)$ is somewhat harder. We start by noting that

$$E(Y_0 \mid T=1, X)=E(Y_0 \mid T=0, X),$$

due to strong ignorability. Then, $E(Y_0 \mid T=0, X) = E(Y\mid T=0, X)$, due to congruence of the potential and observed outcomes. The latter conditional expectation is well defined due to the positivity assumption. We can now use the law of iterated expectation to see that

$\psi = E(Y_0|T=1)$

$= E\{E(Y_0 \mid T=1, X)\mid T=1\}$

$= E\{E(Y_0 \mid T=0, X)\mid T=1\}$

$= E\{E(Y \mid T=0, X)\mid T=1\}$

$= E\{\mu_0(X)\mid T=1\},$

where $\mu_0(x)$ is a function of $x$ denoting the expected value of the outcome under $T=0$ and $X=x$. This formula is the identifiability result that we talked about earlier, and allows us to estimate $\delta$ solely from the observed data $(X, T, Y)$. Note that $\mu_0$ is the outcome expectation

among the control units. This formula uses $\mu_0$ to predict the (unobserved) outcomes of the treated group, had they, contrary to the fact, been in the control group. The outer expectation takes the average of those predicted values for all treated units.

A balancing score is any function $b(x)$ satisfying $X\perp T\mid b(X)$. Using this property, simple algebra shows an equivalent identification result:

$$\psi = E\{E(Y \mid T=0, b(X))\mid T=1\}.$$

This and more features of balancing scores are discussed in a seminal paper by

Rosenbaum & Rubin (1983).

Clearly, $b(x)=x$ is a balancing score. However, $b(x)=x$ is not a very useful balancing score since it does not help alleviate the curse of dimensionality. A more useful balancing score is the propensity score $e(x)$. Since $e(x)$ is univariate, $\psi$ may be easily estimated by matching, subclassification, and other non-parametric estimation methods that adjust for an estimate of $e(x)$. Estimators that use this idea are said to be using the balancing property of the propensity score.

A very important observation arising from this is that, for estimators of $\delta$ using the balancing property, we do not require consistent estimation of the propensity score but only of a balancing score. On the other hand, the reweighted estimator we describe next, does require consistent estimation of the propensity score. This apparently trivial observation is often overlooked and

can be a source of confusion for data analysts working on causal inference analyses.

Note that $\delta=E(Y\mid T=1)-\psi$. The expectation $E(Y\mid T=1)$ can be easily estimated with the sample mean of the outcome among the exposed units. In the simulation section we use this empirical mean estimator. Below we discuss estimators of $\psi$, which can be done via the propensity score with the methods we describe below.

$$\psi = E\{E(Y \mid T=0, X)\mid T=1\} = \frac{E\{W Y\mid T=0\}}{E(W|T=0)},$$

with $W=e(X)/(1-e(X)).$ A natural estimator for $\psi$ is then given by

$$\hat \psi = \frac{\sum_{i:T_i=0} \hat W_i Y_i}{\sum_{i:T_i=0} \hat W_i},$$

where $\hat W_i = \hat e(X_i)/(1-\hat e(X_i))$, and $\hat e(x)$ denotes an estimator of the propensity score.

It can be easily checked that if $\hat e$ is consistent, so is $\hat \psi$. Then, because $E(Y\mid T=1)$ can be estimated consistently, the estimator of $\delta=E(Y\mid T=1)-\psi$ will be consistent as well. It is then very important to place all efforts on obtaining consistency of $\hat e$. In this post we review and compare two types of method for estimating the propensity score: methods based on predictive accuracy and methods based on balancing the covariate distribution between treated and control units. Based on our results and the above discussions we argue that predictive accuracy, rather than balance, should be the criterion guiding the choice of method.

It should be noted that inverse probability weighting is not generally optimal (i.e., efficient) and doubly robust estimators such as the augmented IPW and the TMLE provide an opportunity to achieve the non-parametric efficiency bound (Hahn, 1998). Since our interest is to evaluate estimators for the propensity score as they relate to estimation of the causal effect, we focus on the reweighting method described above.

$$ \arg\min_{f\in \cal F} E(L(f; M)) = e, $$

where $M = (X, T)$. That is, a loss function $L$ is valid if the expected loss is minimized at the true propensity score $e$. Examples of valid loss functions are the $L^2$ and negative log-likelihood loss functions given by

$L(f;m) = (t-f(x))^2,$ and

$L(f;m) = -t\log(f(x))-(1-t)\log(1-f(x)),$

respectively. Though theoretically valid, the $L^2$ loss function is known to underperform compared to the log-likelihood.

The choice of space $\cal F$ (sometimes called the

$$ \mathcal{F}_\textrm{logistic} = \bigg\{f(x) = \frac{1}{1 + \exp(-x'\beta)}:

\beta \in \mathbb{R}^p \bigg\}. $$

A model like $\mathcal{F}_\textrm{logistic}$, indexed by a Euclidean parameter, is often referred to as a

The field of statistical machine learning provides a solution to this problem, allowing exploration of larger spaces. For example, a tree regression algorithm uses

$$\mathcal{F}_\textrm{tree} = \bigg\{\sum_j^J c_j I(x \in D_j): c_j \in

\mathbb{R}, D_j\bigg\},$$

where $c_j$ are constants, $I(A)$ is the indicator function returning one if $A$ is true and zero otherwise, and $D_j$ are disjoint partitions of the covariate space. Another example, given by multivariate adaptive regression splines (MARS), uses

$$\mathcal{F}_\textrm{mars} = \bigg\{f(x) = \sum_j^J\beta_jB_j(x): \beta \in

\mathbb{R}^p; B_j\bigg\},$$

where $B_j$ are basis functions of the form of hinge functions $\max(0, x_j - c_j)$, and $c_j$ are tuning parameters. These spaces are larger than $\cal{F}_\textrm{logistic}$ above. Under lack of domain-specific scientific knowledge supporting the use of a parametric model, these data-adaptive methods have a better chance of consistently estimating the propensity score. Choosing the tuning parameters for data-adaptive methods such as regression trees and MARS is the subject of a large number of research articles and books.

We chose the tree regression and MARS estimators only for illustration purposes, other possible choices of off-the-shelf prediction method include random forests, support vector machines, generalized boosted regression models, neural networks, $k$ nearest neighbors, regularized regression, and many more. An excellent review of statistical learning methods may be found in Friedman et. al. (2001).

$$ E\bigg\{(1-T)\frac{e(X)}{1-e(X)}c(X)\bigg\} =

E\{Tc(X)\}\, \text{ for all functions } c(x).$$

Here $c(x)$ is any function of $x$. Estimation methods that use this property are referred to as

As a result of the balancing property, any estimator $\hat e$ satisfying

$$\sum_{i:T_i=0}\frac{\hat e(X_i)}{1-\hat e(X_i)}c(X_i) = \sum_{i:T_i=1}c(X_i)\,

\text{ for all functions } c(x)$$

can be expected to be a consistent estimator of $e(x)$. Because it has to hold for all functions $c(x)$, this balance condition is analogous to performing a search in the full space $\mathcal F$, and is impossible to achieve in practice. Methods that aim to achieve balance focus on a user-given set of functions $c_1,\ldots, c_J$. The task of choosing the correct functions $c_j$ is akin to the task of specifying the correct functional form in a parametric model. As a result, estimators that focus on covariate balancing are also susceptible to being inconsistent due to model misspecification.

Now let's take a look at two methods that use covariate balancing to estimate the propensity score: entropy balancing and covariate balancing propensity score.

$$ \hat W = \arg\min_W \sum_{i:T_i=0}W_i\log W_i $$

subject to

$$\frac{1}{n_0}\sum_{i:T_i=0}W_ic_j(X_i) = \frac{1}{n_1}\sum_{i:T_i=1}c_j(X_i)\,

\text{ for a set of functions } c_j:j\in\{1,\ldots J\}.$$

The above constrained optimization problem minimizes the entropy of $W_i$ to obtain obtain weights that satisfy the balance conditions for the user-specified covariate functions $c_j$. These functions are chosen by the user, and are generally given by lower order polynomials. Because covariate balance can only be achieved for a finite (generally small) number of functions $c_j$, this method will be inconsistent unless the correct functions $c_j$ are specified. In fact, Hainmueller (2012) show that entropy balancing is equivalent to estimating the weights as a log-linear model of the covariate functions $c_j(X)$.

See Hainmueller (2012), and the work of Zhao & Percival (2015) for more details on how this optimization problem is solved, and for further discussion.

$$ e = \arg\min_{f\in \mathcal{F}_\textrm{logistic}} E(L(f;M)), $$

subject to

$$\sum_{i:T_i=0}\frac{\hat f(X_i)}{1-\hat f(X_i)}c_j(X_i) =

\sum_{i:T_i=1}c_j(X_i)\, \text{ for a set of functions } c_j:j\in\{1,\ldots

J\},$$

where $L$ is the negative log-likelihood loss $L(f;w)=-t\log(f(x))-(1-t)\log(1-f(x))$. The constraint set of this method can be seen to be equivalent to the constraint of EB, only the loss function optimized changes. This problem may be solved using empirical likelihood (Qin & Lawless, 1994) or other methods. For a more detailed discussion the interested reader is referred to the original research article by Imai & Ratkovic (2014).

Covariate balance:

The true set of covariates is generated independently and identically distributed, from the following distribution:

$$(Z_{1}, Z_{2}, Z_{3}, Z_{4}) \sim N_4(0, I_4)$$

where $I_4$ is the four dimensional identity matrix. The outcome is generated as

$$ Y = 210 + 27.4 Z_{1} + 13.7 Z_{2} + 13.7Z_{3} + 13.7 Z_{4} + \epsilon $$

where $\epsilon \sim N(0, 1)$. The propensity score is defined as

$$ P(T = 1 \mid Z = z) = \text{expit}(-z_{1} + 0.5 z_{2} - 0.25z_{3} - 0.1 z_{4}). $$

where expit is the inverse-logit transformation. Note that $\delta = 0$, whereas $E(Y|T=1)=200$ and $E(Y|T=0)=220$, demonstrating the selection bias.

In the simulation we will examine the performance of the above algorithms under the correctly specified propensity model from the data generation process and under a misspecified model. Misspecification occurs when the following transformations are observed in place of the true covariates:

$x_{1} = \exp(z_1/2),$

$x_{2} = z_{2}/(1 + \exp(z_{1})),$

$x_{3} = (z_{1} z_{3}/25 + 0.6)^3,$

$x_{4} = (z_{2} + z_{4} + 20)^2.$

These transformations were proposed in the original paper in order to reflect a real-life problem. It is very unlikely a researcher observing $(X_{1}, X_{2}, X_{3}, X_{4})$ would specify the correct model reflecting these transformations. However, $X$ contains all the relevant information and strong ignorability holds. Without the correct functional form, a researcher using parametric methods is guaranteed to fit a misspecified model. This fact is well documented by Kang & Schafer (2007). The case where only the $X$s are observed is the most interesting since it exemplifies most, if not all, real data analysis problems.

To compare methods we conduct a simulation study by generating $10,000$ datasets according to the above data generation process. Then, for each dataset we estimate the propensity score using the true covariates $Z$ and the transformed covariates $X$ according to the various estimators we are examining. Then we estimate $\delta$ and aggregate across simulations to examine the estimated absolute bias

$$

\frac{1}{10^4} \bigg| \sum_{j = 1}^{10^4} (\hat\delta_j - \delta) \bigg|

$$

and root mean square error (RMSE)

$$

\sqrt{\frac{1}{10^4} \sum_{j = 1}^{10^4} (\hat \delta_j - \delta)^2}

$$

of the different methods. These formulas represent Monte Carlo integrals that aim to approximate the true bias and RMSE of the estimators. The largest estimated error in these integrals was around 0.3 which is small relative to the bias and MSE sizes we see.

The algorithms in the plot are ordered according to the sum of the respective metrics weighted by the square root of the sample size considered. By doing this, we favor algorithms that perform better in larger datasets, in accordance with the statistical notion of consistency. Algorithms aimed at achieving balance in the covariates are shown in boldface.

The similarity between the RMSE and bias plots teaches us that most of the poor performance is driven by bias rather than variance.

As we stated before, the most interesting case for practitioners is when the transformed covariates are observed. In this case, MARS is the best estimator. The two versions of the Super Learner follow closely. This should not be taken as an argument in favor of using MARS in every practical problem. Rather, it is proof of the importance of using a principled model selection tool, since no one algorithm should be expected to outperform all others uniformly across datasets and problems (cf. no free lunch

theorem). Likewise, methods based on random forests and SVM perform quite poorly. This may be surprising to some readers who know that these methods are data-adaptive and have seen them perform well in several applications. This is another demonstration of our claim that no method should be blindly trusted and all methods should be tested against the data for the specific problem at hand. We have seen applications in which the Lasso, Random Forests, or even a simple logistic regression outperforms all competitors in terms of predictive power.

Regarding covariate balance, we see that the two versions of the CBPS perform relatively well. CBPS is a GLM with an extra optimization constraint to satisfy covariate balance. In our simulation, this extra constraint did improve the bias and MSE compared to a standard GLM.

The entropy balanced estimator performs quite poorly. Note that the EB is an estimator that achieves balance in the sample covariate means, and as such would always pass the routine validation tests advocated in the balancing score literature. In the case of EB it appears that the predictive accuracy of the propensity score is sacrificed to ensure covariate balance. As our reweighting estimator for $\psi$ is constructed on the basis of consistent propensity score estimation (rather than the balancing property), it is not surprising to see the poor performance of EB.

Interestingly, covariate balancing methods outperform all other methods—including a correct logistic regression model—when the correct covariates are observed. This is because the response is linear in the correct covariates and thus an estimator that ensures complete balance on the true covariates automatically reduces all of the bias. However, this is merely a theoretical curiosity, as almost every practical situation belongs in the first panel of the plot rather than the second one.

EB and CBPS both require specifying functions of the covariates that need to be balanced. This can be very useful if the researcher has prior domain knowledge and knows what functions of the covariates affect the response and hence need to be balanced. This sort of a-priori knowledge imposes a structure on the propensity score and thus is akin to knowing that the propensity score belongs

to a parametric family of distributions. Unfortunately, objective knowledge of this type is absent in most practical situations, particularly in large dimensional problems.

To conclude, in the absence of subject-matter knowledge supporting the use of parametric functional forms for the propensity score and the balancing conditions, predictive accuracy should be used to select an estimator among a collection of candidates. This collection may include covariate balanced

estimators, and should contain flexible data-adaptive methods capable of unveiling complex patterns in the data. In particular, we advocate for the use of model stacking methods such as the Super Learner algorithm implemented in the SuperLearner R package.

Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine learning 20.3 (1995): 273-297.

Friedman, Jerome H. "Multivariate adaptive regression splines." The annals of statistics (1991): 1-67.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. "The elements of statistical learning." Vol. 1. Springer, Berlin: Springer series in statistics, (2001)

Hahn, Jinyong. "On the role of the propensity score in efficient semiparametric estimation of average treatment effects." Econometrica (1998): 315-331.

Hainmueller, Jens. "Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies." Political Analysis 20.1 (2012): 25-46.

Imai, Kosuke, and Marc Ratkovic. "Covariate balancing propensity score." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76.1 (2014): 243-263.

Kang, Joseph DY, and Joseph L. Schafer. "Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data." Statistical science (2007): 523-539.

Qin, Jin, and Jerry Lawless. "Empirical likelihood and general estimating equations." The Annals of Statistics (1994): 300-325.

Rosenbaum, Paul R., and Donald B. Rubin. "The central role of the propensityscore in observational studies for causal effects." Biometrika 70.1 (1983): 41-55.

Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267-288.

van der Laan, Mark J., and Polley, Eric C., and Hubbard, Alan E. "Super learner." Statistical applications in genetics and molecular biology (2007): Vol 6.1

Eric Polley and Mark van der Laan (2014). SuperLearner: Super Learner Prediction. R package version 2.0-15.

Zhao, Qingyuan, and Daniel Percival. "Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing." arXiv preprint arXiv:1501.03571 (2015).

Determining the causal effects of an action—which we call treatment—on an outcome of interest is at the heart of many data analysis efforts. In an ideal world, experimentation through randomization of the treatment assignment allows the identification and consistent estimation of causal effects. In observational studies treatment is assigned by nature, therefore its mechanism is unknown and needs to be estimated. This can be done through estimation of a quantity known as the propensity score, defined as the probability of receiving treatment within strata of the observed covariates.

There are two types of estimation method for propensity scores. The first tries to predict treatment as accurately as possible. The second tries to balance the distribution of predictors evenly between the treatment and control groups. The two approaches are related, because different predictor values among treated and control units could be used to better predict treatment status. In this post we discuss issues related to these goals, specification of loss functions for the two objectives, and compare both methods via simulation.

We focus on an inverse propensity score weighted estimator for the causal effect. For this estimator, predicting the propensity score accurately is much more important than achieving the balancing property. This is because the estimator is not constructed utilizing the balancing property but wholly relies on reweighting using the propensity score. Other estimators, such as those based on matching and subclassification, may benefit from the balancing property, but the discussion of those estimators is postponed to a later post.

## The Fundamental Problem of Causal Inference

The primary goal of many data analysis efforts is to estimate the effect of an intervention on the distribution of some outcome. For example, imagine that you are working for a car manufacturer. Your company has recently launched a new pickup truck, along with the corresponding online advertisement campaign. You are in charge of assessing whether the campaign had an impact on sales. To do this, you have a data set at the person level containing, among other variables, an indicator of ad exposure, and whether the person bought the truck. A naïve way to solve this problem would be to compare the proportion of buyers between the exposed and unexposed groups, using a simple test for equality of means. Although it may seem sensible at first, this solution can bewrong if the data suffer from

*selection bias*.To illustrate the concept of selection bias, consider a group of users who have searched for pickup trucks through Google and a group who has not. Individuals in the former group are more likely to be exposed to an ad for pickup trucks. However, they are also more likely to buy a pickup truck regardless of ad exposure, because we know already that they are interested in pick up trucks given that they had done online research on them. A naïve comparison of the exposed and unexposed groups would produce an overly optimistic measurement of the effect of the ad, since the exposed group has a higher baseline likelihood of purchasing a pickup truck.

There are two ways of getting around the selection bias problem: (i)

*randomize*ad exposure, and (ii) use an analysis for*observational*data. In a randomized trial, we start with a random sample of the population of target individuals, and then assign them to one of two treatment arms 'randomly' (that is, independent of all other observed or unobserved factors). Randomized studies are considered the gold standard for answering causal inference questions, since they guarantee the two groups only differ in their treatment assignment and have the same distribution in all observed and unobserved pre-treatment variables. Unfortunately, randomization is not possible in many situations. For example, in clinical trials, randomization may be unethical; in economics studies randomization is unfeasible in practice; and in marketing studies the potential cost of a lost business opportunity can make randomization unattractive.### Identification

We now discuss formally the statistical problem of causal inference. We start by describing the problem using standard statistical notation. For a random sample of units, indexed by $i = 1. \ldots n$, we have- $T_i$ as the binary treatment assignment variable,
- $Y_{i1}$ and $Y_{i0}$ as the potential outcomes, and
- $X_i$ as the vector of $p$ confounders

*average effect of treatment on the treated*, or ATT, defined as $\delta = E(Y_{1}-Y_{0}|T=1)$. This quantity is interpreted as the expected change in the outcome caused by treatment, where the expectation is taken over the distribution of the potential outcomes among treated units.Because potential outcomes are unobserved, we need to link the distribution of $Y_t$ to the distribution of the observed data $(X, T, Y)$. A formula linking these distributions is known as an

*identifiability*result. Identifiability allows us to estimate the moments of $Y_t$ as functionals of the distribution of the observed data.The ATT can be estimated from the observed data if all three of the following

*identifiability assumptions*hold:- $Y_t=Y$ in the event $T=t$, stating that the observed outcome is congruent with the potential outcomes $(Y_0,Y_1)$, and
- the assignment mechanism is *strongly ignorable*, that is $Y_0 \perp T \mid X$, which roughly states that all common causes of $T$ and $Y$ are measured and contained in $X,$ and
- $P(T=0|X=x) > 0$ for all $x$ where $P(X=x)>0$,

*positivity*assumption.The first assumption makes clear the fundamental problem of causal inference: for any given unit we observe at most one of the potential outcomes $Y=T\times Y_1 +(1-T)\times Y_0$. A violation to the second assumption formalizes the concept of selection bias. In our pickup truck example, $T$ indicates exposure to the ad, and $Y$ indicates purchase of a pick-up truck. Failure to include an indicator of search for the term "pickup truck" in $X$ would be a violation of strong ignorability. Identifiability assumptions are met by design in a randomized trial for every $X$, including $X=\emptyset$.

These assumptions allow the identifiability of $E(Y_1|T=1)$ and $E(Y_0|T=1)$, both quantities necessary in estimating $\delta$. Identifiability of $E(Y_1|T=1)$ is simple: $E(Y_1|T=1) = E(Y|T=1)$ because $Y=Y_1$ in the event $T=1$. Identifiability of $\psi = E(Y_0|T=1)$ is somewhat harder. We start by noting that

$$E(Y_0 \mid T=1, X)=E(Y_0 \mid T=0, X),$$

due to strong ignorability. Then, $E(Y_0 \mid T=0, X) = E(Y\mid T=0, X)$, due to congruence of the potential and observed outcomes. The latter conditional expectation is well defined due to the positivity assumption. We can now use the law of iterated expectation to see that

$\psi = E(Y_0|T=1)$

$= E\{E(Y_0 \mid T=1, X)\mid T=1\}$

$= E\{E(Y_0 \mid T=0, X)\mid T=1\}$

$= E\{E(Y \mid T=0, X)\mid T=1\}$

$= E\{\mu_0(X)\mid T=1\},$

where $\mu_0(x)$ is a function of $x$ denoting the expected value of the outcome under $T=0$ and $X=x$. This formula is the identifiability result that we talked about earlier, and allows us to estimate $\delta$ solely from the observed data $(X, T, Y)$. Note that $\mu_0$ is the outcome expectation

among the control units. This formula uses $\mu_0$ to predict the (unobserved) outcomes of the treated group, had they, contrary to the fact, been in the control group. The outer expectation takes the average of those predicted values for all treated units.

### Identification Using The Balancing Property

Let us use $e(x)$ to denote the propensity score $P(T=1\mid X=x)$, following the convention in the propensity score literature.A balancing score is any function $b(x)$ satisfying $X\perp T\mid b(X)$. Using this property, simple algebra shows an equivalent identification result:

$$\psi = E\{E(Y \mid T=0, b(X))\mid T=1\}.$$

This and more features of balancing scores are discussed in a seminal paper by

Rosenbaum & Rubin (1983).

Clearly, $b(x)=x$ is a balancing score. However, $b(x)=x$ is not a very useful balancing score since it does not help alleviate the curse of dimensionality. A more useful balancing score is the propensity score $e(x)$. Since $e(x)$ is univariate, $\psi$ may be easily estimated by matching, subclassification, and other non-parametric estimation methods that adjust for an estimate of $e(x)$. Estimators that use this idea are said to be using the balancing property of the propensity score.

A very important observation arising from this is that, for estimators of $\delta$ using the balancing property, we do not require consistent estimation of the propensity score but only of a balancing score. On the other hand, the reweighted estimator we describe next, does require consistent estimation of the propensity score. This apparently trivial observation is often overlooked and

can be a source of confusion for data analysts working on causal inference analyses.

Note that $\delta=E(Y\mid T=1)-\psi$. The expectation $E(Y\mid T=1)$ can be easily estimated with the sample mean of the outcome among the exposed units. In the simulation section we use this empirical mean estimator. Below we discuss estimators of $\psi$, which can be done via the propensity score with the methods we describe below.

## The Propensity Score Weighted Estimator

Writing the expectations as integrals plus some additional algebra shows that$$\psi = E\{E(Y \mid T=0, X)\mid T=1\} = \frac{E\{W Y\mid T=0\}}{E(W|T=0)},$$

with $W=e(X)/(1-e(X)).$ A natural estimator for $\psi$ is then given by

$$\hat \psi = \frac{\sum_{i:T_i=0} \hat W_i Y_i}{\sum_{i:T_i=0} \hat W_i},$$

where $\hat W_i = \hat e(X_i)/(1-\hat e(X_i))$, and $\hat e(x)$ denotes an estimator of the propensity score.

It can be easily checked that if $\hat e$ is consistent, so is $\hat \psi$. Then, because $E(Y\mid T=1)$ can be estimated consistently, the estimator of $\delta=E(Y\mid T=1)-\psi$ will be consistent as well. It is then very important to place all efforts on obtaining consistency of $\hat e$. In this post we review and compare two types of method for estimating the propensity score: methods based on predictive accuracy and methods based on balancing the covariate distribution between treated and control units. Based on our results and the above discussions we argue that predictive accuracy, rather than balance, should be the criterion guiding the choice of method.

It should be noted that inverse probability weighting is not generally optimal (i.e., efficient) and doubly robust estimators such as the augmented IPW and the TMLE provide an opportunity to achieve the non-parametric efficiency bound (Hahn, 1998). Since our interest is to evaluate estimators for the propensity score as they relate to estimation of the causal effect, we focus on the reweighting method described above.

## Estimating the Propensity Score

We now introduce more formally the main dichotomy of this post: predictive accuracy vs covariate balance. We do this by describing the methods in terms of loss functions whose expectation is optimized at the true value of the propensity score.### Methods Based on Predictive Power

Let $\cal F$ be the space of all functions of $x$ bounded between zero and one, and let $f$ be a generic function in that space. We say that a loss function $L$ is valid for the propensity score if the following holds:$$ \arg\min_{f\in \cal F} E(L(f; M)) = e, $$

where $M = (X, T)$. That is, a loss function $L$ is valid if the expected loss is minimized at the true propensity score $e$. Examples of valid loss functions are the $L^2$ and negative log-likelihood loss functions given by

$L(f;m) = (t-f(x))^2,$ and

$L(f;m) = -t\log(f(x))-(1-t)\log(1-f(x)),$

respectively. Though theoretically valid, the $L^2$ loss function is known to underperform compared to the log-likelihood.

The choice of space $\cal F$ (sometimes called the

*model*) and loss function $L$ explicitly defines the estimation problem. For example, logistic regression is based on the negative log-likelihood loss function and the space of logistic functions$$ \mathcal{F}_\textrm{logistic} = \bigg\{f(x) = \frac{1}{1 + \exp(-x'\beta)}:

\beta \in \mathbb{R}^p \bigg\}. $$

A model like $\mathcal{F}_\textrm{logistic}$, indexed by a Euclidean parameter, is often referred to as a

*parametric model*. Although desirable, optimization in the complete space $\cal F$ is not possible in practice when $x$ contains continuous variables, or its dimension is large (cf. the curse of dimensionality). On the other hand, restriction to more tractable spaces such as $\cal{F}_\textrm{logistic}$ is well known to lead to the issue of*model misspecification*, which occurs when the space considered does not contain the propensity score. In the presence of model misspecification, the estimator $\hat\psi$ is inconsistent.The field of statistical machine learning provides a solution to this problem, allowing exploration of larger spaces. For example, a tree regression algorithm uses

$$\mathcal{F}_\textrm{tree} = \bigg\{\sum_j^J c_j I(x \in D_j): c_j \in

\mathbb{R}, D_j\bigg\},$$

where $c_j$ are constants, $I(A)$ is the indicator function returning one if $A$ is true and zero otherwise, and $D_j$ are disjoint partitions of the covariate space. Another example, given by multivariate adaptive regression splines (MARS), uses

$$\mathcal{F}_\textrm{mars} = \bigg\{f(x) = \sum_j^J\beta_jB_j(x): \beta \in

\mathbb{R}^p; B_j\bigg\},$$

where $B_j$ are basis functions of the form of hinge functions $\max(0, x_j - c_j)$, and $c_j$ are tuning parameters. These spaces are larger than $\cal{F}_\textrm{logistic}$ above. Under lack of domain-specific scientific knowledge supporting the use of a parametric model, these data-adaptive methods have a better chance of consistently estimating the propensity score. Choosing the tuning parameters for data-adaptive methods such as regression trees and MARS is the subject of a large number of research articles and books.

We chose the tree regression and MARS estimators only for illustration purposes, other possible choices of off-the-shelf prediction method include random forests, support vector machines, generalized boosted regression models, neural networks, $k$ nearest neighbors, regularized regression, and many more. An excellent review of statistical learning methods may be found in Friedman et. al. (2001).

#### Model Stacking - Super Learner (SL)

When using predictive power as a criterion, the question arises of how to select among the many prediction methods available in the statistical learning literature. We approach this question with a data-adaptive mindset that involves the following steps (see van der Laan et al., 2007):- Propose a finite collection $\mathcal L=\{\hat e_k:k=1,\ldots,K\}$ of estimation algorithms. An estimation algorithm is a procedure that takes a training data set $\mathcal T=\{M_i,i=1,\ldots, n\}$ and outputs a function $\hat e_k(x)$.
- Consider an ensemble learner of the type $\hat e_\alpha(x) = \sum_{k=1}^K \alpha_k \hat e_k(x);\quad\text{for}\quad 0 \leq \alpha_k\leq 1, \quad \sum_{k=1}^K\alpha_k=1.$
- Let $\mathcal V_1, \ldots, \mathcal V_J$ denote a partition of the index set $\{1,\ldots, n\}$ of approximately the same size. In addition, for each $j$, let the associated training set be defined as $\mathcal{T}_j = \{1,\ldots, n\}\backslash \mathcal V_j$. For fixed weights $\alpha$, denote $\hat e_{\alpha, \mathcal T_j}$ the ensemble trained using only data in $\mathcal T_j$. Choose the weights $\alpha$ that minimize the cross-validated risk: $\hat\alpha =\arg\min_{\alpha} \frac{1}{J}\sum_{j=1}^J\frac{1}{|\mathcal V_j|}\sum_{i\in \mathcal V_j} L(M_i, \hat e_{\alpha, \mathcal T_j})$ subject to $\quad 0 \leq \alpha_k\leq 1, \sum_{k=1}^K\alpha_k=1,$ and define the final estimator as $\hat e_{\hat\alpha}(x)$. This algorithm is implemented in the SuperLearner R package (Polley & van der Laan, 2014).

### Methods Based on The Balancing Property

Via Adam's Law it can be shown that$$ E\bigg\{(1-T)\frac{e(X)}{1-e(X)}c(X)\bigg\} =

E\{Tc(X)\}\, \text{ for all functions } c(x).$$

Here $c(x)$ is any function of $x$. Estimation methods that use this property are referred to as

*covariate balancing*, since they assure that properly reweighting leads to the same distribution in the treated and control groups.As a result of the balancing property, any estimator $\hat e$ satisfying

$$\sum_{i:T_i=0}\frac{\hat e(X_i)}{1-\hat e(X_i)}c(X_i) = \sum_{i:T_i=1}c(X_i)\,

\text{ for all functions } c(x)$$

can be expected to be a consistent estimator of $e(x)$. Because it has to hold for all functions $c(x)$, this balance condition is analogous to performing a search in the full space $\mathcal F$, and is impossible to achieve in practice. Methods that aim to achieve balance focus on a user-given set of functions $c_1,\ldots, c_J$. The task of choosing the correct functions $c_j$ is akin to the task of specifying the correct functional form in a parametric model. As a result, estimators that focus on covariate balancing are also susceptible to being inconsistent due to model misspecification.

Now let's take a look at two methods that use covariate balancing to estimate the propensity score: entropy balancing and covariate balancing propensity score.

#### Entropy Balancing (EB)

Entropy balancing (Hainmueller, 2012) is a method that directly estimates the weights $W_i$, rather than the propensity score, by solving the following optimization problem:$$ \hat W = \arg\min_W \sum_{i:T_i=0}W_i\log W_i $$

subject to

$$\frac{1}{n_0}\sum_{i:T_i=0}W_ic_j(X_i) = \frac{1}{n_1}\sum_{i:T_i=1}c_j(X_i)\,

\text{ for a set of functions } c_j:j\in\{1,\ldots J\}.$$

The above constrained optimization problem minimizes the entropy of $W_i$ to obtain obtain weights that satisfy the balance conditions for the user-specified covariate functions $c_j$. These functions are chosen by the user, and are generally given by lower order polynomials. Because covariate balance can only be achieved for a finite (generally small) number of functions $c_j$, this method will be inconsistent unless the correct functions $c_j$ are specified. In fact, Hainmueller (2012) show that entropy balancing is equivalent to estimating the weights as a log-linear model of the covariate functions $c_j(X)$.

See Hainmueller (2012), and the work of Zhao & Percival (2015) for more details on how this optimization problem is solved, and for further discussion.

#### Covariate Balancing Propensity Score (CBPS)

This method proceeds by specifying a parametric form for the propensity score, and then optimizing the negative log-likelihood loss, constrained to covariate balance on a set of functions $c_j(x)$. For example, consider the logistic model $\mathcal{F}_\textrm{logistic}$ defined above. CBPS proceeds by solving the following optimization problem:$$ e = \arg\min_{f\in \mathcal{F}_\textrm{logistic}} E(L(f;M)), $$

subject to

$$\sum_{i:T_i=0}\frac{\hat f(X_i)}{1-\hat f(X_i)}c_j(X_i) =

\sum_{i:T_i=1}c_j(X_i)\, \text{ for a set of functions } c_j:j\in\{1,\ldots

J\},$$

where $L$ is the negative log-likelihood loss $L(f;w)=-t\log(f(x))-(1-t)\log(1-f(x))$. The constraint set of this method can be seen to be equivalent to the constraint of EB, only the loss function optimized changes. This problem may be solved using empirical likelihood (Qin & Lawless, 1994) or other methods. For a more detailed discussion the interested reader is referred to the original research article by Imai & Ratkovic (2014).

## Simulation Study

We compare the various methods for estimating the propensity score in a simulation study. We use performance metrics such as bias and mean squared error for the estimation of $\delta$, our causal estimand of interest, defined as the average effect of treatment on the treated. The methods used are:Covariate balance:

- EB.
- CBPS (exact and over-parameterized).

- Multivariate adaptive regression splines (MARS) (Friedman, 1991).
- Random forest with default R tuning parameters (Breiman, 2001).
- Support vector machines with linear kernel (Cortes & Vapnik, 1995).
- The MLE in a logistic regression model.
- Bayesian logistic regression assuming a diffuse normal prior with mean zero.
- $L^1$ regularized logistic regression (Lasso) (Tibshirani, 1996).
- SL (discrete and full) where the full method is described above and the discrete method does not weight each algorithm but chooses the best one under cross-validation for predictions (van der Laan et al., 2007).

The true set of covariates is generated independently and identically distributed, from the following distribution:

$$(Z_{1}, Z_{2}, Z_{3}, Z_{4}) \sim N_4(0, I_4)$$

where $I_4$ is the four dimensional identity matrix. The outcome is generated as

$$ Y = 210 + 27.4 Z_{1} + 13.7 Z_{2} + 13.7Z_{3} + 13.7 Z_{4} + \epsilon $$

where $\epsilon \sim N(0, 1)$. The propensity score is defined as

$$ P(T = 1 \mid Z = z) = \text{expit}(-z_{1} + 0.5 z_{2} - 0.25z_{3} - 0.1 z_{4}). $$

where expit is the inverse-logit transformation. Note that $\delta = 0$, whereas $E(Y|T=1)=200$ and $E(Y|T=0)=220$, demonstrating the selection bias.

In the simulation we will examine the performance of the above algorithms under the correctly specified propensity model from the data generation process and under a misspecified model. Misspecification occurs when the following transformations are observed in place of the true covariates:

$x_{1} = \exp(z_1/2),$

$x_{2} = z_{2}/(1 + \exp(z_{1})),$

$x_{3} = (z_{1} z_{3}/25 + 0.6)^3,$

$x_{4} = (z_{2} + z_{4} + 20)^2.$

These transformations were proposed in the original paper in order to reflect a real-life problem. It is very unlikely a researcher observing $(X_{1}, X_{2}, X_{3}, X_{4})$ would specify the correct model reflecting these transformations. However, $X$ contains all the relevant information and strong ignorability holds. Without the correct functional form, a researcher using parametric methods is guaranteed to fit a misspecified model. This fact is well documented by Kang & Schafer (2007). The case where only the $X$s are observed is the most interesting since it exemplifies most, if not all, real data analysis problems.

To compare methods we conduct a simulation study by generating $10,000$ datasets according to the above data generation process. Then, for each dataset we estimate the propensity score using the true covariates $Z$ and the transformed covariates $X$ according to the various estimators we are examining. Then we estimate $\delta$ and aggregate across simulations to examine the estimated absolute bias

$$

\frac{1}{10^4} \bigg| \sum_{j = 1}^{10^4} (\hat\delta_j - \delta) \bigg|

$$

and root mean square error (RMSE)

$$

\sqrt{\frac{1}{10^4} \sum_{j = 1}^{10^4} (\hat \delta_j - \delta)^2}

$$

of the different methods. These formulas represent Monte Carlo integrals that aim to approximate the true bias and RMSE of the estimators. The largest estimated error in these integrals was around 0.3 which is small relative to the bias and MSE sizes we see.

The algorithms in the plot are ordered according to the sum of the respective metrics weighted by the square root of the sample size considered. By doing this, we favor algorithms that perform better in larger datasets, in accordance with the statistical notion of consistency. Algorithms aimed at achieving balance in the covariates are shown in boldface.

The similarity between the RMSE and bias plots teaches us that most of the poor performance is driven by bias rather than variance.

As we stated before, the most interesting case for practitioners is when the transformed covariates are observed. In this case, MARS is the best estimator. The two versions of the Super Learner follow closely. This should not be taken as an argument in favor of using MARS in every practical problem. Rather, it is proof of the importance of using a principled model selection tool, since no one algorithm should be expected to outperform all others uniformly across datasets and problems (cf. no free lunch

theorem). Likewise, methods based on random forests and SVM perform quite poorly. This may be surprising to some readers who know that these methods are data-adaptive and have seen them perform well in several applications. This is another demonstration of our claim that no method should be blindly trusted and all methods should be tested against the data for the specific problem at hand. We have seen applications in which the Lasso, Random Forests, or even a simple logistic regression outperforms all competitors in terms of predictive power.

Regarding covariate balance, we see that the two versions of the CBPS perform relatively well. CBPS is a GLM with an extra optimization constraint to satisfy covariate balance. In our simulation, this extra constraint did improve the bias and MSE compared to a standard GLM.

The entropy balanced estimator performs quite poorly. Note that the EB is an estimator that achieves balance in the sample covariate means, and as such would always pass the routine validation tests advocated in the balancing score literature. In the case of EB it appears that the predictive accuracy of the propensity score is sacrificed to ensure covariate balance. As our reweighting estimator for $\psi$ is constructed on the basis of consistent propensity score estimation (rather than the balancing property), it is not surprising to see the poor performance of EB.

Interestingly, covariate balancing methods outperform all other methods—including a correct logistic regression model—when the correct covariates are observed. This is because the response is linear in the correct covariates and thus an estimator that ensures complete balance on the true covariates automatically reduces all of the bias. However, this is merely a theoretical curiosity, as almost every practical situation belongs in the first panel of the plot rather than the second one.

## Conclusion

For our reweighting estimator we see that objective, flexible data-adaptive estimators for the propensity score typically perform best in the case of a misspecified model. This is because these algorithms are able to better explore the covariate space than their parametric counterparts and better recover, or at least approximate, the correct functional form.EB and CBPS both require specifying functions of the covariates that need to be balanced. This can be very useful if the researcher has prior domain knowledge and knows what functions of the covariates affect the response and hence need to be balanced. This sort of a-priori knowledge imposes a structure on the propensity score and thus is akin to knowing that the propensity score belongs

to a parametric family of distributions. Unfortunately, objective knowledge of this type is absent in most practical situations, particularly in large dimensional problems.

To conclude, in the absence of subject-matter knowledge supporting the use of parametric functional forms for the propensity score and the balancing conditions, predictive accuracy should be used to select an estimator among a collection of candidates. This collection may include covariate balanced

estimators, and should contain flexible data-adaptive methods capable of unveiling complex patterns in the data. In particular, we advocate for the use of model stacking methods such as the Super Learner algorithm implemented in the SuperLearner R package.

## Bibliography

Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine learning 20.3 (1995): 273-297.

Friedman, Jerome H. "Multivariate adaptive regression splines." The annals of statistics (1991): 1-67.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. "The elements of statistical learning." Vol. 1. Springer, Berlin: Springer series in statistics, (2001)

Hahn, Jinyong. "On the role of the propensity score in efficient semiparametric estimation of average treatment effects." Econometrica (1998): 315-331.

Hainmueller, Jens. "Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies." Political Analysis 20.1 (2012): 25-46.

Imai, Kosuke, and Marc Ratkovic. "Covariate balancing propensity score." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76.1 (2014): 243-263.

Kang, Joseph DY, and Joseph L. Schafer. "Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data." Statistical science (2007): 523-539.

Qin, Jin, and Jerry Lawless. "Empirical likelihood and general estimating equations." The Annals of Statistics (1994): 300-325.

Rosenbaum, Paul R., and Donald B. Rubin. "The central role of the propensityscore in observational studies for causal effects." Biometrika 70.1 (1983): 41-55.

Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267-288.

van der Laan, Mark J., and Polley, Eric C., and Hubbard, Alan E. "Super learner." Statistical applications in genetics and molecular biology (2007): Vol 6.1

Eric Polley and Mark van der Laan (2014). SuperLearner: Super Learner Prediction. R package version 2.0-15.

Zhao, Qingyuan, and Daniel Percival. "Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing." arXiv preprint arXiv:1501.03571 (2015).