To Balance or Not to Balance?

By IVAN DIAZ & JOSEPH KELLY Determining the causal effects of an action—which we call treatment—on an outcome of interest is at the heart of many data analysis efforts. In an ideal world, experimentation through randomization of the treatment assignment allows the identification and consistent estimation of causal effects. In observational studies treatment is assigned by nature, therefore its mechanism is unknown and needs to be estimated. This can be done through estimation of a quantity known as the propensity score, defined as the probability of receiving treatment within strata of the observed covariates. There are two types of estimation method for propensity scores.  The first tries to predict treatment as accurately as possible.  The second tries to balance the distribution of predictors evenly between the treatment and control groups. The two approaches are related, because different predictor values among treated and control units could be used to better predict treat

Estimating causal effects using geo experiments

by JOUNI KERMAN, JON VAVER, and JIM KOEHLER Randomized experiments represent the gold standard for determining the causal effects of app or website design decisions on user behavior. We might be interested in comparing, for example, different subscription offers, different versions of terms and conditions, or different user interfaces. When it comes to online ads, there is also a fundamental need to estimate the return on investment. Observational data such as paid clicks, website visits, or sales can be stored and analyzed easily. However, it is generally not possible to determine the incremental impact of advertising by merely observing such data across time. One approach that Google has long used to obtain causal estimates of the impact of advertising is geo experiments. What does it take to estimate the impact of online exposure on user behavior? Consider, for example, an A/B experiment , where one or the other version ( A or B ) of a web page is shown at random to a user

Using random effects models in prediction problems

by NICHOLAS A. JOHNSON, ALAN ZHAO, KAI YANG, SHENG WU, FRANK O. KUEHNEL, and ALI NASIRI AMINI In this post, we give a brief introduction to random effects models, and discuss some of their uses. Through simulation we illustrate issues with model fitting techniques that depend on matrix factorization. Far from hypothetical, we have encountered these issues in our experiences with "big data" prediction problems. Finally, through a case study of a real-world prediction problem, we also argue that Random Effect models should be considered alongside penalized GLM's even for pure prediction problems. Random effects models are a useful tool for both exploratory analyses and prediction problems. We often use statistical models to summarize the variation in our data, and random effects models are well suited for this — they are a form of ANOVA after all. In prediction problems these models can summarize the variation in the response, and in the process produce a form of ada