Estimating causal effects using geo experiments

by JOUNI KERMAN, JON VAVER, and JIM KOEHLER

Randomized experiments represent the gold standard for determining the causal effects of app or website design decisions on user behavior. We might be interested in comparing, for example, different subscription offers, different versions of terms and conditions, or different user interfaces. When it comes to online ads, there is also a fundamental need to estimate the return on investment. Observational data such as paid clicks, website visits, or sales can be stored and analyzed easily. However, it is generally not possible to determine the incremental impact of advertising by merely observing such data across time. One approach that Google has long used to obtain causal estimates of the impact of advertising is geo experiments.


What does it take to estimate the impact of online exposure on user behavior? Consider, for example, an A/B experiment, where one or the other version (A or B) of a web page is shown at random to a user. The analysis could then proceed with comparing the probabilities of clicking on a certain link on the page shown. The version of the web page that has a significantly higher estimated probability of click (click-through rate, or CTR) would be deemed the more effective one.

Similarly, we could test the effectiveness of a search ad compared to showing only organic search results. Whenever a user types in a specific search query, the system makes a split-second decision of whether or not to show a particular ad next to the organic results. Click-through rates can then be compared to determine the relative effectiveness of the presence of ads.

Traffic experiments like the ones above can randomize queries, but they cannot randomize users. The same user may be shown the ad whenever they perform the same search the second time. Traffic experiments, therefore, don't allow us to determine the longer term effect of the ad on the behavior of users.

We could approximate users by cookies. However, one user may have several devices (desktop, laptop, tablet, smartphone), each with its own cookie space. Moreover, cookies get deleted and regenerated frequently. Cookie churn increases the chances that a user may end up receiving a mixture of both the active treatment and the control treatment.

Even if we were able to keep a record of all searches and ad clicks and associate them with online conversions, we would still not be able to observe any long-term behavior. A conversion might happen days after the ad was seen, perhaps at a regular brick-and-mortar store. This makes it difficult to attribute the effect of an online ad on offline purchases.

It is important that we can measure the effect of these offline conversions as well. How can we connect an event of purchase to the event of perceiving the ad if the purchase does not happen immediately? And, from the perspective of the experiment set-up, how can we ensure that a user whom we assigned to the control group won't ever see the ad during the experiment?

Another possibility is to run a panel study, an experiment with a recruited set of users who allow us to analyze their web and app usage, and their purchase behavior. Panel studies make it possible to measure user behavior along with the exposure to ads and other online elements. However, meaningful insights require a representative panel or an analysis that corrects for the sampling bias that may be present. In addition, panel studies are expensive. Wouldn't it be great if we didn't require individual data to estimate an aggregate effect? Let's take a look at larger groups of individuals whose aggregate behavior we can measure.

A geo experiment is an experiment where the experimental units are defined by geographic regions. Such regions are often referred to as Generalized Market Areas (GMAs) or simply geos. They are non-overlapping geo-targetable regions. This means it is possible to specify exactly in which geos an ad campaign will be served  and to observe the ad spend and the response metric at the geo level. We can then form treatment and control groups by randomizing a set of geos.

Consider, as an example, the partition of the United States into 210 GMAs defined by Nielsen Media. The regions were originally formed based on television viewing behavior of their residents, clustering together “exclusive geographic area of counties in which the home market television stations hold a dominance of total hours viewed.” These geos can be targeted individually in Google AdWords. Here is an example of a randomized assignment:


In contrast to the US, France doesn't currently have an analogous set of geos. So we created a set of geos using our own clustering algorithms. The figure below shows an example of a set of geos partitioning mainland France into 29 GMAs:


Suppose that users in control regions are served ads with a total spend intensity of $C$ dollars per week, while users in treatment regions are served ads with a cost of $T = C + A$ dollars per week ($A > 0$). The key assumption in geo experiments is that users in each region contribute to sales only in their respective region. This assumption allows us to estimate the effect of the ad spend on sales. What makes geo experiments so simple and powerful is that they allow us to capture the full effects of advertising, including offline sales and conversions over longer periods of time (e.g., days or weeks).

Measuring the effectiveness of online ad campaigns

Estimating the causal effects of an advertising campaign, and the value of advertising in general, is what each of Google’s advertising clients would like to do for each of their products. There are many methods for estimating causal effects, and yet getting it right remains a challenging problem in practice.

The quantity that we aim to estimate, in particular, is a specific type of return on investment where the investment is the cost of advertising. We often refer to this as the Return On Ad Spend (ROAS). Even more specifically, we are typically interested in the change in sales (or website visits, conversions, etc.) when we change the ad spend: the incremental ROAS, or iROAS. When response is expressed in terms of the same currency as the investment, iROAS is just a scalar. For example, an iROAS of 3 means that each extra dollar invested in advertising leads to 3 incremental dollars in revenue. Alternatively, when the response is the number of conversions, iROAS can be given in terms of number of conversions per additional unit of currency invested, say “1000 incremental conversions per 10,000 additional dollars spent.” In other words, iROAS is the slope of a curve of the response metric plotted against the underlying advertising spend.

Structure of a geo experiment

A typical geo experiment consists of two distinct time periods: pretest and test. During the pretest period, there are no differences in the ad campaign structure across geos. All geos operate at the same baseline level; the incremental difference between the control and treatment geos is zero in expectation. The pretest period is typically 4 to 8 weeks long.

During the test period (which is typically 3 to 5 weeks long), geos in the treatment group are targeted with modified advertising campaigns. We know that this modification, by design, causes an incremental effect on ad spend. What we don't know is whether it also causes an incremental effect in the response metric. It is worth noting that modified campaigns may cause the ad spend to increase (e.g., by adding keywords or increasing bids in the AdWords auction) or decrease (e.g., by turning campaigns off). Either way, we typically expect the response metric to be affected in the same direction as spend. As a result, the iROAS is typically positive.

After the test period finishes, the campaigns in the treatment group are reset to their original configurations. This doesn't always mean their causal effects will cease instantly. Incremental offline sales, for example, may well be delayed by days or even weeks. When studying delayed metrics, we may therefore want to include in the analysis data from an additional cool-down period to capture delayed effects.

Designing a geo experiment: power analysis

As with any experiment, it is essential that a geo experiment is designed to have a high probability of being successful. This is what we mean when we estimate the power of the experiment: the probability of detecting an effect if an effect of a particular magnitude is truly present.

Statistical power is traditionally given in terms of a probability function, but often a more intuitive way of describing power is by stating the expected precision of our estimates. We define this as the standard error of the estimate times the multiplier to obtain the bounds of a confidence interval. This could, for example, be stated as the “iROAS point estimate +/- 1.0 for a 95% confidence interval”. This is a quantity that is easily interpretable and summarizes nicely the statistical power of the experiment.

The expected precision of our inferences can be computed by simulating possible experimental outcomes. We also check that the false positive rate (i.e., the probability of obtaining a statistically significant result if the true iROAS is in fact zero) is acceptable, such as 5% or 10%. This power analysis is absolutely essential in the design phase as the amount of proposed ad spend change directly contributes to the precision of the outcome. We can therefore determine whether a suggested ad spend change is sufficient for the experiment to be feasible.

One of the factors determining the standard error (and therefore, the precision) of our causal estimators is the amount of noise in the response variable. The noisier the data, the higher the standard error. On the other hand, for the models that we use, the standard error of the iROAS estimate is inversely proportional to the ad spend difference in the treatment group. That is, we can buy ourselves shorter confidence intervals (and a “higher precision”) by increasing the ad spend difference.

In practice, however, increasing precision is not always as easy as increasing spend. There might simply not be enough available inventory (such as ad impressions, clicks, or YouTube views) to increase spend. Further, there is the risk that the increased ad spend will be less productive due to diminishing returns (e.g., the first 100 keywords in a campaign will be more efficient than the next 100 keywords.) 

A model for assessing incremental return on ad spend

We are interested in estimating the iROAS, which for each geo is the ratio between the incremental (causal) revenue divided by the incremental change in expenditure in that geo. The incremental effect is defined as the difference between the expected potential outcome under treatment and the potential outcome under no treatment.

We estimate the causal treatment effect using linear regression (see for example [3], chapter 9). The model regresses the outcomes $y_{1,i}$ on the incremental change in ad spend $\delta_i$. We can however increase the precision of the estimates (and therefore the statistical power) by also including the pre-test response $y_{0,i}$ as a covariate:

$y_{1,i} = \beta_0 + \beta_1 y_{0,i} + \beta_2 \delta_i + \epsilon_i$  for geo $i$

For all control geos, we have $\delta_i = 0$ by design. For a treatment geo $i$, $\delta_i$ is the observed ad spend minus the counterfactual, that is, the ad spend that would have been observed in the absence of the treatment. This counterfactual is estimated using a linear regression model applied to control geos only (see [1] for details). If there was no previous ad expenditure (for this particular channel), the counterfactual estimate would simply be zero, and so i would be the observed spend in treatment geo $i$.

The modelled incremental response caused by the change in ad spend $\delta_i$ in geo $i$ is $\beta_2 \delta_i$. The model parameter $\beta_2$ is our main quantity of interest, the incremental ROAS. For example, $\beta_2 = 3.1$ would indicate that each unit of currency invested caused an extra 3.1 units of currency generated.

The term $\beta_1 y_{0,i}$ controls for seasonality and other factors common to all geos (e.g., a nationwide sales event), as it is plausible that the pretest and test periods experience a trend or a temporary increase that is unrelated to the effect we are measuring. The interpretation of the coefficient $\beta_1$, of course, depends on the length of the pretest and test periods.

The geos are invariably of different sizes and therefore the data show considerable heteroscedasticity. Since each geo can be thought of as being constructed from small individual contributions, we assume that the variance of a geo is proportional to its mean. This is a plausible assumption, since the variance of a sum of independent variables is equal to the sum of their individual variances. A variance stabilizing transformation (square root) is an option here to equalize the variances, but we prefer to work on the original scale (as opposed to square-root units), as this is more convenient and flexible; for example, the interpretation of the coefficients such as that of $\beta_2$ (iROAS) is straightforward. We have seen in practice that fitting the model using weighted regression with weights $1 / y_{0,i}$ (inverse of the sum of the response variable in the pretest period) controls heteroskedasticity. Without weighted regression we would obtain a biased estimate of the variance. Caution is needed, however, to use the weights: when the pre-test period volume of a geo are close to zero, the weights may be large (this usually reflects an issue with data reporting). A quick remedy is to combine the smallest geos to form a new larger geo.

The paper [1] gives more details on the model; the follow-up paper [2] describes an extension of the methodology to multi-period geo experiments.

Example

Here is a set of daily time series data of sales by geo. As we will see further below, this period will form our pre-period for a geo experiment. There are 100 geos; the largest one has a volume that is 342 times that of the smallest one. Such differences in scale are not unusual. The time series shows significant weekly level seasonality, with the lowest volumes occurring during weekends.


It is usually helpful to look at these response variables on a log scale. This really shows how similarly the geos behave, the only difference being their size.


The goal of the study was to estimate the incremental return in sales on an additional ad spend change. The 100 geos were randomly assigned to control and treatment groups, and a geo experiment test period was set up for February 16  March 15, 2015, with the 6 previous weeks serving as the pre-period. During this test period, advertising spend was increased in the treatment group. 



After the experiment was finished, the incremental return of ad spend was estimated to be $\beta_2 = 3.1$. The other parameters were $\beta_0 = 74$, and $\beta_1= 0.85$ with a residual standard deviation of 4.8. Since the pre-test period is 6 weeks long and the test period is 4 weeks long, a value of $\beta_1 = 0.85$ corresponds to a weekly trend of $(6/4) \times 0.85 = 1.27$, that is, on average each week the response volumes of the geos tend to increase by 27% (in the absence of any intervention). In practice, the focus of the team is however on the estimate of $\beta_2$, not to forget about the uncertainty around this estimate: the confidence interval half-width was estimated to be 0.27.

Caveats

Model

As with any statistical model, the model is designed to operate under certain assumptions. It is important that the technical assumptions of a linear regression model are satisfied: linearity, additivity, independence of errors, normality, and equal variance of errors (after taking into account the weighting). If the response data indeed consists of sums of small independent contributions (say, sales of products) that are grouped into geos, the differences between geos should be normally distributed. A more important assumption is that the relative volume of the geos will not be changing during the geo experiment.

It is critical to apply appropriate diagnostics to the data and to the model fit before and after the test. In practice, anomalies often indicate issues with data reporting: untidy data and programming errors are a frequent source for headaches that are best addressed by developing tools that verify the integrity of the data and that check all model assumptions. For example, we have developed a few diagnostics tools of our own to catch unexpected features in the data such as outliers.

The model that we presented here is quite simple. It is then worth asking whether the model accounts for all of the factors that could influence the response. As usual, the honest answer is: probably not. One obvious omission is that the model presented ignores other possible predictors such as pricing and promotion information. And such information may not even be available. In the case of consumer packaged goods such as soft drinks or shampoo, for example, even manufacturers may not be aware of all the promotions going in each of the stores selling their products. Thanks to randomization, this omission will not bias our iROAS estimate. However, it will typically increase its variance.

Forming treatment groups

By randomizing, we aim to generate the treatment and control groups with comparable baseline characteristics. By complete randomization, we may well end up with groups that are not as balanced as we would prefer. By default, we use stratified randomization with each stratum containing geos of similar size. It may also be a good idea to consider forming strata by other characteristics such as geographical location.

Even stratified randomization may not be enough: the problem of balance may be especially accentuated when we have few geos to work with (for example, in small countries). Highly heterogeneous countries often have their own set of challenges. For example, the metropolitan areas of London and Paris dominate entire countries like the UK or France. This makes it virtually impossible to partition such countries into homogeneous, well-balanced groups.

For this reason, in these particular situations we need an algorithm that matches geos to form control and treatment groups that are predictive of each other, justifying a causal interpretation of the analysis.

Outliers

A particular nuisance is a geo that – for a known or unknown reason – experiences a shock (unexpected drop or surge) in the response variable during the test. For example, if a large geo in the control group suddenly experiences a heat wave, a geo experiment to measure the incremental sales of a particular ice cream brand might be affected, diluting the effect that we are aiming to measure. If the shock is balanced by another geo in the other group, this problem is fortunately mitigated (as it should be in a randomized experiment). Otherwise we will have to deal with the outlier geo(s) after the experiment. Hence we need to be aware of these problems and develop complementary methodologies to deal with them.

It is a good idea to do a sensitivity analysis to check if there is any cause for concern. For example, one useful diagnostic to detect a single outlier geo is to look at the effect of each geo on the iROAS point estimate. In this leave-one-out analysis we repeat the analysis as many times as there are geos, each time dropping one geo from the analysis. By looking at a histogram of the estimates, we may see that dropping a particular geo may have a clear effect on the iROAS estimate.

Small market regions

Since the statistical power of this model depends on the number of geos, it is not obviously suitable for situations in which the number of geos is very small. In the U.S., for national level experiments we have no problem as we have over 210 geos. In Europe, use of this methodology may not be always advisable; as a rough rule of thumb, we prefer to apply this approach to experiments with 30, or more, geos. If this is not the case, alternative approaches relying on time-series models, such as CausalImpact, may be a better choice.

What’s next?

We mentioned several challenges: ensuring the model is appropriate, lack of useful data, forming comparable treatment and control groups, dominant geos, outliers, and running experiments in small countries. For a data scientist, all this means more interesting and applicable research opportunities.

References

  1. Jon Vaver and Jim Koehler, Measuring Ad Effectiveness Using Geo Experiments, 2011.
  2. Jon Vaver and Jim Koehler, Periodic Measurement of Advertising Effectiveness Using Multiple-Test-Period Geo Experiments, 2012.
  3. Andrew Gelman and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge, 2007.

Using random effects models in prediction problems

by NICHOLAS A. JOHNSON, ALAN ZHAO, KAI YANG, SHENG WU, FRANK O. KUEHNEL, and ALI NASIRI AMINI

In this post, we give a brief introduction to random effects models, and discuss some of their uses. Through simulation we illustrate issues with model fitting techniques that depend on matrix factorization. Far from hypothetical, we have encountered these issues in our experiences with "big data" prediction problems. Finally, through a case study of a real-world prediction problem, we also argue that Random Effect models should be considered alongside penalized GLM's even for pure prediction problems.



Random effects models are a useful tool for both exploratory analyses and prediction problems. We often use statistical models to summarize the variation in our data, and random effects models are well suited for this — they are a form of ANOVA after all. In prediction problems these models can summarize the variation in the response, and in the process produce a form of adaptive regularization through the learned prior distributions. Random effects can be viewed as an application of empirical Bayes, and if we broaden our view to that parent technique, we can find theoretical evidence of their superiority over say fixed effects models. For example, the James-Stein estimator has an empirical Bayes interpretation — it was famously proven to dominate the MLE (an analogue of fixed effects) for the problem of estimating the mean of a multivariate normal distribution (see [3] and [4]).

In the context of prediction problems, another benefit is that the models produce an estimate of the uncertainty in their predictions: the predictive posterior distribution. These predictive posterior distributions have many uses such as in multi-armed bandit problems. Thompson sampling is an "explore/exploit" algorithm which was designed for Bayesian models, and its strategy is to randomly select a choice (called an "arm") with probability proportional to the posterior of that choice being the best option. See [9], [10] for a discussion of this approach.

In the next two sections we give a brief introduction to random effects models and then discuss a case study in click-through-rate modeling. This example shows that the prediction accuracy can be competitive with popular alternatives such as penalized GLM's — in fact, much better in this example. Furthermore the algorithms to fit these models are sufficiently scalable to handle many problems of interest.


Random Effect Models


We will start by describing a Gaussian regression model with known residual variance $\sigma_j^2$ of the $j$th training record's response, $y_j$. Often our data can be stored or visualized as a table like the one shown below. In this example we have three features/columns named "a", "b", and "c". Column "a" is an advertiser id, "b" is a web site, and "c" is the 'interaction' of columns "a" and "b".


$y$
feature (a)
feature (b)
feature (c)
$\sigma^2$
1.2
hi-fly-airlines
123.com
"hi-fly-airlines__123.com"
1.10
0.1
erudite-bookstore
123.com
"erudite-bookstore__123.com"
0.15
0.3
hi-fly-airlines
xyz.com
"hi-fly-airlines__xyz.com"
0.20
0.3
lightspeed-brokerage
www.com
"lightspeed-brokerage__www.com"
0.10


To make mathematical notation simpler we suppose that string values in each column have been enumerated, and we deal with their integer index instead:

$y$
$a$
$b$
$c$
$\sigma^2$
1.2
1
1
1
1.10
0.1
2
1
2
0.15
0.3
1
2
3
0.20
0.3
3
3
4
0.10

Now for the $j$th row of data we define the index of the feature "a" as 
$I_a(j) \in \{1, 2, ..., \mbox{#advertisers} \}$. In our example we have \begin{align*}
I_a(1) &= 1\\
I_a(2) &= 2 \\
I_a(3) &= 1 \\
I_a(4) &=3 \\
\cdots
\end{align*}Similarly $I_b(j) = \{1, 2, ..., \mbox{#websites}\}$ could be the index of the site on which the ad was shown and $I_c(j)$ could index the (advertiser, site) pair.

Finally we get to our model of the data: \[
\{y_j | I_a(j), I_b(j), I_c(j), \sigma_j^2 \} \sim
N(\beta + \mu_{I_a(j)}^a + \mu_{I_b(j)}^b + \mu_{I_c(j)}^c, \sigma_j^2)
\]The curly-bracketed left-hand side, $\{y_j| \cdots \}$, denotes a conditional distribution; the parameter $\beta$ represents the overall mean of the data (whose fitted value depends on the structure of the model); and the $\mu^a$'s, $\mu^b$'s, and $\mu^c$'s represent random effects i.e. unobserved random variables.

The model is complete once we specify a prior distribution on each of the $\mu$'s: \begin{align*}
\mu_k^a &\sim N(0, \tau_a^{-1}) \\
\mu_k^b &\sim N(0, \tau_b^{-1}) \\
\mu_k^c &\sim N(0, \tau_c^{-1})
\end{align*}In the formulas above the parameter $\tau_x$ is the inverse-variance of the prior, and all coefficients associated with the same column are drawn from a common prior. Rather than just specifying the $\tau$'s, we will try to infer them from the data: they can be fitted by a Monte Carlo Expectation Maximization (MCEM) algorithm or they can be sampled in a Bayesian model.

These inferred prior distributions give a decomposition of the variance of the data\[
 \mathrm{var\ }y|\sigma = \sigma^2 + \tau_a^{-1} + \tau_b^{-1} + \tau_c^{-1}
\]The variance decomposition formula also gives breakdowns such as\begin{align}
\mathrm{var\ }y|\sigma &= E_\sigma[[\mathrm{var\ } y|\sigma, I_b, I_c]| \sigma]
 + \mathrm{var_\sigma}[E[y|\sigma, I_b, I_c]| \sigma] \\
&= (\sigma^2+ \tau_a^{-1}) + (\tau_b^{-1} + \tau_c^{-1})
\end{align}Their interpretation is most intuitive in hierarchical models. More complex models can involve elements such as "random slopes", and for these it can be informative to calculate a version of $R^2$ (fraction of variance explained) which was developed for random effects models in the textbook by Gelman and Hill [1].

The prior variance decomposition or $R^2$ are useful summaries of the explanatory power of each random effect column, but when making predictions we need the posterior variances of individual random effects. At prediction time suppose we have features (a=3, b=7, c=2) and we need the linear combination $\beta + \mu_3^a + \mu_7^b + \mu_2^c$. By considering several MCMC samples (indexed $t=1,2, \dots, T$) we can compute statistics based on the sample approximation to the posterior distribution of the prediction i.e. $\{\beta_t + \mu_{3,t}^a + \mu_{7,t}^b + \mu_{2,t}^c \}_{t=1}^T$. If we only need the posterior mean then we can simply pre-compute the mean of each $\mu$.

Although Gaussian models are a good way to introduce random effects, in practice we must often model discrete data — usually in the form of counts of events. Using similar algorithms, we can fit the following Gamma-Poisson model:\begin{align}
\{ y_j | I_a(j), I_b(j), I_c(j), \lambda_j \}
&\sim \mathrm{Pois}(\lambda_j \mathrm{e}^
{\beta +\mu_{I_a(j)}^a + \mu_{I_b(j)}^b + \mu_{I_c(j)}^c}) \\
\mathrm{e}^{\mu_k^x} &\sim \Gamma(\tau_x, \tau_x)
\end{align}where the prior has mean of $1.0$ and variance $\tau_x^{-1}$.

Binomial models are also of interest but logistic regression in particular is similar to the Poisson regression when the rate of positive labels is low. In many applications the positive rate is indeed very low, and we then prefer the Poisson model for the computational efficiency of being able to use sums as sufficient statistics [7].

When we only need\[
\log E[y|\mu, \beta, I_a, I_b, I_c, \sigma] =
    \beta +\mu_{I_a(j)}^a + \mu_{I_b(j)}^b + \mu_{I_c(j)}^c
\]in the Poisson model, then it is simple to pre-calculate the mean of each parameter — just as in the Gaussian model. On the other hand, if we required the posterior mean of $\mathrm{exp}(\beta +\mu_{I_a(j)}^a + \mu_{I_b(j)}^b + \mu_{I_c(j)}^c)$ (i.e. the mean of $y_j$), or if we need a variance estimate, then we must store multiple MCMC samples of every $\mu$. This can represent a challenge since we must now store and manipulate many times more parameters; however, there is recent work to develop more compact summaries of the posterior mean [2], [5]. We look forward to further developments in this area.


A Case Study Click-Through-Rate Prediction


In this section we evaluate the prediction accuracy of random effects models in click-through-rate modeling on several data sets — each corresponding to 28 days of traffic belonging to single display ads segment.

We selected 10 segments and chose a subset of training data such that the number of training examples varied from between 20 million to 90 million for each segment. We compared the output of a random effects model to a penalized GLM solver with "Elastic Net" regularization (i.e. both L1 and L2 penalties; see [8]) which were tuned for test set accuracy (log likelihood).

On each of the ten segments the random effects model yielded higher test-set log likelihoods and AUCs, and we display the results in the figure below. In that figure we have taken single future day of data as the test set i.e. we trained on data from days 1, ..., 28 and evaluated on day 29. We saw similar results when looking further ahead and evaluating on days 29 through 34.

Figure 1: Comparing Random Effects vs. Penalized GLM on AUC and log likelihood.
Through this case study we support the argument that practitioners should evaluate random effects models when they encounter a new problem. Often one tries several different techniques and then either combines the outputs or selects the most accurate, and we believe random effects models are a valuable addition to the usual suite of approaches which includes penalized GLMs, decision trees, etc.


Scalability studies


First we evaluate the popular lme4 R package and compare against a specialized Gibbs sampler which we describe in [7]. The lme4 software is an excellent choice for many datasets; however, to scale to large, multi-factor datasets we found it necessary to turn to alternative algorithms such as the Gibbs sampler. These algorithms can be distributed and deal with problems which would be RAM-limited for lme4.

This comparison is meant to show that the relative run times greatly depend on the structure of the random effect model matrix. There are designs for which lme4 scales linearly in the number of columns, and there are also non-pathological designs for which it appears to be quadratic or worse.

The figure below shows the run time for a Poisson lme4 model (dashed) and for 1000 iterations of a Gibbs sampler in the case of nested feature columns (left) and non-nested (right). By 'nested' we mean that there is hierarchical relationship so the level of a parent column is a function of the level of the child column. For example, consider a domain and the url's within it.

The left figure shows that the run time of both methods grows linearly with the number of input records and random effects (we simulated such that the number of random effects at the finest level was approximately equal to the number of input records divided by five). The right figure shows that the run time of one scan of the Gibbs sampler has about the same cost per iteration whether the structure is nested or not (as expected); however, for lme4 the cost of fitting the model increases dramatically as we add random effect columns.

Figure 2: Comparing custom Gibbs sampler vs. lmer running times.

We have many routine analyses for which the sparsity pattern is closer to the nested case and lme4 scales very well; however, our prediction models tend to have input data that looks like the simulation on the right. For example, we may have features that describe or identify an advertiser, other features for the web site on which an ad shows, and yet more features describing the device e.g. the type and model of device: computer, phone, tablet, iPhone5, iPhone6, ...; the operating system; the browser.

In the simulations we plotted the run time for 1,000 scans of the Gibbs sampler. In these simulations this number is far more than necessary — based on manual checks of the convergence of prior parameters and log likelihood. We chose 1,000 iterations in order to put the run time on comparable scale with lme4. This makes it easier to contrast how the two approaches handle more complex model structures i.e. more random effect columns.

One of the known challenges of using MCMC methods is deciding how many iterations to run the algorithm for. We recommend monitoring the prior variance parameters and the log likelihood of the data. More sophisticated checks have been proposed, and a good review of convergence diagnostics is found in [6]. Many of these techniques appear infeasible for the large regression models we are interested in, and we are still searching for checks that are easy to implement and reliable for high-dimensional models.

In the previous section we described per-segment models, and we now use the same datasets to illustrate the scalability of the algorithm. We selected two segments as a case study and varied the number of machines used in inference. The chosen segments had between 30 million and 90 million training records and a positive rate of about 1%. The models contained approximately 190 feature columns and the number of levels per column varied from between just two at the smallest to 5 million at the largest (with a median of about 15,000 levels per column). Finally, we should note that a training record typically represents a single ad-view but we can collapse those with identical features (all 190 of them in this case). In these examples aggregation only reduced the size of the data set by about 10%.

In the figure below we show how the run time per iteration (updating all 190 feature columns) varied with the number of machines used. For this timing test we also evaluated the run time of the model when applied to two combined datasets from all ten segments and another from thirty segments. These large timing tests had roughly 500 million and 800 million training examples respectively. The combined raw training data for the smaller one was 834 GB in protocol buffer format and after integerization and column formatting (roughly how the data is stored in RAM) the training data was roughly 170 GB in size on disk. The larger dataset was around 300 GB after integerization and column-formatting.

In the right panel of the figure below we can see that the run time is growing sub-linearly in the amount of training data i.e. doubling the number of training instances does not double the time per iteration.

Figure 3: Performance scalability of custom Gibbs sampler.


Conclusion and Discussion


We have found random effects models to be useful tools for both exploratory analyses and prediction problems. They provide an interpretable decomposition of variance, and in prediction problems they can supply predictive posterior distributions that can be used in stochastic optimization when uncertainty estimates are a critical component (e.g. bandit problems).

Our study of per-segment click-through-rate models demonstrated that random effects models can deliver superior prediction accuracy. Although this may not hold in every application, we encourage others to evaluate random effects models when selecting a methodology for a new prediction problem. In our case study we considered models with a couple hundred feature columns and hundreds of millions of inputs. This is for illustration and does not represent an upper bound on the problem size; we have applied these models to data sets with billions of inputs (not shown), and we are confident that the techniques are scalable enough to cover many problems of interest.


References


[1] Andrew Gelman, and Jennifer Hill. "Data analysis using regression and multilevel/hierarchical models." Cambridge University Press, (2006).

[2] Edward Snelson and Zoubin Ghahramani. "Compact approximations to bayesian predictive distributions." ICML, (2005).

[3] Bradley Efron. "Large-scale inference: empirical Bayes methods for estimation, testing, and prediction." Vol. 1. Cambridge University Press, (2012).

[4] Bradley Efron, and Carl Morris. "Stein's estimation rule and its competitors—an empirical Bayes approach." Journal of the American Statistical Association 68.341 (1973): 117-130.

[5] Anoop Korattikara, et al. "Bayesian dark knowledge." arXiv preprint arXiv:1506.04416 (2015).

[6] Mary Kathryn Cowles and Bradley P. Carlin. "Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review". Journal of the American Statistical Association, Vol. 91, No. 434 (1996): 883-904.

[7] Nicholas A. Johnson, Frank O. Kuehnel, Ali Nasiri Amini. "A Scalable Blocked Gibbs Sampling Algorithm For Gaussian And Poisson Regression Models." arXiv preprint arXiv:1602.00047, (2016).

[8] Hui Zou, and Trevor Hastie. "Regularization and variable selection via the elastic net." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.2 (2005): 301-320.

[9] Steven L. Scott. "A modern Bayesian look at the multi-armed bandit." Applied Stochastic Models in Business and Industry, 26 (2010): 639-658.

[10] Steven L. Scott. "Multi-armed bandit experiments in the online service economy." Applied Stochastic Models in Business and Industry, 31 (2015): 37-49.

LSOS experiments: how I learned to stop worrying and love the variability

by AMIR NAJMI

In the previous post we looked at how large scale online services (LSOS) must contend with the high coefficient of variation (CV) of the observations of particular interest to them. In this post we explore why some standard statistical techniques to reduce variance are often ineffective in this “data-rich, information-poor” realm.




Despite a very large number of experimental units, the experiments conducted by LSOS cannot presume statistical significance of all effects they deem practically significant. We previously went into some detail as to why observations in an LSOS have particularly high coefficient of variation (CV). The result is that experimenters can’t afford to be sloppy about quantifying uncertainty. Estimating confidence intervals with precision and at scale was one of the early wins for statisticians at Google. It has remained an important area of investment for us over the years.

Given the role played by the variability of the underlying observations, the first instinct of any statistician would be to apply statistical techniques to reduce varia
bility. “We have just emerged from a terrible recession called New Year's Day where there were very few market transactions,” joked Hal Varian, Google’s Chief Economist, recently. He was making the point that economists are used to separating the predictable effects of seasonality from the actual signals they’re interested in. Doing so makes it easier to study the effects of an intervention, say, a new marketing campaign, on the sales of a product. When analyzing experiments, statisticians can do something similar to go beyond overall comparisons between treatment and control. In this post we discuss a few approaches to removing the effects of known or predictable variation. These typically result in smaller estimation uncertainty and tighter interval estimates. Surprisingly, however, removing predictable variation isn’t always effective in an LSOS world where observations can have a high CV.


Variance reduction through conditioning


Suppose, as an LSOS experimenter, you find that your key metric varies a lot by country and time of day. There are pretty clear structural reasons for why this could be the case. For instance, the product offering of a personal finance website might be very different in different countries and hence conversion rates (acceptances per offer) might differ considerably by country. Or perhaps the kinds of music requested from an online music service vary a lot with hour of day, in which case the average number of songs downloaded per user session might vary greatly too. Thus any randomized experiment will observe variation in metric difference between treatment and control groups due merely to sampling variation in the number of experimental units in each of these segments of traffic. In statistics, such segments are often called “blocks” or “strata”. At Google, we tend to refer to them as slices.

One approach statisticians use to mitigate between-slices variability is to employ stratified sampling when assigning units to the treatment and the control group. Stratified sampling means drawing a completely randomized sample from within each slice, ensuring greater overall balance. Figure 1 illustrates how this would work in two dimensions.

Figure 1: Completely randomized versus stratified randomized sampling.
In LSOS, stratified sampling is doable but less convenient than in, say, a clinical trial. This is because assignment to treatment and control arms is made in real time and thus the number of experimental units in each slice are not known in advance. An easier approach is to account for imbalances after the fact in analysis.

To illustrate how this works, let’s take the music service as our example. Let the outcome variable $Y_i$ represent the number of songs downloaded in a user session $i$, assuming each user session is independent. To simplify the analysis, let’s further assume that the treatment does not affect the variance of $Y_i$. 
The simplest estimator for the average treatment effect (ATE) is given by the difference of sample averages for treatment and control: 
\[
\mathcal{E}_1=\frac{1}{|T|}\sum_{i \in T}Y_i - \frac{1}{|C|}\sum_{i \in C}Y_i
\]
Here, $T$ and $C$ represent sets of indices of user sessions in treatment and control and $| \cdot |$ denotes the size of a set. The variance of this estimator is
\[
\mathrm{var\ } \mathcal{E}_1 = \left(\frac{1}{|T|} + \frac{1}{|C|} \right )\mathrm{var\ }Y
\]
Estimator $\mathcal{E}_1$ accounts for the fact that the number of treatment and control units may vary each time we run the experiment. However, it doesn't take into account the variation in the fraction of treatment and control units assigned to each hour of day. Conditioned on $|T|$ and $|C|$, the number of units assigned to treatment or control in each hour of day will vary according to a multinomial distribution. And since the metric average is different in each hour of day, this is a source of variation in measuring the experimental effect.

We can easily quantify this effect for the case when there is a constant experimental effect and variance in each slice. In other words, assume that in Slice $k$, $Y_i \sim N(\mu_k,\sigma^2)$ under 
control and $Y_i \sim N(\mu_k+\delta,\sigma^2)$ under treatment. To compute $\mathrm{var\ }Y$ we condition on slices using the law of total variance which says that \[
\mathrm{var\ }Y = E_Z[\mathrm{var\ }Y|Z] + \mathrm{var_Z}[EY|Z]
\]Let $Z$ be the random variable representing hour of day for the user session. $Z$ is our slicing variable, and takes on integer values $0$ through $23$. The first component, $E_Z[\mathrm{var\ }Y|Z]$, is the expected within-slice variance, in this case $\sigma^2$. The second component, $\mathrm{var_Z}[EY|Z]$, is the variance of the per-slice means. It's that part of the variance which is due to the slices having different means. We’ll call it $\tau^2$:
\begin{align}
\tau^2 &= \mathrm{var}_Z[EY|Z] \\
&= E_Z[(EY|Z)^2] - (E_Z[EY|Z])^2 \\
&= \sum_k w_k \mu_k^2 - \left( \sum_k w_k\mu_k \right )^2
\end{align}
where $w_k$ is the fraction of units in Slice $k$. In this situation, the variance of estimator $\mathcal{E}_1$ is
\[
\mathrm{var\ } \mathcal{E}_1 = \left(\frac{1}{|T|} + \frac{1}{|C|} \right )(\sigma^2+\tau^2)
\]
Quantity $\tau^2$ is nonzero because the slices have different means. We can remove its effect if we employ an estimator $\mathcal{E}_2$ that takes into account the fact that the data are sliced:
\[
\mathcal{E}_2=\sum_k \frac{|T_k|+|C_k|}{|T|+ |C|}\left( \frac{1}{|T_k|}\sum_{i \in T_k}Y_i - \frac{1}{|C_k|}\sum_{i \in C_k}Y_i \right)
\]
Here, $T_k$ and $C_k$ are the subsets of treatment and control indices in Slice $k$. This estimator improves on $\mathcal{E}_1$ by combining the per-slice average difference between treatment and control. Both estimators are unbiased. But the latter estimator has less variance. To see this, we can compute its variance:
\[
\mathrm{var\ } \mathcal{E}_2 =
\sum_k \left(\frac{|T_k|+|C_k|}{|T|+ |C|} \right )^2\left( \frac{1}{|T_k|} + \frac{1}{|C_k|} \right) \sigma^2
\]
Given that $|T_k| \approx w_k |T|$ and $|C_k| \approx w_k|C|$,
\begin{align}
\mathrm{var\ } \mathcal{E}_2 &\approx
\sum_k w_k^2 \left(\frac{1}{w_k|T|} + \frac{1}{w_k|C|} \right ) \sigma^2 \\
&=
\left(\frac{1}{|T|} + \frac{1}{|C|} \right ) \sigma^2
\end{align}
Intuitively, we have increased the precision of our estimator by using information in $Z$ about which treatment units happened to be in each slice and likewise for control. In other words, this estimator is conditioned on the sums and counts (sufficient statistics) in each slice rather than just overall sums and counts.

In general, we can expect conditioning to remove only effects due to the second component of the law of total variation. This is because $\mathrm{var}_Z [EY|Z]$ is the part of $\mathrm{var\ }Y$ explained by $Z$ while $E_Z[\mathrm{var\ }Y|Z]$ is the variance remaining even after we know $Z$. Thus, if an estimator is based on the average value of $Y$, we can bring its variance down to the fraction
\[
\frac{E_Z[\mathrm{var\ }Y|Z]}{\mathrm{var\ }Y}
\]
But that’s just the theory. As discussed in the last post, LSOS often face the situation that $Y$ has a large CV. But if the CV of $Y$ is large, it is often because the variance within each slice is large, i.e. $E_Z[\mathrm{var\ }Y|Z]$ is large. If we track a single metric across the slices, we do so because we believe more or less the same phenomenon is at work in each slice of the data. Obviously, this doesn’t have to be true. For example, an online music service may care about the fraction of songs listened to from playlists in each user session but this playlist feature may only be available to premium users. In this case, conditioning on user type could make a big difference to the variability. But then you have to wonder why the music service doesn’t just track this metric for premium user sessions as opposed to all user sessions!

Another way to think about this is that the purpose of stratification is to use side information to group observations which are more similar within strata than between strata. We have ways to remove the variability due to differences between strata (between-stratum variance). But if between-stratum variance is small compared to within-stratum variance, then conditioning is not going to be very effective.


Rare binary event example


In the previous post, we discussed how rare binary events can be fundamental to the LSOS business model. To that end, it is worth studying them in more detail. Let’s make this concrete with actual values. Say we have track the metric fraction of user sessions which result in a purchase. We care about the effect of experiments on this important metric.

Let $Y$ be the Bernoulli random variable representing the purchase event in a user session. Suppose that the purchase probability varies by hour of day $Z$. In particular, say $E[Y|Z]=\theta$ follows a sinusoidal function of hour of day varying by a factor of 3. In other words, $\theta$ varies from $\theta_{min}$ to $\theta_{max}=3 \theta_{min}$ with average $\bar{\theta} = (\theta_{min} + \theta_{max})/2$. In the plot below, $\theta_{min}=1\%$ and $\theta_{max}=3\%$.

Figure 2: Fictitious hourly variation in mean rate of sessions with purchase.

For simplicity, assume there is no variation in frequency of user sessions in each hour of day. A variation in the hourly rate by a factor of 3 would seem worth addressing through conditioning. But we shall see that the improvement is small.

As discussed earlier, the portion of $\mathrm{var\ }Y$ we can reduce is $\mathrm{var_Z}[EY|Z]$. Here
\[
\mathrm{var\ }Y = \bar{\theta}(1-\bar{\theta}) \\
\mathrm{var_Z}[EY|Z] = \mathrm{var\ }\theta
\]
We can approximate $\mathrm{var\ }\theta$ by the variance of a sine wave of amplitude $(\theta_{max} - \theta_{min})/2$. This gives us $\mathrm{var\ }\theta \approx (\theta_{max} - \theta_{min})^2/8$ which reduces to $\bar{\theta}^2/8$ when $\theta_{max}=3 \theta_{min}$. Thus the fraction of reducible variance is
\[
\frac{\mathrm{var\ }\theta}{\mathrm{var\ }Y} = \frac{\bar{\theta}}{8(1-\bar{\theta})}
\]
which is less than $0.3\%$ when $\bar{\theta}=2\%$. Even with $\bar{\theta}=20\%$, the reduction in variance is still only about $3\%$. Since the fractional reduction in confidence interval width is about half the reduction in variance, there isn’t much to be gained here. 

For intuition as to why variance reduction did not work let’s plot on the same graph the unconditional standard deviation $\sqrt{\bar{\theta}(1-\bar{\theta})}$ of $14\%$ and the hourly standard deviation $\sqrt{\mathrm{var}\ Y|Z}= \sqrt{\theta(1-\theta)}$ which ranges from $10\%$ to $17\%$. We can see clearly that the predictable hourly variation of $\theta$ (the red line in the plot below) is small compared to the purple line representing variation in $Y$.

Figure 3: Mean and standard deviation of hourly rate of sessions with purchase


Variance reduction through prediction


In our previous example, conditioning on hour-of-day slicing did not meaningfully reduce variability. This was a case where our experiment metric was based on a rare binary event, the fraction of user sessions with a purchase. One could argue that we should have included all relevant slicing variables such as country, day of week, customer dimensions (e.g. premium or freemium) etc. Perhaps the combined effect of these factors would be worth addressing? If we were simply to posit the cartesian product of these factors, we would end up with insufficient data in certain cells. A better approach would be to create an explicit prediction of the probability that the user will purchase within a session. We could then define slices corresponding to intervals of the classification probability. To simplify the discussion, assume that all experimental effects are small compared with the prediction probabilities. Of course the classifier cannot use any attributes of the user session that are potentially affected by treatment (e.g. session duration) but that still leaves open a rich set of predictors.

We’d like to get a sense for how the accuracy of this classifier translates into variance reduction. For each user session, say the classifier produces a prediction $\theta$ which is its estimate of the probability of a purchase during that session. $Y$ is the binary event of a purchase. Let’s also assume the prediction is “calibrated” in the sense that $EY|\theta=\theta$, i.e. of the set of user sessions for which the classifier predicts $\theta=0.3$, exactly $30\%$ have a purchase.  If $EY=E\theta=\bar{\theta}$ is the overall purchase rate per session, the law of total variance used earlier says that the fraction of variance we cannot reduce by conditioning is
\[
\frac{E_{\theta}[\mathrm{var\ }Y|\theta]}{\bar{\theta}(1-\bar{\theta})}
\]
The mean squared error (MSE) of this (calibrated) classifier is
\[
E(Y-\theta)^2=E_\theta [\mathrm{var\ }Y|\theta] + E_\theta(EY|\theta - \theta)^2
= E_\theta[\mathrm{var\ }Y|\theta]
\]
In other words, the MSE of the classifier is the residual variance after conditioning. Furthermore, an uninformative calibrated classifier (one which always predicts $\bar{\theta}$) has MSE of $\bar{\theta}(1-\bar{\theta})$. This is equal to $\mathrm{var\ }Y$ so, unsurprisingly, such a classifier gives us no variance reduction (how could it if it gives us a single slice?). Thus, the better the classifier for $Y$ we can build (i.e. the lower its MSE) the greater the variance reduction possible if we condition on its prediction. But since it is generally difficult predict the probability rare events accurately, this is unlikely to be an easy route to variance reduction.

In the previous example, we tried to predict small probabilities. Another way to build a classifier for variance reduction is to address the rare event problem directly — what if we could predict a subset of instances in which the event of interest will definitely not occur? This would make the event more likely in the complementary set and hence mitigate the variance problem. Indeed, I was motivated to take such an approach several years ago when looking for ways to reduce the variability of Google’s search ad engagement metrics. While it is hard to predict in general how a user will react to the ads on any given search query, there are many queries where we can be sure how the user will react. For instance, if we don’t show any ads on a query then there can be no engagement. Since we did not show ads on a very large fraction of queries, I thought this approach appealing and did the following kind of analysis.

Let’s go back to our example of measuring the fraction of user sessions with purchase. Say we build a classifier to classify user sessions into two groups which we will call “dead” and “undead” to emphasize the importance of the rare purchase event to our business model. Suppose the classifier correctly assigns the label “undead” to $100\%$ of the sessions with purchase while correctly predicting “dead” for a fraction $\alpha$ of sessions without purchase. In other words, it has no false negatives and a false positive rate of $1-\alpha$. The idea is to condition on the two classes output by the classifier (again assume we are studying small experimental effects). Let $\theta$ be the fraction of sessions resulting in purchase. The question is how high must $\alpha$ be (as a function of $\theta$) in order to make a significant dent in the variance.

If $w$ is the fraction of sessions for which the classifier outputs “undead”, $w=\theta + (1-\alpha)(1- \theta)$. Meanwhile the conditional variance for the classes “dead” and “undead” are respectively $0$ and $\theta/w \cdot (1-\theta/w)$, leading to an average conditional variance of $w\cdot \theta/w \cdot (1-\theta/w)$. This is the variance remaining of the unconditioned variance $\theta (1-\theta)$. The fractional variance remaining is given by
\[
\frac{w \cdot \theta/w \cdot (1 - \theta/w)}{ \theta \cdot (1-\theta)} = \frac{1-\alpha}{1-\alpha+\alpha \theta}
\]
If the probability of a user session resulting in purchase is 2%, a classifier predicting 95% of non-purchase sessions (and all purchase sessions) would reduce variance by 28% (and hence CI widths by 15%). To get a 50% reduction in variance (29% smaller CIs) requires $\alpha \approx 1-\theta$, in this case 98%. Such accuracy seems difficult to achieve if the event of interest is rare.

Figure 4: Residual fraction of variance as a function of $\alpha$ when $\theta=2\%$.


Overlapping (a.k.a factorial) experiments


For one last instance of variance reduction in live experiments, consider the case of overlapping experiments as described in [1]. The idea is that experiments are conducted in “layers” such that a treatment arm and its control will inhabit a single layer. Assignments to treatment and control across layers are independent, leading to what's known as a full factorial experiment. Take the simplest case of two experiments, one in each layer. Every experimental unit (every user session in our example) goes through either Treatment 1 or Control 1 and either Treatment 2 or Control 2. Assuming that the experimental effects are strictly additive, we can get a simple unbiased estimate of the effect of each experiment by ignoring what happens in the other experiment. This relies on the fact that units subject to Treatment 1 and Control 1 on average receive the same treatment in Layer 2.

The effects of Layer 2 on the experiment in Layer 1 (and vice versa) do indeed cancel on average but not in any specific instance. Thus multiple experiment layers introduce additional sources of variability. A more sophisticated analysis could try to take into account the four possible combinations of treatment each user session actually received. Let $Y_i$ be the response measured on the $i$th user session. A standard statistical way to solve for the effects in a full factorial experiment design is via regression. The data set would have a row for each observation $Y_i$ and a binary predictor indicating whether the observation went through the treatment arm of each experiment. When solved with an intercept term, regression coefficients for the binary predictors are maximum likelihood estimates for the experiment effects under assumption of additivity. We could do this but in our big data world, we would avoid materializing such an inefficient structure by reducing the regression to its sufficient statistics. Let 

\begin{align}
S_{00} &= \sum_{i \in C_1 \cap C_2} Y_i \\
S_{01} &= \sum_{i \in T_1 \cap C_2} Y_i \\
S_{10} &= \sum_{i \in C_1 \cap T_2} Y_i \\
S_{11} &= \sum_{i \in T_1 \cap T_2} Y_i
\end{align}
Similarly, let $N_{00}=| C_1 \cap C_2 |$ etc. The “regression” estimator for the effect of each experiment are the solutions for $\beta_1$ and $\beta_2$ in the matrix equation
\[
\begin{bmatrix}
S_{00}\\
S_{01}\\
S_{10}\\
S_{11}
\end{bmatrix}
=\begin{bmatrix}
N_{00} & 0 & 0 \\
N_{01} & N_{01} & 0\\
N_{10} & 0 & N_{10}\\
N_{11} & N_{11} & N_{11}
\end{bmatrix}
\begin{bmatrix}
\beta_0\\
\beta_1\\
\beta_2
\end{bmatrix}
\]
In contrast, the simple estimator which ignores the other layers has estimates for Experiment 1
\[
\frac{S_{01} + S_{11}}{N_{01}+N_{11}} - \frac{S_{00} + S_{10}}{N_{00}+N_{10}}
\]
and Experiment 2
\[
\frac{S_{10} + S_{11}}{N_{10}+N_{11}} - \frac{S_{00} + S_{01}}{N_{00}+N_{01}}
\]

In this example, the regression estimator isn’t very hard to compute, but with several layers and tens of different arms in each layer, the combinatorics grow rapidly. All still doable but we’d like to know if this additional complexity is warranted.

At first blush this problem doesn’t resemble the variance reduction problems in the preceding sections. But if we assume for a moment that Experiment 2’s effect is known, we see that in estimating the effect of Experiment 1, Layer 2 is merely adding variance by having different means in each of its arms. Of course, the effect of Experiment 2 is not known and hence we need to solve simultaneous equations. But even if the effects of Experiment 2 were known, the additional variance due to Layer 2 would be very small unless Experiment 2 has a large effect. And by supposition, we are operating in the LSOS world of very small effects, certainly much smaller than the predictable effects within various slices (such as the 3x effect of hour of day we modeled earlier). Another way to say this is that knowing the effect of Experiment 2 doesn’t help us much in improving our estimate for Experiment 1 and vice versa. Thus we can analyze each experiment in isolation without much loss.


Conclusion


In this post, we continued our discussion from the previous post about experiment analysis in large scale online systems (LSOS). It is often the case that LSOS have metrics of interest based on observations with high coefficients of variation (CV). On the one hand, this means very small effect sizes may be of practical significance. On the other hand, this same high CV makes ineffective some statistical techniques to remove variability due to known or predictable effects.

A different way to frame this is that conditioning reduces measurement variability by removing the effects of imbalance in assigning experimental units within individual slices to treatment and control. This happens because sampling noise causes empirical fractions within slices to deviate from their expectation. At the same time, if the experiment metric is based on observations of high CV, we need to run experiments with a very large number of experimental units in order to obtain statistically significant results. The experiment size is therefore large enough to ensure that the empirical fractions within slices are unlikely to be far from their expectation. In other words, the law of large numbers leaves little imbalance to be corrected by conditioning. All this is just a consequence of conducting sufficiently powerful experiments in a data-rich, information-poor setting.

We conclude this post with two important caveats. First, not all LSOS metrics rely on observations with high CV. For instance, an online music streaming site may be interested in tracking the average listening time per session as an important metric of user engagement. There is no reason a priori to think that listening time per session will have high CV. If the CV is small, the very techniques discussed in this post may improve experiment analysis significantly. Second, this post has explored only a certain class of approaches to reduce measurement variability. Even in the high CV regime, there are techniques which can prove effective though they may require changes to the experiment itself. In a future post we hope to cover the powerful idea of what at Google we call “experiment counterfactuals”.

Until then, if we are operating in a regime where clever experiment analysis doesn’t help, we are freed from having to worry about it. A burden has been lifted. In a way, I find that comforting!


References


[1] Diane Tang, Ashish Agarwal, Deirdre O’Brien, Mike Meyer, “Overlapping Experiment Infrastructure: More, Better, Faster Experimentation”, Proceedings 16th Conference on Knowledge Discovery and Data Mining, Washington, DC