By JEAN STEINER

Randomized A/B experiments are the gold standard for estimating causal effects. The analysis can be straightforward, especially when it's safe to assume that individual observations of an outcome measure are independent. However, this is not always the case. When observations are not independent, an analysis that assumes independence can lead us to believe that effects are significant when they actually aren't. This blog post explores how this problem arises in applications at Google, and how we can 'mind our units' through analysis and logging.

## A tale of two units

Imagine you are interested in the effectiveness of a new mathematics curriculum. You might run an experiment in which some students are exposed to the new curriculum while others are taught based on the existing curriculum. At the end of the semester, you administer a proficiency test to the students and compare test performance across the groups. For example, you might compare the average test score among students exposed to the new curriculum to the average score among the controls.

Sounds straightforward, right? So, why is this a tale about 'two units'? The first unit is the unit of observation. The outcome measure we care about is an average of the students' test scores, and so the unit of observation is a student. There is one test score per student, and each student was exposed to either the new or the old curriculum. But, can we treat all of the test scores from all of the students as independent observations?

To figure this out, let's consider an appropriate experimental design. To evaluate a curriculum, we will need many teachers to use it, and we will also want many students to be exposed to each curriculum. Since teachers can only teach from a single curriculum for each semester, we will need to ensure that each teacher is assigned to either the new curriculum or the old curriculum. This means that we'll need to assign treatment status, or randomize, at the level of the teacher. In other words, the teacher is our second kind of unit, the unit of experimentation.

This type of experimental design is known as a group-randomized or cluster-randomized trial. As the name suggests, in a group-randomized trial randomization occurs at the group level. In the example above, randomization occurs at the level of a teacher rather than at the student level. Since each teacher works with a group of students, the entire group of a teacher's students is exposed to either the new or the old curriculum.

With this experimental design in mind, let's revisit this question: can we treat all of the students' test scores as independent observations? The answer is no -- there are at least two features of classroom dynamics that prevent us from treating test scores from students in the same classroom as independent observations.

First, there will generally be a teacher effect: when a teacher is really good, their students will likely do a bit better than when a teacher performs less well, and this should hold regardless of the curriculum. The second source of dependence comes from student interactions. Students within the same classroom will likely talk to each other and work with each other: some students might understand the material better and then teach their classmates, or, in the negative scenario, a classroom with a real trouble-maker might impede the progress of several classmates by distracting them.

In summary, there is likely dependence among observations that come from students with the same teacher. When analyzing the outcome measure (e.g., student test scores) in a group-randomized trial, it is critical to account for the group structure (or the experimental unit). Ignoring the group structure would underestimate the variance and could lead to false positive findings.

One typical modelling approach to account for the group structure is a mixed-level model with random effects. This is a comprehensive approach, and we often employ it at Google. However, we also employ alternative approaches whose conceptual simplicity can be a practical advantage.

Let's begin by considering a typical online experiment with a group-randomized structure. Suppose we want to evaluate a proposed change to the user interface for web search. We want an experimental set-up that is consistent with what would happen if we actually launched the new interface so that our experiment will lead us to make relevant conclusions. Additionally, we want to ensure users have a great search experience even as we're trying out new features, so we don't want users to keep seeing new interfaces every time they do a new search. To satisfy both of these experimental design requirements, the best experimental unit is the user. Randomizing at the level of user (or cookie, which is often used as a proxy for a user) will ensure that each user sees the same interface every time she comes to the search page. For those familiar with causal inference, our design is said to satisfy the stable unit treatment value assumption, or SUTVA. This assumption says that the potential outcomes in one user are independent of the treatment status of all other users and that there are no hidden variations in treatment.

Examples of outcome measures might include the click-through-rate or the time-to-first-action, an indication of how quickly users found the information they wanted. This means the unit of observation is an individual search result page. Some users will do many searches, some will do only a handful. The observations that come from the same user are clearly not independent. For example, some users will be faster readers or 'more clicky' than others.

At this point, the group-randomized framework should be evident in the example of evaluating a new search interface: the 'group' is the user, and the 'unit of observation' is at the level of each search that a user conducts.

This example also hints at a key difference between policy evaluations and online experiments; namely, the size of the study. Typically, in a policy evaluation setting there are relatively few groups, often on the order of hundreds or fewer per trial. By contrast, in an online setting, the number of groups can range from thousands to billions. As we will see, this informs the way we account for the group structure when performing causal inference at Google.

## The perils of incorrect units

Is the idea of 'minding our units' just some esoteric issue, or can this actually hurt us in practice? Let's use a simple synthetic dataset to look at the effects of accounting for (or ignoring!) an existing group structure. We generate observations $Y_{ij}$ that correspond to the $i^\textrm{th}$ observation for the $j^\textrm{th}$ group.

The simulation set-up is as follows: We fix the number of groups to be $N = 50,\!000$, then we generate observations for each of the groups, with group sizes governed by the Poisson parameter $\lambda$:
• Draw the group mean, $m_j$, for the $j^\textrm{th}$ group with $m_j \sim N(0, 1)$.
• For $\lambda$ in (0.0, 0.2, ..., 1.2):
• Determine the size of the $j^\textrm{th}$ group, $n_j$, by drawing from $\textrm{Poiss}(\lambda) + 1$.
• Generate observations for the $j^\textrm{th}$ group: for each group $j$ in 1, ..., $N$, generate $n_j$ individual observations $Y_{ij} \sim N(m_j, 0.25)$, for $i = 1,..., n_j$ (where $n_j$ was the group size, and $m_j$ was the group mean).
• Calculate 95% bootstrap confidence intervals with two alternative resampling schemes:
• Ignoring the group structure by resampling each observation $Y_{ij}$.
• Accounting for the group structure by resampling each $j^\textrm{th}$ group.
It is worth noting that $\lambda$ is the only parameter that governs the overall sample size (which is $\Sigma n_j$ and will grow as $\lambda$ grows). However, the total number of independent groups is fixed at $N = 50,\!000$ throughout. Note also that in this simulation the within-group standard deviation (0.25) is much smaller than the between-group standard deviation (1.0), so we should not expect the overall sample standard deviation to respond directly to changes in the overall sample size.

We've implemented the above simulation in a simple R script included at the bottom of this post. We ran the simulation once to illustrate what's going on (Table 1) and then repeated the process 1,000 times to demonstrate how coverage is affected (Figure 1).

In Table 1, the first column gives the Poisson parameter $\lambda$, which governs the group sizes. The second column gives the percent of groups that have more than one observation per group, and the 3rd column gives the total number of observations across the 50,000 groups. The 4th and 5th columns give the half-width of the confidence interval (CI) by ignoring and accounting for the group structure.

 𝜆 % groups with more than one observation # observations CI half-width (a) CI half-width (b) 0.0 0% 50,000 0.009 0.009 0.2 18% 59,887 0.008 0.010 0.4 33% 70,154 0.007 0.010 0.6 45% 79,920 0.007 0.010 0.8 55% 89,699 0.006 0.010 1.0 63% 100,070 0.006 0.010 1.2 70% 109,838 0.006 0.010

Table 1: Output of the simulation showing CIs that either (a) ignore or (b) account for within-group dependence.

You can see that as you read down the table, $\lambda$ increases and the groups become increasingly heterogeneous (more groups contribute multiple observations per group). Of course, when $\lambda = 0$, each group contains only one observation, so there is no dependence to account for, but it is included for reference. By comparing the 4th and 5th column it is clear that as the groups become more heterogeneous, the discrepancy between the two CIs increases.

For example, in the third to last row, when $\lambda = 0.8$ and 55% of the groups have more than one observation, the CI when we (mistakenly) treat each observation as independent is 40% narrower than the CI when we (correctly) account for the true group structure (compare 0.006 to 0.010). Even when we have a less extreme example such as the second row, with $\lambda = 0.2$ and a mere 18% of groups having multiple observations per group, there is already a noticeable difference in the CI widths.

While Table 1 indicates why ignoring the group structure can give incorrect and excessively narrow CIs, it is even more instructive to look at the actual coverage of the two estimators. Coverage is the probability the CI computed by a particular method contains (or 'covers') the true value of the estimand, which is an effect of 0 in this simulation. The figure below shows empirical coverage rates based on 1,000 simulations. The red bars show how often the CI contained the true value of the estimand when we ignore the group structure. Contrast this with the cyan-colored bars, which show coverage of the CIs in which we properly account for the group structure. Even at values as low as $\lambda = 0.2$, the coverage of the CI that ignores the group structure drops to 90%, while the CI that accounts for the group structure keeps its nominal coverage of 95% (blue dotted line), in line with the desired false-positive rate of 5%. By the time $\lambda$ is 1.2, ignoring the group structure reduces coverage to 77%, showing how ignoring the group structure leads to optimistic inferences (i.e., CIs which are too narrow).

Figure 1: Coverage based on 1,000 simulations in which we either ignore (red) or account for (cyan) within-group dependence. The x-axis shows the average group size minus 1. The plot shows how ignoring the group structure (red) leads to increasingly optimistic inferences (i.e., too narrow CIs) while accounting for the group structure leads to nominal coverage regardless of group size (cyan).

## How do we mind our units in analyses at Google?

The above simulation already hints at one of our approaches to incorporating the group structure in some analyses at Google. Since we often want to compute effects on hundreds of outcome measures and since their underlying distributions can be funky, we frequently resort to non-parametric methods such as the bootstrap. A common approach, then, to minding our units is to use the bootstrap as in the above simulation: we re-sample at the group level in order to account for the group structure.

Another non-parametric method frequently used at Google is the jackknife. In order to account for the group structure, we use a blocked version of the jackknife in order to ensure that all observations from the same group are either left-out or included together.

When we're feeling parametric, we may also use the mixed-level model approach as in the policy evaluation setting. An earlier blogpost described applications and implementation details of mixed-level modeling in the 'big data' setting at Google.

## What if we don't (or can't) know our groups?

The blocked resampling strategies described above, as well as random-effects modelling, require that we have access to the independent unit or experimental unit, that is, that we know the group_id at analysis time. However, we might not be able to store all observations for each group_id. In cases like this, we employ an alternative solution: we use logging and intermediate data storage to capture the necessary information about the group structure.

When possible, we begin by including the group_id in our stored data. We then store the full dataset in a table like the one shown below.

 group_id value 1325 0.1 1325 1.4 2347 0.5 2347 0.4 7825 0.9 ... ...

Table 2: Individual observations for each group_id.

If we have the full data of Table 2, we can go ahead and apply the analysis methods described in the preceding section, e.g., bootstrap re-sampling the group_ids. However, when we can't store the group_id, we store partly aggregated data, which we call bucketed data. To generate bucketed data, all of the observations associated with the same group_id will go into the same bucket. To illustrate a simple bucketing scheme, compute the buckets by:

bucket = (group_id mod 100).

Now store data for each bucket:

 bucket SUM(value) # data points in bucket 25 2.4 3 47 0.9 2 ... ... ...

Table 3: Bucketed observations (after associating each group_id with a bucket).

This bucketed data is no longer amenable to the group_id level bootstrap because we have aggregated observations and can no longer generate a full sampling distribution. However, we can still use methods that work on the bucketed data (including bootstrapping the bucket level data to get an estimate of the variability). One approach used at Google is the streaming bucket approach described in section 2.1 of this report by Nicholas Chamandy and colleagues, in which the variability of each bucket's estimate is used to approximate the overall variability of the estimator.

Bucketed data is also amenable to the jackknife, which we can use in a leave-one-bucket-out fashion. The bucketed implementation of the jackknife is popular at Google because it provides a good trade-off between storage needs and statistical coverage: we can slice and deliver results on the fly across many dimensions. At the same time, compared to the streaming bucket approach and other alternatives considered, we have found that the bucketed jackknife has better coverage properties for ill-behaved metrics such as rare rate metrics.

It is worth taking a small tangent to make a note of caution on how you bucket the data. The simple bucketing scheme illustrated above, (group_id mod 100), is a bad idea in practice. If there is anything systematic about how the group_id's are generated, however subtle, then a simple modulus by itself fails to break this systematic behavior. This means we would retain systematic differences across buckets. For example, if the id's had been generated in a way to account for load balancing of some database, you can end up with simple buckets having strange periodicity to them. In other words, it's important to ensure that the group_id's are randomly distributed to buckets. We also want the bucket assignment to be consistent (say, across time), so that we can aggregate the data in various ways. One approach to having consistent and randomly distributed assignment is to use a salt (such as the experiment name) together with a hash of the group_id prior to taking the modulus. That is, if you wanted 100 buckets you could compute

bucket = (hash(group_id + salt) mod 100).

It is also worth noting that for many of our applications at Google we only use 20 buckets, which we have found to give good coverage while allowing for storage and processing efficiency. It is important to use enough buckets to ensure that your CIs have the right coverage properties, and remember that applications with less data will likely need to use more buckets.

## Not all observations created equally

The basic bucketing method assumes that we care equally about the contribution from each observation. However, from a product perspective we sometimes care about some observations more than others. Suppose for example we have an onboarding flow in which users create new accounts. Most of the time, each user will only create one account, but some power users might open up multiple accounts. From a product standpoint, we may want to focus only on analyzing the first account created by each user because the onboarding is really designed to onboard the new users, and when a user is returning to make a subsequent account, they likely don't need the onboarding. We also often see that power users exhibit different behaviors from the regular one-time users.

In light of such considerations, instead of fully accounting for the group structure of these power users, we might want to separately analyze the first observation per group versus any subsequent observations. One approach might be to only log data for the first observation, but it's dangerous to lose visibility into a sizable chunk of data, and we generally would still want to be able to summarize all of the data and use the totals to sanity check our results and product health. In this type of situation, we would implement a counter that will log whether the group_id is seen for the first time or returning. In other words, we are adding another dimension to the data which we can filter by later.

Once the data is tagged with this additional dimension, we can store a combination of the bucket as well as whether the observations occur on the first instance or subsequent. For example, assuming the data in Table 2 was sorted by time, we would store:

 bucket counter SUM(value) # data points in bucket 25 first 1 2 25 subsequent 1.4 1 47 first 0.5 1 47 subsequent 0.4 1 ... ... ... ...

Table 4: Bucketed data that enables separate analysis for first observations.

While this approach requires more data storage than the straight bucketed data approach, it allows us to compute the metrics that are most relevant for evaluating the product. Again, we can apply the  jackknife, but now to the subset of the data where the counter value is first or subsequent. In other words, we have filtered our data by a dimension value. Similarly, you can see that you can add any filter into the bucketed data approach (as long as the filtering field is available in the raw data logs).

In summary, there are many different ways to account for the group structure when the experimental unit differs from the unit of observation. Regardless of how you do it, do remember to mind your units.

#### Acknowledgments

Many thanks to Dancsi Percival for his work on the impact of group-size heterogeneity on confidence intervals and to the many colleagues at Google who have evolved many of the ideas and tools described in this post.

## Appendix: R code

ComputeBootstrapMeanCI <- function(data, data.grouping, num.replicates = 2000) {
# Function to calculate confidence intervals for the mean via bootstrap.
# Must pass in a grouping variable to do the sampling on the grouping units.
# For bootstrap, we sample with replacement and each replicate is same size
# as original data; 95% CI obtained empirically by taking percentiles.
#
# Args:
#   data:           Numeric vector of data.
#   data.grouping:  Numeric vector of grouping variable for data. Must be same
#                   the same length as data.
#   num.replicates: Number of replicates.
#
# Returns:
#   Half-width of a 95% CI.

# Split the data by the grouping variable.
assertthat::assert_that(length(data) == length(data.grouping))
data.split <- split(data, data.grouping)

# Create replicates.
n.groups <- length(unique(data.grouping))
data.samples <-
replicate(num.replicates,
unlist(sample(data.split, replace = TRUE, size = n.groups)),
simplify = FALSE)

# Get sampling distribution and summarize empirically.
bootstrap.means <- vapply(data.samples, mean, 1)
half.width <- unname(diff(quantile(bootstrap.means, c(0.025, 0.975))) / 2)
return(half.width)
}

set.seed(1237)

kNumGroups <- 50000  # Total number of groups for all simulations.
kGroupSd <- 1  # This is the standard deviation for group level means.
kNumReplicates <- 2000  # Number of replicates for bootstrap.
kWithinGroupShrink <- 0.25  # Shrinks standard dev. of w/in group observations.
kLambdaVector <- seq(0, 1.2, by = 0.2)  # Poisson parameter for group size.

# Generate the mean for each group and use this for all runs of simulation.
group.means <- rnorm(kNumGroups, 0, kGroupSd)

# Loop over simulation parameters and store results in a summary data frame.
summary.df <- data.frame()
for (param in kLambdaVector) {
group.sizes <- rpois(kNumGroups, param) + 1

# Make data frame with group size & group mean for ease of computation.
group.df <- data.frame(id = 1:kNumGroups, group.means, group.sizes)
group.info <- split(group.df, group.df$id) # Helper function to generate per-group observations. SimulateGroupObservations <- function(data) { rnorm(data$group.sizes, data$group.means, kWithinGroupShrink * kGroupSd) } # Generate vector of per-group observations for all groups. full.obs <- unlist(sapply(group.info, SimulateGroupObservations)) # Compute CI pretending that each observation is independent. ignore.group.est.half.width <- ComputeBootstrapMeanCI(full.obs, 1:length(full.obs), kNumReplicates) # Get a vector of group id's to incorporate dependence structure. group.id <- unlist(sapply(group.info, function(x) rep(x$id, x\$group.sizes)))

# Compute CI accounting for the group structure.
include.group.est.half.width <-
ComputeBootstrapMeanCI(full.obs, group.id, kNumReplicates)

# Formulate data frame to summarize results.
my.row <- data.frame(lambda = param,
n.multigroups = sum(group.sizes > 1) / kNumGroups,
n.observations = sum(group.sizes),
ignore.group.est.half.width,
include.group.est.half.width)
summary.df <- rbind(summary.df, my.row)
}