Compliance bias in mobile experiments

by DANIEL PERCIVAL

Randomized experiments are invaluable in making product decisions, including on mobile apps. But what if users don't immediately uptake the new experimental version? What if their uptake rate is not uniform? We'd like to be able to make decisions without having to wait for the long tail of users to experience the treatment to which they have been assigned. This blog post provides details for how we can make inferences without waiting for complete uptake.

Background

At Google, experimentation is an invaluable tool for making decisions and inference about new products and features. An experimenter, once their candidate product change is ready for testing, often needs only to write a few lines of configuration code to begin an experiment. Ready-made systems then perform standardized analyses on their work, giving a common and repeatable method of decision making. This process operates well under ideal conditions; in those applications where this process makes optimistic or unrealistic assumptions, data scientists must creep from the shadows and provide new approaches. In the Google Play app store, challenges such as these regularly occur, as the basic framework of mobile technology introduces many wrinkles in experimentation and measurement. The Play Data Science team works to develop appropriate approaches to these cases and develops reusable methodologies to broaden the capabilities of Google Play experimenters at large.

When we set a treatment and a control group, we typically assume that, instantaneously, each unit within the treatment experiences the treatment condition, and each unit within the control instead gets some manner of baseline condition. This assumption of total compliance allows us to make strong inferences about the impact of the treatment condition. However, there are common cases where this assumption could be broken, most notably in mobile. Naturally, this issue is of particular concern to us in the Play Data Science team.

Suppose a developer of a game wants to release a new major version of the game. They have a lot of users already playing their game, and the developer is eager to see what effect the update will have on them. They have a clever thought to run an experiment, offer some users the new version with a popup notification within the game, and don’t tell the rest about it. Then, by comparing the two groups, they can see exactly what benefits (and problems!) they get with the new version. Once they proceed with this, immediately they run into problems. First, as soon as they offer the version, not everybody agrees to upgrade. Some users immediately rush to upgrade, others are slower because they rarely play the game, or have some technical trouble installing the update until days or weeks after the offer is sent. Some users even refuse to update, being happy with the current version and adverse to change. Even worse, some users in the control group manage to get ahold of the new version. Clearly, drawing conclusions from this messy experiment will require more sophisticated analysis.

Abstracting this problem a bit, the total compliance assumption breaks down here --- some units assigned the treatment do not receive it immediately. Instead, a non-random subset of units receives the treatment, where the membership of a unit within this set is a function of covariates and time. In our example, users who rarely interact with the game will likely adopt the treatment more slowly, causing them to be underrepresented in the set of treated users in comparison to the population of users. Further, units in the control group may manage to receive the treatment, despite the assignment setup. In our example, this corresponds to sideloading, when a user obtains the new version of the game despite it not being offered.

Given this situation, a natural response is this: why not just wait until all the users in the treatment have upgraded? Even if we ignore the sideloaders (users in the control who take the treatment), we can’t always take this path. Though sometimes we can wait until most users have completed the update to draw conclusions, we usually want to make inference quickly. For example, we might want to stop the process if we measure harmful effects early. We also might want to use early data to make a decision whether or not to release the treatment to more users, giving us more measurement power later on. In these cases, we must address the bias on the early data, when only a fraction of users who can eventually update their game have done so, and those that have are a non-random subset of the population. For example, perhaps users who have been historically more engaged with the game will update first.

To summarize the rest of this post, we first characterize the main issues with these sorts of experiments. We then build our methods from a simple baseline method: intent-to-treat analysis. Finally, we expand our methods to include notions of which users are treated, and using the control to match users with similar upgrade probabilities.

Characterizing the main issues

The analysis here focuses on an experiment where a set of users is randomly assigned to one of two experiment arms:treatment or control. We observe for each user a set of units with covariates or user features $X_i$ and an associated response of interest $Y_i$. In our introductory example, this response might be the number of times a user opens the game each day. User features might be the country, device quality, and a bucketed measure of how often the user has played the game in a predefined period prior to the experiment. We are interested then in measuring the effect of the treatment on our response $Y_i$. We make the simplifying assumption that the impact of the treatment is instantaneous and does not further evolve over time. From here, the situation is complicated in a few ways, which we now explore in detail.

Issue 1: the empirical mismatch between treatment assignment and experience

Following our running example, after the game developer begins their experimental release they immediately observe that not all users assigned the treatment (the new game version) actually experience the treatment. Further, some users in the control group manage to experience the treatment. That is, users may have an actual experience that differs from their experiment arm assignment. To index these situations, we introduce two binary variables as follows:

• $Q$ indexes the user’s experiment arm assignment
• $Q=0$: the treatment is not assigned to the user; that is, they are the control group
• $Q=1$: the treatment is assigned to the user
• $U$ indexes the treatment experience:
• $U=0$: user does not experience the treatment condition
• $U=1$: user experiences the treatment

We can track the users in the treatment group $(Q=1)$ to see how many are actually experiencing the treatment ($U=1$) at any given time. We can then produce graphs of this upgrade percentage over time similar to the following:

 Fig 1: Only a fraction of users in the treatment group adhere to the treatment. Some take a significant time to adhere.

Issue 2: the users experiencing treatment are not a simple random subsample of the population

A natural question that follows from the graph of user treatment adoption: what users are actually receiving the treatment here? Typical factors that could influence update speed might be system properties like connectivity (better means faster upgrade), hardware quality (higher means faster), or core operating system software. More user-based covariates could be factors like country or current frequency of use of the product. From the point of view of making valid statistical inference, we can assess if the mix of units who have actually received the treatment reflect the overall population. Loosely put, the degree to which the treated units represent a simple random sample of the population has implications for our ability to draw generalizable conclusions from our experiment. One way to assess this is compare the set of treated units within the treatment group ($Q=1$; $U=1$) to all units assigned to the treatment ($Q=1$). Since units satisfying $Q=1$ are a random subset of the overall population, this comparison will give us hints as to how effectively we can generalize our conclusions.

To take our example, suppose we are able to measure how engaged each user has been with the game over the past month. We then can bucket these data into six groups, giving a spectrum of engagement. We can then compare the distribution over these buckets for users with the new version to the entire treatment group:

 Fig 2: Treatment units experiencing treatment are rather different from the population as a whole. This is shown here for one particular dimension, usage.

It is clear from this plot that the two distributions do not match. Further, we see some expected discrepancies: users who have been more highly engaged in the game are more likely to actually experience the treatment. Perhaps they are more enthusiastic to upgrade, or perhaps this covariate is correlated with other more impactful covariates such as the quality of the user’s hardware. In any case, it is clear that we cannot make inference without some caution in this case.

Issue 3: the need to make a timely decision

Following our running example, after the game developer begins their experimental release they want draw conclusions about its impact as soon as possible. As mentioned previously, we could wait to do the analysis until all the users who may eventually upgrade actually, but this is typically impractical. However the strategy of waiting until all (who will comply) have received the treatment is useful as a ground truth for evaluating our methods in the following. For each method, we can compare the conclusions we can draw at the beginning of the process to those we would get at the end. We can then choose the methods where these two conclusions match, or at least where the first is a more useful nuanced view.

We adopt the following notation and assumptions to make the time component of the problem clear:
• We build our models at two time points: $t=T_{\mathrm{measure}}$, and $t=T_{\mathrm{final}}$.
• $T_{\mathrm{measure}}$ represents a time point at an early stage, where we would do our experiment analysis in a ‘real’ situation.
• $T_{\mathrm{final}}$ represents a later time point, where virtually all of the users who eventually may upgrade have done so, and is used to benchmark the performance of the models and estimates produced at $T_{\mathrm{measure}}$
• We assume that the effect of the treatment does not evolve over time. This allows us to compare the two results directly. This is often an unrealistic assumption in applications. For example, users of a game will probably behave differently across days of the week, and their behavior will evolve as they learn the game through experience. We leave it aside in this post so we can clearly explain and explore the remaining issues.
We can visualize some typical values of $T_{\mathrm{measure}}$ and $T_{\mathrm{final}}$ by annotating the adoption figure given above as follows:

 Fig 3: The above figure illustrates a typical case: we can afford to wait a short amount of time for a reasonable percentage of users to adopt before doing analysis ($T_{\mathrm{measure}}$). In order to get more comprehensive adoption ($T_{\mathrm{final}}$), we would have to wait a significantly longer time.

Intent to Treat (ITT) and Treatment on the Treated (TOT) analysis

With the main issues characterized, we now turn to analysis methods for the application. Before we begin, we should state our overall goal estimand, which is the expected effect of the treatment on the average single unit: $$\theta = E(Y|U=1) - E(Y|U=0)$$ In the language of counterfactuals, this gives the difference between the outcomes under the two different treatment conditions. In a simple experiment, we assume that the random assignment to experiment arms is enough to give us a reasonable estimate of this quantity from standard methods, which rely on each group being a simple random sample from the population over which we wish to make inference.

A simple baseline to analyze this kind of experiment is an Intent to Treat (ITT) approach. The intent to treat estimand measures the effect of assigning the treatment to a user:$$\theta_{\mathrm{ITT}} = E(Y | Q=1) - E(Y|Q=0)$$ That is, we simply compare the two experiment groups on the basis of their treatment status assignment, rather than their actual experience of the treatment. To estimate this quantity, we could compare the mean value of a metric Y between treatment and control in the standard way:$$\hat{\theta}_{\mathrm{ITT}} = \frac{1}{N_{Q=1}} \sum_i Y_i[Q_i = 1] - \frac{1}{N_{Q=0}} \sum_j Y_j[Q_j = 0]$$ where $Y_i$ refers to the measured outcome for a single unit, $N_{Q=1}$ and $N_{Q=0}$ are the number of units measured in the treatment and control group, respectively. $[ \cdot]$ evaluates to $0$ or $1$ depending on whether the boolean expression within evaluates to false or true (Iversonian notation).

This method has a few weaknesses we can anticipate from the previous general assessment of the treatment group. A basic issue is that since many units in the treatment do not actually experience the treatment, we would expect that for such units, there is no impact of the treatment. This would effectively shrink our estimates of the effect towards a null point. To refine the analysis, we could focus on estimating the effect of the Treatment on the Treated (TOT). That is, we consider only the units in the treatment group that actually received the treatment to the entire control group. We could adjust for this in a simple way by scaling our estimated effect by the fraction of impacted units as follows:$$\hat{\theta}^*_{\mathrm{TOT}} = \left( \frac{1}{N_{Q=1}} \sum_i Y_i [Q_i=1] - \frac{1}{N_{Q=0}} \sum_j Y_j [Q_j=0] \right) \frac{N_{Q=1}}{N_{Q=1, U=1}}$$ Here, we introduce the additional notation $N_{Q=1, U=1}$, which represents the number of units that received the treatment within the treatment group. This estimator adjusts for the gross fraction of users who actually receive the treatment. A more direct estimator of the TOT effect slices the treatment group:$$\hat{\theta}_{\mathrm{TOT}} = \frac{1}{N_{Q=1,U=1}} \sum_i Y_i [Q_i=1 \cap U_i=1] - \frac{1}{N_{Q=0}} \sum_j Y_j [Q_j=0]$$ We can now do our first evaluation of these methods, by comparing ITT and TOT estimates computed during the beginning ($T_{\mathrm{measure}}$) and end ($T_{\mathrm{final}}$) of the observation period. The following figure displays these results:

 Fig 4: How the ITT and TOT estimates evolve over time.

Here, we see that the ITT method performs poorly; it estimates quite a different effect at the beginning ($t=T_{\mathrm{measure}}$) than at the end ($t=T_{\mathrm{final}}$) of the period. This is likely because the number of users experiencing the treatment is increasing over time, and so the earlier estimate is shrunk more strongly towards zero. The TOT method performs somewhat better in terms of stability, but the estimate declines between the two time points. The differences between the distribution of users experiencing the treatment and the population are likely to be a key factor here.

Indeed, both of these estimators do not well estimate the treatment effect for all users if this set of treated users is not a random subset of the population, and if the covariates that differ between this subset and the population are also correlated with our outcome and the treatment effect. From our earlier analysis, we can see that this is not the case by comparing the distribution over one categorical covariate. To proceed, we fully characterize the types of units that are more likely to adopt the treatment at an earlier stage in our experiment.

Compliance Bias

A central issue in this application is that users assigned treatment sometimes do not actually experience the treatment at $T_{\mathrm{measure}}$, and furthermore this set of users is not random. Here, we can draw a direct analogy to Compliance Bias, which is primarily described in literature on the analysis of medical studies. This type of bias can occur when users do not adhere to their assignment to an intervention plan, for example when patients with less acute disease symptoms more often refuse to take a drug they were given.

To make this issue precise, we expand our language of potential outcomes, and introduce a set of four potential outcomes for each unit accounting for both availability of the treatment and the actual application thereof. To index these situations, we combine our two binary variables $Q$ for assignment and $U$ for treatment experience to give the following table of potential outcomes:

 Potential outcome index (Q, U) Description $(0, 0)$ Control group user not experiencing treatment $(0, 1)$ Control group user experiencing the treatment $(1, 0)$ Treatment user not experiencing treatment $(1, 1)$ Treatment user experiencing treatment

In the context of this application, these rows have already so far been roughly explored. Users in the control are expected to behave as $(0, 0)$ indicates (not offered the upgrade, don’t take the upgrade), and users in the treatment group may be in state $(1, 0)$ or $(1, 1)$. Users realized in the case $(0, 1)$ may seem impossible or surprising, as they represent a sort of leakage of the treatment condition into the control group. These correspond to sideloading behavior, where a user obtains the update without an offer, possibly through internet backchannels, which was discussed above.

It further helps to map out the full potential outcomes for each unit. That is, for a fixed unit, what are the pair of potential outcomes we might see if we vary the group assignment Q? This gives the following table (see [1]):

 Potential outcomes for unit over groups (Q, U) ; (Q’, U’) User Type Rough Description $(0, 0); (1, 0)$ Never-taker User that will never upgrade $(0, 0); (1, 1)$ Complier User that will upgrade if offered $(0, 1); (1, 0)$ Defier User that will avoid upgrades if offered, seek them if not offered $(0, 1); (1, 1)$ Always-taker User that will seek out the upgrade in all conditions (sideloading)

While they may exist in theory, do we have units of each of these types? In our application, never-takers (units who can never execute an update) and compliers (units who will upgrade if given the chance) seem reasonably common. Note here that we evaluate $U$ at $T_{\mathrm{measure}}$, so never-takers here are those who would not experience the treatment at $T_{\mathrm{measure}}$, regardless of assignment $Q$.

We will ignore always-takers and defiers, since in our application the number of users with realized $(0, 1)$ outcomes are extremely rare ($\ll 1\%$). This implies our population consists overwhelmingly of compliers and never-takers. This greatly simplifies our situation, and makes full conditional observable or conditional compliance modeling approaches unnecessary [1]. Conditional compliance models estimate causal effects for each of the four types of users in the table. Conditional observable models try to estimate relationships between all four counterfactual quantities for each user.

Propensity scoring within the treatment

We now explore statistical strategies for estimation that account for the difference in users who experience the treatment. A starting idea is to analyze the treatment alone as an observational study (an analysis on $Q=1$ only). Here, we ignore any control group, and analyze the treatment group as a self-contained observational study for units where $Q=1$. We attack this via propensity modeling, using $U$ as the new ‘treatment’ variable and reducing our set of potential outcomes to $\{Y(1, 1), Y(1, 0)\}$. In this case, we fit the following logistic regression:$$\mathrm{logit}( \Pr(U | X, Q=1) ) = X \beta$$We then estimate the effect using reweighting, matching or stratification methods. Before proceeding to effect estimates, it is useful to examine the output of this model. At both the start and end of the study, we can produce a histogram of the propensity scores for $Q=1$, sliced by $U$, that is, the probability that a user within the treatment group will actually compete the upgrade (receive the treatment)

 Fig 5: Estimated probability of experiencing the treatment in the treatment group. Observe the subset unlikely to uptake.

We see that a collection of never-takers immediately stands out with very low estimated scores, a clear conclusion even from the start of the study. These users can be safely discarded from our effect estimation analysis. Further, the existence of never-taker users calls into the question of the ITT analysis, even at $T_{\mathrm{final}}$. If there are users who will never experience the counterfactual treatment state, then ITT will never estimate the difference between these states if those users are included in the estimation. Otherwise, the model produces a decent range of $\Pr(U=1 | X)$ predictions, and we can see from the plot that the number of users with $U=1$ are more common at higher probabilities. As a classifier, the model gives merely decent performance, which is actually advantageous for propensity methods. If in this case we instead had a clean separation of the two classes for all users, this would imply that certain user factors completely determine adoption. Therefore, within the treatment group, many users with $U=0$ would be without a peer user with $U=1$, so we could not reasonably estimate our target effect. We then would not be able to generalize our conclusions to the entire user population. This problem would be somewhat mitigated if only a subset of users have estimated probability $0$ or $1$, and we were therefore able to understand clearly for which users we cannot estimate the effect / cannot find matching users.

For estimation, we consider stratification or bucketing methods. That is, we take our range of estimated propensity scores, and partition them into buckets. We then perform analysis within each bucket, and collect the within-bucket estimates to form an overall estimate. Since the scores are estimated from a model depending on many covariates, this bucketing has the effect of partitioning users based on their covariates, reduced to a single univariate measure: $\Pr(U=1 | X)$. We adopt a simple form of stratification where we partition the range $0$ to $1$ into twenty $0.05$-width strata $S_k$ for $1\leq k \leq 20$. Let $\hat{p}_i$ be the estimated propensity score for unit $i$. We can compute a propensity weighted comparison by users whose scores fall into the $k$th stratum $S_k$ as follows: $$\hat{\theta}_{\mathrm{p-weight}, k} = \frac{\sum_i \frac{Y_i}{\hat{p}_i} [\hat{p}_i \in S_k \cap Q_i=1 \cap U_i=1] }{\sum_i \frac{1}{\hat{p}_i}[\hat{p}_i \in S_k \cap Q_i=1 \cap U_i=1]} - \frac{\sum_i \frac{Y_i}{1-\hat{p}_i} [\hat{p}_i \in S_k \cap Q_i=1 \cap U_i=0] }{\sum_i \frac{1}{1-\hat{p}_i}[\hat{p}_i \in S_k \cap Q_i=1 \cap U_i=0]}$$
We can then collect these into a single estimate by taking a weighted combination of these strata comparisons, where $N_k$ is the number of users falling into the $k$th stratum:$$\hat{\theta}_{\mathrm{p-weight}} = \frac{ \sum_k N_k \hat{\theta}_{\mathrm{p-weight}, k}} {\sum_k N_k}$$
It is useful to examine the results per strata, as the following plot does. Here, we compare the results on users at both $T_\mathrm{measure}$ and $T_\mathrm{final}$, where the strata and $\hat{p}$ are defined by the model fit at $T_\mathrm{measure}$ only.

 Fig 6: Propensity scores estimated for each stratum. For the most part, the estimates are in good agreement, but differ significantly in the high propensity strata.

We first see that the model performs well for the lower strata (e.g. $(0.45, 0.5]$), in that the estimate at $T_\mathrm{final}$ is close to that made at $T_\mathrm{measure}$. The approach performs worse for higher buckets, which is expected as there are fewer users with $Q=1, U=0$ here.

A more striking overall point is that the effect is not uniform across the buckets. In fact, the impact of the update increases as our estimate of $\Pr(U=1 | X)$ increases. This means that for different users, the new game has a different effect. This is a valuable insight that the ITT and TOT approaches do not provide, as their estimand assumes that the treatment effect is a universal mean shift across all users.

Propensity score matching to the control

Another approach is to leverage the control group along with our propensity scores. With a control group, we have access to many users for whom we (mostly) observe $Y(0, 0)$ for the entire range of estimated propensity scores. In contrast, in our within treatment approach, there are fewer users with realized outcome $Y(1, 0)$ as the estimated propensity score increases. After fitting a propensity model to the treatment group, we can estimate the probability of each member of the control group experiencing treatment by assuming that $\Pr( U =1 | X, Q = 0) = \Pr( U =1| X, Q = 1)$ and $Y(0, 0) = Y(1, 0)$. With a $\Pr(U=1 | X)$ available in for each unit, we can now perform some form of matching, either exact or stratified, between units and take paired differences between the resulting groups. To obtain an estimate of $E(Y(1, 1) - Y(1, 0))$, we select only groups containing units receiving the treatment in the treatment group and compare them as follows:$$\hat{\theta}_{\mathrm{p-match}, k} =\frac{\sum_i \frac{Y_i}{\hat{p}_i} [\hat{p}_i \in S_k \cap Q_i=1 \cap U_i=1] }{\sum_i \frac{1}{\hat{p}_i}[\hat{p}_i \in S_k \cap Q_i=1 \cap U_i=1]} - \frac{\sum_i \frac{Y_i}{1-\hat{p}_i} [\hat{p}_i \in S_k \cap Q_i=0 \cap U_i=0] }{\sum_i \frac{1}{1-\hat{p}_i}[\hat{p}_i \in S_k \cap Q_i=0 \cap U_i=0]}$$
This approach has the distinct advantage in comparison to our ‘within treatment’ stratification analysis pool of users without the treatment at all time points for all strata. Again, we can plot the performance at both the start and end of the study:
 Fig 7: Analogous to Fig 6, but with better agreement in the high propensity strata

The results here are similar to the previous approach in the lower propensity score buckets. The main improvement comes at the higher buckets, where the estimates at $T_\mathrm{final}$ and $T_\mathrm{measure}$ are now as close as in the other buckets. The core reason for this improvement is that in this approach, we have many users available in the control ($Q=0$) with similar user features as those in the treatment ($Q=1$) that would produce higher propensity scores. In contrast, the $Q=0$ users have primarily $U=1$ users with high propensity scores, leading to poor estimates in these strata.

 Fig 8: As expected, propensity matching is more consistent over time than propensity weighting

Conclusion

Experiment analysis often cannot rely on the assumption of faithful adoption of the treatment condition. Here, we’ve explored a case where many users assigned the treatment do not actually experience the treatment for a long time period after the beginning of the experiment. Moreover, waiting until a steady state of treatment adoption to draw inference is often impractical, so we have to make do with a biased early subset of users. As we’ve shown, adjustments are possible, but a litany of assumptions and concerns must be dealt with. Several complexities are left unaddressed here, such as effects that evolve over time or large volumes of users falling into ‘defier’ or ‘always-taker’ categories that require further refinements of approaches. Nonetheless, propensity based models often provide insightful refinements to the basic ITT or TOT approaches, and would form the basis for methods that would address these complexities.

References

[1] Have, Thomas R. Ten, et al. “Causal Models for Randomized Physician Encouragement Trials in Treating Primary Care Depression.” Journal of the American Statistical Association, vol. 99, no. 465, 2004, pp. 16–25. JSTOR, JSTOR, www.jstor.org/stable/27590349.