### Designing A/B tests in a collaboration network

BY SANGHO YOON

At Google, A/B testing plays a key role in better understanding our users and products. With A/B testing, we can validate various hypotheses and measure the impact of our product changes, allowing us to make better decisions. Of course, A/B testing is not something new in our field, as it has been adopted by many tech companies. But due to the large scale and complexity of data, each company tends to develop its own A/B test solution to solve its unique challenges. One particular area involves experiments in marketplaces or social networks where users (or randomized samples) are connected and treatment assignment of one user may influence another user's behavior.

Typical A/B experiments assume that the response or behavior of an individual sample depends only on its own assignment to a treatment group. This is known as the Stable Unit Treatment Value Assumption (SUTVA). However, this assumption no longer holds when samples interact with each other, such as in a network. An example of this is when the effects of exposed users can spill over to their peers. This is the case for experiments on the Google Cloud Platform.

Google Cloud Platform (GCP) offers a suite of products that enable developers to work on their projects in the cloud. GCP also provides great flexibility to developers in sharing their resources and projects, with tools to protect and control their security and privacy. We find that users in GCP naturally form collaboration networks to work on shared projects and this in turn improves efficiency in managing their resources. Our goal is to leverage this network structure in designing and analyzing experiments to improve the GCP product.

One salient requirement we impose on experiment design is that the experience of all users who collaborate with each other be consistent. This is critical to our highly collaborative product. For example, imagine a situation where two users collaborate on a shared project. One user sees a new feature to enable a firewall, but the other user doesn’t see the same option available. This could create confusion. Such an undesired effect is not only bad for user experience, it also hinders us measuring the

While the collaboration networks in Google Cloud Platform bear similarities to components of social networks (e.g. Facebook, Twitter, LinkedIn, Google+), there are significant differences. Two fundamental differences are described below:

Since users can be connected via shared projects, we also need to track another entity, the component of the graph to which any user belongs. A user can be associated with exactly one component. Figure 2 shows that the user collaboration graph has three distinct components.

##

The hierarchical structure of the collaboration network makes it clear that we must use component as the unit of randomization in our experiments. This is necessary to provide guarantees on treatment consistency. However, the downside of using a larger unit of randomization is that we lose experimental power. This comes from two factors: fewer experimental units, and greater inherent difference across experimental units.

Figure 3 shows distribution of project size, as measured in # users per project, and the distribution of # of project per user (axes have been removed for confidentiality). These contribute to the structure and size of components. We see here indications of large differences in size and structure of components. These differences tend to increase the variance of our estimates and hence lose us statistical power.

One way to mitigate this loss of power is to cluster samples into more homogeneous strata and sample proportionately from each stratum. We define strata based on two features: number of users and "usage", a measure of the aggregate user activity in the network. These two properties were selected because they correlate strongly with experiment metrics of greatest interest.

By drawing a fixed fraction of units from each stratum, we achieve better balance across treatment groups, and hence reduce variance in our estimates. In addition, stratified sampling helps us obtain representative samples when the sampling rate is low.

The overall procedure of our methodology for stratified random sampling is described as follows:

##

In order to quantify the tradeoffs involved in experiment design, we need a model of network effects to be used in subsequent simulation studies. We now describe a generative model for how effects might propagate through the network. The network topology itself is the actual collaboration network we observe for GCP.

Consider the case where experiment metrics are evaluated at the per-user level. Assume we have $K$ users. Let $Z_k$ denote the assignment of the $k^{th}$ user to an arm of the experiment. Here $Z_k = 0$ means the user is assigned to control and $Z_k = 1$ for treatment. Under Stable Unit Treatment Value Assumption (SUTVA), one can estimate treatment effect as follows:

$$

\delta = \frac{1}{N_0}\sum_k Y_k[Z_k = 1] - \frac{1}{N_1}\sum_k Y_k [Z_k=0] \tag{1}

$$ where $N_0$ and $N_1$ are the number of samples assigned to treatment group and control group, respectively. This is equivalent to estimating in the following linear model:

When users are connected in a network, their treatment assignments can generate network effects through their interactions. Our model considers two aspects of network effects:

$$

y_{i,j,k} \sim c_i + a_{i,j} + p_{i,j,k} \tag{3}

$$ where $c_i$ refers to the response from Component $i$, $a_{i,j}$ from Account $j$ in Component $i$, and $p_{i,j,k}$ from Project $k$ in Account $j$ in Component $i$. The random effects

At this point, it is no longer possible to guarantee consistent user treatment as defined earlier. We may discuss nuances of graph evolution in a future post, but for the most part, we are fortunate that this is a relatively rare event in our collaboration network today. Another aspect of concern is that treatment can in theory affect not just experiment metrics but also the graph topology itself. Thus graph evolution events also need to be tracked over the course of the experiment.

*In this article, we discuss an approach to the design of experiments in a network. In particular, we describe a method to prevent potential contamination (or inconsistent treatment exposure) of samples due to network effects. We present data from Google Cloud Platform (GCP) as an example of how we use A/B testing when users are connected. Our methodology can be extended to other areas where the network is observed and when avoiding contamination is of primary concern in experiment design. We first describe the unique challenges in designing experiments on developers working on GCP. We then use simulation to show how proper selection of the randomization unit can avoid estimation bias. This simulation is based on the actual user network of GCP.*## Experimentation on networks

A/B testing is a standard method of measuring the effect of changes by randomizing samples into different treatment groups. Randomization is essential to A/B testing because it removes selection bias as well as the potential for confounding factors in assessing treatment effects.At Google, A/B testing plays a key role in better understanding our users and products. With A/B testing, we can validate various hypotheses and measure the impact of our product changes, allowing us to make better decisions. Of course, A/B testing is not something new in our field, as it has been adopted by many tech companies. But due to the large scale and complexity of data, each company tends to develop its own A/B test solution to solve its unique challenges. One particular area involves experiments in marketplaces or social networks where users (or randomized samples) are connected and treatment assignment of one user may influence another user's behavior.

Typical A/B experiments assume that the response or behavior of an individual sample depends only on its own assignment to a treatment group. This is known as the Stable Unit Treatment Value Assumption (SUTVA). However, this assumption no longer holds when samples interact with each other, such as in a network. An example of this is when the effects of exposed users can spill over to their peers. This is the case for experiments on the Google Cloud Platform.

Google Cloud Platform (GCP) offers a suite of products that enable developers to work on their projects in the cloud. GCP also provides great flexibility to developers in sharing their resources and projects, with tools to protect and control their security and privacy. We find that users in GCP naturally form collaboration networks to work on shared projects and this in turn improves efficiency in managing their resources. Our goal is to leverage this network structure in designing and analyzing experiments to improve the GCP product.

One salient requirement we impose on experiment design is that the experience of all users who collaborate with each other be consistent. This is critical to our highly collaborative product. For example, imagine a situation where two users collaborate on a shared project. One user sees a new feature to enable a firewall, but the other user doesn’t see the same option available. This could create confusion. Such an undesired effect is not only bad for user experience, it also hinders us measuring the

**true**average treatment effect of the new firewall feature. Thus, the requirement is to provide a consistent user experience for the following two types of treatment exposure:**Direct exposure**: Every user must have a consistent experience across all GCP projects that he or she owns, manages, or collaborates in.**Indirect exposure**: Any two users who collaborate on a project must have the same experience.

While the collaboration networks in Google Cloud Platform bear similarities to components of social networks (e.g. Facebook, Twitter, LinkedIn, Google+), there are significant differences. Two fundamental differences are described below:

**A few large connected networks versus many connected components**: Users in social networks are linked to each other through their common friends. Methodologies for experimenting on users in social networks focus on ways to partition the overall graph into subgraphs, and then to run randomized experiments on these subgraphs. Severing edges is necessary because the largest connected components of the graph are typically very large. In GCP, however, we observe many small connected components because our customers want to manage their own privacy and security in their projects, and do not want to share access with third parties.**Spillover effects versus contamination**: Experiments in social networks must care about "spillover" or influence effects from peers. These spillover effects are a fundamental aspect of user behavior in a social network. Thus, the effect comes from both direct exposure of each treated individual, and indirect exposure from his or her peers. Such spillover effects also exist in the GCP user collaboration network but they are of secondary importance. In our case, avoiding confusion is more important than estimating indirect treatment effects. For example, imagine the confusion resulting from two users who work on a shared project but see two different versions. We need to avoid these effects rather than estimate them.

## Structure of the user collaboration network

As mentioned earlier, users in GCP collaborate with other developers via shared projects. Projects are linked to a Google Cloud billing account for proper resource management and billing. Since a project can be linked to at most one billing account, the project-to-billing account relationship is nested. However, a user can work on multiple projects.The user-to-project relationship is not necessarily nested. Rather, users can have membership in multiple projects. Therefore the relationship among billing account, project and user is complex. Figure 1 illustrates these three entities.Since users can be connected via shared projects, we also need to track another entity, the component of the graph to which any user belongs. A user can be associated with exactly one component. Figure 2 shows that the user collaboration graph has three distinct components.

Figure 2. Hierarchy: component → account → project → user |

##

Designing experiments on the collaboration network

The hierarchical structure of the collaboration network makes it clear that we must use component as the unit of randomization in our experiments. This is necessary to provide guarantees on treatment consistency. However, the downside of using a larger unit of randomization is that we lose experimental power. This comes from two factors: fewer experimental units, and greater inherent difference across experimental units.Figure 3 shows distribution of project size, as measured in # users per project, and the distribution of # of project per user (axes have been removed for confidentiality). These contribute to the structure and size of components. We see here indications of large differences in size and structure of components. These differences tend to increase the variance of our estimates and hence lose us statistical power.

Figure 3: # of users (per project) and # of projects (per user) |

By drawing a fixed fraction of units from each stratum, we achieve better balance across treatment groups, and hence reduce variance in our estimates. In addition, stratified sampling helps us obtain representative samples when the sampling rate is low.

The overall procedure of our methodology for stratified random sampling is described as follows:

**Build user graphs**: Find all the components in the current collaboration network.**Stratify graphs by size and usage**: Measure the size of each component by number of users and revenue and stratify graphs in number of users and revenue.**Select samples and random assignment**: Randomly sample a fraction of components in each stratum from Step 2 depending on the size of a study. Then randomly assign them to a treatment arm. For example, if we wish to run an experiment with a 5% arm for treatment and 5% for control, we first select a random 10% of components from each stratum, and subsequently assign them 50-50 to treatment and control groups. Each user, project and account inherits the experiment group from the component to which they belong.**Run experiment**: Steps 2 and 3 are repeated daily after the graph has been updated, and new components properly also randomized.

Figure 4. components and random sampling with stratification |

##

Modeling network effects

In order to quantify the tradeoffs involved in experiment design, we need a model of network effects to be used in subsequent simulation studies. We now describe a generative model for how effects might propagate through the network. The network topology itself is the actual collaboration network we observe for GCP.Consider the case where experiment metrics are evaluated at the per-user level. Assume we have $K$ users. Let $Z_k$ denote the assignment of the $k^{th}$ user to an arm of the experiment. Here $Z_k = 0$ means the user is assigned to control and $Z_k = 1$ for treatment. Under Stable Unit Treatment Value Assumption (SUTVA), one can estimate treatment effect as follows:

$$

\delta = \frac{1}{N_0}\sum_k Y_k[Z_k = 1] - \frac{1}{N_1}\sum_k Y_k [Z_k=0] \tag{1}

$$ where $N_0$ and $N_1$ are the number of samples assigned to treatment group and control group, respectively. This is equivalent to estimating in the following linear model:

$$

y_k \sim \mu + \tau z_k\tag{2}

$$ where $\mu$ is an overall intercept term and $\tau$ is the effect of treatment.

y_k \sim \mu + \tau z_k\tag{2}

$$ where $\mu$ is an overall intercept term and $\tau$ is the effect of treatment.

When users are connected in a network, their treatment assignments can generate network effects through their interactions. Our model considers two aspects of network effects:

**Homophily**or similarity within network: users collaborating in network tend to behave similarly. For example, developers working on a specific mobile app show similar behavior in usage. We use hierarchical models for this effect.**Spillover**or contamination effects: direct treatment effects can spill over through network connections. We conservatively limit the degree of spillover effects to immediate neighbors.

$$

y_{i,j,k} \sim c_i + a_{i,j} + p_{i,j,k} \tag{3}

$$ where $c_i$ refers to the response from Component $i$, $a_{i,j}$ from Account $j$ in Component $i$, and $p_{i,j,k}$ from Project $k$ in Account $j$ in Component $i$. The random effects

\begin{align*}

c_i &\sim N(0, \sigma_c^2)\\

a_{i,j} &\sim N(0, \sigma_a^2)\\

p_{i,j,k} &\sim N(0, \sigma_p^2)

\end{align*} can model potential correlation among accounts and projects within a component.

c_i &\sim N(0, \sigma_c^2)\\

a_{i,j} &\sim N(0, \sigma_a^2)\\

p_{i,j,k} &\sim N(0, \sigma_p^2)

\end{align*} can model potential correlation among accounts and projects within a component.

Spillover effects are modeled as an additional component added to to the linear model in (2):

$$

y_k \sim \mu + \tau z_k + \gamma \tau a_k^T \cdot Z \tag{4}

$$ where $Z$ is a vector representing treatment group assignment of every user, and $a_k$ is the $k^{th}$ column of adjacency matrix $A$, i.e., $m^{th}$ element of $a_k$ is $1$ if the $k^{th}$ user and $m^{th}$ user are connected. Note that we only model first order spillover effects in (4). In other words, we do not consider potential effects from neighbors’ neighbors. Thus, our model is conservative with respect to spillover effects (i.e. it limits their impact). Combining spillover effects and similarity within network, we have

$$

y_{i,j,k} \sim \mu + \tau z_k + \gamma \tau a_k^T \cdot Z + (c_i + a_{i,j} + p_{i,j,k}) \tag{5}

$$

Figure 5 shows empirical 95% confidence intervals for each of these sampling methods. Since the true effect is zero in each case, we expect our confidence intervals to include zero 95% of the time. The plot sorts 1000 empirical confidence intervals by their mid point (grey dot). The vertical line segment corresponding to each interval is green if it covers zero, red otherwise. Thus, the patch of red on either side consists of about 25 cases (i.e. 2.5%).

The figure shows that random sampling by component has the widest confidence interval while random sampling by project has the least. Stratified sampling by component is in between. Thus stratification recovers some of the experimental power lost when going from sampling by project to sampling by connected component.

We generate simulation data using (5) on the actual GCP user network, with the following parameter values for similarity and spillover effects:

For each setting of the parameters, we ran three randomized experiments, once for each of the three sampling methods. Each experiment ran with 50% in treatment and 50% in control. We repeated this whole process 1,000 times to estimate a distribution of effect size estimates.

Figure 6 shows the distributions of estimated effect size for the three experiment designs based on 1,000 simulation data sets for fixed values of $\tau=1/8$ and $\gamma= 2^{-4}$. While the variance of the random project design is least, it incurs significant bias. In contrast, random component and stratified component have higher variance but no observable bias.

The amount bias in a random project design depends on the level of spillover effect. This is shown for different values of $m$ in Figure 7. The bias of the random project design is such that its 95% confidence intervals, estimated under independence, exclude the true effect even for small spillover effects ($m \leq 6$).

$$

y_k \sim \mu + \tau z_k + \gamma \tau a_k^T \cdot Z \tag{4}

$$ where $Z$ is a vector representing treatment group assignment of every user, and $a_k$ is the $k^{th}$ column of adjacency matrix $A$, i.e., $m^{th}$ element of $a_k$ is $1$ if the $k^{th}$ user and $m^{th}$ user are connected. Note that we only model first order spillover effects in (4). In other words, we do not consider potential effects from neighbors’ neighbors. Thus, our model is conservative with respect to spillover effects (i.e. it limits their impact). Combining spillover effects and similarity within network, we have

$$

y_{i,j,k} \sim \mu + \tau z_k + \gamma \tau a_k^T \cdot Z + (c_i + a_{i,j} + p_{i,j,k}) \tag{5}

$$

## Experimental power and unit of randomization

We can use the model just defined to simulate the effect of randomization unit. We consider two randomization units: project and connected component. To further illustrate the effects of stratification on experimental power, we sample components either uniformly, or by strata. In other words, for the three methods of randomization- uniform random component
- uniform random project
- stratified random component

Figure 5 shows empirical 95% confidence intervals for each of these sampling methods. Since the true effect is zero in each case, we expect our confidence intervals to include zero 95% of the time. The plot sorts 1000 empirical confidence intervals by their mid point (grey dot). The vertical line segment corresponding to each interval is green if it covers zero, red otherwise. Thus, the patch of red on either side consists of about 25 cases (i.e. 2.5%).

The figure shows that random sampling by component has the widest confidence interval while random sampling by project has the least. Stratified sampling by component is in between. Thus stratification recovers some of the experimental power lost when going from sampling by project to sampling by connected component.

Figure 5. A/A test results: Confidence intervals of three methods: random sampling by projects, random sampling by component, and stratified sampling by component. |

## Estimation bias due to unit of randomization

Of course, running null experiments is hardly the purpose of experiment design. The reason we chose component as the unit of experimentation was that it better captures spillover effects when they are not null. Because randomized projects does not take network effects into account, we would expect to incur bias to the extent there are spillover effects.We generate simulation data using (5) on the actual GCP user network, with the following parameter values for similarity and spillover effects:

- Similarity effect parameters: $\sigma_c=2$, $\sigma_a=1$ and $\sigma_p=0.5$
- Direct treatment effect size: $\tau= \frac{1}{8}$
- Spillover effect parameter: $\gamma=2^{-m}$, where $m = 1, 2, 3, 4, 5, 6, 7$ or $8$

For each setting of the parameters, we ran three randomized experiments, once for each of the three sampling methods. Each experiment ran with 50% in treatment and 50% in control. We repeated this whole process 1,000 times to estimate a distribution of effect size estimates.

Figure 6 shows the distributions of estimated effect size for the three experiment designs based on 1,000 simulation data sets for fixed values of $\tau=1/8$ and $\gamma= 2^{-4}$. While the variance of the random project design is least, it incurs significant bias. In contrast, random component and stratified component have higher variance but no observable bias.

Figure 6. Effect size estimates for each of the three experiment designs. Dotted line shows the true effect size. Distributions estimated from 1,000 simulations, $\tau=1/8$ and $\gamma = 2^{-4}$. |

The amount bias in a random project design depends on the level of spillover effect. This is shown for different values of $m$ in Figure 7. The bias of the random project design is such that its 95% confidence intervals, estimated under independence, exclude the true effect even for small spillover effects ($m \leq 6$).

Figure 7. Degree of network effects and effect size estimation. The dotted line with “A” refers to the true average effect. |

## Dynamic evolution of user collaboration network

An actual user collaboration network is not static and evolves over time as users start new projects, finish existing ones, or change their project memberships. As a result, the following four changes can happen to components:**Create**: a new component is created.**Split**: an existing component breaks into sub-components.**Remove**: a component no longer exists.**Merge**: existing components become connected.

Figure 8. Addition and Merge: new entities (projects and users) added are in RED, and components merged are in BLUE. |

At this point, it is no longer possible to guarantee consistent user treatment as defined earlier. We may discuss nuances of graph evolution in a future post, but for the most part, we are fortunate that this is a relatively rare event in our collaboration network today. Another aspect of concern is that treatment can in theory affect not just experiment metrics but also the graph topology itself. Thus graph evolution events also need to be tracked over the course of the experiment.

## Comments

## Post a Comment