### Variance and significance in large-scale online services

by AMIR NAJMI

There are many reasons for the recent explosion of data and the resulting rise of data science. One big factor in putting data science on the map has been what we might call Large Scale Online Services (LSOS). These are sites and services which rely both on ubiquitous user access to the internet as well as advances in technology to scale to millions of simultaneous users. There are commercial sites which allow users to search for and purchase goods or book rooms they desire. There are music and video streaming sites where users decide which content to consume, and apps, be they for ride-sharing or dating. In each case, users engage with the service at will and the service makes available a rich set of possible interactions. Which action a user takes depends on many factors — her intent, her needs, her tastes, the perceived quality of choices available to her, the presentation of those choices, the ease of selection, the performance of the website, and so on. Indeed, understanding and facilitating user choices through improvements in the service offering is much of what LSOS data science teams do.

As with any enterprise, the goal of the service provider is to better satisfy its users and further its business objectives. But the fact that a service could have millions of users and billions of interactions gives rise to both big data and methods which are effective with big data. Of particular interest to LSOS data scientists are modeling and prediction techniques which keep improving with more data. If these are a

A particularly attractive approach to understanding user behavior in online services is live experimentation. Randomized experiments are invaluable because they represent the gold standard for drawing causal inferences. And because the service is online and large scale, it may be feasible to experiment with each of many parameters of the service. For example, an LSOS experiment may answer the question of whether a new design for the main page is better for the user. The LSOS may do this by exposing a random group of users to the new design and compare them to a control group, and then analyze the effect on important user engagement metrics, such as bounce rate, time to first action, or number of experiences deemed positive. Indeed, such live experiments (so-called “A/B” experiments) have become a staple in the LSOS world [1].

Since an LSOS experiment has orders of magnitude larger sample size than the typical social science experiment, it is tempting to believe that any meaningful experimental effect would automatically be statistically significant. It is certainly true that for any given effect, statistical significance is an SMOD. And an LSOS is awash in data, right? Well, it turns out that depending on what it cares to measure, an LSOS might not have enough data. Surprisingly, outcomes of interest to an LSOS often have very high coefficient of variation compared, say, to social science experiments. This means that each observation has little information, and we need a lot of observations to make reliable statements. The practical consequence of this is that we can’t afford to be sloppy about measuring statistical significance and confidence intervals. At Google, we have invested heavily in making our estimates of uncertainty evermore accurate (see our blog post on Poisson Bootstrap for an example).

Suppose we are running an LSOS with lots of “traffic” (pageviews, user sessions, requests, the like). Ours is a sophisticated outfit, doing lots of live experiments to determine which features will best serve our users’ needs. No doubt we have metrics which we track to determine which experimental change is worth launching. These metrics embody some important aspect of our business objectives, such as click-through rates on content, watch times on video, likes on a social networking site. In addition to a suitable metric, we must also choose our experimental unit. This is the unit being treated and whose response we assume to be independent of the treatment administered to other units (also known as the stable unit treatment value assumption, or SUTVA, in the causal inference literature). Each experiment is conducted by treating some randomly sampled units and comparing against other randomly sampled untreated units. Choice of

At its simplest, we will run our randomized experiment and compare the average metric effect on treatment against that of control. Typically, we would require the results of the experiment to be both statistically significant and practically significant in order to launch.

All this is old hat to statisticians and experimental social scientists even if they aren’t involved in data science. Indeed, a Google search for [statistical significance vs practical significance] turns up lots of discussion. The surprise is that the effect sizes of practical significance are often extremely small from a traditional statistical perspective. To understand this better we need a few definitions.

Let our metric be $Y_i$ on the $i$th experimental unit. Further assume $Y_i \sim N(\mu,\sigma^2)$ under control and $Y_i \sim N(\mu+\delta,\sigma^2)$ under treatment (i.e. known, equal variances). The statistical effect size is often defined as \[

e=\frac{\delta}{\sigma}

\]which is the difference in group means as a fraction of the (pooled) standard deviation (sometimes referred to as “Cohen’s d”). An effect size of 0.2 in this situation is traditionally considered small (say, on Cohen’s scale). For a traditional (i.e. non-LSOS) example, let's say the height of men and women in the US each follows a normal distribution with means 69" and 64" and standard deviation of 3" (close enough to reality). Then the effect size of gender difference on height is 1.67 (a large effect size).

Effect size thus defined is useful because the statistical power of a classical test for $\delta$ being non-zero depends on $e/\sqrt{\tilde{n}}$, where $\tilde{n}$ is the harmonic mean of sample sizes of the two groups being compared. To observe this, let $W$ be the sample average differences between groups (our test statistic). Since $W$ is the difference of two independent normal random variables,\[

W \sim N\left(\delta, (\frac{1}{n_1}+\frac{1}{n_2})\sigma^2\right)

\]where $n_1$and $n_2$ are the sample sizes of the two groups. If we define\[

\frac{2}{\tilde{n}} = \frac{1}{n_t} +\frac{1}{n_c}\]then $W \sim N(\delta, 2\sigma^2/\tilde{n})$. A two-sided classical hypothesis test with Type I and Type II errors no greater than $\alpha$ and $\beta$ respectively requires that\[

\frac{\delta}{\sqrt{2\sigma^2/\tilde{n}}} > \Phi^{-1}(1-\frac{\alpha}{2}) + \Phi^{-1}(1-\beta)

\\

\Rightarrow e \sqrt{\tilde{n}} > \sqrt{2}\left(\Phi^{-1}(1-\frac{\alpha}{2}) + \Phi^{-1}(1-\beta)\right)

\] where $\Phi$ is the cumulative distribution function of the standard normal distribution. To obtain a sufficiently powered test, we therefore need\[

\tilde{n} >\frac{K(\alpha, \beta)}{e^2}

\] where $K(\alpha, \beta) = 2 \left(\Phi^{-1}(1-\frac{\alpha}{2}) + \Phi^{-1}(1-\beta) \right)^2$.

Now let’s look at the ratio of effect fraction to effect size\[

\frac{f}{e} = \frac{\delta/\mu}{\delta/\sigma} = \frac{\sigma}{\mu}

\]This ratio is just the

n > K(\alpha, \beta)\frac{c^2}{f^2}

\]

If a metric is based on the average rate of rare occurrences, its underlying observations will have high CV. In the world of online services, this is rather common. For instance, a news site might care about the average number of comments per user session as a measure of user engagement, even though the vast majority of user sessions do not result in a comment. CV for a binary $\mathrm{Bernoulli}(p)$ random variable is $\sqrt{(1-p)/p}$. As the event becomes rarer, this grows as $1/\sqrt{p}$. Sometimes, the metric of interest is not the average rate of a rare binary event, per se, but is gated by such an event. For instance, the metric could be the price of goods purchased in the average user session. But if a small fraction of user sessions have any purchase at all, then the coefficient of variation for the metric (sale price per session) will necessarily be even larger than that of the binary event (sessions with a sale). In any case, suppose that on average 5% of user sessions to a news site result in comments. CV of the binary random variable “session has a comment” is $\sqrt{(1-0.05)/0.05}$ $=4.36$. Compare this to our non-LSOS example of adult heights in the US, CV of women’s heights is $0.047$.

While our focus has been on CV, we would be remiss not to point out the surprisingly small effect fractions of interest. As noted earlier, effect fractions of 1% or 2% can have practical significance to an LSOS. These are very small when compared with the kinds of effect fractions of interest in, say, medicine. Medicine uses the term “relative risk” to describe effect fraction when referring to the fractional change in incidence of some (bad) outcome like mortality or disease. To see what effect fractions are interesting in medicine, I looked at a recent Lancet paper which claims to prove that happiness doesn’t directly affect mortality. The paper gained much attention because, having conducted the largest study of its kind, it was understood to debunk the idea

Large CV and small effect fraction of practical significance means that an LSOS

Very low variable costs have two implications for the business model of these online services. First, low marginal cost of serving users allows the LSOS to pursue a business model in which it only monetizes through rare events while making the bulk of user interactions free. For instance, a free personal finance service may make its living through the rare sale of lucrative financial products while a free photo storage site may monetize through rare order for prints and photobooks which the LSOS refers to its physical business partners. Thus, important metrics for an LSOS often involve large CV precisely because they are based on aggregating these rare and vital events. Second, very low variable costs mean any growth essentially adds to the bottom line. In this light, a 1% effect fraction in activity metrics might be important because it could represent a much larger percentage of operating profit.

LSOS experiments often measure metrics involving observations with high coefficient of variation. They also tend to care about small effect fractions. I speculated that both of these may be due to a common LSOS business model of “making it up on volume” (old joke, see here for its history). Whatever the reason, this means effect sizes of interest are orders of magnitude smaller than what a traditional statistical experiment would find practically significant. To detect such tiny effect sizes an LSOS needs to run experiments with a very large number of experimental units. Perhaps it is fitting that if scalable online services create a difficult estimation problem (small effect size) they also possess the means (many experimental units) to solve it.

[1] Diane Tang, Ashish Agarwal, Deirdre O'Brien, Mike Meyer, “Overlapping Experiment Infrastructure: More, Better, Faster Experimentation”, Proceedings 16th Conference on Knowledge Discovery and Data Mining, Washington, DC

*Running live experiments on large-scale online services (LSOS) is an important aspect of data science. Unlike**experimentation in**some other areas, LSOS experiments present a surprising challenge to statisticians — even though we operate in the realm of “big data”, the statistical uncertainty in our experiments can be substantial. Because individual observations have so little information, statistical significance remains important to assess. We must therefore maintain statistical rigor in quantifying experimental uncertainty.**In this post we explore how and why we can be**“**data-rich but information-poor**”**.*There are many reasons for the recent explosion of data and the resulting rise of data science. One big factor in putting data science on the map has been what we might call Large Scale Online Services (LSOS). These are sites and services which rely both on ubiquitous user access to the internet as well as advances in technology to scale to millions of simultaneous users. There are commercial sites which allow users to search for and purchase goods or book rooms they desire. There are music and video streaming sites where users decide which content to consume, and apps, be they for ride-sharing or dating. In each case, users engage with the service at will and the service makes available a rich set of possible interactions. Which action a user takes depends on many factors — her intent, her needs, her tastes, the perceived quality of choices available to her, the presentation of those choices, the ease of selection, the performance of the website, and so on. Indeed, understanding and facilitating user choices through improvements in the service offering is much of what LSOS data science teams do.

As with any enterprise, the goal of the service provider is to better satisfy its users and further its business objectives. But the fact that a service could have millions of users and billions of interactions gives rise to both big data and methods which are effective with big data. Of particular interest to LSOS data scientists are modeling and prediction techniques which keep improving with more data. If these are a

**simple matter of data**(call it SMOD in analogy to SMOP), they will improve automatically as the LSOS itself grows.A particularly attractive approach to understanding user behavior in online services is live experimentation. Randomized experiments are invaluable because they represent the gold standard for drawing causal inferences. And because the service is online and large scale, it may be feasible to experiment with each of many parameters of the service. For example, an LSOS experiment may answer the question of whether a new design for the main page is better for the user. The LSOS may do this by exposing a random group of users to the new design and compare them to a control group, and then analyze the effect on important user engagement metrics, such as bounce rate, time to first action, or number of experiences deemed positive. Indeed, such live experiments (so-called “A/B” experiments) have become a staple in the LSOS world [1].

Since an LSOS experiment has orders of magnitude larger sample size than the typical social science experiment, it is tempting to believe that any meaningful experimental effect would automatically be statistically significant. It is certainly true that for any given effect, statistical significance is an SMOD. And an LSOS is awash in data, right? Well, it turns out that depending on what it cares to measure, an LSOS might not have enough data. Surprisingly, outcomes of interest to an LSOS often have very high coefficient of variation compared, say, to social science experiments. This means that each observation has little information, and we need a lot of observations to make reliable statements. The practical consequence of this is that we can’t afford to be sloppy about measuring statistical significance and confidence intervals. At Google, we have invested heavily in making our estimates of uncertainty evermore accurate (see our blog post on Poisson Bootstrap for an example).

## Statistical Significance vs. Practical Significance

Suppose we are running an LSOS with lots of “traffic” (pageviews, user sessions, requests, the like). Ours is a sophisticated outfit, doing lots of live experiments to determine which features will best serve our users’ needs. No doubt we have metrics which we track to determine which experimental change is worth launching. These metrics embody some important aspect of our business objectives, such as click-through rates on content, watch times on video, likes on a social networking site. In addition to a suitable metric, we must also choose our experimental unit. This is the unit being treated and whose response we assume to be independent of the treatment administered to other units (also known as the stable unit treatment value assumption, or SUTVA, in the causal inference literature). Each experiment is conducted by treating some randomly sampled units and comparing against other randomly sampled untreated units. Choice of

**experimental unit**isn’t trivial either, since we want to define them to be as numerous as possible but still largely independent. For instance, we probably don’t want to posit pageviews to our website as our experimental unit because it is hard to argue that treatment received by a user on one page will not affect her behavior on another page in that session — perhaps user sessions or even whole users might be necessary as experimental units. In any event, let’s say we have an appropriate choice of experimental unit.

At its simplest, we will run our randomized experiment and compare the average metric effect on treatment against that of control. Typically, we would require the results of the experiment to be both statistically significant and practically significant in order to launch.

**Statistical significance**ensures that the results of the experiment are unlikely to be due to chance. For this purpose, let’s assume we use a t-test for difference between group means. As noted, if there is any effect at all, statistical significance is “SMOD”. On the other hand,

**practical significance**is about whether the effect itself is large enough to be worthwhile. That’s more of a business question about the value of the underlying effect than about one’s ability to measure the effect. If the experiment gives us 95% confidence of a 0.01 +/- 0.002% change in our metric, we have enough measurement accuracy but may well not care to launch such a small effect.

All this is old hat to statisticians and experimental social scientists even if they aren’t involved in data science. Indeed, a Google search for [statistical significance vs practical significance] turns up lots of discussion. The surprise is that the effect sizes of practical significance are often extremely small from a traditional statistical perspective. To understand this better we need a few definitions.

## Effect size

e=\frac{\delta}{\sigma}

\]which is the difference in group means as a fraction of the (pooled) standard deviation (sometimes referred to as “Cohen’s d”). An effect size of 0.2 in this situation is traditionally considered small (say, on Cohen’s scale). For a traditional (i.e. non-LSOS) example, let's say the height of men and women in the US each follows a normal distribution with means 69" and 64" and standard deviation of 3" (close enough to reality). Then the effect size of gender difference on height is 1.67 (a large effect size).

Effect size thus defined is useful because the statistical power of a classical test for $\delta$ being non-zero depends on $e/\sqrt{\tilde{n}}$, where $\tilde{n}$ is the harmonic mean of sample sizes of the two groups being compared. To observe this, let $W$ be the sample average differences between groups (our test statistic). Since $W$ is the difference of two independent normal random variables,\[

W \sim N\left(\delta, (\frac{1}{n_1}+\frac{1}{n_2})\sigma^2\right)

\]where $n_1$and $n_2$ are the sample sizes of the two groups. If we define\[

\frac{2}{\tilde{n}} = \frac{1}{n_t} +\frac{1}{n_c}\]then $W \sim N(\delta, 2\sigma^2/\tilde{n})$. A two-sided classical hypothesis test with Type I and Type II errors no greater than $\alpha$ and $\beta$ respectively requires that\[

\frac{\delta}{\sqrt{2\sigma^2/\tilde{n}}} > \Phi^{-1}(1-\frac{\alpha}{2}) + \Phi^{-1}(1-\beta)

\\

\Rightarrow e \sqrt{\tilde{n}} > \sqrt{2}\left(\Phi^{-1}(1-\frac{\alpha}{2}) + \Phi^{-1}(1-\beta)\right)

\] where $\Phi$ is the cumulative distribution function of the standard normal distribution. To obtain a sufficiently powered test, we therefore need\[

\tilde{n} >\frac{K(\alpha, \beta)}{e^2}

\] where $K(\alpha, \beta) = 2 \left(\Phi^{-1}(1-\frac{\alpha}{2}) + \Phi^{-1}(1-\beta) \right)^2$.

For typical values of $\alpha=0.05$ and $\beta=0.1$ we have $K(\alpha,\beta)=21.01$. So continuing our traditional example, imagine we wish to test the hypothesis that the average height of men is different from the average height of women in the US. Using the effect size of $1.67$ (standard deviation assumed known), we obtain a minimal required sample size of $\tilde{n}>7.57$. This means we would need at least 16 people (8 men and 8 women) to get the desired statistical power.

## Effect fraction

In contrast to traditional analysis, the quantity typically of interest to an LSOS business is what we might call the

f = \frac{\delta}{\mu}

\]namely, the difference in group means as a fraction of the mean. For instance, when running an experiment, we would want to know the change in downloads per user session on our music site as a fraction of current downloads per session (this is just the percent change in downloads). The business is probably much less interested in the change in downloads per session as a fraction of the standard deviation of downloads per session (effect size). “Effect fraction” isn’t a standard term but I find it useful to distinguish the concept from “effect size”.

**effect fraction**, \[f = \frac{\delta}{\mu}

\]namely, the difference in group means as a fraction of the mean. For instance, when running an experiment, we would want to know the change in downloads per user session on our music site as a fraction of current downloads per session (this is just the percent change in downloads). The business is probably much less interested in the change in downloads per session as a fraction of the standard deviation of downloads per session (effect size). “Effect fraction” isn’t a standard term but I find it useful to distinguish the concept from “effect size”.

Often, a mature LSOS would consider changes of the order of 1% in effect fraction to be practically significant. Several improvements over the year, each of the order of 2% or 3%, would result in substantial annual improvement due to product changes alone. It’s a great recipe for steady product development but requires the ability to run many experiments while being able to measure effect fractions of this size quickly and reliably.

## Coefficient of variation

\frac{f}{e} = \frac{\delta/\mu}{\delta/\sigma} = \frac{\sigma}{\mu}

\]This ratio is just the

**coefficient of variation**(CV) of a random variable, defined as its standard deviation over its mean. For our LSOS experiment above, CV of $Y_i$ in control is $c=\sigma/\mu$, and treatment CV approximately the same. Being dimensionless, it is a simple measure of the variability of a (non-negative) random variable. Furthermore, the fractional mean squared error when estimating $\mu$ from $n$ samples is $(\sigma/\sqrt{n})/\mu=c/\sqrt{n}$. Thus CV can be seen as a measure of the amount of information each sample from a distribution provides towards estimating its mean. In signal processing, CV is simply the reciprocal of the

**signal-to-noise ratio**. We could call observations from a distribution “information-poor” if their distribution has large CV. It shouldn’t surprise, then, that the larger the CV, the more observations it takes to run useful experiments. And because $e=f/c$, the larger the CV for a given effect fraction, the smaller the resulting effect size. If we control for Type I and Type II errors as before, the required sample size is\[

n > K(\alpha, \beta)\frac{c^2}{f^2}

\]

## LSOS metrics can have large CV, small effect fraction, hence large sample sizes

While our focus has been on CV, we would be remiss not to point out the surprisingly small effect fractions of interest. As noted earlier, effect fractions of 1% or 2% can have practical significance to an LSOS. These are very small when compared with the kinds of effect fractions of interest in, say, medicine. Medicine uses the term “relative risk” to describe effect fraction when referring to the fractional change in incidence of some (bad) outcome like mortality or disease. To see what effect fractions are interesting in medicine, I looked at a recent Lancet paper which claims to prove that happiness doesn’t directly affect mortality. The paper gained much attention because, having conducted the largest study of its kind, it was understood to debunk the idea

**definitively**. However, their abstract presents relative risk of death comparing the unhappy group to the happy group with CIs we’d consider quite large — death from all causes -6% to +1%, from ischemic heart disease -13% to +10%, from cancer -7% to +2%. A typical LSOS experiment with effect fraction CIs of several percent would be considered too underpowered to establish absence of meaningful effect.

## A consequence of the LSOS business model?

**requires**very large sample size compared to traditional experiments. It seems worth wondering why an LSOS should end up with large CV and also care about small effect fractions. One line of explanation I find plausible has to do with the low variable costs enabled by scalable web architectures — if the LSOS doesn’t involve physical goods and services (think Facebook, Spotify, Tinder as opposed to Amazon, Shutterfly, Apple) the marginal cost to support one more user request is almost zero. This is very different than bricks-and-mortar companies where there are many marginal costs, such as sales, manufacturing, transportation.

Very low variable costs have two implications for the business model of these online services. First, low marginal cost of serving users allows the LSOS to pursue a business model in which it only monetizes through rare events while making the bulk of user interactions free. For instance, a free personal finance service may make its living through the rare sale of lucrative financial products while a free photo storage site may monetize through rare order for prints and photobooks which the LSOS refers to its physical business partners. Thus, important metrics for an LSOS often involve large CV precisely because they are based on aggregating these rare and vital events. Second, very low variable costs mean any growth essentially adds to the bottom line. In this light, a 1% effect fraction in activity metrics might be important because it could represent a much larger percentage of operating profit.