Experiment design and modeling for long-term studies in ads

by HENNING HOHNHOLD, DEIRDRE O'BRIEN, and DIANE TANG

In this post we discuss the challenges in measuring and modeling the long-term effect of ads on user behavior. We describe experiment designs which have proven effective for us and discuss the subtleties of trying to generalize the results via modeling.


A/B testing is used widely in information technology companies to guide product development and improvements. For questions as disparate as website design and UI, prediction algorithms, or user flows within apps, live traffic tests help developers understand what works well for users and the business, and what doesn’t. 

Nevertheless, A/B testing has challenges and blind spots, such as:
  1. the difficulty of identifying suitable metrics that give "works well" a measurable meaning. This is essentially the same as finding a truly useful objective to optimize.
  2. capturing long-term user behavior changes that develop over time periods exceeding the typical duration of A/B tests, say, over several months rather than a few days.
  3. accounting for effects "orthogonal" to the randomization used in experimentation. For example in ads, experiments using cookies (users) as experimental units are not suited to capture the impact of a treatment on advertisers or publishers nor their reaction to it.
A small but persistent team of data scientists within Google’s Search Ads has been pursuing item #2 since about 2008, leading to a much improved understanding of the long-term user effects we miss when running typical short A/B tests. This work has also resulted in advances for item #1, as it helped us define more useful objectives for A/B tests in Search Ads which include long-term impact of experimental treatments.

Recently, we presented some basic insights from our effort to measure and predict long-term effects at KDD 2015 [1]. In this blog post, we summarize that paper and refer you to it for details. Since we work in Google’s Search Ads group, the long-term effects our studies focus on are ads blindness and sightedness, that is, changes in users’ propensity to interact with the ads on Google’s search results page. However, much of the methodology is not ads-specific and should help investigate other questions such as long-term changes in website visit patterns or UI feature usage.


A/A tests and long-term effects


In our quest to measure long-term user effects, we found a neat and surprising use case for A/A tests, i.e., experiments where treatment and control units receive identical treatments. Typically, these are used at Google to diagnose problems in the experiment infrastructure or undesired biases between treatment and control cohorts for A/B tests. Amazingly, in the context of long-term studies such "undesirable biases" turn into the main object of interest!

To see this, imagine you want to study long-term effects in an A/B test. The first thing you’ll want to do is to run your test for a long time with fixed experimental units, in our case cookies. Doing this affords time for long-term effects to develop and manifest themselves. The principal challenge is now to isolate long-term effects from the primary impact of applying the A/B treatment. Unfortunately, this is difficult since long-term effects are often much more subtle than the primary A/B effects, and even small changes in the latter (due to seasonality, other serving system updates, etc.) can overshadow the long-term effects we are trying to measure. We have found it basically impossible to adjust for such changes in the primary A/B effects through modeling. To see why this is so difficult, note, first, that there is a large number of factors that could potentially affect the primary A/B effects. Second, even if we could account for all of them, it would still be difficult to predict which ones interact with the A/B treatment (most of them don’t), and what the effect would be. Last, even if we could pull a model together, it might lack sufficient credibility to justify business-critical decisions since conclusions would depend strongly on model assumptions (given the relatively small size of the long-term effects).

Better experimental design, rather than fancy modeling, turned out to be the key to progress on this question. An elegant way to circumvent the issue of changes in the primary A/B treatment over time is to include a "post-period" in the experimental setup. By this we mean an A/A test with the same experimental units as in our A/B test, run immediately after the A/B treatment. Figure 1 shows the post-period and also includes an A/A test pre-period to verify that our experiment setup and randomization work as intended.

Screen Shot 2015-09-17 at 12.28.31 PM.png
Figure 1: Pre-periods and post-periods

The simple but powerful rationale behind post-periods is that during an A/A comparison there are no “primary effects” and hence any differences between the test and control cohorts during the post-period are due to the preceding extended application of the A/B treatment to the experimental units. For Google search ads, we have found that this method gives reliable and repeatable measurements of user behavior changes caused by a wide variety of treatments.

An obvious downside of post-periods is that the measurement of long-term effects happens only after the end of the treatment period, which can last several months. In particular, no intermediate results become available, which is a real downer in practice.

This can be remedied by another addition to the experimental design, namely a second experiment serving the treatment B. The twist is that in this second experiment we re-randomize study participants daily. Since at Google the typical experimental unit is a cookie, we call this construct with daily re-randomization a cookie-day experiment (as opposed to the cookie experiments we’ve considered up to now, where the experimental units stay fixed across time). We take the cookies for our cookie-day experiment from a big pool of cookies that receive the control treatment whenever they are not randomized into our cookie-day experiment — which is almost always. Consequently, the longer-term aspects of their behavior are shaped by having experienced the control treatment.

On any given day of the treatment period, the cookie-day and cookie experiments serving B define a B/B test, which we call the cookie cookie-day (CCD) comparison. As in the post-period case, this allows us to attribute metric differences between the two groups to their previous differential exposure (A for the cookie-day experiment, and B for the cookie experiment).

A neat aspect of CCD is that it allows us to follow user behavior changes while they are happening. For example, Figure 2 shows the change in users’ propensity to click on Google search ads for 10 different system changes that vary ad load and ranking algorithms. Depending on the average ad quality the different cohorts are exposed to, their willingness to interact with our ads changes over time. We learn that the average user “attitude towards ads” in each of the 10 cohorts approaches a new equilibrium, and that the process can be approximated reasonably well by exponential curves with a common learning rate (dashed lines).

Screen Shot 2015-09-12 at 9.11.22 PM.png
Figure 2: User learning as measured by CCD experiments

Note that all learning measurements here are taken at an aggregate (population) level, not on a per-user basis.


Modeling long-term effects in ads


In addition to measuring long-term effects we’ve also made efforts to model them. This is attractive for many reasons, the most prominent being:
  • running long-term studies is a lot of work, expensive, and takes a long time. Having reliable models predicting these effects enables us to take long-term effects into account without slowing down development.
  • interpretable models help us understand what drives user behavior. This knowledge has influenced our decision-making way beyond the concrete cases we studied in detail.
The most important insight from our modeling efforts is that users’ attitude towards Google’s search ads is, above all, shaped by the average quality of the ads they experienced previously. More precisely, we learned that both the relevance of the ads shown and the experience after users click substantially influence their future propensity to interact with ads. For more details see [1] Section 4. A scatterplot of observations vs. predictions for a model of this type is given in Figure 3:

Screen Shot 2015-09-12 at 9.44.22 PM.png
Figure 3: Predicted vs. measured user learning.
The plot makes clear that our quality-based models can predict how users will react to a relatively large class changes to Google’s ads system. (Note that UI manipulations are absent here — we have found these to be much harder to understand from a modeling perspective). We use this knowledge to define objective functions to optimize our ads system with a view towards the long-term. In other words, we have created a long-term focused OEC (Overall Evaluation Criterion [2]) for online experiment evaluation.

You’ve probably noticed just how few data points the scatter-plot contains. That’s because each observation here is a long-term study, usually with a treatment duration of about three months. Hence generating suitable training data is challenging, and as a consequence we ran into the curious situation of dealing with extremely small data at Google. Over the years, our modeling efforts have taught us that in such a situation
  • cross-validation may not be sufficient to prevent overfitting when the data is sparse and the set of available covariates is large. Moreover, not all training data is created equal — in our case several observations come from conceptually similar treatments. This additional structure must be taken into account, at the very least in creating cross-validation folds. Otherwise cross-validation RMSEs might seriously overstate prediction accuracy on new test data.
  • choosing interpretable models is both appealing to humans and reduces the model space so as to improve prediction accuracy on test data.

Finally, the fact that our interpretable models say “quality makes users more engaged” also helps validate the overall measurement methodology. We get asked whether the long-term effects we measure may just be biases on long-lived cookies or a similar unnoticed failure of our experiment setup. This seems unlikely — why should quality reliably predict cookie biases? We’ve certainly performed many other checks such as negative controls, meaningful dose-response relationships, but it is nice to have this simple result as validation.

Conclusion


We’ve described methods to measure and predict change in long-term user behavior. These methods have had lasting impact on ad serving at Google. For instance, in 2011 we altered the AdWords auction ranking function to account for the long-term impact of showing a given ad.  The function determines which ads can show on the SERP and in which order and this adjustment places greater emphasis on user satisfaction after ad clicks. Long-term studies played a crucial role both in the motivation and evaluation of this change (see [1] Section 5).

Another application was to the ad load on smartphones. We saw strong long-term user effects here, and as a consequence reduced the ad load on Google’s mobile interface by about 50% over the course of two years, with both positive user experience and business impact. Similar adjustments have been made on some websites for which Google serves as ads provider.

While “more ads equals more money” might possibly be supported by short-term A/B tests, the clear message of our research is that this is misguided when taking the long view. At a time when many internet publishers struggle to strike a balance between ad load and user experience on their site, we hope that the long-term focus described here will be of broader interest.

References


[1] Henning Hohnhold, Deirdre O'Brien, Diane Tang, Focus on the Long-Term: It's better for Users and Business, Proceedings 21st Conference on Knowledge Discovery and Data Mining, 2015.

[2] Ron Kohavi, Randal M. Henne, Dan Sommerfield, Overall Evaluation Criterion, Proceedings 13th Conference on Knowledge Discovery and Data Mining, 2007.