Using random effects models in prediction problems

by NICHOLAS A. JOHNSON, ALAN ZHAO, KAI YANG, SHENG WU, FRANK O. KUEHNEL, and ALI NASIRI AMINI

In this post, we give a brief introduction to random effects models, and discuss some of their uses. Through simulation we illustrate issues with model fitting techniques that depend on matrix factorization. Far from hypothetical, we have encountered these issues in our experiences with "big data" prediction problems. Finally, through a case study of a real-world prediction problem, we also argue that Random Effect models should be considered alongside penalized GLM's even for pure prediction problems.

Random effects models are a useful tool for both exploratory analyses and prediction problems. We often use statistical models to summarize the variation in our data, and random effects models are well suited for this — they are a form of ANOVA after all. In prediction problems these models can summarize the variation in the response, and in the process produce a form of adaptive regularization through the learned prior distributions. Random effects can be viewed as an application of empirical Bayes, and if we broaden our view to that parent technique, we can find theoretical evidence of their superiority over say fixed effects models. For example, the James-Stein estimator has an empirical Bayes interpretation — it was famously proven to dominate the MLE (an analogue of fixed effects) for the problem of estimating the mean of a multivariate normal distribution (see [3] and [4]).

In the context of prediction problems, another benefit is that the models produce an estimate of the uncertainty in their predictions: the predictive posterior distribution. These predictive posterior distributions have many uses such as in multi-armed bandit problems. Thompson sampling is an "explore/exploit" algorithm which was designed for Bayesian models, and its strategy is to randomly select a choice (called an "arm") with probability proportional to the posterior of that choice being the best option. See [9], [10] for a discussion of this approach.

In the next two sections we give a brief introduction to random effects models and then discuss a case study in click-through-rate modeling. This example shows that the prediction accuracy can be competitive with popular alternatives such as penalized GLM's — in fact, much better in this example. Furthermore the algorithms to fit these models are sufficiently scalable to handle many problems of interest.

Random Effect Models

We will start by describing a Gaussian regression model with known residual variance $\sigma_j^2$ of the $j$th training record's response, $y_j$. Often our data can be stored or visualized as a table like the one shown below. In this example we have three features/columns named "a", "b", and "c". Column "a" is an advertiser id, "b" is a web site, and "c" is the 'interaction' of columns "a" and "b".

$y$	feature (a)	feature (b)	feature (c)	$\sigma^2$
1.2	hi-fly-airlines	123.com	"hi-fly-airlines__123.com"	1.10
0.1	erudite-bookstore	123.com	"erudite-bookstore__123.com"	0.15
0.3	hi-fly-airlines	xyz.com	"hi-fly-airlines__xyz.com"	0.20
0.3	lightspeed-brokerage	www.com	"lightspeed-brokerage__www.com"	0.10

To make mathematical notation simpler we suppose that string values in each column have been enumerated, and we deal with their integer index instead:

$y$	$a$	$b$	$c$	$\sigma^2$
1.2	1	1	1	1.10
0.1	2	1	2	0.15
0.3	1	2	3	0.20
0.3	3	3	4	0.10

Now for the $j$th row of data we define the index of the feature "a" as $I_a(j) \in \{1, 2, ..., \mbox{#advertisers} \}$. In our example we have \begin{align*}
I_a(1) &= 1\\
I_a(2) &= 2 \\
I_a(3) &= 1 \\
I_a(4) &=3 \\
\cdots
\end{align*}Similarly $I_b(j) = \{1, 2, ..., \mbox{#websites}\}$ could be the index of the site on which the ad was shown and $I_c(j)$ could index the (advertiser, site) pair.

Finally we get to our model of the data: \[
\{y_j | I_a(j), I_b(j), I_c(j), \sigma_j^2 \} \sim
N(\beta + \mu_{I_a(j)}^a + \mu_{I_b(j)}^b + \mu_{I_c(j)}^c, \sigma_j^2)
\]The curly-bracketed left-hand side, $\{y_j| \cdots \}$, denotes a conditional distribution; the parameter $\beta$ represents the overall mean of the data (whose fitted value depends on the structure of the model); and the $\mu^a$'s, $\mu^b$'s, and $\mu^c$'s represent random effects i.e. unobserved random variables.

The model is complete once we specify a prior distribution on each of the $\mu$'s: \begin{align*}
\mu_k^a &\sim N(0, \tau_a^{-1}) \\
\mu_k^b &\sim N(0, \tau_b^{-1}) \\
\mu_k^c &\sim N(0, \tau_c^{-1})
\end{align*}In the formulas above the parameter $\tau_x$ is the inverse-variance of the prior, and all coefficients associated with the same column are drawn from a common prior. Rather than just specifying the $\tau$'s, we will try to infer them from the data: they can be fitted by a Monte Carlo Expectation Maximization (MCEM) algorithm or they can be sampled in a Bayesian model.

These inferred prior distributions give a decomposition of the variance of the data\[
\mathrm{var\ }y|\sigma = \sigma^2 + \tau_a^{-1} + \tau_b^{-1} + \tau_c^{-1}
\]The variance decomposition formula also gives breakdowns such as\begin{align}
\mathrm{var\ }y|\sigma &= E_\sigma[[\mathrm{var\ } y|\sigma, I_b, I_c]| \sigma]
+ \mathrm{var_\sigma}[E[y|\sigma, I_b, I_c]| \sigma] \\
&= (\sigma^2+ \tau_a^{-1}) + (\tau_b^{-1} + \tau_c^{-1})
\end{align}Their interpretation is most intuitive in hierarchical models. More complex models can involve elements such as "random slopes", and for these it can be informative to calculate a version of $R^2$ (fraction of variance explained) which was developed for random effects models in the textbook by Gelman and Hill [1].

The prior variance decomposition or $R^2$ are useful summaries of the explanatory power of each random effect column, but when making predictions we need the posterior variances of individual random effects. At prediction time suppose we have features (a=3, b=7, c=2) and we need the linear combination $\beta + \mu_3^a + \mu_7^b + \mu_2^c$. By considering several MCMC samples (indexed $t=1,2, \dots, T$) we can compute statistics based on the sample approximation to the posterior distribution of the prediction i.e. $\{\beta_t + \mu_{3,t}^a + \mu_{7,t}^b + \mu_{2,t}^c \}_{t=1}^T$. If we only need the posterior mean then we can simply pre-compute the mean of each $\mu$.

Although Gaussian models are a good way to introduce random effects, in practice we must often model discrete data — usually in the form of counts of events. Using similar algorithms, we can fit the following Gamma-Poisson model:\begin{align}
\{ y_j | I_a(j), I_b(j), I_c(j), \lambda_j \}
&\sim \mathrm{Pois}(\lambda_j \mathrm{e}^
{\beta +\mu_{I_a(j)}^a + \mu_{I_b(j)}^b + \mu_{I_c(j)}^c}) \\
\mathrm{e}^{\mu_k^x} &\sim \Gamma(\tau_x, \tau_x)
\end{align}where the prior has mean of $1.0$ and variance $\tau_x^{-1}$.

Binomial models are also of interest but logistic regression in particular is similar to the Poisson regression when the rate of positive labels is low. In many applications the positive rate is indeed very low, and we then prefer the Poisson model for the computational efficiency of being able to use sums as sufficient statistics [7].

When we only need\[
\log E[y|\mu, \beta, I_a, I_b, I_c, \sigma] =
\beta +\mu_{I_a(j)}^a + \mu_{I_b(j)}^b + \mu_{I_c(j)}^c
\]in the Poisson model, then it is simple to pre-calculate the mean of each parameter — just as in the Gaussian model. On the other hand, if we required the posterior mean of $\mathrm{exp}(\beta +\mu_{I_a(j)}^a + \mu_{I_b(j)}^b + \mu_{I_c(j)}^c)$ (i.e. the mean of $y_j$), or if we need a variance estimate, then we must store multiple MCMC samples of every $\mu$. This can represent a challenge since we must now store and manipulate many times more parameters; however, there is recent work to develop more compact summaries of the posterior mean [2], [5]. We look forward to further developments in this area.

A Case Study Click-Through-Rate Prediction

In this section we evaluate the prediction accuracy of random effects models in click-through-rate modeling on several data sets — each corresponding to 28 days of traffic belonging to single display ads segment.

We selected 10 segments and chose a subset of training data such that the number of training examples varied from between 20 million to 90 million for each segment. We compared the output of a random effects model to a penalized GLM solver with "Elastic Net" regularization (i.e. both L1 and L2 penalties; see [8]) which were tuned for test set accuracy (log likelihood).

On each of the ten segments the random effects model yielded higher test-set log likelihoods and AUCs, and we display the results in the figure below. In that figure we have taken single future day of data as the test set i.e. we trained on data from days 1, ..., 28 and evaluated on day 29. We saw similar results when looking further ahead and evaluating on days 29 through 34.

Figure 1: Comparing Random Effects vs. Penalized GLM on AUC and log likelihood.

Through this case study we support the argument that practitioners should evaluate random effects models when they encounter a new problem. Often one tries several different techniques and then either combines the outputs or selects the most accurate, and we believe random effects models are a valuable addition to the usual suite of approaches which includes penalized GLMs, decision trees, etc.

Scalability studies

First we evaluate the popular lme4 R package and compare against a specialized Gibbs sampler which we describe in [7]. The lme4 software is an excellent choice for many datasets; however, to scale to large, multi-factor datasets we found it necessary to turn to alternative algorithms such as the Gibbs sampler. These algorithms can be distributed and deal with problems which would be RAM-limited for lme4.

This comparison is meant to show that the relative run times greatly depend on the structure of the random effect model matrix. There are designs for which lme4 scales linearly in the number of columns, and there are also non-pathological designs for which it appears to be quadratic or worse.

The figure below shows the run time for a Poisson lme4 model (dashed) and for 1000 iterations of a Gibbs sampler in the case of nested feature columns (left) and non-nested (right). By 'nested' we mean that there is hierarchical relationship so the level of a parent column is a function of the level of the child column. For example, consider a domain and the url's within it.

The left figure shows that the run time of both methods grows linearly with the number of input records and random effects (we simulated such that the number of random effects at the finest level was approximately equal to the number of input records divided by five). The right figure shows that the run time of one scan of the Gibbs sampler has about the same cost per iteration whether the structure is nested or not (as expected); however, for lme4 the cost of fitting the model increases dramatically as we add random effect columns.

Figure 2: Comparing custom Gibbs sampler vs. lmer running times.

We have many routine analyses for which the sparsity pattern is closer to the nested case and lme4 scales very well; however, our prediction models tend to have input data that looks like the simulation on the right. For example, we may have features that describe or identify an advertiser, other features for the web site on which an ad shows, and yet more features describing the device e.g. the type and model of device: computer, phone, tablet, iPhone5, iPhone6, ...; the operating system; the browser.

In the simulations we plotted the run time for 1,000 scans of the Gibbs sampler. In these simulations this number is far more than necessary — based on manual checks of the convergence of prior parameters and log likelihood. We chose 1,000 iterations in order to put the run time on comparable scale with lme4. This makes it easier to contrast how the two approaches handle more complex model structures i.e. more random effect columns.

One of the known challenges of using MCMC methods is deciding how many iterations to run the algorithm for. We recommend monitoring the prior variance parameters and the log likelihood of the data. More sophisticated checks have been proposed, and a good review of convergence diagnostics is found in [6]. Many of these techniques appear infeasible for the large regression models we are interested in, and we are still searching for checks that are easy to implement and reliable for high-dimensional models.

In the previous section we described per-segment models, and we now use the same datasets to illustrate the scalability of the algorithm. We selected two segments as a case study and varied the number of machines used in inference. The chosen segments had between 30 million and 90 million training records and a positive rate of about 1%. The models contained approximately 190 feature columns and the number of levels per column varied from between just two at the smallest to 5 million at the largest (with a median of about 15,000 levels per column). Finally, we should note that a training record typically represents a single ad-view but we can collapse those with identical features (all 190 of them in this case). In these examples aggregation only reduced the size of the data set by about 10%.

In the figure below we show how the run time per iteration (updating all 190 feature columns) varied with the number of machines used. For this timing test we also evaluated the run time of the model when applied to two combined datasets from all ten segments and another from thirty segments. These large timing tests had roughly 500 million and 800 million training examples respectively. The combined raw training data for the smaller one was 834 GB in protocol buffer format and after integerization and column formatting (roughly how the data is stored in RAM) the training data was roughly 170 GB in size on disk. The larger dataset was around 300 GB after integerization and column-formatting.

In the right panel of the figure below we can see that the run time is growing sub-linearly in the amount of training data i.e. doubling the number of training instances does not double the time per iteration.

Figure 3: Performance scalability of custom Gibbs sampler.

Conclusion and Discussion

We have found random effects models to be useful tools for both exploratory analyses and prediction problems. They provide an interpretable decomposition of variance, and in prediction problems they can supply predictive posterior distributions that can be used in stochastic optimization when uncertainty estimates are a critical component (e.g. bandit problems).

Our study of per-segment click-through-rate models demonstrated that random effects models can deliver superior prediction accuracy. Although this may not hold in every application, we encourage others to evaluate random effects models when selecting a methodology for a new prediction problem. In our case study we considered models with a couple hundred feature columns and hundreds of millions of inputs. This is for illustration and does not represent an upper bound on the problem size; we have applied these models to data sets with billions of inputs (not shown), and we are confident that the techniques are scalable enough to cover many problems of interest.

References

[1] Andrew Gelman, and Jennifer Hill. "Data analysis using regression and multilevel/hierarchical models." Cambridge University Press, (2006).

[2] Edward Snelson and Zoubin Ghahramani. "Compact approximations to bayesian predictive distributions." ICML, (2005).

[3] Bradley Efron. "Large-scale inference: empirical Bayes methods for estimation, testing, and prediction." Vol. 1. Cambridge University Press, (2012).

[4] Bradley Efron, and Carl Morris. "Stein's estimation rule and its competitors—an empirical Bayes approach." Journal of the American Statistical Association 68.341 (1973): 117-130.

[5] Anoop Korattikara, et al. "Bayesian dark knowledge." arXiv preprint arXiv:1506.04416 (2015).

[6] Mary Kathryn Cowles and Bradley P. Carlin. "Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review". Journal of the American Statistical Association, Vol. 91, No. 434 (1996): 883-904.

[7] Nicholas A. Johnson, Frank O. Kuehnel, Ali Nasiri Amini. "A Scalable Blocked Gibbs Sampling Algorithm For Gaussian And Poisson Regression Models." arXiv preprint arXiv:1602.00047, (2016).

[8] Hui Zou, and Trevor Hastie. "Regularization and variable selection via the elastic net." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.2 (2005): 301-320.

[9] Steven L. Scott. "A modern Bayesian look at the multi-armed bandit." Applied Stochastic Models in Business and Industry, 26 (2010): 639-658.

[10] Steven L. Scott. "Multi-armed bandit experiments in the online service economy." Applied Stochastic Models in Business and Industry, 31 (2015): 37-49.

Comments

UnknownApril 5, 2016 at 4:56 AM
Did you use an R installation with optimized BLAS to run lme4? For details see here: https://matloff.wordpress.com/2015/11/20/back-to-the-blas-issue/
UnknownApril 7, 2016 at 4:33 AM
Coming from psychology, I'm used to mixed models because we deal with nestedness, unequal time between observations, meaningful missingness, etc. I'm curious what you think of these approaches :
https://methodology.psu.edu/ra/inten-longit-data/research/tech
Discrete VortexApril 10, 2016 at 6:53 AM
This reminds me of the work that Eric Schwartz did with my past employer. The dissertation is here http://repository.upenn.edu/cgi/viewcontent.cgi?article=1852&context=edissertations.
The use of poisson is interesting, I need to read up on that. We stuck with logistic and had plans to try out probit. Data was similar (summarized extracts from Doubleclick), but with way fewer attributes that made lme4 happy. The work was in context of multi armed bandit optimization of banner ads across different ad publishing sites (hence Doubleclick).
Brendan O'ConnorApril 10, 2016 at 7:41 PM
For scaling mixed models, this might be relevant:

Fast Moment-Based Estimation for Hierarchical Models
Patrick O. Perry
http://arxiv.org/abs/1504.04941
UnknownAugust 30, 2016 at 7:30 PM
Hiya - did you test to see what the cost would be wrt to MSE/Loss if you just used the Gaussian (I am assuming p*n>10). Even more telling, since it sounds like you want to use this in for control (since you mention Thompson/Posterior selection), did you check for marginal efficacy wrt regret? IOW, did you test two controllers, one with partial pooling, and one without?
Cheers!
Matt Gershoff (matt (at) conductrics.com)
UnknownJune 4, 2018 at 10:34 AM
could someone define for me "segment" and "input" in this paper? thanks

Search This Blog

The Unofficial Google Data Science Blog