Using Random Effects Models in Prediction Problems

by NICHOLAS A. JOHNSON, ALAN ZHAO, KAI YANG, SHENG WU, FRANK O. KUEHNEL, and ALI NASIRI AMINI In this post, we give a brief introduction to random effects models, and discuss some of their uses. Through simulation we illustrate issues with model fitting techniques that depend on matrix factorization. Far from hypothetical, we have encountered these issues in our experiences with "big data" prediction problems. Finally, through a case study of a real-world prediction problem, we also argue that Random Effect models should be considered alongside penalized GLM's even for pure prediction problems. Random effects models are a useful tool for both exploratory analyses and prediction problems. We often use statistical models to summarize the variation in our data, and random effects models are well suited for this — they are a form of ANOVA after all. In prediction problems these models can summarize the variation in the response, and in the process produce a form of ada

LSOS experiments: how I learned to stop worrying and love the variability

by AMIR NAJMI In the previous post we looked at how large scale online services (LSOS) must contend with the high coefficient of variation (CV) of the observations of particular interest to them. In this post we explore why some standard statistical techniques to reduce variance are often ineffective in this “data-rich, information-poor” realm. Despite a very large number of experimental units, the experiments conducted by LSOS cannot presume statistical significance of all effects they deem practically significant. We previously went into some detail as to why observations in an LSOS have particularly high coefficient of variation (CV). The result is that experimenters can’t afford to be sloppy about quantifying uncertainty. Estimating confidence intervals with precision and at scale was one of the early wins for statisticians at Google. It has remained an important area of investment for us over the years. Given the role played by the variability of the underlying observations, the

Variance and significance in large-scale online services

by AMIR NAJMI Running live experiments on large-scale online services (LSOS) is an important aspect of data science. Unlike  experimentation in  some other areas, LSOS experiments present a surprising challenge to statisticians — even though we operate in the realm of “big data”, the statistical uncertainty in our experiments can be substantial. Because individual observations have so little information, statistical significance remains important to assess. We must therefore maintain statistical rigor in quantifying experimental uncertainty.  In this post we explore how and why we can be  “ data-rich but information-poor ” . There are many reasons for the recent explosion of data and the resulting rise of data science. One big factor in putting data science on the map has been what we might call Large Scale Online Services (LSOS). These are sites and services which rely both on ubiquitous user access to the internet as well as advances in technology to scale to millions of simu

Replacing Sawzall — a case study in domain-specific language migration

by AARON BECKER In a previous post, we described how data scientists at Google used Sawzall to perform powerful, scalable analysis. However, over the last three years we’ve eliminated almost all our Sawzall code, and now the niche that Sawzall occupied in our software ecosystem is mostly filled by Go. In this post, we’ll describe Sawzall’s role in Google’s analysis ecosystem, explain some of the problems we encountered as Sawzall use increased which motivated our migration, and detail the techniques we applied to achieve language-agnostic analysis while maintaining strong access controls and the ability to write fast, scalable analyses. Any successful programming language has its own evolutionary niche, a set of problems that it solves unusually well. Sometimes this niche is created by language features. For example, Erlang has strong tools for constructing distributed systems built into the language. In other cases, features such as standard libraries and a language’s commun