Showing posts from August, 2015

An introduction to the Poisson bootstrap

by AMIR NAJMI The bootstrap is a powerful resampling procedure which makes it easy to compute the distribution of any statistical estimator. However, doing the standard bootstrap on big data (i.e. which won’t fit in the memory of a single computer) can be computationally prohibitive. In this post I describe a simple “statistical fix” to the standard bootstrap procedure allowing us to compute bootstrap estimates of standard error in a single pass or in parallel. At Google, data scientists are just too much in demand. Thus, anytime we can replace data scientist thinking with machine thinking, we consider it a win. Anticipating the ubiquity of cheap computing, Efron introduced the bootstrap back in 1979 [1]. What makes bootstrap so attractive is that it doesn’t require any parametric assumptions about the data, or any math at all, and can be applied generically to a wide variety of statistical estimators. As simple as the bootstrap procedure is, its theory is far from trivial and

Welcome to the unofficial Google data science blog

Despite Google’s technical achievements with big data, it may come as a surprise that there is no official Google blog for data science. True, Google Research puts out many academic papers and has a  blog  describing matters of interest to researchers. But what has been missing to date is a conversation about the nuts-and-bolts, the day-to-day of large scale analytical systems Google builds to serve its users. We’d like to change that. We are a group of individuals from across several engineering teams at Google whose job it is to design and build the analytics used in Google’s products and services. While most of us have PhDs in statistics, machine learning or a related field, ours is not a blog aimed at academia. We’ll provide academic references if necessary, but we mean for this to be a practitioners’ blog. At the same time, the problems we face are often complex enough to require highly technical solutions in statistics and computation. Thus many of our posts might not be