An introduction to the Poisson bootstrap
by AMIR NAJMI
At Google, data scientists are just too much in demand. Thus, anytime we can replace data scientist thinking with machine thinking, we consider it a win. Anticipating the ubiquity of cheap computing, Efron introduced the bootstrap back in 1979 [1]. What makes bootstrap so attractive is that it doesn’t require any parametric assumptions about the data, or any math at all, and can be applied generically to a wide variety of statistical estimators. As simple as the bootstrap procedure is, its theory is far from trivial and has attracted the attention of many mathematicians. For conditions under which the standard bootstrap procedure applies see the definitive theory [2]. But here we are focused on its practical application and assume the reader knows how to apply basic bootstrap.
Suppose we are given a procedure for estimating $\theta$ from the IID data set \(x\), which we denote \(\hat{\theta}(x)\). For example, if we want to estimate the population mean, our estimator be the sample mean
\[
\hat{\theta}(x) = \frac{1}{n} \sum_{i=i}^n x_i
\]
We’d like to estimate the distribution of our estimator without having additional samples. The bootstrap procedure does this by generating $B$ “simulated” data sets, each of size $n$ denoted by \(x^*_1, .., x^*_B\) and simply uses the empirical distribution of \(\hat{\theta}(x^*_i)\) across these datasets as an estimate of the distribution of \(\hat{\theta}(X)\). It may take computation, but it’s easy to carry out. That the bootstrap blows away the problem by brute force computation is the reason Tukey suggested that Efron name it “shotgun” [1]. How big should $B$ be? It depends on what you want, but for percentile intervals $B=1000$ while standard errors can usefully be estimated even with $B=20$.
The standard bootstrap procedure creates each simulated data set (aka “resample”) by drawing observations from $x$ with replacement, i.e. from $n$ observations we draw \(n\) with replacement. In any given resample, each observation may occur 0, 1 or more times according to $Binomial(n, \frac{1}{n})$. And since the total number of observations is constrained to be \(n\), the counts are jointly $Multinomial(n, \frac{1}{n}, \cdots, \frac{1}{n})$.
As a tiny example, suppose we wish to estimate the population mean using the mean of the given a random sample $\{ 2.3, 1.1, 2.7, 4.0 \}$. To construct the first resample (i.e. simulated data set of size n) we select one of the four observations and do this four times, say, resulting in the sample $\{ 4.0, 2.7, 2.3, 4.0 \}$. We repeat this whole process $B=3$ times to create resamples as follows:
For IID data, we can actually describe any resample by the number of occurrences of each observation in canonical order. For the resamples above, this would be
{ 0, 1, 1, 2 } # Counts for Resample 0
{ 1, 3, 0, 0 } # Counts for Resample 1
{ 0, 1, 1, 2 } # Counts for Resample 2
where the values in each tuple are the number of occurrences of the observations 1.0, 2.3, 2.7 and 4.0 respectively. These counts follow the multinomial distribution $Multinom(4, \frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})$. Thus the problem of bootstrap resampling reduces to generating these counts.
The standard (multinomial) bootstrap is a lot better at estimating standard errors for small samples than parametric methods which make asymptotic assumptions. Even for larger samples where procedures like the Delta Method work well, the bootstrap is attractive because it replaces tedious derivatives with more computation.
But when it comes to big data, the multinomial bootstrap is computationally problematic. For parallel computation, we need to make independent decisions about how many times to include each observation. For instance, in the examples above, our observations $\{ 1.1, 2.3, 2.7, 4.0 \}$ might be spread over a number of different machines (in a big data world, we cannot assume the data fits on one machine). Thus, in the Resample 0, we would want to compute each of the counts $\{ 0, 1, 1, 2 \}$ independently. But the sum of counts in the multinomial distribution are fixed, thereby negatively correlating with one another. In many cases, we may not even know the total number of observations in advance.
Our approach is to go ahead with independent sampling and just deal with the consequences mathematically. It turns out that for estimators involving ratios of sums, there really aren’t any practical consequences when $n \geq 100$. For instance, we could sample each observation from IID $Binomial(n, \frac{1}{n})$ and end up with a variable number of samples. But noting that
\[
\lim_{n \to \infty} Binomial(n, \frac{1}{n}) = Poisson(1)
\]
we prefer to just use Poisson sampling. This way there is no need to know n, the total count of observations, ahead of time. When applied to our tiny example, we generate $B$ realizations from $Poisson(1)$ for each observation independently. This results in something like this
{ 0, 0, 0 } # For item value 1.1
{ 0, 1, 1 } # For item value 2.3
{ 2, 1, 0 } # For item value 2.7
{ 0, 2, 3 } # For item value 4.0
where there is a B-tuple corresponding to each observation. To make them look like the multinomial counts we showed earlier could transpose these counts as follows
{ 0, 0, 2, 0 } # Counts for Resample 0
{ 0, 1, 1, 2 } # Counts for Resample 1
{ 0, 1, 0, 3 } # Counts for Resample 2
but in reality, we use the first form to stream over the data while maintaining B running aggregates, one for each resample. In this way, we only look at each observation once in the map phase of a MapReduce and never have to store raw data in memory.
Notice that the first Poisson resample only has a total count of two. As long as our estimator scales for sample size (as does the sample mean) this won’t be a practical problem when $n \geq 100$ because the count in each resample will likely be close to $n$ (or rather within $n \pm 2 \sqrt{n}$).
Say we wish to estimate average CPC (cost per click paid by an advertiser for her ad) for all the ads by country from a day’s worth of traffic. Assume the raw data is organized as a massive CSV file such that each row is a single ad impression with columns for number of clicks and cost (in US dollars) on that impression:
# Raw data with header
query_id, impression_id, country_code, clicks, cost_usd
0x23be2341, 0x701a6301, JP, 2, 1.2201
0x57fcb234, 0x48af7063, AE, 1, 0.8234
...
The sample estimate is simply the sum of cost for all advertisers divided by the number of ad clicks on that day, and can be computed efficiently and in parallel within a MapReduce framework. Each row of aggregated data is the sum of clicks and cost of the raw data keyed by country:
# Aggregated data with header
country_code, clicks, cost_usd
0, AE, 1345, 2368.621
0, AR, 13894, 7197.968
...
To do the bootstrap, we introduce an additional key called resample_id. If we want 1000 resamples, this id ranges from 0 to 999. Each resample is an aggregate in which every row of raw data is included with weight drawn from independent $Poisson(1)$:
# Bootstrapped aggregate data with header
resample_id, country_code, clicks, cost_usd
0, AE, 1345, 2368.621
1, AE, 1411, 2039.838
2, AE, 1359, 1916.304
...
0, AR, 13894, 7197.968
1, AR, 14076, 8858.110
2, AR, 14066, 12131.114
...
By convention, the 0th resample is the unperturbed sum, i.e. each observation appears exactly once. We can use this data to compute standard intervals for the average CPC by country using any of the standard methods for bootstrap interval (e.g. normal, t, quantile).
In the example above, we applied the resampling procedure assuming each row of the raw data to be an independent observation. In other words, we assumed that ad impressions were the exchangeable unit in the statistical sense. But there are reasons to doubt this assumption. For instance, we could easily imagine that ads on the same query have positively correlated CPCs (prices for ads on the query [structured settlements] are generally higher than those on the query [magic markers]). If this correlation is not negligible then a bootstrap procedure assuming exchangeable ad impressions will underestimate the variability of our CPC estimator.
Thus, we cannot always assume that each row of the raw data is exchangeable. For the sake of this discussion, let us say we do not assume that queries rather than ad impressions are exchangeable. We could aggregate the data at the query level and then run a Poisson bootstrap procedure as before, but materializing query-level aggregates could be an expensive operation just to do this bootstrap. Instead, we can use a Poisson bootstrap in which we generate a pseudo-random \(Poisson(1)\) sequence of length B seeded is a hash of the exchangeable unit, in this case the unique query_id. What this does is that all data for a single query is aggregated to any given resample an integer number of times, where that integer is drawn from \(Poisson(1)\).
Side note: hashing i.e. reproducible, unit-based pseudo-randomization is a fundamental building block of statistical procedures for big data. We will cover this in more detail in a future blog post.
Thus, if we were to go back to our tiny example, imagine that our four observations were not independent, but came in independent pairs. In other words, the raw data is
{ 1703, 1.1 }
{ 4306, 2.3 }
{ 4306, 2.7 }
{ 1703, 4.0 }
where the first element of the tuple is the id for the pair. We want to bootstrap on the pairs but don’t want to incur the cost of rekeying the data by pair id. Instead, we simply generate a pseudorandom sequence of $Poisson(1)$ using the pair id as seed. This results in
{ 1, 0, 2 } # For item value 1.1
{ 0, 1, 0 } # For item value 2.3
{ 0, 1, 0 } # For item value 2.7
{ 1, 0, 2 } # For item value 4.0
representing resamples counts for the pairs 1703 and 4306 respectively as follows:
{ 1, 0 }
{ 0, 1 }
{ 2, 0 }
By using seeded pseudorandom generation, we were able keep pieces of the exchangeable unit in sync without additional communication.
Thus far, we have justified the Poisson bootstrap on the intuitive basis that when n is large, the anti-correlation between sample counts for different observations will be small, and further, that the $Binomial(n, \frac{1}{n})$ approaches $Poisson(1)$. But independent sampling also means that the number of observations in each bootstrap resample is random $Poisson(n)$. There is a small chance that a resample will select insufficient observations for the estimator (e.g. a sample variance with fewer than 2 observations, or even no samples at all). While this is not of practical concern for cases when $n \geq 100$, the variable size of our resamples is clearly an additional source of variability. We merely asserted that it wasn’t a big deal.
Like any bootstrap theory, the theory behind the Poisson bootstrap is a bit involved. For a proper treatment of the Poisson bootstrap and other streaming resampling methods we refer the reader to Chamandy et al [3]. But the basic idea is that we formally define the Poisson bootstrap procedure to include the rejection and redoing of any resample if its size is “too far” from $n$. The resultant resampling distribution is no longer strictly Poisson but as $n$ grows, any rejection becomes increasingly unlikely. It can then be shown that the asymptotic distribution of the estimator using this Poisson bootstrap is much closer to that of the standard bootstrap than either is to the true distribution of the estimator. Thus there really isn’t any price to pay in estimation for using independent Poisson sampling and much to be gained computationally by way of parallelism.
The Poisson bootstrap is widely used within Google, having been built into the primitives of our analysis infrastructure. As we’ve shown, a clever implementation can be efficient even when the unit of resampling is not a single observation.
Statistical theory works at all scales, but statistical methods created for small data might not work as-is in a big data world. The Poisson bootstrap is an example of how a little flexibility in statistical theory can lead to methods which work efficiently with big data. One might say that Mathematical Statistics helped us scale!
[1] Efron, B. (1979). "Bootstrap methods: Another look at the jackknife". The Annals of Statistics 7 (1): 1–26
[2] Gené, E. and Zinn, J., 1990, Bootstrapping general empirical measures, Annals of probability 18, 851-869
[3] Nicholas Chamandy, Omkar Muralidharan, Amir Najmi, Siddartha Naidu “Estimating Uncertainty for Massive Data Streams”, Tech Report, 2012, Google
[4] Bradley Efron, R. J. Tibshirani, An Introduction to the Bootstrap, CRC Press, 1994
The bootstrap is a powerful resampling procedure which makes it easy to compute the distribution of any statistical estimator. However, doing the standard bootstrap on big data (i.e. which won’t fit in the memory of a single computer) can be computationally prohibitive. In this post I describe a simple “statistical fix” to the standard bootstrap procedure allowing us to compute bootstrap estimates of standard error in a single pass or in parallel.
Suppose we are given a procedure for estimating $\theta$ from the IID data set \(x\), which we denote \(\hat{\theta}(x)\). For example, if we want to estimate the population mean, our estimator be the sample mean
\[
\hat{\theta}(x) = \frac{1}{n} \sum_{i=i}^n x_i
\]
We’d like to estimate the distribution of our estimator without having additional samples. The bootstrap procedure does this by generating $B$ “simulated” data sets, each of size $n$ denoted by \(x^*_1, .., x^*_B\) and simply uses the empirical distribution of \(\hat{\theta}(x^*_i)\) across these datasets as an estimate of the distribution of \(\hat{\theta}(X)\). It may take computation, but it’s easy to carry out. That the bootstrap blows away the problem by brute force computation is the reason Tukey suggested that Efron name it “shotgun” [1]. How big should $B$ be? It depends on what you want, but for percentile intervals $B=1000$ while standard errors can usefully be estimated even with $B=20$.
The standard bootstrap procedure creates each simulated data set (aka “resample”) by drawing observations from $x$ with replacement, i.e. from $n$ observations we draw \(n\) with replacement. In any given resample, each observation may occur 0, 1 or more times according to $Binomial(n, \frac{1}{n})$. And since the total number of observations is constrained to be \(n\), the counts are jointly $Multinomial(n, \frac{1}{n}, \cdots, \frac{1}{n})$.
As a tiny example, suppose we wish to estimate the population mean using the mean of the given a random sample $\{ 2.3, 1.1, 2.7, 4.0 \}$. To construct the first resample (i.e. simulated data set of size n) we select one of the four observations and do this four times, say, resulting in the sample $\{ 4.0, 2.7, 2.3, 4.0 \}$. We repeat this whole process $B=3$ times to create resamples as follows:
{ 4.0, 2.7, 2.3, 4.0 }
{ 2.3, 2.3, 1.1, 2.3 }
{ 2.7, 4.0, 2.3, 4.0 }
{ 2.3, 2.3, 1.1, 2.3 }
{ 2.7, 4.0, 2.3, 4.0 }
We would simply apply our estimator (in this case the sample mean) to each of these $B$ simulated data sets and, according to bootstrap theory, obtain $B$ realizations as if from the distribution of the estimator.
For IID data, we can actually describe any resample by the number of occurrences of each observation in canonical order. For the resamples above, this would be
{ 0, 1, 1, 2 } # Counts for Resample 0
{ 1, 3, 0, 0 } # Counts for Resample 1
{ 0, 1, 1, 2 } # Counts for Resample 2
where the values in each tuple are the number of occurrences of the observations 1.0, 2.3, 2.7 and 4.0 respectively. These counts follow the multinomial distribution $Multinom(4, \frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})$. Thus the problem of bootstrap resampling reduces to generating these counts.
The standard (multinomial) bootstrap is a lot better at estimating standard errors for small samples than parametric methods which make asymptotic assumptions. Even for larger samples where procedures like the Delta Method work well, the bootstrap is attractive because it replaces tedious derivatives with more computation.
But when it comes to big data, the multinomial bootstrap is computationally problematic. For parallel computation, we need to make independent decisions about how many times to include each observation. For instance, in the examples above, our observations $\{ 1.1, 2.3, 2.7, 4.0 \}$ might be spread over a number of different machines (in a big data world, we cannot assume the data fits on one machine). Thus, in the Resample 0, we would want to compute each of the counts $\{ 0, 1, 1, 2 \}$ independently. But the sum of counts in the multinomial distribution are fixed, thereby negatively correlating with one another. In many cases, we may not even know the total number of observations in advance.
Our approach is to go ahead with independent sampling and just deal with the consequences mathematically. It turns out that for estimators involving ratios of sums, there really aren’t any practical consequences when $n \geq 100$. For instance, we could sample each observation from IID $Binomial(n, \frac{1}{n})$ and end up with a variable number of samples. But noting that
\[
\lim_{n \to \infty} Binomial(n, \frac{1}{n}) = Poisson(1)
\]
we prefer to just use Poisson sampling. This way there is no need to know n, the total count of observations, ahead of time. When applied to our tiny example, we generate $B$ realizations from $Poisson(1)$ for each observation independently. This results in something like this
{ 0, 0, 0 } # For item value 1.1
{ 0, 1, 1 } # For item value 2.3
{ 2, 1, 0 } # For item value 2.7
{ 0, 2, 3 } # For item value 4.0
where there is a B-tuple corresponding to each observation. To make them look like the multinomial counts we showed earlier could transpose these counts as follows
{ 0, 0, 2, 0 } # Counts for Resample 0
{ 0, 1, 1, 2 } # Counts for Resample 1
{ 0, 1, 0, 3 } # Counts for Resample 2
but in reality, we use the first form to stream over the data while maintaining B running aggregates, one for each resample. In this way, we only look at each observation once in the map phase of a MapReduce and never have to store raw data in memory.
Notice that the first Poisson resample only has a total count of two. As long as our estimator scales for sample size (as does the sample mean) this won’t be a practical problem when $n \geq 100$ because the count in each resample will likely be close to $n$ (or rather within $n \pm 2 \sqrt{n}$).
A realistic example
Say we wish to estimate average CPC (cost per click paid by an advertiser for her ad) for all the ads by country from a day’s worth of traffic. Assume the raw data is organized as a massive CSV file such that each row is a single ad impression with columns for number of clicks and cost (in US dollars) on that impression:
# Raw data with header
query_id, impression_id, country_code, clicks, cost_usd
0x23be2341, 0x701a6301, JP, 2, 1.2201
0x57fcb234, 0x48af7063, AE, 1, 0.8234
...
The sample estimate is simply the sum of cost for all advertisers divided by the number of ad clicks on that day, and can be computed efficiently and in parallel within a MapReduce framework. Each row of aggregated data is the sum of clicks and cost of the raw data keyed by country:
# Aggregated data with header
country_code, clicks, cost_usd
0, AE, 1345, 2368.621
0, AR, 13894, 7197.968
...
To do the bootstrap, we introduce an additional key called resample_id. If we want 1000 resamples, this id ranges from 0 to 999. Each resample is an aggregate in which every row of raw data is included with weight drawn from independent $Poisson(1)$:
# Bootstrapped aggregate data with header
resample_id, country_code, clicks, cost_usd
0, AE, 1345, 2368.621
1, AE, 1411, 2039.838
2, AE, 1359, 1916.304
...
0, AR, 13894, 7197.968
1, AR, 14076, 8858.110
2, AR, 14066, 12131.114
...
By convention, the 0th resample is the unperturbed sum, i.e. each observation appears exactly once. We can use this data to compute standard intervals for the average CPC by country using any of the standard methods for bootstrap interval (e.g. normal, t, quantile).
More sophisticated resampling
In the example above, we applied the resampling procedure assuming each row of the raw data to be an independent observation. In other words, we assumed that ad impressions were the exchangeable unit in the statistical sense. But there are reasons to doubt this assumption. For instance, we could easily imagine that ads on the same query have positively correlated CPCs (prices for ads on the query [structured settlements] are generally higher than those on the query [magic markers]). If this correlation is not negligible then a bootstrap procedure assuming exchangeable ad impressions will underestimate the variability of our CPC estimator.
Thus, we cannot always assume that each row of the raw data is exchangeable. For the sake of this discussion, let us say we do not assume that queries rather than ad impressions are exchangeable. We could aggregate the data at the query level and then run a Poisson bootstrap procedure as before, but materializing query-level aggregates could be an expensive operation just to do this bootstrap. Instead, we can use a Poisson bootstrap in which we generate a pseudo-random \(Poisson(1)\) sequence of length B seeded is a hash of the exchangeable unit, in this case the unique query_id. What this does is that all data for a single query is aggregated to any given resample an integer number of times, where that integer is drawn from \(Poisson(1)\).
Side note: hashing i.e. reproducible, unit-based pseudo-randomization is a fundamental building block of statistical procedures for big data. We will cover this in more detail in a future blog post.
Thus, if we were to go back to our tiny example, imagine that our four observations were not independent, but came in independent pairs. In other words, the raw data is
{ 1703, 1.1 }
{ 4306, 2.3 }
{ 4306, 2.7 }
{ 1703, 4.0 }
where the first element of the tuple is the id for the pair. We want to bootstrap on the pairs but don’t want to incur the cost of rekeying the data by pair id. Instead, we simply generate a pseudorandom sequence of $Poisson(1)$ using the pair id as seed. This results in
{ 1, 0, 2 } # For item value 1.1
{ 0, 1, 0 } # For item value 2.3
{ 0, 1, 0 } # For item value 2.7
{ 1, 0, 2 } # For item value 4.0
representing resamples counts for the pairs 1703 and 4306 respectively as follows:
{ 1, 0 }
{ 0, 1 }
{ 2, 0 }
By using seeded pseudorandom generation, we were able keep pieces of the exchangeable unit in sync without additional communication.
The mathematical fine print
Thus far, we have justified the Poisson bootstrap on the intuitive basis that when n is large, the anti-correlation between sample counts for different observations will be small, and further, that the $Binomial(n, \frac{1}{n})$ approaches $Poisson(1)$. But independent sampling also means that the number of observations in each bootstrap resample is random $Poisson(n)$. There is a small chance that a resample will select insufficient observations for the estimator (e.g. a sample variance with fewer than 2 observations, or even no samples at all). While this is not of practical concern for cases when $n \geq 100$, the variable size of our resamples is clearly an additional source of variability. We merely asserted that it wasn’t a big deal.
Like any bootstrap theory, the theory behind the Poisson bootstrap is a bit involved. For a proper treatment of the Poisson bootstrap and other streaming resampling methods we refer the reader to Chamandy et al [3]. But the basic idea is that we formally define the Poisson bootstrap procedure to include the rejection and redoing of any resample if its size is “too far” from $n$. The resultant resampling distribution is no longer strictly Poisson but as $n$ grows, any rejection becomes increasingly unlikely. It can then be shown that the asymptotic distribution of the estimator using this Poisson bootstrap is much closer to that of the standard bootstrap than either is to the true distribution of the estimator. Thus there really isn’t any price to pay in estimation for using independent Poisson sampling and much to be gained computationally by way of parallelism.
Conclusion
The Poisson bootstrap is widely used within Google, having been built into the primitives of our analysis infrastructure. As we’ve shown, a clever implementation can be efficient even when the unit of resampling is not a single observation.
Statistical theory works at all scales, but statistical methods created for small data might not work as-is in a big data world. The Poisson bootstrap is an example of how a little flexibility in statistical theory can lead to methods which work efficiently with big data. One might say that Mathematical Statistics helped us scale!
References
[1] Efron, B. (1979). "Bootstrap methods: Another look at the jackknife". The Annals of Statistics 7 (1): 1–26
[2] Gené, E. and Zinn, J., 1990, Bootstrapping general empirical measures, Annals of probability 18, 851-869
[3] Nicholas Chamandy, Omkar Muralidharan, Amir Najmi, Siddartha Naidu “Estimating Uncertainty for Massive Data Streams”, Tech Report, 2012, Google
[4] Bradley Efron, R. J. Tibshirani, An Introduction to the Bootstrap, CRC Press, 1994