Unintentional data


A large part of the data we data scientists are asked to analyze was not collected with any specific analysis in mind, or perhaps any particular purpose at all. This post describes the analytical issues which arise in such a setting, and what the data scientist can do about them.

A landscape of promise and peril

The data scientist working today lives in what Brad Efron has termed the "era of scientific mass production," of which he remarks, "But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. [1]"

Statistics, as a discipline, was largely developed in a small data world. Data was expensive to gather, and therefore decisions to collect data were generally well-considered. Implicitly, there was a prior belief about some interesting causal mechanism or an underlying hypothesis motivating the collection of the data. As computing and storage have made data collection cheaper and easier, we now gather data without this underlying motivation. There is no longer always intentionality behind the act of data collection — data are not collected in response to a hypothesis about the world, but for the same reason George Mallory climbed Everest: because it’s there.

Fig. 1: The Himalaya mountain range

Much of the data available to us in big data settings is large not only in the number of observations, but also in the number of features. With more features come more potential post hoc hypotheses about what is driving metrics of interest, and more opportunity for exploratory analysis. Operating successfully as a data scientist in industry in such an environment is not only a matter of mathematics, but also one of intuition and discretion. Understanding the goals of the organization as well as guiding principles for extracting value from data are both critical for success in this environment.

Thankfully not only have modern data analysis tools made data collection cheap and easy, they have made the process of exploratory data analysis cheaper and easier as well. Yet when we use these tools to explore data and look for anomalies or interesting features, we are implicitly formulating and testing hypotheses after we have observed the outcomes. The ease with which we are now able to collect and explore data makes it very difficult to put into practice even basic concepts of data analysis that we have learned — things such as:
  • Correlation does not imply causation.
  • When we segment our data into subpopulations by characteristics of interest, members are not randomly assigned (rather, they are chosen deliberately) and suffer from selection bias.
  • We must correct for multiple hypothesis tests.
  • We ought not dredge our data.
All of those principles are well known to statisticians, and have been so for many decades. What is newer is just how cheap it is to posit hypotheses. For better and for worse, technology has led to a democratization of data within organizations. More people than ever are using statistical analysis packages and dashboards, explicitly or more often implicitly, to develop and test hypotheses.

Although these difficulties are more pronounced when we deal with observational data, the proliferation of hypotheses and lack of intentionality in data collection can even impact designed experiments. We data scientists now have access to tools that allow us to run a large numbers of experiments, and then to slice experimental populations by any combination of dimensions collected. This leads to the proliferation of post hoc hypotheses. That is, having observed that Variant A is better than Variant B, we are induced by our tools to slice the data by covariates to try and understand why, even if we had no a priori hypothesis or proposed causal mechanism.

Looking at metrics of interest computed over subpopulations of large data sets, then trying to make sense of those differences, is an often recommended practice (even on this very blog). And for good reason! Every data scientist surely has a story of identifying important issues by monitoring metrics on dashboards without having any particular hypothesis about what they are looking for. As data scientists working in big data environments, the question before us is how to explore efficiently and draw inference from data collected without clear intent 
— where there was no prior belief that the data would be relevant or bear on any particular question. Our challenges are no longer purely analytical in nature: questions of human psychology and organizational dynamics arise in addition to the mathematical challenges in statistical inference.

John Tukey writes in the introduction to The Future of Data Analysis that although he had once believed himself to be a statistician, he realized that his real interest was in data analysis — which includes “procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier… in addition to all the machinery and results of (mathematical) statistics…” Mathematics can inform us, but it alone can no longer save us, and while mathematics may inform us how to do something, it cannot inform us what should be done. Although this post prescribes no formula for being an effective data scientist in the world of unintentional data, it does offer guidance in confronting both the organizational and operational issues the data scientist may encounter.

Avalanche of questions: the role of the data scientist amid unintentional data

The internet is awash in guides and tutorials for the aspiring data scientist that focus on various facets of statistics, mathematics, programming, or other methodological concerns. Yet it is just as important to have a handle on how to reason about the mountains of observational data that can overwhelm an organization.

When presented with a “finding” from the world of unintentional data, the data scientist must answer three broad questions:
  • Is it relevant to our goals? Is the effect we’ve discovered related to a topic that is of interest to the organization? Is the effect size large enough to be meaningful to the organization? In the world of big, unintentional data there are many discoveries to be had which have no bearing on the organization’s goals. Understanding what is relevant to the organization is critical to managing the deluge of questions that could be asked of all the data we now collect. 
  • Is it actionable? If the question is material to the organization, are there any decisions that can be made given what has been discovered? Understanding how the organization operates and what avenues are available to respond is critical in choosing how to investigate a pile of unintentional data. The data scientist in industry needs not only a way to attack the analysis problem, but a way also to attack the business problem on which their analysis may shed light.
  • Is it real? Is the effect being observed the result of some causal process as opposed to the kind of random variation in user or system behavior expected in steady state? This question is statistical or methodological in nature.
What can the data scientist do to answer these questions?

Know what matters. If hypothesis generation is cheap, the data scientist will soon be inundated with hypotheses from across the organization: theories to evaluate, urgent emails in the middle of the night to explain every wiggle on a time series dashboard. Only by knowing the broader goals of the organization, what data the organization has that can speak to those goals, and what analyses have been impactful in the past can the data scientist guide the organization to focus on meaningful questions.

Know your data. Without understanding how and why the data are generated and collected, it is impossible to have any reliable intuition about whether the result of an analysis makes sense, or whether a given hypothesis is any good. Be sure to have a deep, thorough understanding of how data under consideration was collected and what it actually means. Check any assumptions up front — nothing is worse than completing an analysis based on a faulty assumption about what a piece of data means.

Make experimentation cheap and understand the cost of bad decisions. The gold standard of evidence is the randomized controlled experiment. If we are to succeed in a world where hypothesis generation is cheap, we must develop or acquire infrastructure and processes to ensure that it is also cheap to test them. It is part of the data scientist's role to advocate for rapid experimentation and to educate those who use it. Still, if there are too many hypotheses to test with experiments, we should only be testing those that lead to consequential decisions. We need to know whether the decision we are optimizing is significant enough to justify the time and experiment resources spent optimizing it.

Be skeptical, intellectually honest. When an appealing conclusion presents itself, be skeptical and thorough in considering any issue with the data that may lead to incorrect conclusions. In addition to the issues above, does the conclusion pass the smell test? Is the data presented consistent with other data that you have seen? 

Because the issues with observational data are subtle and easily missed or ignored, the temptation to make up just-so stories to explain the observed world is sometimes overwhelming. The engineer who carelessly fails to write tests and breaks the build is caught immediately, but the data scientist who carelessly fails to conduct a thorough analysis and comes to false conclusions may not be discovered quickly, and may by then have caused significant damage to the organization and to their own reputation.

Democratization of analysis: quantity has a quality all its own

Just as dealing with unintentional data shapes the role of the data scientists in their organization, it also shapes the day to day practice of data analysis. We now describe some of the chief issues encountered in analysis (and especially exploratory analysis) of unintentional data.

Selection bias and survivorship bias

When we slice a population by some factor or combination of factors and compute a metric over the new sub-populations, we are implicitly specifying a model (the metric is a function of the slicing factors) and testing a hypothesis. But this test is subject to "selection bias", which occurs when we analyze a group that was not randomly selected. Selection bias played a notable role in the discussion of the avian influenza outbreak of 2011 during which the reported case fatality rate was as high as 80% [2]. However, the World Health Organization criteria for defining a "confirmed case" of avian influenza were very strict and meant that only very sick individuals were counted. Because only very sick individuals were counted, the case mortality rate was quite high (since individuals who were less sick and more likely to recover were never counted as cases at all.) Nevertheless, these estimates caused considerable fear about the ramifications of the outbreak.

A related issue can occur when looking at time series data. Suppose a data scientist works at The Hill Climber, a climbing shop in the Himalayas. She wants to evaluate the performance of the Ice Axe 2000. To do this, she evaluates the condition of Ice Axes brought in for sharpening. She finds these axes to be in very good shape after controlling for age, and concludes that the Ice Axe 2000 can be recommended on the basis of its durability. She is dismayed next season at a wave of customer complaints over axes that cracked on their first use. The cause is survivorship bias, which has happened because only the most durable Ice Axes survive into old age and return to the shop for maintenance. In general, survivorship bias happens in longitudinal studies when we cannot track all members of the group through the entire period of interest.

More generally, when we look at slices of data longitudinally, the individuals comprising those groups may vary over time and the distribution of characteristics of those individuals may also change. Formally, we may say that the joint distribution of the slicing variable and other variables that are correlated with the metric of interest is non-stationary.

Multiple hypothesis testing and invalid inferences

In an effort to improve sales, suppose our data scientist slices purchase rates at the Himalayan climbing store by all manner of data available about the customers: what other mountains they have climbed, how large their climbing teams are, whether they have already made a purchase at the store, and so on, before discovering that climbers from several countries look especially interesting because of their unusually high or low conversion rates. Assuming customers from 15 different countries are represented in the data set, how large is the family of hypotheses under test? Correspondingly, how much should the p-values of each individual test be corrected? The natural answer is 15, but what about all of those other hypotheses that were rejected during the hypothesis generation phase? What about all of those hypotheses that will be tested at the end of the next quarter? What is the likelihood of a given rank ordering for arbitrary comparisons between countries?

As Andrew Gelman and Eric Loken point out in their essay "The garden of the forking paths", invalid inferences can occur when the opportunity to test multiple hypotheses merely exists, even if multiple tests are not actually carried out. Exercising these “researcher degrees of freedom” by choosing not to to carry out a hypothesis test upon seeing the data may not feel like a fishing expedition, but will lead to invalid inferences all the same. Of course, exploratory analysis of big unintentional data puts us squarely at risk for these types of mistakes.

Regression to the mean

If enough slices are examined, there will certainly be some slices with extreme values on metrics of interest. But this does not mean that the slice will continue to exhibit an extreme value on this measurement in the future. This is closely related to the issues with multiple hypothesis testing — given enough observations, we expect to find some extreme values.

What is to be done?

Natural experiments, counterfactuals, synthetic controls.

Back at The Hill Climber, our data scientist wants to understand the impact of a more stringent permitting process on the number of climbers attempting Everest, in order to better understand potential future sales.

The simplest approach is to compare data from the month preceding to the month following the change. Unfortunately this approach does nothing to control for trends or seasonality in the number of climbers nor typical year-to-year variation, and attributes all differences to the intervention (the new permitting process).

A slightly more sophisticated approach is to compare year-over-year Everest climbing data with the change in the number of climbers of K2 over the same time period. This may give the data scientist confidence that the change is attributable to the policy, rather than some global trend of interest in mountain climbing. This is the differences-in-differences approach, where we estimate the impact of an intervention by comparing the change in metric before and after the intervention in a group receiving that intervention with the change in a group that did not receive the intervention. This requires that both groups satisfy the parallel trends assumption, which states the groups must have similar additive trends prior to the intervention. The parallel trends assumption is most likely to be true if we have a natural experiment, that is, we believe the intervention (in this case, the permitting process) happened essentially at random to some subjects (Everest) and not others (K2).

Alternatively, sometimes we can estimate what the climbing data for Everest would have been, under a counterfactual scenario to the intervention. To do this, our data scientist may use a weighted combination of climbing time series data from several different mountains into a synthetic control, where the weighted combination is designed such that the characteristics of the synthetic control match the characteristics of the Himalayas before the intervention.

The process of constructing synthetic controls is made easy by the CausalImpact package in R, which uses Bayesian structural time series to build a forecast based on a time series from a population that received the intervention with one that did not. In this way, it can be thought of as a more rigorous differences-in-differences approach.

Develop a retrospective (better yet, prospective) cohort study, or a case-control study

Tracking an identified population of individuals longitudinally can help avoid bias caused by changes in the composition of the group, and careful selection of those users can help avoid selection bias inherent in observational data. Although there are subtle differences in the design and interpretation of case-control and prospective or retrospective cohort studies, the general concept is the same. That is, by selecting groups of substantially similar users and tracking their outcomes, we can have greater confidence in our causal conclusion, even if we haven’t perfectly controlled for bias via proper randomization.

Fig 2: Features (smile color, hat wearing) are distributed differently in these two groups which could confound analysis. We can correct for this with techniques such as propensity matching, or observational experimental designs.

Multiple Hypothesis Testing

The standard advice in multiple hypothesis testing settings (such as using a Bonferroni correction) typically involves using some method to correct (that is, inflate) p-values to account for the additional hypothesis tests that were performed. But these approaches are typically focused on controlling the family-wise Type I error rates. Such procedures ensure that for a set of hypotheses being tested, the probability of getting any statistically significant results under the null should be no greater than some 𝛼.

In the unintentional data setting is it is nearly impossible to define the ‘family’ for which we are controlling the familywise error rate. How do we know when two studies are sufficiently similar that they require correction for multiple testing? And given all the exploratory data analysis that has gone before, it may be hard to say how many hypotheses have been implicitly tested — how far down Gelman’s forking path have we already walked before we started the mathematically formal portion of the analysis?

Controlling the Type I error necessarily comes at the expense of increasing the risk of a Type II error. In some settings, this is prudent, but without the data scientist understanding the loss function for a specific context it’s not clear that having such a high aversion to falsely rejecting the null is more costly than failing to reject the null when it is indeed false. (Again, understanding the organization and the decisions to be made is critical to producing a useful analysis.)

Consider your loss function.
How expensive is it to confirm experimentally that the null hypothesis can be rejected, and how expensive would it be if that path leads down turned out to be wrong? By contrast, how valuable are the opportunities that will be passed up if we fail to reject a truly false null?

Control the false discovery rate.
When we have many hypotheses to evaluate, it may be more important that we identify a subset that are mostly true, rather than insisting on high certainty that each hypothesis is true. This is the essence of false discovery rate procedures. Controlling the false discovery rate means that when we identify a set of "discoveries" (rejected nulls), no more than a specified fraction of them will be false.

Regression to the mean

Start with a hypothesis that follows from a causal mechanism. Regression to the mean occurs when we find a subpopulation that has an unusually high or low value for some metric of interest due to pure chance. We can avoid being fooled into bad assumptions about causality if we begin by exploring slices for which we have a credible causal mechanism that could be driving the differences between groups.

A Bradford Hill to die on

Fig 3: Pictured: Goats on a hill. (Not Bradford.)

Sir Austin Bradford Hill was an English epidemiologist and statistician credited both with running the first randomized clinical trial and with landmark research into the association between lung cancer and smoking (beginning with a case-control study conducted with Richard Doll and published in 1950). However, he is perhaps best remembered today for the “Bradford Hill criteria” [3], nine “aids to thought” [4] published in 1965 that can help an epidemiologist determine whether some factor is a cause (not the cause) for a disease. The modern data scientist and the epidemiologist have plenty in common, as they are both often tasked with explaining the world around them without the benefit of experimental evidence. We data scientists could do much worse than to learn from Bradford Hill.

First, one of the defining pieces of work in Hill’s storied career began with a case-control study. We have already emphasized cohort and case-control studies as a technique for facilitating inferences from unintentional data. The conceptual motivation behind it is that although we only have observational data, if we take care in what we observe and compare we may still be able to make causal claims. By keeping in mind the need to control for bias in observational data and the strategies employed in the design of observational trials to do so, we can check our own thinking during an exploratory analysis of observational data.

Next, consider the Bradford Hill criteria used to evaluate whether observational data may serve as evidence of causality. In the world of the data scientist, this amounts to understanding whether variation in the slices of data that we are looking at is caused by the factor we used to create the slices, or merely correlated with it.

Of the nine points Hill gives, four are especially relevant to unintentional data:

Strength of effect
. A larger association is more likely causal. This principle is particularly important for data exploration in the world of unintentional data, where we may have no proposed causal mechanism and consequently should have strong prior belief that there is no effect. The standard of evidence required to overcome this prior is high indeed. Any attempt at causal inference assumes that we have properly accounted for confounders, which can never be done perfectly. Small effects could be explained by flaws in this assumption, but large effects are much harder to explain away.

Consistency. Can we observe the effect in different places, under different circumstances or at different times? We have previously discussed the risk of confounding factors obscuring important real effects. By making sure that an effect is consistent across various sub-populations and time we can increase our confidence that we have found a real causal relationship.

Biological gradient (dose response). Subjects in slices that see greater exposure to the proposed causal mechanism should (probably) exhibit a greater response. If we posit that discounts on climbing gear lead to more purchases, our data scientist should expect that climbers with coupons for 50% off will purchase more gear than those for coupons for 15% off.

Temporality. The effect must follow the cause. As Richard Doll says, this is the only of the nine that is sine qua non for identifying causality. Are we sure those customers with 50% off coupons hadn’t started purchasing more climbing gear before our data scientist mailed the coupons?

The nine guideposts of causality are the best remembered parts of Hill’s 1965 paper, but equally important is its final stanza in which Hill lays out ”The Case For Action” [5]. He argues that determinations of causality should not be made independently of the decisions that will be made based on those determinations. Although there is no scientific justification for this, Hill does not forget that as an epidemiologist his goal is to take action, and that the quality of the evidence we require to take some action must be judged in the context of the potential costs and harms of the decision to be made. In this regard, Hill’s situation was remarkably similar to the modern data scientist working in industry.


As we try to summit the challenges of living in a world with a mountain of observational data and a blizzard of hypotheses, the concepts above can help us turn out more like Edmund Hillary than George Mallory. When data are collected without intentionality, we must approach exploratory analysis with a skeptical eye. This state of affairs may not correspond to the world as it existed in the early days of statistics, but as Bradford Hill proves there is a path forward if we maintain focus on the nature of our data and purpose of our analysis. Modern technologies that facilitate rapid experimentation and easy exploratory analysis offer tremendous opportunity for the data scientist who can focus on the key concerns of their organization and find meaningful information in the data.


[1] Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. New York: Cambridge University Press.

[2] Palese P, Wang TT. 2012. H5N1 influenza viruses: facts, not fear. PNAS. 109:2211–2213.

[3] Hill, A. B. (1965). The Environment and Disease: Association or Causation? Proceedings of the Royal Society of Medicine, 58(5), 295–300.

[4] Doll R.(2002). Proof of Causality: Deduction from Epidemiological Observation. Perspect Biol Med., 45, 499–515.

[5] Phillips CV, Goodman KJ. The missed lessons of Sir Austin Bradford Hill. Epidemiol Perspect Innov. 2004 Oct 4;1(1):3.