### Uncertainties: Statistical, Representational, Interventional

by AMIR NAJMI & MUKUND SUNDARARAJAN

Data science is about decision making under uncertainty. Some of that uncertainty is the result of statistical inference, i.e., using a finite sample of observations for estimation. But there are other kinds of uncertainty, at least as important, that are not statistical in nature. This blog post introduces the notions of representational uncertainty and interventional uncertainty to paint a fuller picture of what the practicing data scientist is up against.

## Data science and uncertainty

Data Science (DS) deals with data-driven decision making under uncertainty. The decisions themselves may range from "how much data center capacity should we build for two years hence?" or "does this product change benefit users?" to the very granular "what content should we recommend to this user at this moment?"

This kind of decision making must address particular kinds of uncertainty. Wrestling with uncertainty characterizes the unique experience of doing DS. It explains why answering an innocuous question ("Which content verticals have low quality?") takes longer than what the asking executive thinks it should take ("... and can we have the numbers by next Monday?").

In this post, we discuss three types of uncertainty:
1. Statistical uncertainty: the gap between the estimand, the unobserved property of the population we wish to measure, and an estimate of it from observed data.
2. Representational uncertainty: the gap between the desired meaning of some measure and its actual meaning.
3. Interventional uncertainty: the gap between the true benefit of an intervention arising from a decision, and an evaluation of that benefit.
Among these, only statistical uncertainty has formal recognition. Thus, there is a tendency for data scientists and executives to fixate on it at the expense of the other two. We introduce these other uncertainties to express aspects of the job that go beyond the abstractions of mathematics or the certainties of computation. They involve the judgment to interpret data analysis in the context of the real world. Not being mathematizable, these uncertainties are harder to articulate. And yet it is only in connecting data to meaning and to purpose that we create value. Without new vocabulary, we cannot talk about this crucial element of DS, we cannot teach it, reward it, or hire those who aspire to it.

This blog post does not discuss objectives. What is the purpose of the organization? What information is meaningful towards that purpose? We assume that answers to these questions are given. Nevertheless, it is our hope that wider use of the terms "representational uncertainty" and "interventional uncertainty" will help you better understand, exercise and express how DS creates value, and recognize when a problem is not a DS problem.

## Vignette: Data Science at fluff.ai

The purpose of this post is to dissect and examine uncertainty in order to address it. But it might help the reader see first how the different types of uncertainty we just named may play out in everyday experience. To that end, we present a vignette that is lighthearted but sufficiently realistic as to allow for the requisite reasoning. Thus, we imagine the activities of data scientists at fluff.ai (pronounced "fluff-ay"). The product's mission is to deliver 20 second videos of the cuddliest critters of all creation™. When users watch a video, they can express their approval by clicking the Like button. There is no "Dislike" button, but users may abandon videos without watching all the way (a harsh judgment on those little tykes). Finally, videos are categorized for easy browsing, such as "Animal Apparel" (animals attired), "Bon Appetit" (animals at mealtime) and "Sweet Dreams" (animals asleep). Recently, the product added the hot new "Just Arrived" category for videos of newborn animals.

 Figure 1: A video from fluff.ai

Uncertainty makes data science different from the mere reporting of facts. For instance, the Like Rate of a video is the probability that a viewer will click the Like button for that video. If three users saw a video and all three liked it, then it is valid to report an observed Like Rate of 100%. But any student of statistical uncertainty would consider a Like Rate of 100% to be a poor estimate of the true Like Rate.

### A measure of video quality

Early on in the life of the product, the data scientists at fluff.ai are asked to assess the content quality of their videos. Even the least-experienced among them can intuit that, properly estimated, Like Rate might be a proxy for quality. But they also know that the (true) Like Rate for a video does not equate to the quality of that video. This gap between the quality of a video and its measurement, Like Rate, is representational uncertainty. In the past, the data scientists had debated this very issue. They went through the many reasons why the Like Rate for a video may be an incomplete measure of the video's quality. For instance, users who proactively express their appreciation of videos may not be representative of all users. Or perhaps cognitive dissonance prevents users from "liking" a high quality video if its content is sad.

The DS team needed an explicit definition of video quality. They agreed to base their definition on the evaluation of trained raters. Since Content Quality cannot be defined by formula, raters were given carefully chosen guidelines designed to evoke their individual judgment. For each video, raters were asked to rate how much they agreed with the statement "This video made me feel warm and fuzzy all over" on a five point Likert scale. As per guidelines, this question is meant to capture both the cuteness of the video's subject, as well as production quality. Instead of trying to convert the categorical rating into a score, they defined a video's Content Quality as the probability that a trained rater would respond "Agree" or "Strongly Agree" with the rating question. They collected ratings on a broad set of videos and set it up as their gold standard.

Using human raters is a valid solution to the problem, but there may still be a gap between the full meaning of Content Quality and how each rater interprets the rating question. Moreover, an individual rater's interpretation may drift over time. But given the simplicity of the rating task and strong inter-rater reliability, the DS team felt they could assume the raters had the right interpretation, more or less. Perhaps they would check for rater drift at some future date.

Next, they determined the relationship between Like Rate and Content Quality by a whole lot of analysis of different videos in varying contexts. They knew this effort would be worthwhile because Content Quality was likely to be a factor in many important decisions. The legitimacy to read the estimated Like Rate as Content Quality (the desired meaning) was the result of debate and consensus. Most agreed that Like Rate of a video on the site was in fact a reasonable proxy for this definition of video content quality. Their chain of reasoning can be written out as follows:
1. We wish to measure Content Quality.
2. We start proxying this measurement via a rating task.
3. We satisfy ourselves that each rater follows the guidelines and uses her judgment appropriately.
4. We correlate rater judgments with Like Rate, allowing us to use Like Rate at scale.
There was one final argument against the need for Like Rate made by a data scientist who argued that one could use the complement of Abandonment Rate as a measure of quality. Abandonment Rate is the number of users who abandon a video divided by the number of users who started to watch the video. And indeed, there is significant (negative) correlation between estimated Like Rate and estimated Abandonment Rate over the population of videos. However, the correlation is weak when restricted to the subset of videos that human raters consider high quality.

 Figure 2: The relationship between Abandonment Rate and Like Rate

It seems Abandonment Rate is sensitive to low quality, while Like Rate is sensitive to high quality. With these caveats in mind, Like Rate became well-established as a measure of Content Quality. Indeed, they had been using Like Rate in fluff.ai's recommendation system so that it only recommended videos having an estimated Like Rate of at least x%.

### Contemplating a change to the product

Then one day, the Chief Product Officer tells the DS team that she has an intuition that market reception might benefit from drastically fewer video recommendations (not your typical CPO). She asks the DS team for help to make the decision. After some discussion and initial analysis, they determine that this calls for raising the threshold on Like Rate to $y$%. Their job is now to determine the impact of launching this change. They know they cannot do this perfectly because there will always be something approximate, something speculative about the answer. This gap between the 'true' impact of the intervention and its estimated impact is interventional uncertainty.

There are several ways to measure interventional uncertainty but data scientists don't always consider their choices carefully. Tunnel vision may stem from inertia ("We only believe in live experiments"), tool preferences ("I am deeply in love with machine learning"), indolence ("Running a survey is a lot of work. Let's just use what's in the logs."), impatience ("We should have already made this decision last quarter."), least resistance ("Easier to do the analysis than explain why it is irrelevant.") or just plain lack of imagination ("                   ").

Fortunately, the data scientists at fluff.ai are a diligent lot, critical thinkers even. It turns out, they have some mathematical models for how this might play out, so they could do this in silico. But it is easy enough to run a live experiment for two weeks, so they do that instead. They see an increase in recommendation Uptake Rate (defined as the fraction of recommendations that are accepted; as many as four videos may be recommended at a time), along with an immediate drop in engagement (defined as the number of videos watched per person per day). The increase in Uptake Rate makes sense because they were now recommending fewer videos with higher Like Rate. The drop in engagement also makes sense because there was no relevant video to recommend in many situations.

Now comes the hard part — interpreting what the results of a brief experiment mean for a decision focused on the long term. They have a model based on past experiments which suggests that an improvement in recommendation Uptake Rate leads to greater engagement in the long term. This is caused by increased user confidence in recommendations, but manifests over a period of several months. According to the model, a fractional change of $\epsilon$ in recommendation Uptake Rate translates into a long-term fractional change in engagement of $\lambda \epsilon$. This is great.

But wait. The model is based on data from experiments that change the Uptake Rate by small amounts. The change in Uptake Rate of this experiment is larger than any they have tried before. The model may not hold well for such extrapolation.

However, suppose they were to claim that the product change should not be made because the short-term loss in engagement cannot be made up by the long-term gain. Perhaps they have reason to believe that the effect (i.e., change in long-term engagement due to change in Uptake rate) is sublinear. In this case the model can be interpreted as an upper bound on long-term engagement. This might be good enough to justify the decision. They take the inference back to the CPO that the long-term gains in engagement don't offset the short-term hit to engagement.

Moreover, they believe that the quality of recommended videos went down for the Just Arrived category. They explain to her that Uptake Rate in that category has dropped because the system now recommends newborn pandas more broadly, even though only panda enthusiasts (a.k.a "the Pandanista") find them cute. These videos are almost exclusively watched by the Pandanista who always "like" them. Thus, newborn pandas dominate the high Like Rate subset of this relatively small video category.

The CPO accepts the analysis but notes that they haven't taken into account the long-term impact on the brand. Her hypothesis was not about Uptake Rate, but about UI. With very few recommendations, fluff.ai would be a radical departure from the clutter of other similar sites. fluff.ai would attract users from competitors by separating from the pack. But all this may take several months to play out. The data scientists acknowledge that they have no good way to assess this hypothesis. It was now a business decision to decide whether to pursue this radical strategy given the projected hit to metrics.

### Caveat: a more complex reality

We have painted a picture of the fictitious fluff.ai only in enough detail to motivate the types of uncertainty that are central to this article. For the sake of brevity and focus, there is much that we omit. But the state of the online services industry is such that we feel we must at least acknowledge what our upbeat description of fluff.ai leaves out.

Online services, even those that deliver short animal videos, may entail unforeseen effects. As a service scales, so too its potential problems. Its very success may attract bad actors who exploit the platform for profit or mischief. For instance, how should fluff.ai handle a cute animal video that prominently displays a commercial brand, making it something of an ad? What if a video for the Bon Appetit category appears to show dogs being fed chocolate or sugar-free gum (i.e., foods harmful to dogs), misinforming the naive? Even without bad actors, a widely-used service may give rise to bad behavior. Thus, sleepy animals may be the cutest (hence fluff.ai's Sweet Dreams category) but are some pets being chemically sedated or having their symptoms of disease ignored? Finally, runaway success may itself become a societal problem. Apropos, the team at fluff.ai may work hard to deliver the cutest videos, but what if some kids watch them way past their bedtime?

This is the difficult reality of online services. Anyone involved in building a service needs to pay attention to its effects, both expected and unexpected. Concerned with assessing impact, DS has a particularly important role to play here in defining success and choosing metrics. While this blog post does not address these complexities, no responsible data scientist can afford to ignore them.

## Types of uncertainty

We now discuss the uncertainties named in the vignette more formally. But first, what is uncertainty? By "uncertainty" we mean absence of a knowable truth, and hence absence of a perfect answer. Even so, there are better and worse answers, and criteria for evaluating their quality. Judiciously, we narrow and manage the gap between the truth and our estimate but we never entirely eliminate it.

The uncertainties we identify below differ in the nature of the gap, the criteria for evaluating answers, and the strategies a data scientist uses to mitigate the gap.

### Statistical uncertainty (or the problem of measurement)

Our definition of uncertainty quite obviously fits with statistical uncertainty where the estimand (population parameter) is unknown and unknowable, and must be estimated from random observation $X$ by an estimator $\hat{\theta}(X)$. The task is to choose from one of many estimators, none of which can guarantee that $\hat{\theta}(X) = \theta$. We want $\theta$ but must settle for $\hat{\theta}(X)$ at the cost of some accuracy:
Example: we want the "Like Rate" for any video:
• Estimand: Like Rate, i.e., probability that a viewer will "like" the video
• Estimator: (# of likes + $\alpha$)/(# of viewers + $\beta$), Bayesian for prior Beta($\alpha$, $\beta$)
Other estimators are also possible, e.g., Bayesian with a non-parametric prior, and even the Maximum Likelihood estimate, disfavored for its poor small-sample behavior. What is special about statistical uncertainty is that this choice and many criteria for comparison (e.g. mean squared error, consistency, efficiency, admissibility) have long been formalized and mathematized, much by Fisher himself (e.g. Royal Society 1922). Should we use a t-test or a sign test? How to compute confidence intervals? The data scientist may choose from the literature, or occasionally extend theory to fashion a custom estimation procedure. The best way to address statistical uncertainty is to study the body of statistical literature. There is plenty out there, so we need say no more.

Our purpose is to contrast statistical uncertainty with the other forms of uncertainty we introduce. Statistical uncertainty may involve matters of bias and variance in estimating a quantity. But representational uncertainty asks whether we measured the right thing. To what extent does the thing measured mean what we take it to mean? Finally, interventional uncertainty is about evaluating the consequences of a hypothetical intervention. In assessing impact, what cost and benefits do we measure? And how do we conjure the counterfactual scenarios we need for these measurements?

### Representational uncertainty (or the problem of representation)

Representational uncertainty permeates any effort to represent the state of the world in terms of concepts supported by data. Recall the vignette: we want a measure of the content quality for any video:
• Concept: Content Quality of video
• Estimand: video's true Like Rate
• Estimator: Bayes estimate of Like Rate, using prior Beta($\alpha$, $\beta$)
Another example: we want to represent the country of the user:
• Concept: country of residence of the user
• Estimand: country from which the user is most likely to use fluff.ai
• Estimator: country with the most distinct days of activity from the user in the past 28 days
We choose a concept because it possesses meanings we desire, and we choose a measure (i.e., estimator for an estimand) because we believe it represents something of those desired meanings. But the concept itself is in our head, not in the data. There is a gap between the messy reality of the data beneath the measure and the idealized meanings we may read into it. (DS is usually involved in identifying the relevant concept, not just how to make it measurable. But here we take the concept as given).

The relationship between concept and estimand is analogous to the relationship between estimand and estimator — in either case, we seek the former but must settle for the latter. It is convention in statistics to use a Latin symbol for an observed quantity and a Greek symbol for an unobserved estimand, perhaps because the Greek alphabet was an ancestor of the Latin. Given that the Greek alphabet itself descended from Egyptian hieroglyphics, we extend this notion by using an Egyptian hieroglyph to represent the concept. We can write the relationship between all three as follows:

where ð“‚€ (Unicode 13080, eye of Horus) represents the concept. In going from ð“‚€ to $\theta$ we lose meaning, while loss in accuracy is incurred in going from $\theta$ to $\hat{\theta}(X)$.

Perhaps more fundamental than the idea of representational uncertainty is the notion of separating the estimand, the quantity we seek to estimate from the data, from the concept, the thing in our heads that we would like the estimand to represent.

As a precedent, consider statistical uncertainty. It is only possible to discuss statistical uncertainty once there is a separation between estimator and estimand. Making this distinction is one of Fisher's great contributions. Now, we can argue about the accuracy of the estimator as the (probabilistic) gap between estimate and estimand. Non-statisticians often fail to see the distinction between estimator and estimand and therefore unknowingly fall prey to "noise".

With representational uncertainty we make a further distinction between concept and measure (the estimator for an estimand). Now, we can argue about the meaningfulness of a measure as the (semantic) gap between measure and concept. If we fail to make this distinction, we will unknowingly fall prey to "semantic noise". Going back to fluff.ai, it would not be possible to discuss the flaws of Like Rate as a measure without articulating what we would want Like Rate to represent (Content Quality).

Aside from statistical uncertainty, the primary reason why a measure may not have all the meanings we desire from the concept is lack of direct measurement. That is, the relationship between the measure and the concept is indirect and not entirely causal. All else being equal, one might be convinced that true Like Rate is tantamount to Content Quality, but all will rarely be equal for two different videos.

The observed relationship between measure and concept may only hold in particular contexts, and may not generalize — if we measure the relationship in one data slice it may not extrapolate to others. For instance, videos with higher content quality tend to garner a higher Like Rate, but this isn't always so. It may also depend on the topic of the video, the subpopulation of its viewers, even differences in user interface between computer and phone (e.g., how prominent is the Like button). In our second example, there is no direct mechanism to determine a user's residence country. One resorts to a reasonable heuristic, even though it will be wrong some of the time (e.g., for those who reside in France but commute daily to Switzerland).

An added challenge is that the world in which the DS team operates is ever in flux. For fluff.ai, the product is changing, how users interact with the product is changing, the needs of the organization are changing, new users are being added, tastes are changing (RIP Grumpy Cat). Many relationships we depend on are not entirely causal, and such changes can disrupt them. For instance, fluff.ai data scientists had found Like Rate to be a good proxy for Content Quality. But that was before the hot new category of Just Arrived videos was added, where this relationship is not as strong (thank you Pandanistas).

To identify the scope of validity of any concept-measure relationship, we bring all our knowledge of the domain to bear on the analysis. In practice, this means cross-checking against other relevant measures for consistency and monotonicity. For example, fluff.ai primarily compared Like Rate to content quality ratings but they also compared it to Abandonment Rate, a related measure. The relationship between Like Rate and Abandonment Rate was not entirely expected, but it made sense given what these measures are supposed to mean. If Like Rate were not monotone with respect to Abandonment Rate, the team would have been skeptical that Like Rate could represent the concept of Content Quality. We require concepts to be mutually reinforcing. Even so, we can never fully corroborate the relationship between concept and measure. Therefore we will close the gap between them, but never eliminate it.

The semantics of any concept we have worked hard to establish are tentative in a world of correlations and change. Thus our approach needs to be defensive — each new inference is partial reconfirmation of what (we think) we already know. It's a lot to ask, and can only thrive in an environment that values skepticism, in a culture of integrity that doesn't brush away inconvenient findings. At every step, the data must be able to surprise, to alter beliefs. For example, while evaluating the experiment on raising the Like Rate threshold for recommendations, fluff.ai's data scientists found the anomalous behavior in the Just Arrived category. This led them to identify the problematic subcategory of newborn panda videos. They could go further and try to define a new concept for the breadth of appeal of a video. This concept might be based on the number of unique video categories visited by those who "liked" the video, and employed in recommendation decisions.

Thus far, for reasons of clarity, we have limited our treatment of representational uncertainty to a single concept. But in reality, a DS team works towards the creation of a constellation of concepts that relate to one another. This ontology is a quantified representation of the world within which the organization exists and which it seeks to influence. This is why we call it representational uncertainty. How the ontology is managed is beyond the scope of this post, but a possible subject for a future one.

### Interventional uncertainty (or the problem of evaluating an intervention)

Organizations exist to intervene in the world, and well-run organizations make informed decisions on what they do. Interventions and decisions come in several flavors: they may be bespoke — the typical strategic decision made by executives, or they may be one-of-many — for instance, when we use an experiment framework to enable a class of decisions. In all cases, in deciding whether or not to make a given intervention, the organization's data scientists evaluate and compare two alternative futures.

Example from the vignette: Our recommender system currently only recommends videos with a Like Rate hihger than x%. Should we raise this threshold from $x$% to $y$%?
• Actual net value:
• An immediate improvement in overall recommendation quality, though worse for the Just Arrived category
• Long-term increase in user confidence of our recommendations
• Long-term change in engagement
• Long-term change in brand perception
• Possible estimate:
• Determine impact from randomized experiment by objective:$$\Delta \mathrm{Short\ term \ Engagement} + \lambda \Delta \mathrm{recommendation \ Uptake \ Rate}$$
There are alternate futures depending on whether or not we intervene. Interventional uncertainty is not the standard error of an experiment used to estimate the impact of intervening. Rather, it derives from incompleteness of the evaluation criteria and infidelity of the proxies we use for the alternate futures. Since only one future will be realized, the difference between them is unknowable, its ramifications never to be measured directly. Yet we may still estimate this difference based on the requirements of decision makers and reasonable modeling assumptions. We can think of this as a two step process.

The first step is to identify measures that largely capture the value to the organization from the intervention. We must narrow down the all-inclusive notion of organizational value for the specific decision at hand in order to make it measurable. This is how we frame the decision, i.e., identify the logical hypothesis behind it in order to evaluate the decision. For instance, fluff.ai data scientists framed the decision to raise the Like Rate recommendation threshold as a hypothesized long-term increase in engagement, which they can infer in the near term by a change in engagement and a change in recommendation Uptake Rate. Also note that they did not include long-term brand perception, which was the CPO's motivation.

The second step is to choose two counterfactual scenarios that are good proxies for launch versus status quo wherein we can evaluate the difference in these measures we identified in the first step. The choice of measures used to quantify the value of the decision drives the choice of comparison points wherein these can be evaluated. For instance, only after they have a hypothesis for increased long-term engagement based on near term Uptake Rate does it make sense for fluff.ai to run a brief experiment to test the hypothesis. The points of comparison go by various names: "potential outcomes" in statistics, "counterfactuals" in philosophy, "benchmarks" in marketing. DS teams invest in building out methodology to evaluate a variety of counterfactuals, their tools include observational causality, mathematical modeling, simulation, and evermore-sophisticated experimentation.

Recall how fluff.ai's DS team was unable to measure the effects of an intervention on long-term brand perception. It might behoove them to build out their capabilities in this area. For instance, they might use experiments on user panels to develop models for long-term behavior based on short-term metrics. Or they might develop the ability to extrapolate from long running experiments in certain small countries (e.g., Finland). Every method for evaluating counterfactuals has limitations, so it is important to develop a range of methods for evaluating different types of interventions. It is the mark of a mature DS team.

The launch scenario is never perfectly modeled because it is never fully-realized in the data. Even randomized user experiments are imperfect. They make simplifying assumptions: e.g., the assignment of one user does not affect the behavior of another (SUTVA); or that the treatment being tested is not affected by seasonality (no one at fluff.ai would expect treatment effects in the Animal Apparel category to generalize if the experiment were run near Halloween). There is always residual interventional uncertainty, i.e., some gap between the true net value of the launch and our estimate; we can close this gap, but not eliminate it. For instance, if fluff.ai launches their product change, it will be harder for quirky videos to find their audience, and thus limit product growth beyond users with mainstream tastes. The framing ignores this since it is not captured in experiments by near-term engagement and recommendation Uptake Rate.

There is an interaction between the selection of measures, and the selection of counterfactuals. Our choice of counterfactuals can identify effects that were not initially hypothesized. For instance, we can slice the results of a live experiment to discover unanticipated issues (e.g., lower Uptake Rate of the Just Arrived category). Equally, our choice of counterfactuals could constrain the type of effects we can measure. As you saw, fluff.ai was unable to assess the value of long-term brand perception within a two-week randomized experiment. For the same intervention (reducing the number of recommendations by raising the quality bar) it may also be hard to identify the reaction of content creators — do they respond by creating higher quality videos or do they churn? In general, the net value of all ramifications is hard to estimate.

Finally, organizational values and ethics address hard-to-estimate ramifications. Often ramifications have far reaching effects beyond the organization, to society at large. Because this kind of impact takes place beyond the product itself, it is harder to represent and to estimate in a controlled manner. Data scientists must not let the difficulty of definitive measurements stand in the way of sensible actions. Yes, we work for the organization, but we also hold values as members of society.

## Summary

Good DS only happens when we approach analyses with the right sort of skepticism, and this goes beyond just the statistical variety.

Representational uncertainty is about the gap between the desired meaning (e.g. video quality) of a measure (e.g. Like Rate) and its actual meaning. This gap may arise for a variety of reasons such as unstable correlations or feedback loops. Identifying the source of the gap isn't a mathematical problem — it is about wrestling with the mess of the real world. Because these effects vary over time, taming representational uncertainty is an ongoing process.

Interventional uncertainty is about the gap between the measured value to the organization of an intervention and its actual value in the real world. This gap arises because no set of measures fully captures the change in organizational value, and because in any measurement of value, one of the two outcomes (the world with and without the intervention) is never realized fully.

Representation and intervention are of a pair in science. Perhaps they are more so in DS because the organization that employs DS usually seeks to both understand and change its environment. Thus, its ability to represent and to intervene are coevolving, an aspect of laying down tracks of representation while deciding where to drive the train of intervention. What we believe to be the state of the world depends on what we can measure, and what we measure depends on what we believe to be relevant to the state of the world. The purpose of self-skepticism is to mitigate this circularity. Business cliches such as the draconian "You are what you measure" and the misattributed "If you can't measure it, you can't improve it" don't give enough credit to the two-way dynamic between representation and intervention in DS. Good representations don't just enable desired interventions — they often suggest how one might intervene. Conversely, interventions may themselves affect the quality of representations by altering relationships on which the representations depend. In the case of objectives, this general phenomenon may show up as Goodhart-Strathern's Law ("When a measure becomes a target, it ceases to be a good measure").

Of course, the work of the data scientist involves more than addressing uncertainties. Nor are we making an exhaustive claim as to the types of uncertainties in DS. Nevertheless, we hope that our description of representational and interventional uncertainty rings true to your DS experience. We also hope that our formalization helps foster professional awareness, and stops data scientists from unduly focussing on statistical uncertainty.

Did these new concepts resonate with you? We would love to hear from you in the comment section.