Why model calibration matters and how to achieve it
Calibrated models make probabilistic predictions that match real world probabilities. This post explains why calibration matters, and how to achieve it. It discusses practical issues that calibrated predictions solve and presents a flexible framework to calibrate any classifier. Calibration applies in many applications, and hence the practicing data scientist must understand this useful tool.
What is calibration?
At Google we make predictions for a large number of binary events such as “will a user click this ad” or “is this email spam”. In addition to the raw classification of $Y = 0$/'NotSpam' or $Y = 1$/'Spam' we are also interested in predicting the probability of the binary event $\Pr(Y = 1 | X)$ for some covariates $X$. One useful property of these predictions is calibration. To explain, let’s borrow a quote from Nate Silver’s The Signal and the Noise:One of the most important tests of a forecast — I would argue that it is the single most important one — is called calibration. Out of all the times you said there was a 40 percent chance of rain, how often did rain actually occur? If, over the long run, it really did rain about 40 percent of the time, that means your forecasts were well calibrated. If it wound up raining just 20 percent of the time instead, or 60 percent of the time, they weren’t.
Mathematically we can define this as $\hat{p} = \Pr(Y | \hat{p})$, where $\hat{p}$ is the predicted probability given by our classifier.
While calibration seems like a straightforward and perhaps trivial property, miscalibrated models are actually quite common. For example, some complex models are miscalibrated out-of-the-box, such as Random Forests, SVMs, Naive Bayes, and (modern) neural networks. Simpler models, such as logistic regression, will be miscalibrated if the conditional probability doesn’t follow the specified functional form (e.g. a sigmoid). And even if your simpler model is correctly specified, you still might suffer from the curse of dimensionality (see figure 3 in Candes and Sur). Things get even more complicated when you bring in regularization, complex reweighting schemes, or boosting. Given all of the potential causes of miscalibration, you shouldn’t assume your model will be calibrated.
Why calibration matters
What are the consequences of miscalibrated models? Intuitively, you want to have calibration so that you can interpret your estimated probabilities as long-run frequencies. In that sense, the question could be “why wouldn’t you want to be calibrated?”. But let’s go further and point out some practical reasons why we want our model to be calibrated:Practical Reason #1: Estimated probabilities allow flexibility
If you are predicting whether a user will click on an ad with a classifier, it’s useful to be able to rank ads by their probability of being clicked. This does not require calibration. But if you want to calculate the expected number of clicks then you need calibrated probabilities. This expected value can be helpful for simulating the impact of an experiment (does this increase expected clicks enough to merit running an actual experiment?) or may be used directly by not serving ads which don’t have expected revenue greater than their cost.Practical Reason #2: Model Modularity
In complex machine learning systems, models depend on each other. Single classifiers are often inputs into larger systems that make the final decisions.For these ML systems, calibration simplifies interaction. Calibration allows each model to focus on estimating its particular probabilities as well as possible. And since the interpretation is stable, other system components don’t need to shift whenever models change.
For example, let’s say you quantify the importance of an email using a $\Pr(\mbox{Important})$ model. This is then an input to a $\Pr(\mbox{Spam})$ model, and the $\Pr(\mbox{Spam})$ model decides which emails get flagged as spam. Now the $\Pr(\mbox{Important})$ model becomes miscalibrated and starts assigning too high of probabilities for emails being important. In this case, you can just change the threshold for $\Pr(\mbox{Important})$ and the system seems to be back to normal. However, downstream your $\Pr(\mbox{Spam})$ model just sees the shift and starts under-predicting spam because the upstream importance signal is telling it that they’re likely to be important. The numerical value of the signal became decoupled from the event it was measuring even as the ordinal value remained unchanged. And users may start receiving a lot more spam!
Calibration and other considerations
Calibration is a desirable property, but it is not the only important metric. Indeed, merely having calibration may not even be helpful for the task at hand. Consider predicting whether an email is spam. Assuming 10% of messages are spam, we predict $\Pr(\mathrm{Spam}) = 0.1$ for each individual email. This is well-calibrated but it doesn’t do anything to help your inbox.The problem with the previous example is when we predicted $\Pr(\mathrm{Spam}) = 0.1$ we didn’t condition on any covariates. A model that considers whether the email came from a known-address and predicts $\Pr(\mathrm{Spam}|\mathrm{Known\ Sender}) = 0.01$ and $\Pr(\mathrm{Spam}|\mathrm{Unknown\ Sender}) = 0.4$ could be perfectly calibrated and also more useful.
E[(\hat{p}-Y)^2] = E[(\hat{p} - \pi(\hat{p}))^2] - E[(\pi(\hat{p}) - \bar{\pi})^2] +
\bar{\pi} (1 - \bar{\pi})
$$where $\pi(\hat{p})$ is $\Pr(Y | \hat{p})$ and $\bar{\pi}$ is $\Pr(Y)$.
Let’s examine each of these terms in the decomposition:
- $E[(\hat{p} - \pi(\hat{p}))^2]$: This term is calibration. If you have a perfectly calibrated classifier this will be zero and deviations from calibration will hurt the loss.
- $- E[(\pi(\hat{p}) - \bar{\pi})^2]$: This term is sharpness. The further your predictions are from the global average the more you improve the loss.
- $\bar{\pi} (1 - \bar{\pi})$: This is the irreducible loss due to uncertainty.
The relationship between calibration and sharpness is complicated. For instance, we can always coarsen prediction to improve calibration: indeed we can coarsen all the way to the global average and achieve perfect calibration. But is there some intrinsic tradeoff between the two — will calibration always decrease sharpness?
This depends on the nature of the miscalibration, i.e., whether the model is over or under confident. Overconfidence occurs when the model is too close to the extremes — it predicts something with 99% when it actually happens at 80%. This is symmetric — it is also over confident when it predicts something at 1% when it actually happens at 20%. The ultimate overconfident model would just predict 0s or 1s as probabilities. The opposite problem occurs when the model is underconfident: the ultimate underconfident model might just predict 0.5 (or the global average) for each observation.
If the model is overconfident and too far into the tails, we lose sharpness to improve calibration. If models are under confident and not far enough into the tails, we can improve both calibration and sharpness. In principle, this means you can end up with either a lower or higher quadratic loss (or other loss functions) for finite samples after implementing the calibration methods we discuss below. In practice, we haven’t observed worse performance, in either the quadratic loss or log loss.
Other important losses we consider are accuracy (the proportion of correct classifications) and discrimination based metrics like AUC. These are less affected by calibration because they are only functions of the ordered probabilities and their labels (assuming you change your threshold for accuracy appropriately). We discuss below that we can choose calibration functions which keep accuracy and AUC unchanged.
This implies that if we care about AUC, but calibration also matters for our application, we can take the shortcut of just picking the best model according to AUC and applying a calibration fix on top of it. In fact, this is exactly our situation in the notifications case-study described in a later section.
How should practitioners integrate calibration into their workflow? First, you should decide if calibration matters for your application. If calibration matters, our recommendation is to follow the paradigm proposed by Gneiting (2007): pick the best performing model amongst models that are approximately calibrated, where "approximately calibrated" is discussed in the next section.
How to be calibrated
If you discover that your classifier is miscalibrated, you might want to start fixing the classifier itself. We suggest a different approach: view your classifier as a black box and learn a calibration function which transforms your output to be calibrated. The reason you don’t want to adjust your classifier directly is it requires adaptation to your specific method. For instance, a random forest classifier will have different problems than a neural network.
The model-as-black-box perspective assumes that fixing the model is intractable analytically. Instead, we just ignore the model’s internal structure and fix things with a method-agnostic approach. This is the same fruitful perspective taken by the jackknife, conformal prediction, and second-order calibration. This gives us the advantage that we can rapidly iterate on the model's structure and features and not have to worry about calibration every time. Of course it comes at the cost of maintaining a separate step, but we’ll show you that calibration functions are not particularly complicated to add/maintain.
Finally, a frequent discussion topic is the relationship between calibration and slices. Currently we’re only talking about global calibration: $\hat{p} = \Pr(Y | \hat{p})$. You can also have calibration for a particular slice: $\hat{p} = \Pr(Y | \hat{p}, Z)$ for some covariates $Z$. Calibration for large $Z$ is unlikely, as it’s an even bigger ask than global calibration. In fact, the only model that’s perfectly calibrated across all slices is the true model. However, you might have good reasons for wanting calibration on a few select slices.
Like with global calibration, you can calibrate your model on slices/subsets of data. But if you calibrate across too many slices, things can become as complicated as the original model. To keep things manageable, our recommendation is to calibrate globally, and to calibrate a small number of slices that affect important decisions as needed.
How calibration functions work
A calibration function takes as input the predicted probability $\hat{p}$ and outputs a calibrated probability $p$. In this way, you can view it as a single input probabilistic classifier: given $\hat{p}$ as the sole covariate, can you predict the true probability $p$?Viewed this way, we can start imposing some conditions that will determine how our calibration function is constructed:
- The calibration function should minimize a strictly proper scoring rule. Strictly proper scoring rules are loss functions such that the unique minimizer is the true probability distribution. Log-loss and quadratic loss are two such examples. This ensures that with enough data our calibration function converges to the true probabilities: $\hat{p} = \Pr(Y | \hat{p})$.
- The calibration function should be strictly monotonic. It doesn’t feel intuitive to flip predictions if the model suggests one is more likely. Additionally, a monotonic calibration function preserves the ranking of predictions: this means that AUC isn’t affected (indeed you can estimate AUC and train your calibration function on the same data: more on that later).
- The calibration function should be flexible. Miscalibration may not fit a specific parametric form so we need a non-parametric model.
- The calibration function needs to be trained on independent data. Otherwise you might be vulnerable to extreme overfitting: your model could be too confident predicting close to zero and one and then your calibration function makes it even more overconfident.
- Calibration functions should be implemented in Tensorflow. This is an extra Google requirement because many of our important models are implemented in Tensorflow. Keeping tooling the same allows easy collaboration and coordination. Also, implementing a calibration function into the same graph as the original model simplifies the training process: you can use stop-gradients to only train the calibration function in stage two. Similarly it simplifies the serving process as it’s only another layer in the original graph.
Calibration Methods
With these requirements in mind, let’s consider some traditional calibration methods.The first method is Platt’s scaling which uses a logistic regression as the calibration function. This is easy to fit, but it violates our requirement for flexibility. As a parametric function it doesn’t flexibly adapt to more complicated calibration curves.
Isotonic regression solves this problem by switching from logistic regression to fully nonparametric regression. This is almost everything we would want from a calibration method. However, it does have two downsides. First, it’s not strictly monotonic: the model will output piecewise continuous functions that lead to ties and thus affects AUC. Second, isotonic regression is hard to fit into Tensorflow.
With both Platt’s scaling and isotonic regression failing to satisfy all our requirements, we need another method. Taking a step back, it’s clear we simply need a strictly monotonic regression function that is easy to fit in Tensorflow. This gives us two potential candidates: Tensorflow Lattice and I-Splines (ISplines) (monotonic neural nets are another option, but they have not worked as well in our experience) . Tensorflow Lattice is described in the previous blog post so we will focus on I-Splines here.
I-Spline Calibration
Splines are piecewise polynomial functions which, amongst other applications, are used to learn nonparametric regression functions. Splines can be constructed as a linear combination of basis functions:$$\Pr(y | x) = \sum_{i = 1}^R \beta_i \phi_i(x)
$$ where the basis functions $\phi_i(x)$ are pre-specified. To fit splines, we learn the weights $\beta_i$.
B_{i,1}(x) =
\begin{cases}
1 & t_i \leq x < t_{i+1} \\
0 & \mbox{otherwise}
\end{cases}
\\
B_{i,k+1}(x) =
\frac{x-t_i}{t_{i+k} - t_i} B_{i,k}(x) +
\frac{t_{i+1+k}-x}{t_{i+1+k} - t_{i+1}} B_{i+1,k}(x)
$$where $k$ represents the order of the polynomial function, and $T = [t_1, t_2, …, t_k]$ is a set of non-decreasing knots where the piecewise polynomials connect.
B-Splines don't quite work for calibration functions, since they’re not monotonic. Fortunately, you can make splines monotonic with Integrated Splines, or I-Splines. The idea behind I-Splines is that, since properly normalized B-Splines only take positive values, the integral of a B-Spline is always increasing. And if we pair increasing functions with positive weights, we achieve monotonicity.
I_{j,m} = \sum_{l=j}^R B_{l,m+1}(x)
$$where $R$ is the number of spline functions. This formula shows that we can compute I-Splines using reversed cumulative sums of B-Splines.
Putting it all together:
- We can evaluate the basis functions using the closed form expressions for I-splines given in this section.
- We achieve monotonicity by restricting our weights to be positive. In practice, we enforce this by optimizing over $\log(\beta_i)$.
- We can optimize the weights using a proper scoring rule, such as log loss, or MSE.
Now that we understand how calibration functions work, and a few of our options, let’s look at a practical example.
Case Study: Notifications Opt Out Model
At Google, one way we interact with users is through notifications. We want to send users notifications they find valuable, but we don’t want to annoy users with too many notifications.To determine which notifications to send, we use models to predict whether or not a user will dismiss, opt out, or otherwise react to a notification in a negative way. Then, we use the predictions to decide whether or not to send the notification by assuring that the probability of negative reactions is minimal. This is sometimes done using a combination of ML model outputs, with coefficients that have been tuned through offline simulations.
One model that’s particularly challenging is the $\Pr(\mbox{Opt Out})$ model. The $\Pr(\mbox{Opt Out})$ model predicts whether a user will opt out of receiving notifications if a particular notification is sent. Since opt outs can be a clear negative signal from a user, large $\Pr(\mbox{Opt Out})$ predictions prevent us from sending annoying notifications.
A difficulty with predicting opt outs is that they’re rare, so the model suffers from class imbalance issues. To improve model performance in class imbalanced problems, a standard trick is re-weighting the data to emphasize examples from the minority class. But re-weighting based on the class changes the distribution of the predicted probabilities, which leads to miscalibration. In this case, calibration functions automatically compensate for scaling issues. An added bonus is that calibration functions work for any re-weighting method, so engineers can quickly iterate on new methods.
Let’s start by looking at the reliability diagram for various calibration methods. The calibration methods we’ll try here are Platt Scaling, Isotonic Regression, and I-Splines. Also, since this model iteration re-weights the Opt Outs based on a set of features, our baseline method is to invert the weights computed from the features, which we include here as well.
Let’s take a closer look at the calibration functions, which reveals a more about the differences between Isotonic Regression and I-Splines:
Finally, we mentioned earlier that when calibration functions are strictly monotonic, applying them leaves AUC unchanged. We can confirm this observation from the following table, which shows that AUC for Platt Scaling and I-Splines are identical up to 8 decimal places. In this case, Isotonic regression isn’t quite the same due to ties:
We hope this example has convinced you that when your model gets too complicated, and you want to iterate on methods quickly, it’s ok to stop worrying and use a calibration function.
Comments
Post a Comment