Why model calibration matters and how to achieve it

by LEE RICHARDSON & TAYLOR POSPISIL

Calibrated models make probabilistic predictions that match real world probabilities. This post explains why calibration matters, and how to achieve it. It discusses practical issues that calibrated predictions solve and presents a flexible framework to calibrate any classifier. Calibration applies in many applications, and hence the practicing data scientist must understand this useful tool.

What is calibration?

At Google we make predictions for a large number of binary events such as “will a user click this ad” or “is this email spam”. In addition to the raw classification of $Y = 0$/'NotSpam' or $Y = 1$/'Spam' we are also interested in predicting the probability of the binary event $\Pr(Y = 1 | X)$ for some covariates $X$. One useful property of these predictions is calibration. To explain, let’s borrow a quote from Nate Silver’s The Signal and the Noise:

One of the most important tests of a forecast — I would argue that it is the single most important one — is called calibration. Out of all the times you said there was a 40 percent chance of rain, how often did rain actually occur? If, over the long run, it really did rain about 40 percent of the time, that means your forecasts were well calibrated. If it wound up raining just 20 percent of the time instead, or 60 percent of the time, they weren’t.

Mathematically we can define this as $\hat{p} = \Pr(Y | \hat{p})$, where $\hat{p}$ is the predicted probability given by our classifier.

While calibration seems like a straightforward and perhaps trivial property, miscalibrated models are actually quite common. For example, some complex models are miscalibrated out-of-the-box, such as Random Forests, SVMs, Naive Bayes, and (modern) neural networks. Simpler models, such as logistic regression, will be miscalibrated if the conditional probability doesn’t follow the specified functional form (e.g. a sigmoid). And even if your simpler model is correctly specified, you still might suffer from the curse of dimensionality (see figure 3 in Candes and Sur). Things get even more complicated when you bring in regularization, complex reweighting schemes, or boosting. Given all of the potential causes of miscalibration, you shouldn’t assume your model will be calibrated.

Why calibration matters

What are the consequences of miscalibrated models? Intuitively, you want to have calibration so that you can interpret your estimated probabilities as long-run frequencies. In that sense, the question could be “why wouldn’t you want to be calibrated?”. But let’s go further and point out some practical reasons why we want our model to be calibrated:

Practical Reason #1: Estimated probabilities allow flexibility

If you are predicting whether a user will click on an ad with a classifier, it’s useful to be able to rank ads by their probability of being clicked. This does not require calibration. But if you want to calculate the expected number of clicks then you need calibrated probabilities. This expected value can be helpful for simulating the impact of an experiment (does this increase expected clicks enough to merit running an actual experiment?) or may be used directly by not serving ads which don’t have expected revenue greater than their cost.

Practical Reason #2: Model Modularity

In complex machine learning systems, models depend on each other. Single classifiers are often inputs into larger systems that make the final decisions.

For these ML systems, calibration simplifies interaction. Calibration allows each model to focus on estimating its particular probabilities as well as possible. And since the interpretation is stable, other system components don’t need to shift whenever models change.

For example, let’s say you quantify the importance of an email using a $\Pr(\mbox{Important})$ model. This is then an input to a $\Pr(\mbox{Spam})$ model, and the $\Pr(\mbox{Spam})$ model decides which emails get flagged as spam. Now the $\Pr(\mbox{Important})$ model becomes miscalibrated and starts assigning too high of probabilities for emails being important. In this case, you can just change the threshold for $\Pr(\mbox{Important})$ and the system seems to be back to normal. However, downstream your $\Pr(\mbox{Spam})$ model just sees the shift and starts under-predicting spam because the upstream importance signal is telling it that they’re likely to be important. The numerical value of the signal became decoupled from the event it was measuring even as the ordinal value remained unchanged. And users may start receiving a lot more spam!

Calibration and other considerations

Calibration is a desirable property, but it is not the only important metric. Indeed, merely having calibration may not even be helpful for the task at hand. Consider predicting whether an email is spam. Assuming 10% of messages are spam, we predict $\Pr(\mathrm{Spam}) = 0.1$ for each individual email. This is well-calibrated but it doesn’t do anything to help your inbox.

The problem with the previous example is when we predicted $\Pr(\mathrm{Spam}) = 0.1$  we didn’t condition on any covariates. A model that considers whether the email came from a known-address and predicts $\Pr(\mathrm{Spam}|\mathrm{Known\ Sender}) = 0.01$ and $\Pr(\mathrm{Spam}|\mathrm{Unknown\ Sender}) = 0.4$ could be perfectly calibrated and also more useful.

To examine the difference between these two models, let’s consider the expected quadratic loss function, which we can decompose as$$E[(\hat{p}-Y)^2] = E[(\hat{p} - \pi(\hat{p}))^2] - E[(\pi(\hat{p}) - \bar{\pi})^2] + \bar{\pi} (1 - \bar{\pi})$$where $\pi(\hat{p})$ is $\Pr(Y | \hat{p})$ and $\bar{\pi}$ is $\Pr(Y)$.

Let’s examine each of these terms in the decomposition:
• $E[(\hat{p} - \pi(\hat{p}))^2]$: This term is calibration. If you have a perfectly calibrated classifier this will be zero and deviations from calibration will hurt the loss.
• $- E[(\pi(\hat{p}) - \bar{\pi})^2]$: This term is sharpness. The further your predictions are from the global average the more you improve the loss.
• $\bar{\pi} (1 - \bar{\pi})$: This is the irreducible loss due to uncertainty.
This shows why $\Pr(\mathrm{Spam}) = 0.1$ isn’t good enough: it optimizes the calibration term, but pays the price in sharpness. And if our model has useful features (known senders are less likely to send spam), the global average model ($\Pr(\mathrm{Spam}) = 0.1$) should have a worse quadratic loss than our model.

The relationship between calibration and sharpness is complicated. For instance, we can always coarsen prediction to improve calibration: indeed we can coarsen all the way to the global average and achieve perfect calibration. But is there some intrinsic tradeoff between the two — will calibration always decrease sharpness?

This depends on the nature of the miscalibration, i.e., whether the model is over or under confident. Overconfidence occurs when the model is too close to the extremes — it predicts something with 99% when it actually happens at 80%. This is symmetric — it is also over confident when it predicts something at 1% when it actually happens at 20%. The ultimate overconfident model would just predict 0s or 1s as probabilities. The opposite problem occurs when the model is underconfident: the ultimate underconfident model might just predict 0.5 (or the global average) for each observation.

If the model is overconfident and too far into the tails, we lose sharpness to improve calibration. If models are under confident and not far enough into the tails, we can improve both calibration and sharpness. In principle, this means you can end up with either a lower or higher quadratic loss (or other loss functions) for finite samples after implementing the calibration methods we discuss below. In practice, we haven’t observed worse performance, in either the quadratic loss or log loss.

Other important losses we consider are accuracy (the proportion of correct classifications) and discrimination based metrics like AUC. These are less affected by calibration because they are only functions of the ordered probabilities and their labels (assuming you change your threshold for accuracy appropriately). We discuss below that we can choose calibration functions which keep accuracy and AUC unchanged.

This implies that if we care about AUC, but calibration also matters for our application, we can take the shortcut of just picking the best model according to AUC and applying a calibration fix on top of it. In fact, this is exactly our situation in the notifications case-study described in a later section.

How should practitioners integrate calibration into their workflow? First, you should decide if calibration matters for your application. If calibration matters, our recommendation is to follow the paradigm proposed by Gneiting (2007): pick the best performing model amongst models that are approximately calibrated, where "approximately calibrated" is discussed in the next section.

How to be calibrated

At this point, you may be convinced that you want your model to be calibrated. The question then becomes, how can you achieve calibration? The natural first step is checking whether you have already achieved calibration. Practically speaking, we are interested in whether your model is calibrated enough. We can check this by plotting your predicted probability against your empirical probability for some quantile buckets of your data.

Miscalibration will be recognizable as a deviation from the diagonal line that represents perfect calibration. Usually an eye-test suffices to diagnose problems (see above) although you could check more formally with hypothesis testing or thresholding calibration specific metrics.

If you discover that your classifier is miscalibrated, you might want to start fixing the classifier itself. We suggest a different approach: view your classifier as a black box and learn a calibration function which transforms your output to be calibrated. The reason you don’t want to adjust your classifier directly is it requires adaptation to your specific method. For instance, a random forest classifier will have different problems than a neural network.

The model-as-black-box perspective assumes that fixing the model is intractable analytically. Instead, we just ignore the model’s internal structure and fix things with a method-agnostic approach. This is the same fruitful perspective taken by the jackknife, conformal prediction, and second-order calibration. This gives us the advantage that we can rapidly iterate on the model's structure and features and not have to worry about calibration every time. Of course it comes at the cost of maintaining a separate step, but we’ll show you that calibration functions are not particularly complicated to add/maintain.

Finally, a frequent discussion topic is the relationship between calibration and slices. Currently we’re only talking about global calibration: $\hat{p} = \Pr(Y | \hat{p})$. You can also have calibration for a particular slice: $\hat{p} = \Pr(Y | \hat{p}, Z)$ for some covariates $Z$. Calibration for large $Z$ is unlikely, as it’s an even bigger ask than global calibration. In fact, the only model that’s perfectly calibrated across all slices is the true model. However, you might have good reasons for wanting calibration on a few select slices.

Like with global calibration, you can calibrate your model on slices/subsets of data. But if you calibrate across too many slices, things can become as complicated as the original model. To keep things manageable, our recommendation is to calibrate globally, and to calibrate a small number of slices that affect important decisions as needed.

How calibration functions work

A calibration function takes as input the predicted probability $\hat{p}$ and outputs a calibrated probability $p$. In this way, you can view it as a single input probabilistic classifier: given $\hat{p}$ as the sole covariate, can you predict the true probability $p$?

Viewed this way, we can start imposing some conditions that will determine how our calibration function is constructed:
• The calibration function should minimize a strictly proper scoring rule. Strictly proper scoring rules are loss functions such that the unique minimizer is the true probability distribution. Log-loss and quadratic loss are two such examples. This ensures that with enough data our calibration function converges to the true probabilities: $\hat{p} = \Pr(Y | \hat{p})$.
• The calibration function should be strictly monotonic. It doesn’t feel intuitive to flip predictions if the model suggests one is more likely. Additionally, a monotonic calibration function preserves the ranking of predictions: this means that AUC isn’t affected (indeed you can estimate AUC and train your calibration function on the same data: more on that later).
• The calibration function should be flexible. Miscalibration may not fit a specific parametric form so we need a non-parametric model.
• The calibration function needs to be trained on independent data. Otherwise you might be vulnerable to extreme overfitting: your model could be too confident predicting close to zero and one and then your calibration function makes it even more overconfident.
• Calibration functions should be implemented in Tensorflow. This is an extra Google requirement because many of our important models are implemented in Tensorflow. Keeping tooling the same allows easy collaboration and coordination. Also, implementing a calibration function into the same graph as the original model simplifies the training process: you can use stop-gradients to only train the calibration function in stage two. Similarly it simplifies the serving process as it’s only another layer in the original graph.

Calibration Methods

With these requirements in mind, let’s consider some traditional calibration methods.

The first method is Platt’s scaling which uses a logistic regression as the calibration function. This is easy to fit, but it violates our requirement for flexibility. As a parametric function it doesn’t flexibly adapt to more complicated calibration curves.

Isotonic regression solves this problem by switching from logistic regression to fully nonparametric regression. This is almost everything we would want from a calibration method. However, it does have two downsides. First, it’s not strictly monotonic: the model will output piecewise continuous functions that lead to ties and thus affects AUC. Second, isotonic regression is hard to fit into Tensorflow.

With both Platt’s scaling and isotonic regression failing to satisfy all our requirements, we need another method. Taking a step back, it’s clear we simply need a strictly monotonic regression function that is easy to fit in Tensorflow. This gives us two potential candidates: Tensorflow Lattice and I-Splines (ISplines) (monotonic neural nets are another option, but they have not worked as well in our experience) . Tensorflow Lattice is described in the previous blog post so we will focus on I-Splines here.

I-Spline Calibration

Splines are piecewise polynomial functions which, amongst other applications, are used to learn nonparametric regression functions. Splines can be constructed as a linear combination of basis functions:$$\Pr(y | x) = \sum_{i = 1}^R \beta_i \phi_i(x)$$ where the basis functions $\phi_i(x)$ are pre-specified. To fit splines, we learn the weights $\beta_i$.

The most common spline functions are B-Splines. B-splines are popular for computational purposes, since they can be defined recursively as:$$B_{i,1}(x) = \begin{cases} 1 & t_i \leq x < t_{i+1} \\ 0 & \mbox{otherwise} \end{cases} \\ B_{i,k+1}(x) = \frac{x-t_i}{t_{i+k} - t_i} B_{i,k}(x) + \frac{t_{i+1+k}-x}{t_{i+1+k} - t_{i+1}} B_{i+1,k}(x)$$where $k$ represents the order of the polynomial function, and $T = [t_1, t_2, …, t_k]$ is a set of non-decreasing knots where the piecewise polynomials connect.

B-Splines don't quite work for calibration functions, since they’re not monotonic. Fortunately, you can make splines monotonic with Integrated Splines, or I-Splines. The idea behind I-Splines is that, since properly normalized B-Splines only take positive values, the integral of a B-Spline is always increasing. And if we pair increasing functions with positive weights, we achieve monotonicity.

Less well known, though pointed out by Jan de Leeuw in an excellent article, is that I-Splines are also easy to compute. De Leeuw shows (section 4.1.2) that you can compute I-Splines directly from B-Splines:$$I_{j,m} = \sum_{l=j}^R B_{l,m+1}(x)$$where $R$ is the number of spline functions. This formula shows that we can compute I-Splines using reversed cumulative sums of B-Splines.

Putting it all together:
1. We can evaluate the basis functions using the closed form expressions for I-splines given in this section.
2. We achieve monotonicity by restricting our weights to be positive. In practice, we enforce this by optimizing over $\log(\beta_i)$.
3. We can optimize the weights using a proper scoring rule, such as log loss, or MSE.
All of this can be achieved in Tensorflow. And since I-Splines are flexible non-parametric functions, I-Spline calibration meets all of our requirements for a calibration function.

Now that we understand how calibration functions work, and a few of our options, let’s look at a practical example.

Case Study: Notifications Opt Out Model

At Google, one way we interact with users is through notifications. We want to send users notifications they find valuable, but we don’t want to annoy users with too many notifications.

To determine which notifications to send, we use models to predict whether or not a user will dismiss, opt out, or otherwise react to a notification in a negative way. Then, we use the predictions to decide whether or not to send the notification by assuring that the probability of negative reactions is minimal. This is sometimes done using a combination of ML model outputs, with coefficients that have been tuned through offline simulations.

One model that’s particularly challenging is the $\Pr(\mbox{Opt Out})$ model. The $\Pr(\mbox{Opt Out})$ model predicts whether a user will opt out of receiving notifications if a particular notification is sent. Since opt outs can be a clear negative signal from a user, large $\Pr(\mbox{Opt Out})$ predictions prevent us from sending annoying notifications.

A difficulty with predicting opt outs is that they’re rare, so the model suffers from class imbalance issues. To improve model performance in class imbalanced problems, a standard trick is re-weighting the data to emphasize examples from the minority class. But re-weighting based on the class changes the distribution of the predicted probabilities, which leads to miscalibration. In this case, calibration functions automatically compensate for scaling issues. An added bonus is that calibration functions work for any re-weighting method, so engineers can quickly iterate on new methods.

Let’s start by looking at the reliability diagram for various calibration methods. The calibration methods we’ll try here are Platt Scaling, Isotonic Regression, and I-Splines. Also, since this model iteration re-weights the Opt Outs based on a set of features, our baseline method is to invert the weights computed from the features, which we include here as well.

This plot shows the flaw in Platt Scaling. In this case, the empirical calibration curve was more complex than a sigmoid, and Platt scaling wasn’t flexible enough to adapt, and converged to essentially a constant function. We also observe that simply inverting the weights leads to miscalibrated predictions. In this case, the smaller predictions under predict, and the larger predictions over-predict. Fortunately, both I-Splines and Isotonic Regression are flexible enough to learn the more complex calibration curve, which leads to calibrated predictions.

Let’s take a closer look at the calibration functions, which reveals a more about the differences between Isotonic Regression and I-Splines:

You can see that I-Splines and Isotonic Regression learn essentially the same calibration function. The main difference is that I-Splines are smooth, and Isotonic Regression is piecewise constant.

Finally, we mentioned earlier that when calibration functions are strictly monotonic, applying them leaves AUC unchanged. We can confirm this observation from the following table, which shows that AUC for Platt Scaling and I-Splines are identical up to 8 decimal places. In this case, Isotonic regression isn’t quite the same due to ties:

In sum, this case study shows how you can use a calibration function to calibrate a tricky real-world problem. It also shows how flexible calibration functions are, since they work for arbitrary re-weighting schemes, extreme scales, and complicated shapes.

We hope this example has convinced you that when your model gets too complicated, and you want to iterate on methods quickly, it’s ok to stop worrying and use a calibration function.

Conclusion

Calibration is an intuitively appealing property, and also has many practical benefits in complex ML systems and real-world environments. Despite its importance, there are many ways in which a model can end up not being calibrated. Instead of focusing on fixing the model, we can treat the model as a blackbox and achieve calibration with a calibration function. We can use any 1D monotonic regression function, but two that we’ve found to work well are I-Splines and piecewise linear functions from TF-Lattice.