### Adding Common Sense to Machine Learning with TensorFlow Lattice

by TAMAN NARAYAN & SEN ZHAO

Here is what happened after training a neural network and gradient boosted trees model (Wang and Gupta, 2020):

For instance, the problem with the law school example above can be reduced to the requirement that both features be monotonic with respect to the model output. Monotonicity is a very powerful idea and can apply to a huge number of ML contexts and all types of features — be they boolean, categorical (especially ordered categoricals, such as Likert-style responses), and continuous.

We also realized that many other types of domain knowledge and common sense could be expressed mathematically. For example, consider trying to rate coffee shops (in order to recommend some of them to users) based on how far away they are, how many reviews they have received, and what their average review score is. The strength of the recommendation should decrease monotonically with distance, and increase monotonically with average review score (and maybe also with number of reviews). But we also know more. For example, distance should probably obey a diminishing returns relationship, where each additional kilometer hurts less than the last. We should also trust, or rely on, the average review score more if it is backed by a high number of reviews.

In this blog post, we describe how we impose common-sense “shape constraints” on complex models. We call these “semantic regularizers”, because they serve the same purpose as traditional regularization strategies — guiding your model towards learning real generalizable correlations instead of noise — but do so in a way that directly takes advantage of your domain knowledge. They give you the right kind of control over your model, leaving it flexible enough to learn while ensuring it behaves in accordance with known facts about the phenomenon.

Adding such constraints to a model will regularize the model and produce guaranteed model behavior for reliability or policy needs. These constraints also make it easier to summarize and explain the model. For the law school example, before we even train the model with shape constraints, we can already tell end-users, “The learned function never penalizes higher GPA and higher LSAT scores.”

We have packaged the solutions we describe in this blog post in the TensorFlow Lattice library. TF Lattice offers semantic regularizers that can be applied to models of varying complexity, from simple Generalized Additive Models, to flexible fully interacting models called lattices, to deep models that mix in arbitrary TF and Keras layers.

In the remainder of this blog post, we’ll walk you through how we do shape constrained machine learning, starting from linear functions and moving through more flexible modeling. We end with links to code tutorials and papers with more technical detail.

A more general approach is to learn a Generalized Additive Model (GAM). GAMs are popular among data science and machine learning applications for their simplicity and interpretability. Formally, GAMs are generalized linear models, in which the model output depends linearly on learned transformations of features, denoted by $c$:$$

\hat{y} = \alpha + c_1(x_1) + c_2(x_2) + c_3(x_3) + \cdots + c_k(x_k)

$$The transformations of features usually take the form of smoothing splines or local smoothers, which are fit using back-fitting algorithms. When the data is noisy or sparse, however, there is no guarantee that the learned transformations will align with domain knowledge and common sense.

For example, after we train a one-feature GAM model to recommend coffee shops based on how far they are, we observe the following trend. The model fits training data well, and in feature regions with many training examples (e.g., coffee shops $5$ to $20$ km away), the model performs sensibly. However, the model is questionable when rating coffee shops that are more than $20$ km away, as coffee shops $30$ km away are predicted to be more preferable than ones only $20$ km away. This phenomenon is likely due to noise in the training data at the corresponding feature region. If we deploy this model online, it will likely direct users to farther away coffee shops, hurting user experience.

One may try to apply regularization to smooth the curve, but the questionable performance on far away coffee shops still persists.

With TF Lattice, we can learn GAMs in which the feature transformations are guaranteed to behave sensibly in ways that align with domain knowledge and common sense. The most prominent of these is monotonic GAMs, in which we constrain the feature transformations to be monotonically increasing or decreasing without otherwise limiting their flexibility.

The figure below shows the learned trend of a monotonic GAM. This model fits well in regions with sufficient training data. In regions with less training data, the model still performs sensibly, which ensures robust online performance.

The secret to achieving this guarantee comes in how we pose the problem. TF Lattice uses piecewise-linear functions (PLFs) in its feature transformations. PLFs have two useful properties that we take advantage of. The first is that they are straightforward to optimize using traditional gradient-based optimizers as long as we pre-specify the placement of the knots. If we parameterize the PLF by the values $\theta$ it takes at each of its knots $\kappa$, then$$

c(x) = \sum_{j=1}^K (\alpha \theta_j + (1-\alpha)\theta_{j+1})

\left[ x \in (\kappa_j, \kappa_{j+1}] \right]

\quad \alpha = \frac{x-\kappa_j}{\kappa_{j+1} - \kappa_j}

$$ where square brackets represents the truth value 0 or 1 of the expression contained.

If we observe label vector y and feature vectors $x_1, \cdots, x_d$ we can write the differentiable empirical risk minimization problem with a squared loss as$$

\min_\theta \left\| y - \sum_{j=1}^d c_j(x_j) \right\|^2

$$Note that we use squared loss for the simplicity of presentation; one can use any differentiable loss in their application.

Secondly, PLFs have the property that many types of constraints can be written as simple linear inequalities on their parameters. For example, a monotonicity constraint can be succinctly expressed by the constraint set $\{ \theta_i < \theta_{i+1} \}$ while diminishing returns are captured by $\{ \theta_{i+1} - \theta_i > \theta_{i+2} - \theta_{i+1} \}$. There is no convenient way to do this with other popular function classes such as polynomials, particularly in a way that does not overconstrain and rule out viable monotonic functions. For categorical features, unique categories are assigned their own knots and then we learn a “PLF” (really, a 1-d embedding since every value is exactly at a knot). This formulation allows the user to specify any monotonic orderings they want between the categories, or even a total ordering (as with, say, a Likert-scale).

There is a robust set of tools for working with these kinds of constrained optimization problems. TF Lattice uses a projected stochastic gradient descent (SGD) algorithm in which each gradient step on the objective function is followed by a projection onto the constraints. The projection is performed with Dykstra’s algorithm, which involves sequential exact projections on subsets of the constraints. This is where the linearity of the constraints is important, as these exact projections are straightforward.

It is true that PLFs are sometimes viewed as rigid or choppy, with the additional undesirable property of having to preselect good knots. In our case, it turns out that the monotonicity regularizer allows us to increase the number of knots without incurring much risk of overfitting — there are fewer ways for the model to go wrong. More knots make the learned feature transformation smoother and more capable of approximating any monotonic function. As a result, selecting knots according to the quantiles of the input data (or even linearly across the domain), and then steadily increasing their number as long as the metrics improve works well in practice.

With PLFs, we can even create unique and interesting types of regularization. We can make a learned transformation flatter, making a feature less influential in the final model output, by adding a regularizer on the magnitude of the first-order differences $ \theta_{i+1} - \theta_i $. We can make the transformation more linear, by regularizing the magnitude of the second-order differences $ (\theta_{i+2} - \theta_{i+1}) - (\theta_{i+1} - \theta_i)$. We can even make the learned function smoother, by regularizing the magnitude of the third-order differences! (Can you work out how to express that in terms of $\theta$?)

The drawback of GAMs is that they do not allow feature interactions. In the next section, we describe lattice models, which allow feature interactions that are guaranteed to align with common sense.

\hat{y} = \alpha + l(c_1(x_1), c_2(x_2), \cdots, c_k(x_k))

$$The following image shows the response surface of an example lattice function $l$ with two inputs, RATER CONFIDENCE and RATING. This lattice function is parametrized by four parameters $\theta_1, \cdots, \theta_4$, which define the lattice function output at extreme input values; at other input values, we linearly interpolate from the vertices. As can be seen from the image, the lattice response surface is not a plane, which indicates that unlike GAMs the lattice function can model interactions between features.

We can therefore write a very similar empirical risk minimization problem as in the case of 1-dimensional PLFs; indeed, an astute reader will note that lattices are in some sense a multi-dimensional extension of 1-d PLFs. For simplicity, we stick to the two-feature model described above, where x1 is the rating feature and x2 is the rater confidence feature and we have two knots per dimension set at their min and max values. Assuming that both features are scaled to lie in [0,1], this setup gives rise to the lattice function$$

l(x_1, x_2) = (1-x_1)(1-x_2)\theta_1 + (1-x_1)x_2\theta_2 + x_1(1-x_2)\theta_3 + x_1 x_2 \theta_4

$$We similarly write out the lattice function for more features. The empirical risk minimization problem with a squared loss hence becomes$$

\min_\theta \| y - l(x_1, x_2, \cdots, x_d) \|^2

$$Note that as we have mentioned in the previous section, one can use any differentiable loss in their application.

As with PLFs, the parameterization of lattices allows us to easily constrain their shape to make them align with common sense. For example, to make the RATING feature monotonic, we just need to constrain the model parameters such that$$

\theta_4 > \theta_2, \theta_3 > \theta_1

$$We should pause here to note just how powerful these two constraints are. They guarantee that no matter what value RATER CONFIDENCE takes, we remain monotonic as we move along the RATING dimension. This is a substantially more complex task than maintaining monotonicity in the absence of feature interactions, but one that is made possible thanks to our parameterization.

In addition to the monotonicity constraints, lattice models also allow us to constrain how two features interact. Consider the intuition described above: We want good ratings to “help” more (and bad ratings to “hurt” more) when backed by high rater confidence. We don’t want our model, in contrast, to reward or penalize ratings as strongly when our confidence in them is low. We call this idea trust constraints -- we trust RATING more if RATER CONFIDENCE is higher. Mathematically, it implies that$$

\theta_4 > \theta_3 > \theta_1 > \theta_2

$$There are various other pairwise constraints, from Edgeworth complementarity to feature dominance to joint unimodality, that we have explored and that can be imposed on lattice models — there simply is not enough room in this post to describe them all. As with monotonicity, all these cases are implemented in the TF Lattice package in such a way that the pairwise interaction is constrained for all possible values of the other features. And as in the case of PLFs, we can impose custom forms of regularization on the model parameters that can make the learned lattice function less reliant on specific features, more linear, or more flat overall.

The smoothness of the lattice surface may make it seem inflexible — it is not. As with piecewise linear feature transformations, we can add additional knots to increase expressibility. Indeed, lattices are as expressive as deep neural networks (both are universal function approximators). We can also compose PLFs with lattices by first feeding features through one-dimensional feature transformations. In practice, even a low-dimensional lattice model combined with learned feature transformations can achieve high performance. The figure below shows the output surface of a real feature-transformed lattice model used within Google. One of the features is constrained to be monotonic (can you identify it?), and the other unconstrained. The model is smooth, yet flexible. It also aligns with our common sense and hence is more robust.

At the same time, the richness of the lattice’s expressibility means that we have an exponential number of parameters relative to the number of features. In practice, we adjust for that by fitting ensembles of smaller lattices when working with higher numbers of features. TF Lattice has various mechanisms to help with that process, such as intelligently grouping features with high interactions into a single lattice.

In this section, we extend the ideas of building monotonic GAMs and lattice models to construct monotonic deep learning models. Mathematically, the key idea is that PLFs and lattices can be viewed as generic transformations capable of propagating monotonicity inside deeper modeling structures. A monotonic deep learning model is well-regularized, easy to understand, and aligns with common sense. It takes advantage of your domain knowledge to keep the model on track, while maintaining the advantages of modern deep learning, namely, scalability and performance.

Deep learning models are usually composed of several functions, stacked together. For example, a two layer fully-connected deep learning model is of the form$$

\hat{y} = f(g(h(x)))

$$where $f$ and $h$ are fully connected layers, and $g$ is the activation function. Other deep learning models can also be written in this form.

A sufficient (but not necessary) condition for a deep learning model to be monotonic is that all of its layers are monotonic. One can achieve this by, say, using monotonic activations and constraining all the coefficients in all fully-connected layers to be non-negative. As a result, even if we only want to constrain a subset of features to be monotonic, a monotonic deep learning model with fully-connected layers still needs to constrain all hidden layer weights to be non-negative. As a result, such deep learning models are inflexible, losing much of the benefit of using a deep learning model.

To address this issue, we propose using monotonic piecewise linear and lattice functions as core components to build monotonic deep learning models. (We will also use “monotonic linear layers”, or linear layers with some coefficients constrained to be non-negative, as a supporting tool.) As we’ve said, both monotonic piecewise linear and lattice functions with enough keypoints are universal approximators of monotonic functions of their respective dimensions, and hence composing them together will produce highly flexible monotonic deep learning models. For lattice functions, we can constrain a subset of its inputs to be monotonic, with the others left unconstrained; we call such a lattice a “partially monotonic lattice”.

Piecewise-linear and lattice layers can coexist with traditional deep learning structures, as we will show. While lattices are universal function approximators, they are not always the best modeling choice to use with arbitrary data. As mentioned above, lattice layers have an exponential number of parameters in their input dimension, meaning that we will generally use ensembles of lattices in applications. But these can be cumbersome in the presence of high-dimensional sparse features or images for which effective modeling structures already exist and where granular element-wise shape constraints are not appropriate. The key is thinking through composability. When monotonicity is required, we can enforce and propagate it through a model with our layers. When it isn’t, piecewise-linear and lattice layers may still be effective tools, but other layers and approaches can be used as well.

Viewed through the lens of composability, we can express the GAMs and Lattice models described in the earlier sections as special cases of monotonic modeling. GAMs feed element-wise piecewise-linear layers into a monotonic linear layer, while Lattice models combine piecewise-linear layers with lattice layers. Lattice ensembles, meanwhile, start with piecewise-linear layers and end with lattice layers, but have an intermediate layer to assign inputs to their respective lattices (or optionally, do a “soft” assignment with learned monotonic linear layers).

In the following sections, we describe several deep learning applications that can be improved by using lattice layers. The main benefit is the ability to employ monotonic and partially-monotonic structures. We also present some examples that take advantage of multi-dimensional shape constraints. We recommend caution when using multi-dimensional constraints in arbitrary deep models, as it can be difficult to think through how such constraints compose end-to-end — on the other hand, monotonicity is much more straightforward. How best to use lattices in deep learning applications is still an open question and an active area of research. The following examples are by no means the only way to utilize TF Lattice layers.

These sparse features are often handled through high-dimensional embeddings and specialized processing blocks such as fully connected layers, residual networks, or transformers. These offer a huge amount of flexibility and processing power, but also make it easy for dense features to get “lost” in the mix, learning strange correlations with rare sparse features. Recall that lattice layers can be set to make only certain inputs monotonic. These offer a way to keep the dense features monotonic, no matter what values the sparse features take, without having to constrain the relationships learned within the embedding block.

In addition, the embeddings themselves can be prone to excessive memorization as they are not robust to distribution changes in input features. This can make the resulting models hard to understand, and when they make unreasonable predictions, difficult to pinpoint the cause of the bug. As a fix, we can impose monotonicity on the final outputs of an embedding block as an additional regularization technique — if the resulting outputs of the embedding block are unusually high, we can point to the sparse features as the “reason” for our model output being high.

The following figure shows an example model structure that we may use for the movie box-office performance problem. It is an example of a model with sparse features, non-monotonic dense features, and monotonic dense features.

We use an embedding structure to embed the names of the director and leading actors into a small number of real-valued outputs. We also use non-monotonic structures (e.g., fully connected layers) to fuse non-monotonic features (such as length of the movie, season of the premiere) into a few outputs. For those monotonic features (such as the budget of the movie), we fuse them with non-monotonic features using a lattice structure. Finally, the outputs from embedding, non-monotonic and monotonic blocks are monotonically fused in the final monotonic block.

Note that there is flexibility in how you create those monotonic blocks and structures. Our monotonic linear, piecewise-linear, lattice, and lattice-ensemble layers can be combined with standard monotonic functions like sigmoids in as simple or complex of a manner as you might want. We often build around the lattice-ensemble layer, which can flexibly accept large numbers of inputs. These inputs can be processed with individual PLFs or even just a monotonic linear layer and sigmoid (recall that lattice inputs must be bounded). These blocks can then be stacked. Alternatively, there might be times when a monotonic linear layer, despite its lack of expressiveness, is enough for one part of your model.

The green paths in the diagram denote monotonic functions, and as a result, the output (yellow circle) is monotonic with respect to those monotonic dense features (green circle), but is not monotonic with respect to other features. By constraining some of the inputs to be monotonic, such a model balances flexibility with regularization. Furthermore, this model is easy to debug. For example, when the prediction of an example is unexpectedly large we can trace back the prediction through visualization of the final block and identify its problematic, unexpectedly large input. This can help us pinpoint the issue in the model or feature.

A similar idea can be used when combining dense features with other types of input data, such as images or videos. Existing convolutional or residual structures can be used to process the sparse data into a manageable number of real outputs, while lattice layers can be used towards the end of the model to combine with dense features in an interpretable way.

Incorporating lattice layers into deep learning models offer a way to build models like this while embedding in them relevant domain knowledge. For example, dominance constraints (which we have not discussed in detail — see paper Multidimensional Shape Constraints) can be used to ensure that no matter where we are in the feature space, changing the value of a dominant feature will influence the model more than changing the value of a non-dominant feature. Using these, we can require more recent data to be more influential in our forecast, matching the behavior of common univariate techniques such as exponential smoothing.

We depict below a simple model that takes in historical revenue data alongside a sparse embedding feature to predict future revenue. Eschewing potential complications such as seasonality or bounce-back effects, we assume that each historical data point is monotonic with respect to our forecast and that recent data should be more important to our model relative to older data. Our resulting model may be easier to explain — and generalize better — than one which simply threw in all of the inputs into a neural network.

A simple approach might be to take a weighted mean of some sort. We could average reviewer scores while giving greater weight to experienced reviewers (who may be more trustworthy) and adjust for reviewers who are prone to giving high or low scores across the board. We could tweak these weights and adjustments to better correlate with some “golden” quality label.

But what if we want a machine learning model that directly took in these variable length features? What if we still wished to maintain key properties of a weighted mean such as monotonicity on individual review scores, and to accord greater trust to the scores of experienced reviewers?

As introduced in Zaheer et al. (2017), we could use the following model to learn a “set function”, i.e., a function that acts on an unordered set of fixed-dimensional inputs in a permutation invariant way$$

\hat{y} = \rho\left( \sum_{m=1}^M \phi(x_m) \right)

$$where $x_m$are the features for the $m$-th review, while $\phi$ and $\rho$ are generic deep learning models ($\phi$ may have multiple outputs). We extend this model with the following 6-layer structure with monotonicity constraints (Cotter et al., 2019). Absent monotonicity, this structure is proven to be flexible enough to learn any function on the variable length features. Here, our structure additionally constrains the user rating feature to be monotonic — if one user increases his/her rating on the restaurant, our predicted restaurant quality score will only increase. We can optionally add trust constraints to make sure that user ratings from experienced reviewers are more influential in the final model output. For more details, see Shape Constraints for Set Functions (Cotter et al., 2019).

This blog post is only a primer of the problems we encountered and the solutions at which we have arrived. TF Lattice includes many more constraints, regularizers, and techniques to solve other problems that a practitioner may face, and we continue to improve it. We encourage you to take a look at our tutorials and colabs on TF Lattice Keras layers, Keras premade models, canned estimators, custom estimators, shape constraints, ethical constraints and set functions. We hope TF Lattice will prove useful to you in your real-world applications.

Wightman, L. LSAC National Longitudinal Bar Passage Study. Law School Admission Council, 1998.

Garcia, E. and Gupta, M. (2009). Lattice Regression. In Advances in Neural Information Processing Systems 22, pages 594-602.

Gupta, M., Cotter, A., Pfeifer, J., Voevodski, K., Canini, K., Mangylov, A., Moczydlowski, W. and van Esbroeck, A. (2016). Monotonic Calibrated Interpolated Look-Up Tables. Journal of Machine Learning Research, 17: 1-47.

Milani Fard, M., Canini, K., Cotter, A., Pfeifer, J. and Gupta, M. (2016). Fast and Flexible Monotonic Functions with Ensembles of Lattices. In Advances in Neural Information Processing Systems 29, pages 2919-2927.

You, S., Ding, D., Canini, K., Pfeifer, J. and Gupta, M. (2017). Deep Lattice Networks and Partial Monotonic Functions. In Advances in Neural Information Processing Systems 30, pages 2981-2989.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R. and Smola, A. J. (2017). Deep Sets. In Advances in Neural Information Processing Systems 30, pages 3391-3401.

Cotter, A., Gupta, M., Jiang, H., Louidor, E., Muller, J., Narayan, T., Wang, S. and Zhu, T. (2019). Shape Constraints for Set Functions. In Proceedings of the 36th International Conference on Machine Learning, pages 1388-1396.

Gupta, M. R., Louidor, E., Mangylov, O., Morioka, N., Narayan, T. and Zhao, S. (2020). Multidimensional Shape Constraints. In Proceedings of the 37th International Conference on Machine Learning, to appear.

Wang, S., Gupta, M. (2020). Deontological Ethics by Monotonicity Shape Constraints. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pages 2043-2054.

*A data scientist is often in possession of domain knowledge which she cannot easily apply to the structure of the model. On the one hand, basic statistical models (e.g. linear regression, trees) can be too rigid in their functional forms. On the other hand, sophisticated machine learning models are flexible in their form but not easy to control. This blog post motivates this problem more fully, and discusses monotonic splines and lattices as a solution.*

*While the discussion is about methods and applications, the blog also contains pointers to research papers and to the TensorFlow Lattice package that provides an implementation of these solutions.*

*Authors of this post are part of the team at Google that builds TensorFlow Lattice.*

## Introduction

Machine learning models often behave unpredictably, as data scientists would be the first to tell you. For example, consider the following simple example — fitting a two-dimensional function to predict if someone will pass the bar exam based just on their GPA (grades) and LSAT (a standardized test) using the public dataset (Wightman, 1998).Here is what happened after training a neural network and gradient boosted trees model (Wang and Gupta, 2020):

Both models end up penalizing higher grades and higher LSAT in some parts of the input space! That seems wrong, and a quick look at the data distribution explains why we are seeing these erratic results — there just aren’t many data samples outside of the high-LSAT high-GPA quadrant.

There are a number of reasons why a model like this can, and does, get us in trouble:

We were unsatisfied with this state of affairs and wanted to build on the following insight:

**Training-serving skew**: The offline numbers may look great, but what if your model will be evaluated on a different or broader set of examples than those found in the training set? This phenomenon, more generally referred to as “dataset shift” or “distribution shift”, happens all the time in real-world situations. Models are trained on a curated set of examples, or clicks on top-ranked recommendations, or a specific geographical region, and then applied to every user or use case. Curiosities and anomalies in your training and testing data become genuine and sustained loss patterns.**Bad individual errors**: Models are often judged by their worst behavior — a single egregious outcome can damage the faith that important stakeholders have in the model and even cause serious reputational harm to your business or institution. They can also defy explanation in that there may be no feature value to blame, and therefore no obvious way to fix the problem.**Violating policy goals**: It is often important for deployed models to uphold certain policy goals in addition to overall performance, such as fairness, ethics, and safety. Policy may require that certain inputs may only ever positively influence the output score. Any model that is unable to make such a guarantee may be rejected for such policy reasons, regardless of its overall accuracy.

We were unsatisfied with this state of affairs and wanted to build on the following insight:

For instance, the problem with the law school example above can be reduced to the requirement that both features be monotonic with respect to the model output. Monotonicity is a very powerful idea and can apply to a huge number of ML contexts and all types of features — be they boolean, categorical (especially ordered categoricals, such as Likert-style responses), and continuous.

We also realized that many other types of domain knowledge and common sense could be expressed mathematically. For example, consider trying to rate coffee shops (in order to recommend some of them to users) based on how far away they are, how many reviews they have received, and what their average review score is. The strength of the recommendation should decrease monotonically with distance, and increase monotonically with average review score (and maybe also with number of reviews). But we also know more. For example, distance should probably obey a diminishing returns relationship, where each additional kilometer hurts less than the last. We should also trust, or rely on, the average review score more if it is backed by a high number of reviews.

In this blog post, we describe how we impose common-sense “shape constraints” on complex models. We call these “semantic regularizers”, because they serve the same purpose as traditional regularization strategies — guiding your model towards learning real generalizable correlations instead of noise — but do so in a way that directly takes advantage of your domain knowledge. They give you the right kind of control over your model, leaving it flexible enough to learn while ensuring it behaves in accordance with known facts about the phenomenon.

Adding such constraints to a model will regularize the model and produce guaranteed model behavior for reliability or policy needs. These constraints also make it easier to summarize and explain the model. For the law school example, before we even train the model with shape constraints, we can already tell end-users, “The learned function never penalizes higher GPA and higher LSAT scores.”

We have packaged the solutions we describe in this blog post in the TensorFlow Lattice library. TF Lattice offers semantic regularizers that can be applied to models of varying complexity, from simple Generalized Additive Models, to flexible fully interacting models called lattices, to deep models that mix in arbitrary TF and Keras layers.

In the remainder of this blog post, we’ll walk you through how we do shape constrained machine learning, starting from linear functions and moving through more flexible modeling. We end with links to code tutorials and papers with more technical detail.

## Generalized Additive Models

Linear models are a common first step when building models, and have the nice property that they will automatically obey some of the common-sense properties discussed above, such as monotonicity. The assumption of linearity is often too strong, though, leading practitioners to explore transforming their features in various ways. For example, one could apply logarithms or add polynomial terms for certain features.A more general approach is to learn a Generalized Additive Model (GAM). GAMs are popular among data science and machine learning applications for their simplicity and interpretability. Formally, GAMs are generalized linear models, in which the model output depends linearly on learned transformations of features, denoted by $c$:$$

\hat{y} = \alpha + c_1(x_1) + c_2(x_2) + c_3(x_3) + \cdots + c_k(x_k)

$$The transformations of features usually take the form of smoothing splines or local smoothers, which are fit using back-fitting algorithms. When the data is noisy or sparse, however, there is no guarantee that the learned transformations will align with domain knowledge and common sense.

For example, after we train a one-feature GAM model to recommend coffee shops based on how far they are, we observe the following trend. The model fits training data well, and in feature regions with many training examples (e.g., coffee shops $5$ to $20$ km away), the model performs sensibly. However, the model is questionable when rating coffee shops that are more than $20$ km away, as coffee shops $30$ km away are predicted to be more preferable than ones only $20$ km away. This phenomenon is likely due to noise in the training data at the corresponding feature region. If we deploy this model online, it will likely direct users to farther away coffee shops, hurting user experience.

One may try to apply regularization to smooth the curve, but the questionable performance on far away coffee shops still persists.

With TF Lattice, we can learn GAMs in which the feature transformations are guaranteed to behave sensibly in ways that align with domain knowledge and common sense. The most prominent of these is monotonic GAMs, in which we constrain the feature transformations to be monotonically increasing or decreasing without otherwise limiting their flexibility.

The figure below shows the learned trend of a monotonic GAM. This model fits well in regions with sufficient training data. In regions with less training data, the model still performs sensibly, which ensures robust online performance.

The secret to achieving this guarantee comes in how we pose the problem. TF Lattice uses piecewise-linear functions (PLFs) in its feature transformations. PLFs have two useful properties that we take advantage of. The first is that they are straightforward to optimize using traditional gradient-based optimizers as long as we pre-specify the placement of the knots. If we parameterize the PLF by the values $\theta$ it takes at each of its knots $\kappa$, then$$

c(x) = \sum_{j=1}^K (\alpha \theta_j + (1-\alpha)\theta_{j+1})

\left[ x \in (\kappa_j, \kappa_{j+1}] \right]

\quad \alpha = \frac{x-\kappa_j}{\kappa_{j+1} - \kappa_j}

$$ where square brackets represents the truth value 0 or 1 of the expression contained.

If we observe label vector y and feature vectors $x_1, \cdots, x_d$ we can write the differentiable empirical risk minimization problem with a squared loss as$$

\min_\theta \left\| y - \sum_{j=1}^d c_j(x_j) \right\|^2

$$Note that we use squared loss for the simplicity of presentation; one can use any differentiable loss in their application.

Secondly, PLFs have the property that many types of constraints can be written as simple linear inequalities on their parameters. For example, a monotonicity constraint can be succinctly expressed by the constraint set $\{ \theta_i < \theta_{i+1} \}$ while diminishing returns are captured by $\{ \theta_{i+1} - \theta_i > \theta_{i+2} - \theta_{i+1} \}$. There is no convenient way to do this with other popular function classes such as polynomials, particularly in a way that does not overconstrain and rule out viable monotonic functions. For categorical features, unique categories are assigned their own knots and then we learn a “PLF” (really, a 1-d embedding since every value is exactly at a knot). This formulation allows the user to specify any monotonic orderings they want between the categories, or even a total ordering (as with, say, a Likert-scale).

There is a robust set of tools for working with these kinds of constrained optimization problems. TF Lattice uses a projected stochastic gradient descent (SGD) algorithm in which each gradient step on the objective function is followed by a projection onto the constraints. The projection is performed with Dykstra’s algorithm, which involves sequential exact projections on subsets of the constraints. This is where the linearity of the constraints is important, as these exact projections are straightforward.

It is true that PLFs are sometimes viewed as rigid or choppy, with the additional undesirable property of having to preselect good knots. In our case, it turns out that the monotonicity regularizer allows us to increase the number of knots without incurring much risk of overfitting — there are fewer ways for the model to go wrong. More knots make the learned feature transformation smoother and more capable of approximating any monotonic function. As a result, selecting knots according to the quantiles of the input data (or even linearly across the domain), and then steadily increasing their number as long as the metrics improve works well in practice.

With PLFs, we can even create unique and interesting types of regularization. We can make a learned transformation flatter, making a feature less influential in the final model output, by adding a regularizer on the magnitude of the first-order differences $ \theta_{i+1} - \theta_i $. We can make the transformation more linear, by regularizing the magnitude of the second-order differences $ (\theta_{i+2} - \theta_{i+1}) - (\theta_{i+1} - \theta_i)$. We can even make the learned function smoother, by regularizing the magnitude of the third-order differences! (Can you work out how to express that in terms of $\theta$?)

The drawback of GAMs is that they do not allow feature interactions. In the next section, we describe lattice models, which allow feature interactions that are guaranteed to align with common sense.

## Flexible Lattice

To build shape-constrained models which allow interactions among features, we utilize lattice models. Similar to PLF-based GAMs, a lattice model also learns feature transformations that can be constrained to align with common sense (e.g., monotonicity, diminishing returns). Unlike GAMs, which sum up the transformed feature values, a lattice model uses a learned lattice function $l$ to fuse the features that may themselves be transformed by PLFs $c_k$.$$\hat{y} = \alpha + l(c_1(x_1), c_2(x_2), \cdots, c_k(x_k))

$$The following image shows the response surface of an example lattice function $l$ with two inputs, RATER CONFIDENCE and RATING. This lattice function is parametrized by four parameters $\theta_1, \cdots, \theta_4$, which define the lattice function output at extreme input values; at other input values, we linearly interpolate from the vertices. As can be seen from the image, the lattice response surface is not a plane, which indicates that unlike GAMs the lattice function can model interactions between features.

We can therefore write a very similar empirical risk minimization problem as in the case of 1-dimensional PLFs; indeed, an astute reader will note that lattices are in some sense a multi-dimensional extension of 1-d PLFs. For simplicity, we stick to the two-feature model described above, where x1 is the rating feature and x2 is the rater confidence feature and we have two knots per dimension set at their min and max values. Assuming that both features are scaled to lie in [0,1], this setup gives rise to the lattice function$$

l(x_1, x_2) = (1-x_1)(1-x_2)\theta_1 + (1-x_1)x_2\theta_2 + x_1(1-x_2)\theta_3 + x_1 x_2 \theta_4

$$We similarly write out the lattice function for more features. The empirical risk minimization problem with a squared loss hence becomes$$

\min_\theta \| y - l(x_1, x_2, \cdots, x_d) \|^2

$$Note that as we have mentioned in the previous section, one can use any differentiable loss in their application.

As with PLFs, the parameterization of lattices allows us to easily constrain their shape to make them align with common sense. For example, to make the RATING feature monotonic, we just need to constrain the model parameters such that$$

\theta_4 > \theta_2, \theta_3 > \theta_1

$$We should pause here to note just how powerful these two constraints are. They guarantee that no matter what value RATER CONFIDENCE takes, we remain monotonic as we move along the RATING dimension. This is a substantially more complex task than maintaining monotonicity in the absence of feature interactions, but one that is made possible thanks to our parameterization.

In addition to the monotonicity constraints, lattice models also allow us to constrain how two features interact. Consider the intuition described above: We want good ratings to “help” more (and bad ratings to “hurt” more) when backed by high rater confidence. We don’t want our model, in contrast, to reward or penalize ratings as strongly when our confidence in them is low. We call this idea trust constraints -- we trust RATING more if RATER CONFIDENCE is higher. Mathematically, it implies that$$

\theta_4 > \theta_3 > \theta_1 > \theta_2

$$There are various other pairwise constraints, from Edgeworth complementarity to feature dominance to joint unimodality, that we have explored and that can be imposed on lattice models — there simply is not enough room in this post to describe them all. As with monotonicity, all these cases are implemented in the TF Lattice package in such a way that the pairwise interaction is constrained for all possible values of the other features. And as in the case of PLFs, we can impose custom forms of regularization on the model parameters that can make the learned lattice function less reliant on specific features, more linear, or more flat overall.

The smoothness of the lattice surface may make it seem inflexible — it is not. As with piecewise linear feature transformations, we can add additional knots to increase expressibility. Indeed, lattices are as expressive as deep neural networks (both are universal function approximators). We can also compose PLFs with lattices by first feeding features through one-dimensional feature transformations. In practice, even a low-dimensional lattice model combined with learned feature transformations can achieve high performance. The figure below shows the output surface of a real feature-transformed lattice model used within Google. One of the features is constrained to be monotonic (can you identify it?), and the other unconstrained. The model is smooth, yet flexible. It also aligns with our common sense and hence is more robust.

At the same time, the richness of the lattice’s expressibility means that we have an exponential number of parameters relative to the number of features. In practice, we adjust for that by fitting ensembles of smaller lattices when working with higher numbers of features. TF Lattice has various mechanisms to help with that process, such as intelligently grouping features with high interactions into a single lattice.

## Monotonic Deep Lattice Networks

Deep learning is a powerful tool when we have an abundance of data to learn from. However, it can suffer from the same problems highlighted at the beginning of this post as any other unconstrained model - brittleness in the presence of training-serving skew, the potential of strange predictions in some parts of the feature space, and a general lack of guardrails and guarantees about its behavior.In this section, we extend the ideas of building monotonic GAMs and lattice models to construct monotonic deep learning models. Mathematically, the key idea is that PLFs and lattices can be viewed as generic transformations capable of propagating monotonicity inside deeper modeling structures. A monotonic deep learning model is well-regularized, easy to understand, and aligns with common sense. It takes advantage of your domain knowledge to keep the model on track, while maintaining the advantages of modern deep learning, namely, scalability and performance.

Deep learning models are usually composed of several functions, stacked together. For example, a two layer fully-connected deep learning model is of the form$$

\hat{y} = f(g(h(x)))

$$where $f$ and $h$ are fully connected layers, and $g$ is the activation function. Other deep learning models can also be written in this form.

A sufficient (but not necessary) condition for a deep learning model to be monotonic is that all of its layers are monotonic. One can achieve this by, say, using monotonic activations and constraining all the coefficients in all fully-connected layers to be non-negative. As a result, even if we only want to constrain a subset of features to be monotonic, a monotonic deep learning model with fully-connected layers still needs to constrain all hidden layer weights to be non-negative. As a result, such deep learning models are inflexible, losing much of the benefit of using a deep learning model.

To address this issue, we propose using monotonic piecewise linear and lattice functions as core components to build monotonic deep learning models. (We will also use “monotonic linear layers”, or linear layers with some coefficients constrained to be non-negative, as a supporting tool.) As we’ve said, both monotonic piecewise linear and lattice functions with enough keypoints are universal approximators of monotonic functions of their respective dimensions, and hence composing them together will produce highly flexible monotonic deep learning models. For lattice functions, we can constrain a subset of its inputs to be monotonic, with the others left unconstrained; we call such a lattice a “partially monotonic lattice”.

Piecewise-linear and lattice layers can coexist with traditional deep learning structures, as we will show. While lattices are universal function approximators, they are not always the best modeling choice to use with arbitrary data. As mentioned above, lattice layers have an exponential number of parameters in their input dimension, meaning that we will generally use ensembles of lattices in applications. But these can be cumbersome in the presence of high-dimensional sparse features or images for which effective modeling structures already exist and where granular element-wise shape constraints are not appropriate. The key is thinking through composability. When monotonicity is required, we can enforce and propagate it through a model with our layers. When it isn’t, piecewise-linear and lattice layers may still be effective tools, but other layers and approaches can be used as well.

Viewed through the lens of composability, we can express the GAMs and Lattice models described in the earlier sections as special cases of monotonic modeling. GAMs feed element-wise piecewise-linear layers into a monotonic linear layer, while Lattice models combine piecewise-linear layers with lattice layers. Lattice ensembles, meanwhile, start with piecewise-linear layers and end with lattice layers, but have an intermediate layer to assign inputs to their respective lattices (or optionally, do a “soft” assignment with learned monotonic linear layers).

In the following sections, we describe several deep learning applications that can be improved by using lattice layers. The main benefit is the ability to employ monotonic and partially-monotonic structures. We also present some examples that take advantage of multi-dimensional shape constraints. We recommend caution when using multi-dimensional constraints in arbitrary deep models, as it can be difficult to think through how such constraints compose end-to-end — on the other hand, monotonicity is much more straightforward. How best to use lattices in deep learning applications is still an open question and an active area of research. The following examples are by no means the only way to utilize TF Lattice layers.

### Combining Embeddings with TF Lattice Layers

Let us dig into the case of embedding layers. Sparse categorical features, such as keywords or location ids, are often used in modern machine learning applications, and have proven to be powerful improvements to model performance. For example, if we want to predict the box-office performance of a movie, in addition to dense numeric features, such as the budget of the movie (which likely is a positive signal to the box-office revenue), sparse features such as the names of the director and leading actors can also be useful.These sparse features are often handled through high-dimensional embeddings and specialized processing blocks such as fully connected layers, residual networks, or transformers. These offer a huge amount of flexibility and processing power, but also make it easy for dense features to get “lost” in the mix, learning strange correlations with rare sparse features. Recall that lattice layers can be set to make only certain inputs monotonic. These offer a way to keep the dense features monotonic, no matter what values the sparse features take, without having to constrain the relationships learned within the embedding block.

In addition, the embeddings themselves can be prone to excessive memorization as they are not robust to distribution changes in input features. This can make the resulting models hard to understand, and when they make unreasonable predictions, difficult to pinpoint the cause of the bug. As a fix, we can impose monotonicity on the final outputs of an embedding block as an additional regularization technique — if the resulting outputs of the embedding block are unusually high, we can point to the sparse features as the “reason” for our model output being high.

The following figure shows an example model structure that we may use for the movie box-office performance problem. It is an example of a model with sparse features, non-monotonic dense features, and monotonic dense features.

We use an embedding structure to embed the names of the director and leading actors into a small number of real-valued outputs. We also use non-monotonic structures (e.g., fully connected layers) to fuse non-monotonic features (such as length of the movie, season of the premiere) into a few outputs. For those monotonic features (such as the budget of the movie), we fuse them with non-monotonic features using a lattice structure. Finally, the outputs from embedding, non-monotonic and monotonic blocks are monotonically fused in the final monotonic block.

Note that there is flexibility in how you create those monotonic blocks and structures. Our monotonic linear, piecewise-linear, lattice, and lattice-ensemble layers can be combined with standard monotonic functions like sigmoids in as simple or complex of a manner as you might want. We often build around the lattice-ensemble layer, which can flexibly accept large numbers of inputs. These inputs can be processed with individual PLFs or even just a monotonic linear layer and sigmoid (recall that lattice inputs must be bounded). These blocks can then be stacked. Alternatively, there might be times when a monotonic linear layer, despite its lack of expressiveness, is enough for one part of your model.

The green paths in the diagram denote monotonic functions, and as a result, the output (yellow circle) is monotonic with respect to those monotonic dense features (green circle), but is not monotonic with respect to other features. By constraining some of the inputs to be monotonic, such a model balances flexibility with regularization. Furthermore, this model is easy to debug. For example, when the prediction of an example is unexpectedly large we can trace back the prediction through visualization of the final block and identify its problematic, unexpectedly large input. This can help us pinpoint the issue in the model or feature.

A similar idea can be used when combining dense features with other types of input data, such as images or videos. Existing convolutional or residual structures can be used to process the sparse data into a manageable number of real outputs, while lattice layers can be used towards the end of the model to combine with dense features in an interpretable way.

### Controllable Deep Learning with Spatiotemporal Data

Spatiotemporal data are often used in forecasting models. We are provided with historical data from the past several time periods, for example, and want to predict outcomes in the following time period. A common approach is to fit a univariate time series model, but there has recently been a focus on building larger machine-learned models that can generalize across time series and take in other high-dimensional inputs.Incorporating lattice layers into deep learning models offer a way to build models like this while embedding in them relevant domain knowledge. For example, dominance constraints (which we have not discussed in detail — see paper Multidimensional Shape Constraints) can be used to ensure that no matter where we are in the feature space, changing the value of a dominant feature will influence the model more than changing the value of a non-dominant feature. Using these, we can require more recent data to be more influential in our forecast, matching the behavior of common univariate techniques such as exponential smoothing.

We depict below a simple model that takes in historical revenue data alongside a sparse embedding feature to predict future revenue. Eschewing potential complications such as seasonality or bounce-back effects, we assume that each historical data point is monotonic with respect to our forecast and that recent data should be more important to our model relative to older data. Our resulting model may be easier to explain — and generalize better — than one which simply threw in all of the inputs into a neural network.

### Controllable Deep Learning with a Set of Features

Consider the task of predicting the quality of a restaurant using user review ratings for that given restaurant. For each user review rating, we also have information regarding the corresponding reviewer, such as the number of reviews given and average rating. This is a non-classical machine learning task in the sense that the model features (user review ratings, number of reviews by each reviewer, average rating by reviewer) are of variable length — for different restaurants, we will have different number of reviews.A simple approach might be to take a weighted mean of some sort. We could average reviewer scores while giving greater weight to experienced reviewers (who may be more trustworthy) and adjust for reviewers who are prone to giving high or low scores across the board. We could tweak these weights and adjustments to better correlate with some “golden” quality label.

But what if we want a machine learning model that directly took in these variable length features? What if we still wished to maintain key properties of a weighted mean such as monotonicity on individual review scores, and to accord greater trust to the scores of experienced reviewers?

As introduced in Zaheer et al. (2017), we could use the following model to learn a “set function”, i.e., a function that acts on an unordered set of fixed-dimensional inputs in a permutation invariant way$$

\hat{y} = \rho\left( \sum_{m=1}^M \phi(x_m) \right)

$$where $x_m$are the features for the $m$-th review, while $\phi$ and $\rho$ are generic deep learning models ($\phi$ may have multiple outputs). We extend this model with the following 6-layer structure with monotonicity constraints (Cotter et al., 2019). Absent monotonicity, this structure is proven to be flexible enough to learn any function on the variable length features. Here, our structure additionally constrains the user rating feature to be monotonic — if one user increases his/her rating on the restaurant, our predicted restaurant quality score will only increase. We can optionally add trust constraints to make sure that user ratings from experienced reviewers are more influential in the final model output. For more details, see Shape Constraints for Set Functions (Cotter et al., 2019).

## Discussion

As data scientists, we’ve all been there — scratching our head at unexpected model outputs, or witnessing a seemingly promising model struggle when deployed in the real world. We believe TensorFlow Lattice takes a big step forward in building high-performance models that align with common sense, satisfy policy constraints, and ultimately deliver peace-of-mind. We are particularly excited about the broad applicability of shape-constrained models because principles such as monotonicity are intuitive and pop up repeatedly across domains and use cases.This blog post is only a primer of the problems we encountered and the solutions at which we have arrived. TF Lattice includes many more constraints, regularizers, and techniques to solve other problems that a practitioner may face, and we continue to improve it. We encourage you to take a look at our tutorials and colabs on TF Lattice Keras layers, Keras premade models, canned estimators, custom estimators, shape constraints, ethical constraints and set functions. We hope TF Lattice will prove useful to you in your real-world applications.

## References

Wightman, L. LSAC National Longitudinal Bar Passage Study. Law School Admission Council, 1998.

Garcia, E. and Gupta, M. (2009). Lattice Regression. In Advances in Neural Information Processing Systems 22, pages 594-602.

Gupta, M., Cotter, A., Pfeifer, J., Voevodski, K., Canini, K., Mangylov, A., Moczydlowski, W. and van Esbroeck, A. (2016). Monotonic Calibrated Interpolated Look-Up Tables. Journal of Machine Learning Research, 17: 1-47.

Milani Fard, M., Canini, K., Cotter, A., Pfeifer, J. and Gupta, M. (2016). Fast and Flexible Monotonic Functions with Ensembles of Lattices. In Advances in Neural Information Processing Systems 29, pages 2919-2927.

You, S., Ding, D., Canini, K., Pfeifer, J. and Gupta, M. (2017). Deep Lattice Networks and Partial Monotonic Functions. In Advances in Neural Information Processing Systems 30, pages 2981-2989.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R. and Smola, A. J. (2017). Deep Sets. In Advances in Neural Information Processing Systems 30, pages 3391-3401.

Cotter, A., Gupta, M., Jiang, H., Louidor, E., Muller, J., Narayan, T., Wang, S. and Zhu, T. (2019). Shape Constraints for Set Functions. In Proceedings of the 36th International Conference on Machine Learning, pages 1388-1396.

Gupta, M. R., Louidor, E., Mangylov, O., Morioka, N., Narayan, T. and Zhao, S. (2020). Multidimensional Shape Constraints. In Proceedings of the 37th International Conference on Machine Learning, to appear.

Wang, S., Gupta, M. (2020). Deontological Ethics by Monotonicity Shape Constraints. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pages 2043-2054.

## Comments

## Post a Comment