Replacing Sawzall — a case study in domain-specific language migration

by AARON BECKER

In a previous post, we described how data scientists at Google used Sawzall to perform powerful, scalable analysis. However, over the last three years we’ve eliminated almost all our Sawzall code, and now the niche that Sawzall occupied in our software ecosystem is mostly filled by Go. In this post, we’ll describe Sawzall’s role in Google’s analysis ecosystem, explain some of the problems we encountered as Sawzall use increased which motivated our migration, and detail the techniques we applied to achieve language-agnostic analysis while maintaining strong access controls and the ability to write fast, scalable analyses.



Any successful programming language has its own evolutionary niche, a set of problems that it solves unusually well. Sometimes this niche is created by language features. For example, Erlang has strong tools for constructing distributed systems built into the language. In other cases, features such as standard libraries and a language’s community of users are more important — the main reason that R is a great language for statistics is that it’s widely used by statisticians and has a huge variety of useful statistics libraries. In order to understand the reason for Sawzall’s decline, we have to first understand the niche that it occupied in Google’s software ecosystem.

Our previous discussion of Sawzall focused on one of Sawzall’s biggest strengths — it makes it easy to write powerful analysis scripts quickly for tasks like computing statistical aggregates or computing a Poisson bootstrap. As such, it’s great for writing quick one-off analysis code and iterating on it as we come to a better understanding of the data. The name of the language is suggestive — the actual physical Sawzall® (trademark Milwaukee Tool) that the language is named after is a versatile hand tool that can make quick work of logs.

Figure 1: A physical Sawzall sawing physical logs.

Sawzall also has important strengths in another critical area — access control and auditing. The input to analysis jobs often includes personally identifiable information like IP addresses, and there are strict rules that limit what analysts can do with this data. We need to be able to answer several questions about any analysis before it runs:
  • Should this analyst have access to this data at all? 
  • If they should have access, which fields should they be able to read? Our input records are protocol buffers, and we’ve annotated the fields of our logged protos to indicate which ones may contain sensitive data (e.g. a user’s IP address) and which ones are innocuous (e.g. the amount of time it took to process a request). Reading sensitive fields requires a strong justification.
  • If they’re reading sensitive fields, what code are they actually running? We want to be able to audit the actual code that’s being used to do any sensitive analysis.

In short, we want fine-grained control over who has access to data, and visibility into what they’re doing with it. Sawzall provided a good solution to all these issues. We ran a centralized service called Sawmill that managed all Sawzall analysis on our logs.

Figure 2: In the Sawmill execution environment, users send their Sawzall analysis scripts to Sawmill Server, which performs authorization, applies access filters, and launches a MapReduce job on the user’s behalf in a restricted execution zone where the user isn’t allowed to run arbitrary binaries.

You could send your Sawzall code to Sawmill, and it would make sure that you have access to the data that you want to analyze. If you do, it would add some code to the beginning of your script to filter out any fields that you don’t have access to and record your script for auditing purposes. Then it would start a MapReduce which runs your Sawzall code on each worker. Since your Sawzall code runs inside a sandbox, it cannot get access to the raw, unfiltered logs data. It only sees filtered input.


Problems with Sawzall


This setup is great for access control and auditing, but it also creates some problems. Since we’re relying on the Sawzall sandbox to enforce our access policies, we have to make sure that un-sandboxed code doesn’t run alongside our Sawzall analysis. If the analysis could call unsafe code (e.g. user-controlled C++ functions), it could bypass our sandbox and read sensitive fields before they’re filtered. Sawzall does provide a way of calling functions written in other languages as though they were Sawzall functions. These functions are called intrinsics, and they provide a bridge between Sawzall and the rest of the world.

At Google, intrinsics were commonly used to provide an interface to large, complex C++ libraries and to interact with external services via RPC. However, since intrinsics provide a way to break out of the Sawzall sandbox, each one needed to be carefully vetted for safety before it could be whitelisted for use. As more and more people started using Sawzall, the demand for new intrinsics grew quickly and became a common point of friction for interoperability with services or libraries from other teams within Google.

The need to prevent arbitrary un-sandboxed code from interacting with Sawzall analysis also put strong constraints on the execution environment where analysis runs. For example, if a user could run arbitrary programs alongside their sandboxed analysis, they would be able to inspect the memory of their Sawzall program and extract unfiltered data that they shouldn’t have access to. To avoid this scenario, we had to reserve compute resources for logs analysis with restrictions on what kinds of programs can be run and who can launch them, making our analysis infrastructure much less flexible.

These problems were manageable when Sawzall occupied a small, well-contained niche. But as the community using Sawzall became larger and more diverse, the problems became more acute and the limitations of a domain-specific language became more important.

Sawzall may be an excellent hand tool, but many teams at Google came to need something more akin to heavy industrial machinery. Sawzall is at its best for small, focused analyses. While Sawmill itself is large, sophisticated infrastructure that allows Sawzall analysis to scale up and process vast amounts of data efficiently, Sawzall is not well-suited for building large integrated pipelines with sophisticated testing and release management. Teams built their core business logic in Sawzall, but without an object system or any support for user-defined interfaces it became very hard to manage a large codebase. These problems aren’t unique to Google — other companies that have adopted Sawzall for their analytics needs have reported similar difficulties.

Sawzall likely could have continued as a small, niche language, but it was sufficiently useful that people wanted much more out of it, and those needs grew beyond what the language and its associated access control and execution model could provide.


Language-Agnostic Analysis


The first step toward solving these problems was removing the tight link between access controls on logs data and the Sawzall execution model. By placing these controls outside of the Sawzall sandbox, we can open the door for analysis written in any language without weakening our ability to control access to sensitive data.

If we allow users to run arbitrary un-sandboxed code on the data, we have to change the model for how we filter out sensitive fields. Once the data gets to the user’s binary, it’s too late for filtering. We therefore need a separate service that proxies access to the raw data and enforces our access control policies before the data ever makes its way to analysts.

We’ve built just such a system, called the logs proxy. It provides a language-agnostic interface for reading logs data, and it applies all the necessary filtering logic before sending the data along to clients. There are a few interesting wrinkles to this process (for example, what if I want to do a join that’s keyed by a field that will be filtered out?), and we’ve had to solve some tough performance optimization problems to handle the scale of analysis at Google, but the fundamental idea is very simple.


Figure 3: In the logs proxy execution environment, user analysis code never has direct access to logs data. No restricted zone is necessary, because the logs proxy filters out sensitive fields before they’re available to analysis code.

Since the logs proxy decouples our data access policy from the programming language used for analysis, individual teams now have more freedom to choose the language that best fits their needs. However, since analysis libraries can often get very complicated, and multiple teams often share common data sources, there is an economy of scale in choosing a common language for most analysis.

At Google, most Sawzall analysis has been replaced by Go. Go has the advantage of being a relatively small language which is easy to learn and integrates well with Google’s production infrastructure. Fast compile times and garbage collection make Go a natural fit for iterative development. To ease the process of migrating from Sawzall, we’ve developed a set of Go libraries that we call Lingo (for Logs in Go). Lingo includes a table aggregation library that brings the powerful features of Sawzall aggregation tables to Go, using reflection to support user-defined types for table keys and values. It also provides default behavior for setting up and running a MapReduce that reads data from the logs proxy. The result is that Lingo analysis code is often as concise and simple as (and sometimes simpler than) the Sawzall equivalent.

As an example, consider the spam classification task from an earlier post on Sawzall on this site, where the goal is to measure the impact of two versions of a spam classifier on different websites. Here’s how that code looks in Lingo:

package spamcount

import (
  "google/spam"
  "google/table"
  "google/webpage"
)

// For each site, track whether or not it’s spam according to
// the old and new spam scores.
type SpamCount struct {
  Old int
  New int
  URLs int
}

func spamCount(score float) int {
  // A record with a spam score above 0.5 counts as spam.
  if score > 0.5 {
    return 1
  }
  return 0
}

// stats is a sum table with string keys (site name), and
// SpamCount values (the old and new spam counts and total
// count of URLs).
var stats = table.Sum("my_stats", "site", SpamCount{})

func Mapper(w *webpage.WebPage) {
  // Each record is a protocol buffer of type WebPage, which
  // has a url field which the spam package can classify.
 
  stats.Emit(sites.SiteFromURL(w.GetUrl()), SpamCount{
    Old: spamCount(spam.SpamScore(w.GetUrl())),
    New: spamCount(spam.NewSpamScore(w.GetUrl())),
    URLs: 1,
  })
}


The structure of this Lingo program is very similar to its Sawzall equivalent, thanks to the table library. It outputs a table of summed spam scores, keyed by site names. The table library uses the same output encoding as Sawzall, so the output of this program is byte-for-byte identical to its Sawzall equivalent. This greatly simplifies the process of migrating away from Sawzall for interested teams.

The benefit of this work is that logs analysis is now much more flexible and better integrated into Google’s broader software ecosystem. The logs proxy has decoupled the choice of language from the execution and access control model for analysis, which gives teams the freedom to make their own determination about what language best suits their needs.


Conclusion


Moving away from Sawzall has been a huge job. In part that’s because Sawzall was quite successful at its original goal — make it easy for analysts to write quick, powerful analysis programs. As a result there was a lot of Sawzall code to be migrated. However, Sawzall was in some ways a victim of its own success. There’s a natural tension for any domain-specific language between staying highly focused on its problem domain and growing to accommodate the needs of users who want to stretch the language in new directions. Sawzall’s development was shaped by this tension from the very beginning — early designs didn’t even include the ability to define functions, but functions were quickly added when it became apparent that the language couldn’t meet users’ needs without them. Over time, many more features have been added. But as the language grows, the rationale for using a domain specific language rather than a general purpose language becomes more and more diluted.

Fortunately, we’ve found that with carefully designed libraries we can get most of the benefits of Sawzall in Go while gaining the advantages of a powerful general-purpose language. The overall response of analysts to these changes has been extremely positive. Today, logs analysis is one of the most intensive users of Go at Google, and Go is the most-used language for reading logs through the logs proxy. And for users who prefer a different language, the logs proxy provides a language-agnostic way to read logs data while complying with our access policies. Looking forward, we can’t predict exactly what direction logs analysis at Google will go next, but we do know that its path won’t be constrained by our choice of programming language.

  1. This blog is having a wonderful talk. The technology are discussed and provide a great knowledge toall. This helps to learn more details about technology. All this details are important for this technology. Thank you for this blog.
    Hadoop Training in Chennai

    ReplyDelete