On procedural and declarative programming in MapReduce

by SEAN GERRISH and AMIR NAJMI

To deliver the services our users have come to rely upon, Googlers have to process a lot of data — often at web-scale. For doing analyses quickly, it helps to abstract away as much of the repeated work as possible. In this post, we’ll describe some things we have learned about mixing declarative and procedural programing paradigms to simplify MapReduce as used by data scientists.



One goal of a data scientist’s software stack is to eliminate as much routine work as possible so she can spend more time on her comparative advantage: reasoning about data. Because a data scientist’s work is more abstract than a typical software engineer’s, the languages she uses often include declarative patterns — constructs by which the analyst specifies what she wants rather than how to get it, with the framework doing magic under the hood to get her the results.

There are many examples of declarative programming constructs out there for data gathering, SQL being one of the most obvious. As with all design, the challenge is to provide the right level of abstraction for the tool. Here we shall use the venerable Google workhorse Sawzall to illustrate a couple of lessons we have learned about using declarative abstractions on top of MapReduce. Sawzall is a programming language developed at Google for performing aggregation over the result of complex operations on structured data. While use of Sawzall at Google is in decline today, we believe the lessons discussed here have survived the test of time and are employed by descendant systems used throughout Google.

Figure 1: Sawzall is less expressive than MapReduce but cuts down on development time for many data-collection exercises

The problem Sawzall solves is that MapReduce is often too general — one must write a custom reducer to perform basic aggregation (it’s called “bean counting” for a reason); while on the other end SQL-like languages are insufficiently expressive for the needs of a data scientist.

A key observation justifying Sawzall’s design is that, whereas mappers are often highly specialized to the input data, reducers for data analysis almost always implement a small handful of types of aggregation. Sawzall takes advantage of this insight by letting the mapper stay procedural while handling the aggregation step declaratively.


Record-level program scope


As a data scientist, you write a Sawzall script to operate at the level of a single record. The scope of each record is determined by the source of the data; it might be a web page, metadata about an app, or logs from a web server. Typically the record arrives in the form of a protocol buffer and is available in a special Sawzall variable available to each program (simply called “input”). The goal of your (record-level) Sawzall program is typically to extract data from each record and then to aggregate it. This aggregation method is determined by you, based on what type of output table you employ.

Apart from these output tables (which we describe below), there is no way to maintain state across multiple records. Modulo initialization code, the lifespan of all variables is just a single record. This may seem like a pretty big constraint because you cannot do analyses which require iterating over the data with state, such as fitting a logistic regression with stochastic optimization or finding the inverse of a giant matrix. However, it turns out to be quite useful for data science applications. In fact, we have found that the record-level scoping forces those who are new to the MapReduce paradigm to formulate their programs in a MapReduce-parallelizable way. By design, “(t)here is nothing in the language to enable examining multiple input records simultaneously, or even to have the contents of one input record influence the processing of another” [1]. Sawzall has thus been a useful learning tool for many of us.


Output Tables


The most valuable innovation in Sawzall is probably the output table types. Despite Sawzall's decline, these are still actively used today. They are a simple abstraction for declaring a custom reducer. You declare one or more output tables in each program to accumulate statistics. As part of the declaration, the table is assigned an aggregation type, such as sum,  sample(20) or quantile(100). Each table type corresponds to a pre-implemented reduce function which aggregates data in the specified way. In addition, tables allow an arbitrary tuple to key the aggregations.

The only output primitive in Sawzall is the emit statement. In the Sawzall script you "emit" output records with key-value pairs, and all values for a particular key are aggregated according to the table type. For example, let’s say you want to measure the impact of two versions of a spam classifier on different websites. You might write a program like this, where we have emphasized table operations in bold:

  import "sites" as "Sites";
  import "spam" as "Spam";

  proto "webpage.proto"
  # Define the sum table for counting statistics.
  my_stats: table sum[site: string]
                  of { old: int, new: int, urls: int };


  # Each record is a protocol buffer of type WebPage, defined
  # in webpage.proto.
  web_page: WebPage = input;
  site := Sites.SiteFromURL(web_page.url);

  spam_old_count := 0;
  if (Spam.SpamScore(web_page.url) > 0.5)
    spam_old_count = 1;

  spam_new_count := 0;
  if (Spam.NewSpamScore(web_page.url) > 0.5)
    spam_new_count = 1;

  # Count this site along with whether the url was spam
  # according to the old and new spam scores.
  emit my_stats[site] <- { spam_old_count, spam_new_count, 1 };


Note that the program above is very much procedural — it is a lot easier to read than an SQL-type User Defined Function (and this would become more true if we had added for loops to the script), and the imported libraries sites and spam are doing what could be heavy-lifting within this script. They could be implemented in Sawzall as well. If you run your program over all web pages in Google’s index, the table will have stored these (made-up) key, value pairs as follows:

  { "www.datascience.com" } -> { old: 0, new: 1, urls: 40 }
  { "www.quora.com" } -> { old: 4, new: 2, urls: 970223 }
  { "www.nytimes.com” } -> { old: 1, new: 5, urls: 602010 }
  ...


The sum table is just one type of table aggregator. The only constraint on an aggregator is that it be implementable within the Reduce phase of a MapReduce. This means the aggregation function must have associative and commutative properties, i.e. its results cannot depend on the order of the records. This is still rather flexible, and we have an extensive set of aggregation types, the most common of which are sum, maximum, sample, unique, top and quantile. This set of table types has grown over the years in keeping with our needs (it now even includes a type called bootstrapsum which implements the Poisson bootstrap procedure we described in an earlier blog post). Not everything can be implemented as an output table, nor can everything be done in a single MapReduce. But for data analysis, the abstraction provided by output tables has proven extremely valuable to us over the years.


The value of a strong procedural language for the mapper


Because of the implicit reducer, most of a Sawzall script describes the mapper. Thus, Sawzall’s C-like syntax, structure and strong static typing all facilitate writing custom procedural mappers. But if you are convinced of the value of a declarative reducer, you might wonder: why not have a declarative mapper as well? Indeed, SQL is a declarative query language used today in several systems on top of MapReduce (e.g. Apache Hive). We could even translate our Sawzall example code to pseudo SQL as follows:

  SELECT
    sites:FromURL(url) AS site,
    SUM(IF(spam:SpamScore(url) > 0.5, 1, 0))

                               AS spam_old_count,
    SUM(IF(spam:NewSpamScore(url) > 0.5, 1, 0))

                               AS spam_new_count,
    COUNT(*) AS pages
  FROM 
web_pages
  GROUP BY 1;

where sites:FromURL(), spam:SpamScore(), and spam:NewSpamScore() are some kind of user-defined functions (UDFs). For this example, the SQL code doesn’t appear all that less readable than the Sawzall, but one could imagine more complex conditioning quickly growing unwieldy in SQL. Another potential issue, specific to SQL, is that the rich structure of the protobuf might not lend itself to an intuitive relational model. But most important to a data science team is how the UDFs are expressed. This is because the team is committing itself to building and maintaining much of their code base in this language.

What we mean is that while SQL is great for high level data gathering, complex flow-control is best expressed in the language of UDFs. Often this language is meant to be a limited “escape hatch” from SQL. Initially tolerable, its limitations can become a huge problem for a data science team whose vast institutional knowledge about its data sources is best expressed in a reusable codebase. For instance, imagine writing a function to produce the canonical singular form of a word in a query (e.g. “cat” -> “cat”, “cats” -> “cat”, “canaries” -> “canary” in English) — this can be as simple as a few rules, or it could be arbitrarily complex, depending on language and context. Or imagine a function which wraps changes in the protocol buffer definition at some date in the past, say, to account for a business merger — before that date, customer_id is directly read from a given field, whereas after that date, customer_id must be prepended with the field division_id. Such functions and their variants would be developed, tested and maintained over time as part of the comprehensive codebase which the UDF language might not be designed to support.

We mentioned earlier that Sawzall is in on its way out at Google. This decline itself reinforces the point that data science teams must think of eventually building a strong codebase to express their domain knowledge. While Sawzall was a complete high-level language, it was designed to write short analyses. Its authors state this assumption while explaining why it was originally implemented as an interpreter:
“(a)n interpreted language is fast enough: most of the programs are small and on large data sets the calculation tends to be I/O bound” [1].
By design, therefore, it does not provide support for writing large scale reusable libraries — for example, it has no objects or abstract interfaces. Nevertheless, we ended up with large, unwieldy libraries in the language that were hard to maintain. This was an important factor in data science teams at Google moving away from Sawzall. In a future post we hope to describe the systems which replaced it.

Conclusion


There is a lot at stake in deciding how to mix declarative paradigms for data analysis with procedural programming within MapReduce. A decade of Google data scientists using the Sawzall programming language provides a useful lesson in the tradeoffs.


References


[1] Interpreting the Data: Parallel Analysis in Sawzall. Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan. Google Inc.

Comments