Practical advice for analysis of large, complex data sets

By PATRICK RILEY For a number of years, I led the data science team for Google Search logs. We were often asked to make sense of confusing results, measure new phenomena from logged behavior, validate analyses done by others, and interpret metrics of user behavior. Some people seemed to be naturally good at doing this kind of high quality data analysis. These engineers and analysts were often described as “careful” and “methodical”. But what do those adjectives actually mean? What actions earn you these labels? To answer those questions, I put together a document shared Google-wide which I optimistically and simply titled “Good Data Analysis.” To my surprise, this document has been read more than anything else I’ve done at Google over the last eleven years. Even four years after the last major update, I find that there are multiple Googlers with the document open any time I check. Why has this document resonated with so many people over time? I think the main reason is that it

Statistics for Google Sheets


Big data is new and exciting, but there are still lots of small data problems in the world. Many people who are just becoming aware that they need to work with data are finding that they lack the tools to do so. The statistics app for Google Sheets hopes to change that.

Editor's note: We've mostly portrayed data science as statistical methods and analysis approaches based on big data. But some of our readers have perfectly validly pointed out that this may be too narrow a view. While big data remains a focus of this blog, there are exciting innovations happening in other areas as well. Steve's post is an excellent example of this, and we are thrilled to see him contribute this month's article.

Introduction Statistics for Google Sheets is an add-on for Google Sheets that brings elementary statistical analysis tools to spreadsheet users. The app focuses on material commonly taught in introductory statistics and regression courses, with the intent that st…

Next generation tools for data science


Since inception, this blog has defined “data science” as inference derived from data too big to fit on a single computer. Thus the ability to manipulate big data is essential to our notion of data science. While MapReduce remains a fundamental tool, many interesting analyses require more than it can offer. For instance, the well-known Mantel-Haenszel estimator cannot be implemented in a single MapReduce. Apache Spark and Google Cloud Dataflow represent two alternatives as “next generation” data processing frameworks. This post compares the two, relying on the author’s first-hand experience and subsequent background research.

That MapReduce was the solution to write data processing pipelines scalable to hundreds of terabytes (or more) is evidenced by the massive uptake. This was true within Google as well as outside of Google in the form of Hadoop/MapReduce (for some “Hadoop” and “data science” are synonyms). However, it didn’t take long for the pain of writi…