Statistics for Google Sheets

BY STEVEN L. SCOTT Big data is new and exciting, but there are still lots of small data problems in the world. Many people who are just becoming aware that they need to work with data are finding that they lack the tools to do so. The statistics app for Google Sheets hopes to change that. Editor's note: We've mostly portrayed data science as statistical methods and analysis approaches based on big data. But some of our readers have perfectly validly pointed out that this may be too narrow a view. While big data remains a focus of this blog, there are exciting innovations happening in other areas as well. Steve's post is an excellent example of this, and we are thrilled to see him contribute this month's article. Introduction Statistics for Google Sheets is an add-on for Google Sheets that brings elementary statistical analysis tools to spreadsheet users. The app focuses on material commonly taught in introductory statistics and regression courses, with the

Next generation tools for data science

By DAVID ADAMS Since inception, this blog has defined “data science” as inference derived from data too big to fit on a single computer. Thus the ability to manipulate big data is essential to our notion of data science. While MapReduce remains a fundamental tool, many interesting analyses require more than it can offer. For instance, the well-known Mantel-Haenszel estimator cannot be implemented in a single MapReduce. Apache Spark and Google Cloud Dataflow represent two alternatives as “next generation” data processing frameworks. This post compares the two, relying on the author’s first-hand experience and subsequent background research. Introduction That MapReduce was the solution to write data processing pipelines scalable to hundreds of terabytes (or more) is evidenced by the massive uptake. This was true within Google as well as outside of Google in the form of Hadoop/MapReduce (for some “Hadoop” and “data science” are synonyms). However, it didn’t take long for the