Unintentional data


A large part of the data we data scientists are asked to analyze was not collected with any specific analysis in mind, or perhaps any particular purpose at all. This post describes the analytical issues which arise in such a setting, and what the data scientist can do about them.

A landscape of promise and peril

The data scientist working today lives in what Brad Efron has termed the "era of scientific mass production," of which he remarks, "But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. [1]"

Statistics, as a discipline, was largely developed in a small data world. Data was expensive to gather, and therefore decisions to collect data were generally well-considered. Implicitly, there was a prior belief about some interesting causal mechanism or an underlying hypothesis…

Fitting Bayesian structural time series with the bsts R package


Time series data are everywhere, but time series modeling is a fairly specialized area within statistics and data science. This post describes the bsts software package, which makes it easy to fit some fairly sophisticated time series models with just a few lines of R code.

Introduction Time series data appear in a surprising number of applications, ranging from business, to the physical and social sciences, to health, medicine, and engineering. Forecasting (e.g. next month's sales) is common in problems involving time series data, but explanatory models (e.g. finding drivers of sales) are also important. Time series data are having something of a moment in the tech blogs right now, with Facebook announcing their "Prophet" system for time series forecasting (Taylor and Letham 2017), and Google posting about its forecasting system in this blog (Tassone and Rohani 2017).

This post summarizes the bsts R package, a tool for fitting Bayesian structural time…