Measuring Validity and Reliability of Human Ratings
by MICHAEL QUINN, JEREMY MILES, KA WONG As data scientists, we often encounter situations in which human judgment provides the ground truth. But humans often disagree, and groups of humans may disagree with each other systematically (say, experts versus laypeople). E ven after we account for disagreement, human ratings may not measure exactly what we want to measure. How do we think about the quality of human ratings, and how do we quantify our understanding is the subject of this post. Overview Human-labeled data is ubiquitous in business and science, and platforms for obtaining data from people have become increasingly common. Considering this, it is important for data scientists to be able to assess the quality of the data generated by these systems: human judgements are noisy and are often applied to questions where answers might be subjective or rely on contextual knowledge. This post describes a generic framework for understanding the quality of human-labeled data, based arou