Quantifying the statistical skills needed to be a Google Data Scientist

by DAVID MEASE and AMIR NAJMI

What does someone need to know in order to be a successful data scientist at Google? This blog post shares a set of questions that were answered by Google data scientists and how they did. See how much you agree with the authors’ view of the importance of these questions in assessing practical data science ability.

Defining "Data Scientist"

If you look through job listings at Google for data scientists, you will find a role called “Data Scientist - Research” (DS-R for short). This role has several explicit requirements including statistical expertise, programming/ML, communication, data analysis/intuition. Focusing narrowly on the first of these, the description currently states that candidates “will bring scientific rigor and statistical methods to the challenges of product creation”. Internally, more detailed text descriptions exist to parallel these external descriptions. For this DS-R role at Google, part of the internal text description indicates that employees in this role should possess “knowledge of statistics and optimization models and methods” and that this knowledge increases from “proficient” for recent PhD graduates hired to “advanced” and eventually “expert” once the employee has been promoted to higher levels. How much knowledge of statistics and optimization is required? What kind of knowledge would take a person from "proficient" to "advanced"? Without additional context, these descriptions are ambiguous.

Of course, addressing ambiguity is a key aspect of data science. As past posts on this blog have discussed, there are statistical as well as semantic aspects to uncertainty. This problem of characterizing and quantifying uncertainty takes on a particular form when the data is that of human judgments. This post describes an attempt to apply data science principles towards mitigating the ambiguity in job descriptions for data scientists. Specifically, we want to tell you about a project to provide greater clarity on what a candidate needs to know in order to be a successful data scientist at Google. This exercise turned out to be challenging but also rather illuminating. Even though our findings refer to one data science role at Google at a single point in time, we nevertheless believe it will be of broad interest to data scientists to learn what we found and what we might do differently.

A word about the authors

At the time of writing, the two authors of this post have together conducted over 600 interviews during their time at Google. Both work in the DS-R role that is the main focus of this post. In addition to interviewing, we have served on the hiring and promotion committees for that role. We have also been involved in many other aspects of the interview process including developing interviewer training, reviewing questions for the interview question bank, editing external job descriptions, and working with leadership to create internal job descriptions to clarify how the DS-R role differs from other roles within the company. This blog post draws on much of that experience.

Strengths and limitations of job descriptions

The text of the job description plays a critical role in whom we hire into the data scientist role. We try our best to specify a clear set of job requirements so that staffing personnel can recruit promising candidates, interviewers can ask appropriate questions, and hiring committees can make consistent hiring decisions. Crucially, potential candidates can use these job requirements to self-select into the hiring process.

However, there are a number of shortcomings with the use of text descriptions to define job roles. Despite our best efforts, text descriptions inevitably remain somewhat ambiguous and open to interpretation. Of course, ambiguity is a fundamental aspect of human communication that must be mitigated through shared context. To that end, we have extensive practices inside Google to ensure that relevant individuals have a shared understanding of what the text descriptions mean: these include interviewer training, alignment discussions as part of performance reviews and hiring committees. We also get to see the impact of individuals inside the company and relate that to their level of technical skill. Thus, in an environment of ongoing information exchange, any group of data scientists has enough shared context to put flesh on the bones of the text descriptions.

All this context is not visible to those outside the company. To the outside, text descriptions can only convey the skills needed in an imprecise and non-quantifiable manner and remain subject to interpretation. Thus it can be difficult for a candidate considering applying to a position to determine if they possess the right level of statistical skills needed. Likewise, it is difficult for a recruiter hiring for these roles to determine which candidates are likely to possess the requisite level of skills.

To a lesser extent, analogous problems arise internally within the company for employees looking to move from one role to another. At Google, in addition to the “Data Scientist - Research” (DS-R) role mentioned above, there are other data science roles including “Business Data Scientist”. This latter role uses the exact same text in its internal description for the statistical skills (“knowledge of statistics and optimization models and methods” that ranges from “proficient” to “expert”). This text would suggest that the level of statistical skills needed are equivalent or at least quite similar across these two roles, which is generally not believed to be the case in practice. At least internally, we can agree upon and communicate divergent interpretations of the different data science roles.

In fact, we see text descriptions being reused with external job postings as well. For example the text “will bring scientific rigor and statistical methods to the challenges of product creation” in the Google job posting can be found word for word in the job postings of other companies. With it being so easy and common to reuse text descriptions and these text descriptions not being quantifiable or even falsifiable, one wonders how informative they are.

This problem of course is not unique to statistics skills for a data science role. Similar challenges exist in describing any technical skills for a number of roles, e.g., programming skills for a software engineering role. For statistics, however, the problem is arguably more severe — and increasingly so — with the expansion of the field of data science in recent years. Due to its rising popularity, more and more professionals self-identify as data scientists. As a result, the range of statistical skills that a data scientist may possess has become quite wide. Someone who has earned an advanced degree in statistics or has acquired specialized statistical skills by other means will likely describe themselves as a data scientist today; so too will someone who enjoys working with data but having an extremely limited statistics skill set. Given that the application of data science is itself evolving along with technology, we would not expect the skills required to be a successful data scientist to stay constant over time. But even so, we want to be thoughtful and precise about those requirements, not leave them subject to drift in meaning.

Communicating through calibrated examples

When asking humans to make judgments, say in their capacity as raters or moderators, it is common to accompany text guidelines of the rating standards with actual examples. Likewise, our main idea here was to create examples that illustrate what the text descriptions are aiming to capture. These examples take a form much like interview questions that a candidate would be asked in terms of subject matter and depth. But unlike questions that would be asked in an actual interview, we used multiple choice questions with a single correct answer. This allows for objectivity and makes the level of expertise quantifiable in terms of the accuracy over the set of questions. The downside of multiple choice is the additional noise that comes from luck and ignoring nuance. Most importantly, we cannot see how a person thinks. This makes multiple choice questions unsuitable for actual hiring decisions. But here we were evaluating the questions themselves rather than the data scientists.

As noted earlier, there are several different jobs at Google involving data science. Given our own experience in the DS-R role, we chose questions that we felt were appropriate for that job. We therefore calibrated these questions on current Googlers employed as DS-R as a way to explicitly quantify the level of statistical skills for that specific role. We did not calibrate these questions against the responses from data scientists in other roles, since these were specifically chosen for DS-R. As detailed in the discussion of individual questions, many of the incorrect answer choices were motivated by incorrect statements given over the years by actual candidates interviewing for the DS-R job. In fact, many of the questions themselves are variations or components of actual interview questions that have been asked in the past for this role.

Calibration results

To calibrate the 10 questions, we gave them to a simple random sample of $n=30$ current Google employees in the DS-R role. From these $n=30$, there were 4 non-responses (it was voluntary) and the remaining $n^*=26$ had a mean score of 7.00 (standard deviation = 1.98, standard error = 0.39) correct out of a possible perfect 10.

You can see the questions in Appendix #1 in the same format as the $n^*=26$ received. To get the most out of this post, we suggest you try to answer them yourself and score yourself at the end of this section. Doing so will make the in-depth discussions that follow in Appendix #2 more meaningful. There is also some discussion of specific questions in the next section. Thus for readers interested in scoring themselves on these questions, this is the best time to do so.

(Go ahead and do the test now. We’ll be here when you return!)

Debriefing our respondents

Our first observation from the results is that the questions vary substantially in difficulty. As you can see in the graph below, some questions have an accuracy of over 80%, while others were much lower, with question #5 actually being less than 50%.

Beyond the data itself, we learned a lot from discussing the questions with the respondents, once all $n^*=26$ responses were collected.

A general theme was that many of the respondents were pleasantly surprised with their performance. Many described having initially been apprehensive about participating in the exercise because it had been a long time since they had interviewed for any position or taken any sort of a similar quiz. Hence they were quite happy to learn that they had high accuracy in their answers. We viewed this somewhat as a validation in that the questions were meant to represent skills that are not easily forgotten over time, but rather skills internalized through application in a practical setting. We didn’t want textbook questions on the one hand, nor questions that could only be answered by industry veterans.

Another theme was that respondents generally felt that all of the questions were fair, but thought that some were “tricky” in that it was easy to make a mistake if they didn’t read every word carefully. An example is choice C in Question #6 which mentions a hypothesis test being overpowered. As discussed in Appendix #2, respondents mentioned this was tricky because choice C in many situations is a correct statement, but an incorrect choice in this particular case only because the question implies that the null hypothesis was not rejected. They said it could be easy to overlook or misread that important aspect.

There were very few cases of any respondents disagreeing with the correct answer. There was one fairly lengthy discussion with a respondent who felt choice A should be a correct answer for question #1. But in the end, the respondent admitted they were indeed somewhat confusing the concepts of “interaction” and “independence” (see discussion in Appendix #2). In this way, we believe the questions are indeed fair, but not necessarily easy (by design) in that surface knowledge of statistical concepts may be insufficient. Rather, the questions are meant to test for a deeper understanding of the concepts and thus don’t permit confusing one concept with another.

What we think went well

An obvious limitation of this work is that it lacks formal validation. Given the voluntary participation, we are not able to directly measure the relationship between test performance and long term success in the DS-R role. Nevertheless, there are several reasons why we are pleased with this set of questions:

Face validity: as described earlier, the volunteers who took the test generally felt that the questions were fair, regardless of how well they themselves did.
Informal correlation: Individuals who scored highest on the questions are, in our judgment, seen to be very strong performers in the DS-R role. A two-way split of level was not significant, though the test had little power.
Construction: Many of the wrong answer choices were motivated by incorrect statements made by candidates during actual interviews. Candidates making these incorrect (“red flag”) statements generally went on to perform poorly overall in the full interview. Thus, to the extent the interview process itself succeeds in being predictive of performance in the DS-R role at Google, it lends credibility to the questions.
Internal consistency: Correctly answering any given question was strongly predictive of accuracy on other questions. To assess predictiveness, we conducted the following analysis: given whether or not an individual correctly answered question x, what is their accuracy on the other nine questions? With a single exception, answering any given question increased the rate of answering other questions correctly (for 7 questions, the increase was at least 10%). By this measure, Question #7 seems suspect, and we discuss it in detail.

Limitations

While we do feel the conversations with the respondents gave us increased confidence in the quality of these questions, this was our first attempt at such an exercise and we believe that none of the ten questions we developed are perfect. During question development, we made some modifications and improvements from our initial drafts to the “final” versions that were used for the $n^*=26$. But as you will see in the discussion of the questions in Appendix #2, there are further learnings from the $n^*=26$ responses which could be used to improve the questions even more. In this way (as with actual interview questions) the process of question creation is arguably inherently iterative.

We also fully expect to evolve the set of questions itself over time for a number of possible reasons. For example, the current coverage of the questions chosen may not be an accurate representation of the most important focus areas and subtopics within the discipline of statistics for the DS-R role. Of the ten questions we created, there are four that deal largely with modeling (questions #1, #3, #4 and #8). While modeling is indeed a very important skill for the DS-R role, it is quite possible that being the focus of four of the ten questions makes it over-represented. Likely, in future iterations we will shift the balance over time to match more closely to the balance of skills needed (which itself may shift over time).

Discussion

Despite these limitations, we want to emphasize that we view the overall exercise successful in terms of solving the problem we set out to solve, namely, how to bring greater precision to the text descriptions. In particular, we believe that a data scientist who wants to better understand the statistical skills needed for the “Data Scientist - Research” role at Google will learn more deeply and precisely from reading these ten questions and the performance of our sample than by reading the text description of the role alone. To reiterate, statistical expertise is one of several requirements for the DS-R role, so these are best seen as necessary but not sufficient. Furthermore, DS-R is just one of several data science roles at Google (e.g., Data Scientist - Product, and Data Scientist - Marketing). We took a narrow focus in order to be concrete.

We believe the people sampled are a good cross-sectional representation of the current members of the DS-R role at Google, and the questions capture well the depth of statistical skills possessed by these members. We can and will improve the representativeness of the questions in future iterations by way of the subtopics covered, and we will endeavor to write the questions to be less “tricky”. But in terms of the questions hitting the right depth of understanding, we do feel we have already landed on the target with these ten questions. They do not test for facts which are memorized and then forgotten, but rather test for working statistical knowledge that is core to the DS-R role at Google.

We hope that publishing these questions (and answers) will have a number of benefits:

We hope that instructors of statistics and data science courses will gain value from seeing these example questions to guide them in how they might best prepare their students for positions in industry that require a strong applied knowledge of statistics.
We plan to include these questions in Google interview preparation materials sent to candidates to help them self-assess their fit for the DS-R role. For this use case, candidates will also likely appreciate that we have measured the mean (=7.0/10) on a simple random sample of employees currently in the role so that they can compare their own score against it.
Finally, we hope this may serve as a template for other employers and other job families to help them more concretely communicate their expectations to candidates with respect to certain skills needed. This would not replace the text descriptions currently used to convey these skills, but would likely prove to be a valuable and informative supplement.

One sensitive aspect of this study is the human element in the calibration we carried out. No doubt, it is valuable to calibrate questions using current employees (assuming we are looking for candidates with similar skills). At the same time, current employees can be uncomfortable talking about or permitting assessment of their skills. Employees may have some degree of imposter syndrome, or some may fear that weakened memory of certain knowledge and skills might cause them to have embarrassingly poor performance.

For these reasons, when doing such an internal calibration exercise, it is important to keep the data confidential and emphasize that the goal is to measure the current distribution and not the skills of any one individual. In our internal exercise using the simple random sample of $n^*=26$ Google employees, we only shared aggregate measures. Employees were told their own score confidentially if they asked for it, but we did not even share a histogram of the scores, as doing so might unduly draw attention to the lowest scores, and those employees may feel alienated (even while being anonymous).

Despite the challenges, we feel pleased with the results of this first attempt. We look forward to a broader discussion followed by iterative improvement. In the meantime, we hope you now have a clearer understanding of the statistical skills required for the Data Scientist - Research role at Google. We hope that some of you will be motivated to come work with us!

Acknowledgements

The authors are extremely grateful to David Goldberg and Chong Zhang for all of their help developing the ten questions, and also to the 26 randomly selected anonymous respondents for taking time to provide their answers to the questions.

Appendix #1 Questions

Question #1: Someone is fitting a linear regression model with a predictor (y) regressed on two variables (x1 and x2). They are trying to decide if they should also include an interaction between x1 and x2 in their model or not. What would be the most reasonable consideration in making this decision:

A. Whether or not x1 and x2 are independent.
B. Whether or not x1 and x2 are highly correlated.
C. Whether or not the interaction improves the fit of the predicted y values vs the actual y values on test data.
D. Whether or not the intercept is statistically significant in the model.
E. Whether or not the Kolmogorov-Smirnov test for normality is statistically significant for the residuals from the model.

Question #2: Someone is concerned that the p-values in their A/B experiment platform are not correct. In order to investigate they run 100 (unrelated, non-overlapping) experiments using that platform in which the test and control conditions are set to be the same. (These are sometimes called "A/A tests".) They use a significance level of alpha=.05. What should be true of the resulting 100 p-values?

A. As the experiments run longer and longer, the p-values should get closer and closer to zero.
B. The p-values should all be near zero.
C. The p-values should all be near one.
D. The p-values should all be near 0.05.
E. Roughly 10% of the p-values should be below 0.10.
F. More than 5% of the p-values should be below 0.05.
G. Less than 5% of the p-values should be below 0.05.
H. The p-values will have a fairly symmetric and unimodal distribution with a peak near .50.

Question #3: Someone is fitting a linear regression model with two predictors x1 and x2. The x2 predictor is ordinal in nature taking the three values small, medium and large. They decide to encode this as small=1, medium=2 and large=3 and simply include it in the model as a linear term. This could be problematic for a number of reasons. Which of the following concerns would represent the strongest argument for NOT doing this.

A. This model allows for predictions for x2 at values that are not 1,2 or 3. For example, it allows a prediction for x2=1.5 which is not meaningful.
B. This model allows for predictions for x2 at values outside of the range of 1 to 3. For example, it allows a prediction for x2=4 which is not meaningful.
C. If x1 is held constant, this model assumes that the expected response for x2=3 is three times the expected response for x2=1 which may not be true.
D. If x1 is held constant, this model assumes that the expected difference in the response for x2=3 vs x2=1 is twice that of x2=2 vs x2=1 which may not be true.
E. This model assumes that x2 has a roughly equal number of observations for x2=1 and x2=2 and x2=3 which may not be true.

Question #4: Two data scientists are doing analysis of two categorical variables (country = USA/Canada/Mexico and phone type = iPhone/Android/other) as it relates to a numeric response variable (life expectancy in years). One data scientist simply analyzes the 9 means and the other fits a linear regression model. They arrive at very different conclusions regarding which combination of country and phone type has the highest life expectancy. What is the most likely reason?

A. If the life expectancy is right skewed, the model assumptions may be violated.
B. If the model does not include an interaction term, it may give quite different results from the analysis of the 9 means.
C. The model will properly control for the right-censoring in the data, while the analysis of the 9 means ignores this.
D. The 'other' category may have a small sample size and may be removed from the model due to lack of statistical significance.
E. The proper model is a logistic regression model which would have been equivalent to the analysis of the 9 means.

Question #5: A data scientist is trying to predict the future sale price of houses. For their predictions, they are considering using either the average sale price of the three (k=3) geographically closest houses that most recently sold or the average sale price of the ten (k=10) geographically closest houses that most recently sold. Which of the following statements is most correct?

A. k=3 will always work best because k=3 is more similar to k=1 for which the median and the mean are the same and the median is more robust to outliers than the mean.
B. k=3 may work best because the other 7 houses may be quite different.
C. k=10 will always work best because it uses more of the data.
D. k=10 may work best because it will include a more diverse set of houses.
E. k=3 will work best because 3 is an odd number and 10 is even.

Question #6: A data scientist has counts of people broken down by country = USA / Canada / Mexico and phone type = iPhone / Android / other. They want to do some modeling and analysis with this data but first want to determine if country and phone type are independent. They carry out a chi-squared hypothesis test for independence using this and conclude that they are indeed independent based on this. Going forward in the rest of the analysis they assume independence. What could be a problem with using the chi-squared hypothesis test for independence to conclude independence here?

A. There may be missing data which needs to be accounted for or imputed.
B. The analysis described does not control for multiple testing.
C. If the data set is large, the hypothesis test may be overpowered.
D. The data may not be normally distributed so the hypothesis test is invalid.
E. A chi-squared hypothesis test for independence will almost always show independence if the sample size is sufficiently small.

Appendix #2 Answers and Discussion of Questions

The correct option in each case is highlighted in green.

Overall, 20/26 (=77%) of the respondents answered this question correctly.

We note that one potential drawback of this question is that it does rely heavily on knowledge of various terms, most importantly the term “interaction”. It is possible that a respondent knows the concept well but does not remember the terminology. It may have been feasible to work around this by spelling things out in code or mathematical notation (such as x1*x2), but at the same time that could have been confusing for the majority of respondents who do remember the terms and may be less comfortable with such alternative representations.

Of the six respondents who answered this question incorrectly, five of the six chose choice A (“A. Whether or not x1 and x2 are independent.”). This (incorrect) choice was intentionally included since in open ended interview questions involving interactions, we had observed candidates confusing the concepts of interaction and independence. The results here seem consistent with that observation. Possibly these candidates confuse the concepts since if there is no interaction then x1 and x2 independently impact the response. But of course the concepts are actually quite different since independence between x1 and x2 is defined outside of any notion of a model or the variable y. Further, independence between x1 and x2 involves considering them as random variables, while in a traditional regression setting x1 and x2 would not be considered random. Another possibility is respondents simply confuse the concepts since both words (interaction and independence) begin with the letter “i”, which relates somewhat to the previous concern that the question relies heavily on knowledge of terms.

The one remaining respondent chose choice D (“D. Whether or not the intercept is statistically significant in the model.”) The main motivation for including this incorrect choice was that it was yet another term beginning with the letter “i”.

We intentionally did not include any choices about the statistical significance of the interaction term. There could be room for healthy debate whether it is better practice to make the decision to include an interaction term or not based on its statistical significance in the model, or based on how much it improves fit on test data (our correct choice). We did not want to require respondents to make a call along these lines since very strong statistical thinkers could easily disagree on this point.

Question #2: Someone is concerned that the p-values in their A/B experiment platform are not correct. In order to investigate they run 100 (unrelated, non-overlapping) experiments using that platform in which the test and control conditions are set to be the same. (These are sometimes called "A/A tests".) They use a significance level of alpha=.05. What should be true of the resulting 100 p-values?

A. As the experiments run longer and longer, the p-values should get closer and closer to zero.
B. The p-values should all be near zero.
C. The p-values should all be near one.
D. The p-values should all be near 0.05.
E. Roughly 10% of the p-values should be below 0.10.
F. More than 5% of the p-values should be below 0.05.
G. Less than 5% of the p-values should be below 0.05.
H. The p-values will have a fairly symmetric and unimodal distribution with a peak near .50.

Overall, 17/26 (=65%) of the respondents answered this question correctly.

The goal of this question was to assess knowledge about p-values being uniform under the null hypothesis but in the specific context of A/B testing. We wanted to avoid too much dependence on statistical terminology (recall the previous question could be faulted for requiring respondents to know the term “interaction”), so we intentionally avoided using “null hypothesis” or “uniform” in the question or answer choices.

Our hope was that respondents who knew p-values were uniform under the null hypothesis and could apply that knowledge to the context of A/B testing would do well on this question, and also respondents who had general practical experience with A/B testing would do well. Conversely, respondents who knew p-values were uniform under the null hypothesis but weren’t able to apply that to this practical context might struggle.

We often list p-values and A/B testing as topics with which candidates should be familiar for the “Data Scientist - Research” role at Google. This question is nice in the regard that it gives a concrete example of both of these simultaneously.

Among the incorrect responses, choice H (“H. The p-values will have a fairly symmetric and unimodal distribution with a peak near .50.”) was the most common with 4 respondents out of the 26 choosing it. This choice is likely attractive since so many empirical distributions do tend to be unimodal and symmetric especially for large samples. Other than choice H, all of the other incorrect responses were chosen by two or fewer respondents.

Overall this question seemed to work quite well as designed. Specifically it is able to test a theoretical fact in an applied context with little dependence on terminology.

Question #3: Someone is fitting a linear regression model with two predictors x1 and x2. The x2 predictor is ordinal in nature taking the three values small, medium and large. They decide to encode this as small=1, medium=2 and large=3 and simply include it in the model as a linear term. This could be problematic for a number of reasons. Which of the following concerns would represent the strongest argument for NOT doing this.

A. This model allows for predictions for x2 at values that are not 1,2 or 3. For example, it allows a prediction for x2=1.5 which is not meaningful.
B. This model allows for predictions for x2 at values outside of the range of 1 to 3. For example, it allows a prediction for x2=4 which is not meaningful.
C. If x1 is held constant, this model assumes that the expected response for x2=3 is three times the expected response for x2=1 which may not be true.
D. If x1 is held constant, this model assumes that the expected difference in the response for x2=3 vs x2=1 is twice that of x2=2 vs x2=1 which may not be true.
E. This model assumes that x2 has a roughly equal number of observations for x2=1 and x2=2 and x2=3 which may not be true.

Overall, 22/26 (=85%) of the respondents answered this question correctly.

The four respondents who answered this question incorrectly all chose choice C (“C. If x1 is held constant, this model assumes that the expected response for x2=3 is three times the expected response for x2=1 which may not be true.”). This choice is very close to the correct answer, but considering the presence of x1 in the model and also the presence of a potentially non-zero intercept, this in general is mathematically incorrect. This choice was included because interview candidates often provide this incorrect explanation. These candidates have the correct general idea but are not being precise in their explanation.

Choices A and B are other explanations commonly given by interview candidates. These do represent concerns, but certainly do not represent strong arguments against the specific encoding as asked in the question.

Unlike the other incorrect choices included, choice E is actually not a common explanation given by interview candidates and was fabricated simply to provide an additional incorrect choice.

Question #4: Two data scientists are doing analysis of two categorical variables (country = USA/Canada/Mexico and phone type = iPhone/Android/other) as it relates to a numeric response variable (life expectancy in years). One data scientist simply analyzes the 9 means and the other fits a linear regression model. They arrive at very different conclusions regarding which combination of country and phone type has the highest life expectancy. What is the most likely reason?

A. If the life expectancy is right skewed, the model assumptions may be violated.
B. If the model does not include an interaction term, it may give quite different results from the analysis of the 9 means.
C. The model will properly control for the right-censoring in the data, while the analysis of the 9 means ignores this.
D. The 'other' category may have a small sample size and may be removed from the model due to lack of statistical significance.
E. The proper model is a logistic regression model which would have been equivalent to the analysis of the 9 means.

As with the previous question, 22/26 (=85%) of the respondents answered this question correctly.

Among the four incorrect responses, there were two for choice A and one each for choices C and E (and none for choice D).

Often interview candidates struggle to understand what models are “doing under the hood” and instead view even simple models as a black box. With this question we wanted to test to see if respondents understood that the predictions from a model that includes all interactions are simply just the corresponding means in each cell. The incorrect choices were written such that they would sound somewhat appealing to any respondents who (incorrectly) eliminated that correct choice.

Question #5: A data scientist is trying to predict the future sale price of houses. For their predictions, they are considering using either the average sale price of the three (k=3) geographically closest houses that most recently sold or the average sale price of the ten (k=10) geographically closest houses that most recently sold. Which of the following statements is most correct?

A. k=3 will always work best because k=3 is more similar to k=1 for which the median and the mean are the same and the median is more robust to outliers than the mean.
B. k=3 may work best because the other 7 houses may be quite different.
C. k=10 will always work best because it uses more of the data.
D. k=10 may work best because it will include a more diverse set of houses.
E. k=3 will work best because 3 is an odd number and 10 is even.

Overall, only 10/26 (=38%) of the respondents answered this question correctly. This makes this the only question of the ten with less than half of the respondents choosing the correct answer. In fact, the majority of the respondents (14/26=54%) chose choice D.

The goal of this question was to ask a bias/variance trade off question without using the words “bias” or “variance” to avoid too much dependence on terminology. The phrases “quite different” in choice B and “more diverse” in choice D were both meant to mean “more bias”. In retrospect and based on follow up discussions with respondents, this wording was confusing, and if we were to rewrite this question now we would express this differently.

The example of nearest neighbor was used for this question. This is a commonly used example of the bias/variance tradeoff; as the number of neighbors is increased, the variance goes down (more data) but the bias is increased since the set of neighbors used become further from the target of interest. Using the replacement “quite different” for “more bias” as mentioned in the previous paragraph, choice B correctly says that a smaller number of neighbors (k=3) may work best because a larger number of neighbors (k=10) may introduce more bias (“quite different”). In conversations with respondents, most agreed that this choice sounded correct to them, but many incorrectly thought choice D also sounded correct. Choice D is meant to say a larger number of neighbors (k=10) may work better because it has more bias (“more diverse”). This was meant to be an incorrect answer since while indeed a larger number of neighbors (k=10) would have more bias, that is not the reason for which it might perform better. Instead, it is the larger amount of data that offsets this increase in bias that might lead to better performance (if the offset is enough). But again this wording was confusing to many of the respondents leading them to choose choice D.

We certainly regret not writing better phrasing for these choices B and D. However, we will note that despite the low success rate, this question still seems to be predictive of overall success across the ten questions empirically. By this we mean that the 10 respondents who answered this question correctly did in fact score better on the other 9 questions than the 14 who answered it incorrectly. The 10 correct respondents averaged 7.20 correct on the other 9 questions compared to an average of just 6.25 correct out of 9 for the 14 who answered it incorrectly.

Other than choices B and D, the remaining three choices did not use the word “may” and most respondents seemed to be able to correctly eliminate those. Only two respondents chose choice A and none chose C or E.

Question #6: A data scientist has counts of people broken down by country = USA / Canada / Mexico and phone type = iPhone / Android / other. They want to do some modeling and analysis with this data but first want to determine if country and phone type are independent. They carry out a chi-squared hypothesis test for independence using this and conclude that they are indeed independent based on this. Going forward in the rest of the analysis they assume independence. What could be a problem with using the chi-squared hypothesis test for independence to conclude independence here?

A. There may be missing data which needs to be accounted for or imputed.
B. The analysis described does not control for multiple testing.
C. If the data set is large, the hypothesis test may be overpowered.
D. The data may not be normally distributed so the hypothesis test is invalid.
E. A chi-squared hypothesis test for independence will almost always show independence if the sample size is sufficiently small.

Overall, 14/26 (=54%) of the respondents answered this question correctly.

Hypothesis testing can be problematic when it is underpowered and fails to detect important effects due to small sample size, and also when it is overpowered and detects statistically significant effects that might be too small to be of practical significance. This question was written to test to see if respondents understand both of these concerns well and can differentiate the two of them properly without confusion.

The most common incorrect answer was choice C which mentions being overpowered. This was chosen by 7 of the 26 respondents. As discussed earlier in the blog post, some respondents thought this was quite tricky and needed very careful reading. Indeed, choice C is correct in many situations, but in this particular situation because the question says “conclude that they are indeed independent” we know the null hypothesis was not rejected. Thus, being underpowered is a potential concern (choice E) but being overpowered (choice C) is irrelevant.

The question is perhaps made even more difficult since the language “conclude that they are indeed independent” reads as if the null hypothesis is determined to be true, which we know is commonly said but not technically correct; we should only say it is not determined to be untrue. Thus this question really forces the respondents to think and read carefully to differentiate being over/under powered and rejecting/not rejecting the null hypothesis.

Finally, anyone who has actually worked with this kind of data might make an implicit assumption of big data ("overpowered"). In that context, the null hypothesis is almost sure to be rejected (hard to imagine that phone types would have exactly the same prevalence across countries). Thus, the situation described is strange and possibly confusing for someone with experience. In contrast, if the question had mentioned survey data, the experienced data scientist would not have found the situation strange.

Among the 5 respondents not choosing choices E and C, there were four who chose D and one who chose B (no one chose A).

Question #7: A data scientist wants to remove outliers in their dataset. They decide to remove anything more than 3 standard deviations above or below the mean. What could be problematic about this approach?

A. They have not confirmed that the data is normally distributed. Many distributions have a very large fraction of data outside this range and they may be removing a majority (more than 50%) of the data if it isn't normal.
B. The standard deviation itself is impacted by outliers. It is better to use the interquartile range multiplied by a constant.
C. For large datasets, the standard deviation will be close to zero since it scales with the square root of the sample size.
D. The usual number to use is two standard deviations. It is very unusual to use three standard deviations without a strong reason.

Overall, 14/26 (=54%) of the respondents answered this question correctly, which is the same accuracy rate as the previous question. None of the respondents chose choices C or D, which means the remaining 12/26 (=46%) chose choice A.

The main goal of this question was to assess some practical knowledge about outlier removal, and a secondary goal was to assess applied knowledge of Chebyshev's inequality. On that second point, we didn’t want to assess whether someone has memorized Chebyshev's inequality and can recall it by name and apply it exactly, but rather more generally an understanding or intuition that having more than 50% of the data outside of three standard deviations from the mean is impossible, thus ruling out choice A. Indeed, Chebyshev's inequality states that this can’t be more than 1/9=11.1% (which is much less than the 50% mentioned in choice A), but we wanted to assess this at a higher level versus getting into the exact bound given by Chebyshev's inequality.

In discussions with some respondents who chose the (incorrect) choice A, they mentioned that they somewhat glossed over the “more than 50%” in the choice, and were attracted to the choice since other than the “more than 50%” (and the word “majority”) the choice was correct and gave good practical advice in general. In retrospect, if we were to write this question again, we would more strongly emphasize the 50% value in the choice (not just putting it parenthetically) and perhaps make it even larger than 50% - maybe 90% or 95% to make it more obvious.

In these discussions with respondents who chose the (incorrect) choice A they also mentioned that choice B (the correct choice) was worded a bit too vaguely and casually in the use of “multiplied by a constant”. Obviously the value of the constant here must be a reasonable choice, but as worded it reads somewhat as if any constant will suffice. It would have been better to use something like "multiplied by an appropriately-chosen constant".

Overall despite these difficulties this question did seem to work as intended, and some of the incorrect respondents mentioned that with a more careful reading they likely would have selected the correct answer. However, we do feel we went a bit too far in trying to make choice A sound attractive and perhaps also too far in trying to undersell (the correct) choice B. And objectively, this is the only question whose correct answer does not increase the chances of correctly answering other questions.

Question #8: The variable y is numeric, the variable x1 is numeric and the variable x2 is categorical taking on 5 unique values representing 5 colors (the five values of x2 are red, blue, yellow, green and purple.) A data scientist fits a linear regression model with y as the response and x1 and x2 as predictors (an intercept was fit as well but no interactions). A second data scientist decides instead to fit 5 separate regression models. In other words, they fit a regression model with y as the response and x1 as the predictor for each of the 5 colors separately (and each regression model includes an intercept). Both data scientists are fitting the models using the same (training) dataset. Which of the following is true regarding the predictions for the response variable y from these two approaches?

A. These two approaches will give exactly the same predictions for all values of x1 and x2.
B. These two approaches will give exactly the same predictions for all values of x1 in the training data, but may give different predictions for test data with new values for x1.
C. These two approaches will give exactly the same predictions for all values of x1 and x2 in the training data, but may give different predictions for test data with new values for x1 and x2.
D. These two approaches may give different predictions. They would give exactly the same predictions if the first data scientist had included the interaction between x1 and x2.
E. These two approaches may give different predictions. They would give exactly the same predictions if x1 and x2 are statistically independent.
F. These two approaches may give different predictions. They would give exactly the same predictions if there are equal numbers of observations for the five colors red, blue, yellow, green and purple.

Overall, 18/26 (=69%) of the respondents answered this question correctly. Of the 8 incorrect responses, 5 were choice E and one each went to choices A, B and F (and no one chose C).

This question is somewhat similar to an actual interview question for the “Data Scientist - Research” role that was retired in 2019 but asked many times previous to that by one of the authors of this blog post. The key is to understand that without the interaction, the model is fitting 5 parallel lines (the parameterization constrains the slopes to be the same), but once an interaction is included then the slopes are unconstrained and the fitted values are equivalent to fitting 5 separate models. From experience asking this question in interviews, many candidates struggle with this. About half of candidates (even those with PhDs in statistics) would not be certain about the equivalence of the fitted values for the interaction model and the five separate models. They could usually be convinced by walking through a discussion of factoring the likelihood into a product of five functions (one for each color), or even a more intuitive argument explaining that the best overall fit is simply to fit the line for each of the five colors as best as possible.

But before walking through any such argument, about half of the candidates initially believed that there was no equivalence between the two approaches even once an interaction is included. For this reason the answer choices were written carefully so as to not give any option that states there is no equivalence once an interaction is included. All six answer choices discuss some type of equivalence, so that a respondent who initially thinks there is no equivalence when an interaction is included may still be able to fix their own initial belief by eliminating incorrect choices.

Note that the most common wrong answer choice was choice E which mentions statistical independence. As discussed with question #1, this choice may be attractive to people who have somewhat confused the concepts of interaction and independence, and it was written for that reason.

Question #9: In a certain time period, an economist noted that average sale prices for cars had increased 10%. They split the data into gasoline cars and non-gasoline cars (two mutually exclusive categories) to analyze further. They observed that average sale prices for each were decreasing over this time period. Which of the following is true?

A. This can not be possible. At least one of the two must have increased if the overall average sale price increased.
B. We know total gasoline car sales decreased over this time period.
C. We know total non-gasoline car sales increased over this time period.
D. We know gasoline cars and non-gasoline cars have different average sale prices.

Overall, 23/26 (=88%) of the respondents answered this question correctly, making it the question with the highest accuracy rate among the ten questions. Of the 3 incorrect responses, 2 were choice A and one was choice B (and no one chose C).

The goal of this question was to test knowledge of Simpson's paradox without explicitly mentioning Simpson's paradox by name (in order to avoid too much dependence on terminology much like not mentioning “null hypothesis” or “uniform” in question #2 or “bias” or “variance” in question #5). The question is asked in the context of a time series in which the trend for the aggregate is different from the trend of the (two) separate components. The intention is that even someone who may not know Simpson's paradox by name or recall its definition can still answer the question correctly through recalling their practical experience with similar analysis and/or through using analytic reasoning skills.

In particular, one can reason that if choice D were not true and gasoline cars and non-gasoline cars have exactly the same average sale prices, then splitting the data into these two categories will not change the trend. In this regard, perhaps the question can be faulted for being too easy since even someone who is fairly weak in this particular skill area would still likely recognize that choice D must be correct.

Choice A was included to attract people who could potentially be fooled by Simpson's paradox in practice. It was only selected by two of the respondents, but again perhaps choice D was too attractive as a correct answer so that even respondents who might have initially believed choice A was correct were able to eliminate it after reading choice D.

Question #10: A statistical model for market share showed that in June market share was not statistically significantly different from the target value of 50% market share. The same model using the same data showed that in July of the same year, the market share was indeed now statistically significantly different from the target value of 50% market share. Which of the following must be true?

A. The difference between June and July market share is statistically significant.
B. The difference between June and July market share is not statistically significant.
C. The difference between June and July market share is statistically significant if using a significance level of 2*alpha where alpha is the original significance level.
D. The difference between June and July market share is statistically significant if using a significance level of alpha/2 where alpha is the original significance level.
E. None of these.

Overall, 22/26 (=85%) of the respondents answered this question correctly. Of the 4 incorrect responses, one was choice A, two were choice C, and one was choice D (no one chose B).

This question evaluates a respondent’s understanding of the concept of statistical significance. The question is closely related to the problem discussed in the paper with the very descriptive title “The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant” by Andrew Gelman and Hal Stern (The American Statistician, Nov 2006, Vol. 60, No. 4). One of the authors of this blog post also encountered a high profile instance of this situation (and a large misunderstanding surrounding it) in their work at Google, which was part of the motivation for including the question.

One effective strategy for correctly answering this question is for a respondent to imagine situations which are consistent with the information provided in the question prompt but provide a counterexample to each incorrect answer choice. For instance for choice B one could imagine a situation where June is 50% +/- 5% and July is 90% +/- 5% and this large difference between June and July results in statistical significance, thus disproving choice B.

We will note that the motivation for making the “none of these” the correct answer was not to be tricky, but rather we found it challenging to write this question as a multiple choice question with a single correct answer. Simply knowing that exactly one of the answers must be correct was enough to logically eliminate many of the answer choices in earlier drafts of this question. It is truly a question that lends itself to “none of these” as being the correct choice by the nature of the problem, meaning that one cannot infer anything definite about the statistical significance of the difference between June and July market share from the information provided about June and July separately.

Finally we note that in many of the previous nine questions we had to be careful to avoid too much dependence on terminology or definitions (“null hypothesis”, “uniform”, “bias”, “variance”, “Simpson's paradox”, etc.); however, in this question a respondent must know what “statistical significance” means. This is intentional as we feel “statistical significance” is terminology used commonly enough in industry that it is reasonable to expect a respondent (or a candidate interviewing for Google’s “Data Scientist - Research” role) to be very comfortable with this terminology.

Back to article

Search This Blog

The Unofficial Google Data Science Blog