I recently came across this article from Interviewing.io, an interview practice platform, which paints a bleak picture of how reflective technical interview performance is of true skill. It makes the case that, because of how different performance can be for an interviewee across different interviews, a single interview can’t be considered a reliable predictor of ability. This conclusion echoes a lot of negative sentiments about how the technical interview and hiring processes are fundamentally broken.
While I agree with the conclusions of the article, I think the methodology is a little problematic. Before we look at where the article might’ve fallen short of an unbiased experiment, I recommend you read the article for greater context.
Here’s a paraphrased quote directly from the article:
When an interviewer and an interviewee match, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump into technical questions. Interview questions on the platform tend to fall into the category of what you’d encounter at a phone screen for a back-end software engineering role, and interviewers typically come from large companies, as well as engineering-focused startups.
After every interview, interviewers rate interviewees on a few different dimensions, including technical ability. Technical ability gets rated on a scale of 1 to 4, where 1 is “meh” and 4 is “amazing!”.
The same interviewee can do multiple interviews, each of which is with a different interviewer and/or different company.
The article analyzes 299 such interviews with 67 interviewees.
Interview Performance is Volatile
The main argument of the article uses the below graph of the standard deviation vs mean interviewee technical score across multiple interviews to show that interview performance is volatile across interviews.
We see a pretty wide range of technical scores in different interviews for any given mean score. What might be some reasons for this wide spread?
The Rating System
If we go back to the scoring system, an interviewer is asked to rate interviewee performance between 1-4, but this mightn’t reflect the granularity of an interviewers’ opinion. If an interviewee is more than “good enough” but not the best candidate they’ve seen, should they be rated a 3 or 4? This lack of fine control seems likely to affect the standard deviation, either due to difficulty in choosing between two adjacent options or due to differences in where interviewers draw the threshold for a rating.
Variability Across Time
Another issue is that the article doesn’t account for changes in interviewer and interviewee behaviour across time. If you think about it, Interviewing.io is an interview practice platform. Interviewees are likely change their performance based on interviewer feedback and interviewers could tweak their methodology with time. This difference could’ve impacted technical scores across interviews!
When we look at performance across multiple interviews, we’re looking at statistically independent events. We thus can’t directly conclude that the volatility in technical score spread is due to the failure of the interview as a performance measure.
The article questions the reliability of a technical interview as an indicator of skill – a sentiment that I, and I’m sure many others, can sympathize with. It is my hope that empirical evidence of the problems plaguing the hiring pipeline will force companies to rethink their strategy. Extending the scope of the article to address the issues I’ve mentioned here will definitely be a step in the right direction.