Data Science Interview Experience- Top 5 statistics questions asked
Statistics is one of the topics that become very daunting in the data science interview. If you have some regression-related projects on your resume, you are bound to lead into the arena of statistics. In countless interviews, statistics questions became the reason for my non-selection.
The most tricky thing about statistics is that its implementation in projects is intrinsic in nature so it becomes difficult to develop a clear intuition, but still, if you can cover 5 main questions/concepts related to statistics, you will be able to sail through this section in the interviews.
Before going into details of questions, there is a fact about statistics that is important to understand. Statistics is a great deal about using probability to make the best guesses. Why the need to guess, you may ask? Because it is practically not possible to collect statistics for the population. The most intuitive example is opinion polls during elections. Aren’t you amazed that a survey among a few thousand people gives results so close to those of millions of voters? The power of statistics!
Let’s get into it –
Q1: What is the central limit theorem? And what is a normal distribution? Difference between normal distribution vs standard normal distribution?
Ans: Central limit theorem is the foundation of inferential statistics (making guesses on population using sample). It states that if the random sample is drawn from a population n number of times, then its mean will follow a normal distribution. (Some people make the mistake of assuming that if the samples are coming from same distribution, their means should be equal. Mind that we are randomly picking samples so it will not happen)
A normal distribution is bell-shaped distribution meaning data is distributed symmetrically around the central tendencies (mean, median, mode).
Standard normal distribution is a special case where mean = 0, standard deviation = 1.
Bonus Tip — I used to mention that the normal distribution concept is also used when we standardize variables. An intuitive way to think is that the variable is now transformed in such a way that each row represents how many standard deviations the data point is from the mean.
Q: What are the t-test and z-test? When do we use z-test vs t-test?
Ans: Both the t-test and z-test are part of hypothesis testing. There is a null hypothesis (by default accepted fact) and an alternative hypothesis that we are trying to prove using the sample data. Specifically, using a t-test we can compare the sample means of two samples.
Explain it using a practical example in the interview –
For e.g. A company has recently implemented a recommendation system(RS) on a website and they want to know if it has had any real positive impact on sales. We define the null hypothesis as RS has no impact on the sales and the alternative hypothesis as RS has a positive impact on the sales. We get the data of sales for 20 random customers who use the old popularity-based system and the new recommendation system.
Popularity-based sample => x1(mean) = 800, SD1 = 100, n1=20
Recommendation Engine sample => x2(mean) = 900, SD2=200, n2=20
t = (900–800)/sqrt(200²/20 + 100²/20) = 100/50 = 2
degree of freedom = (n1 + n2–2) = 38
Define α = 0.05, significance level
Using the (α, degree of freedom) we get the critical t-value from the tables (in this case, 1.68) and if the computed t-value is more than the critical t-value, we reject the null hypothesis and say RS has a positive impact. But if the t-value is less than the critical value then we say that the current sample does not give conclusive evidence to reject the null hypothesis.
Z-test is used over t-test when two conditions are followed-
- We have the population standard deviation
- The sample size is greater than 30
Bonus Tip —
What will happen to the computed t-value if we lower the number of data points in the sample? In the formula, reduce the value of n1 and n2 and recompute the t-value.
Q: Explain the p-value and how is it relevant in the context of regression.
The P-value for a certain event is the sum of the probability of a given event, equally rare event, and more rare event.
Some other definition states that it is the probability of getting a sample as or more extreme than the given sample. I know this might sound confusing. Check out this article for more details.
In regression, the p-value of the coefficient determines whether the coefficient is statistically significant. We test the hypothesis on each coefficient where the null hypothesis is that the coefficient is 0. Now if the p-value of the coefficient is below the threshold, we reject the null hypothesis and say that the coefficient for the variable is significant.
Q: What is ANOVA? Where do you use ANOVA?
ANOVA is used to compare means when we have more than 2 samples. It is part of hypothesis testing.
Now, the mathematical details of the ANOVA process are somewhat complex and long so you cannot expect a question on it but it is good to have knowledge of the process. If you get time, read about it.
Fundamentally, like the t-value, In ANOVA, we try to reject the null hypothesis that all the sample means are equal by computing the f-value which in simple terms as follows—
F-value= Variance between the samples/Variance within the sample
Check out the interesting version of an ANOVA-related question that was asked in one of my interviews. It can be found here.
Q: Explain the chi-square test and what it is used for.
The Chi-square test is another test of hypothesis testing and it can be used to compare the categorical variables.
There are two main use-cases of chi-square-
- To determine goodness of fit (Let’s say someone gives you data and asks you to check if it comes from the normal distribution)
- To test the independence of two variables
Try to explain it with a practical example like—
Suppose research claims that sending cards to customers on their birthday increases the retention rate. Can you test this claim?
You design an experiment in which for a certain week you send cards to 1 group of customers while not sending them to another set of customers. Next week you check how many customers are retained from both groups. Let’s say below is the data
Null Hypothesis — Both the variables are independent meaning sending cards has no impact on the retention rate of the customer
Alternate Hypothesis — Both the variables are not independent
Expected data is based on the total frequency of rows and columns. For e.g. Expected(Received Cards,Retained) = (Total Retained X Total Received Cards)/Total Customers
= (110 X 100)/200 = 55
Calculate the Chi-square value using the above formula—
X² = (60–55)²/55 + (50–55)²/55 + (40–45)²/45 + (50–45)²/45 = 2.02
degree of freedom = (r-1)(c-1)
Using the (α, degree of freedom), we check the critical X² value from the table and compare it to the calculated X² value. If the calculated X² value is higher then we can reject the null hypothesis.
Hope you get to benefit from this article in your interview!