P-value Explained Clearly — Regression, PDF, Discrete
P-value is a very powerful concept yet there is a lack of clear understanding of what exactly the p-value represents.
I was shown the below image in my first introduction to the p-value.
The p-value was referred to as the shaded area in the probability distribution curve and there are so many interpretations of it. Let’s say if the shaded area is 5% of the total area, then the p-value can be defined as —
- There is a 95% probability that the sample value will lie between the two red lines.
- 95% of the time the sample value will lie between two red lines
- There is a 5% chance that the random sample will produce an extreme value that falls in the shaded region
- If the p-value is low we can reject the null hypothesis
- etc.
Now, this may or may not make sense but for better understanding, let’s define the p-value in different contexts —
- Discrete random Events
- Hypothesis Testing
- Regression Coefficients
First the definition —
P-value — It is just calculating the probability of observing an event that is
- given,
- equally rarer,
- or something more extreme.
Discrete random events
Let’s say there is a company that sends cashback coupons on every transaction. There is a 50% chance that the coupon will have a 5% discount while others will have a 10% discount. Each customer makes 4 transactions, what is the p-value of getting 3 — 10% discount coupons?
From the above image, you will observe that Probability(3' 10, 1' 5) = 4/16
Is the p-value same as the probability of observing an event — No
P-value = P(given event) + P(equally rarer event) + P(more rarer event)
Now, what is the equally rarer event and more rarer event, if you see the probability of 5 possible outcomes —
P(equally rarer event) corresponds to P(3' 5, 1' 10)
P(more rarer event) corresponds to the outcomes that have low probability than the given event probability.
P-value = (4/16) + (4/16) + (1/16 + 1/16) = 10/16
P-value = 0.625
Hypothesis Testing
In simple terms, the whole foundation of hypothesis testing is using a sample to make some conclusion about the population. If you recall in hypothesis testing, we try to reject the null hypothesis based on the sample and if the p-value is low we reject the null hypothesis. But why?
Before that let’s understand what the p-value represents in this scenario. From the definition, it is the probability of the given event, equally rare event, or more rare event. One thing to note about the probability density function is that the area under the curve gives us the probability of an event, which event? that the sample statistic (in this case mean) lies in the given range. Let’s look at the below example —
Below is a normal distribution.
What are different events —
P(value lies between -2SD & 2SD) = unshaded region
P(value is ≤-2SD or ≥2SD) = shaded region
According to the normal distribution,
P(unshaded region) = 95%
P(shaded region) =5%
Now let’s see what is an event in the hypothesis testing. Suppose we know that population means = 10 and standard deviation = 2. Now we have taken a sample and got the means = 18. Now the standard question that we ask in the z-test is, can we reject the null hypothesis that population means is not 10 based on the sample?
The event corresponds to sample means = 18
Now, we want to compute Probability(means≥18 or means ≤2), if this value is very small we can say that probability of this sample belonging to the original distribution is very less. Let’s say this value comes out to be 0.02.
Now recall that in hypothesis testing, we compare the p-value to the critical threshold value which can be anything like 0.005,0.01,0.05,0.1.
But you may ask that we have only computed probability, not p-value. Surprise! we had already determined the p-value for this test which is the same as Probability(means≥18 or means ≤2), how?
p-value = Probability(given event (value=18) or equally rare event (value=2) or more rare event(value>18, value<2).
Regression Coefficients
One of the main applications of the p-value comes with the regression coefficients, and we need to be aware of 2 main properties when the p-value for a regression coefficient is high —
- The regression coefficients are unstable
- There is a high probability that the coefficient can change its sign from positive to negative or vice-versa (this property is very useful in mmm modeling)
So, the p-value in the regression coefficients is based on hypothesis testing.
Null Hypothesis — The regression coefficient for a variable is 0.
Now when the p-value for a certain regression coefficient is high, then it indicates that we cannot reject the null hypothesis, which essentially indicates that the coefficient value is 0 only and it is by random chance that we are getting a non-zero coefficient, therefore in such a condition, even a small change in the input data can change the regression coefficient significantly. This is why our regression coefficients are very unstable when the p-value is high.
Now, we all know that there is a confidence interval associated with all the estimates and the value will lie between [β-(z*(SD), β+ (z*SD)], now when we fail to reject the null hypothesis the coefficient value can easily swing between positive and negative value because it will be in the following interval [0-x, 0+x].
In MMM Modelling, the impact of marketing on sales can only be positive so when we are presented with a scenario where we are getting a negative coefficient for a marketing variable but with a high p-value, we can easily get the coefficient positive by applying some transformations or getting more data. So its understanding becomes very useful.
Hope you liked the interpretation of the p-value in different contexts but ultimately everything boils down to the same meaning.
Check out my blog Analytixtime!