Imagine your site/app to be a bag full of balls of two colors - red & black - in unequal proportions.
Standard Deviation
= Square-Root of [{Sum of Square of (CTR data - CTR Mean)} / {Impressions total count}]
= Square-Root of [{0.026} / {30000}]
= 0.0009 or 0.09%
Because you can't look inside the box to count the number of balls of each color, you ask a couple of your friends to pick one ball each.
The reason for doing this activity is that, by checking a sample of balls, you want to estimate the entire count.
This is the basis of Multivariate Testing/Experiments as well:
Your site/app has a button whose existing color = red.
While the color you think the button should be = black.
While the color you think the button should be = black.
Around 100k people visit your site daily.
On average, 3% of daily visitors have been clicking the red button, for the last 30 days.
Technically speaking, the red button's CTR = 3%.
Note that "3%" is an average, which means that it is the MEAN of daily-CTRs of the last 30 days (i.e. there would have been days when the CTR would have been 9% and other days when the CTR would have been 0% as well).
Now, you want to test how the black color button will perform.
Though you are confident that that the black button will get a better CTR, you can not just replace the red button with the black button and show it to all users, as there is a chance that users might not like it at all, and hence not click it at all.
Though you are confident that that the black button will get a better CTR, you can not just replace the red button with the black button and show it to all users, as there is a chance that users might not like it at all, and hence not click it at all.
So, you let a couple of your visitors view the black color button.
The reason for doing this activity is; just like the story that I told at the start of this post; that, by checking a sample, you want to estimate the entire click-count.
Now, let's say you showed the black button to 1k (1% of total) visitors for 30 days, and the following are the results that you get:
We can see from the above image that on 1st & 2nd day, the CTR of the black button was 0 (no one clicked), while on 26th, 27th, 28th, 29th, 30th day it was 0.09 (9 out of each 100 visitors who saw the button clicked on it).
Also, at the bottom of the 4th column, we have calculated the Mean of the CTR's data = 5
Also, at the bottom of the 4th column, we have calculated the Mean of the CTR's data = 5
Also, in the 5th column, we have calculated the Square of (CTR data - CTR Mean) for each day.
And, at the bottom of the 5th column, we have calculated Sum of Square of (CTR data - CTR Mean) = 0.026, which we will now use to calculate the Standard Deviation (sometimes, also called the Standard Error), whose formula is:
And, at the bottom of the 5th column, we have calculated Sum of Square of (CTR data - CTR Mean) = 0.026, which we will now use to calculate the Standard Deviation (sometimes, also called the Standard Error), whose formula is:
Standard Deviation
= Square-Root of [{Sum of Square of (CTR data - CTR Mean)} / {Impressions total count}]
= Square-Root of [{0.026} / {30000}]
= 0.0009 or 0.09%
*** So, the Mean of CTR data is 5% with a Standard Deviation of 0.09% ***
Now, we will apply the 68-95-99.7 rule of Statistics, assuming the data is normally distributed (the rule is depicted in the image below):
5.
p-value is the measurement of Statistical Significance of any given experiment's result.
It is also called the measurement of the Uncertainty of any given experiment's result.
6.
The 95% confidence level is the most preferred one for declaring the results of Ab tests.
The 99.7% is used where you are testing something which is extremely-sensitive-for-business.
Similarly, 0.05 is the most commonly used p-value to check whether the result is Statistically Significant or not.
7.
The existing red color button is called the Control.
The black color button, that we wanted to test, is called the Variant.
8.
In AB testing we start with 2 hypotheses:
Now, we will apply the 68-95-99.7 rule of Statistics, assuming the data is normally distributed (the rule is depicted in the image below):
Rule 1
68% of the data falls within ONE standard deviation (=0.09 in this case) of the mean.
So, 68% of our CTR-data would be between 4.91% (=5%-0.09%*1) and 5.09% (=5%+0.09%*1)
Rule 2
Rule 2
95% of the data falls within TWO standard deviations of the mean.
Actually, precisely, it is not TWO, but 1.96
So, 95% of our CTR-data would be between (5%-[0.09%*1.96]) and (5%+[0.09%*1.96])
So, 95% of our CTR-data would be between 4.82% and 5.18%
Actually, precisely, it is not TWO, but 1.96
So, 95% of our CTR-data would be between (5%-[0.09%*1.96]) and (5%+[0.09%*1.96])
So, 95% of our CTR-data would be between 4.82% and 5.18%
Rule 3
99.7% of the data falls within THREE standard deviations of the mean.
So, the final result of the experiment will be:
So, the final result of the experiment will be:
1. We are 68% confident that the CTR of the black button is between 4.91% to 5.09%
2. We are 95% confident that the CTR of the black button is between 4.81% to 5.18%
2. We are 95% confident that the CTR of the black button is between 4.81% to 5.18%
the '68-95-99.7' Rule
============================
Notes/Tips:
1.
These exact coefficients; for example, 1.96; are called the Standard Error or Z-score; denoted by 'Z'; of Standard Deviation - It can be calculated using the NORM.S.INV function in the excel sheet.
2.
The +0.18% & -0.18% are called the Margins of Error
3.
The 68%, 95%, etc. are called the Confidence Level
The Significance Level is calculated by subtracting the confidence level from 1.
So, if we choose to use the 95% confidence level, the Significance level = 1 - 95% = 0.05
So, if we choose to use the 95% confidence level, the Significance level = 1 - 95% = 0.05
4.
The ranges; 4.91% to 5.09%, and 4.81% to 5.18%; are called Confidence Interval
5.
p-value is the measurement of Statistical Significance of any given experiment's result.
It is also called the measurement of the Uncertainty of any given experiment's result.
6.
The 95% confidence level is the most preferred one for declaring the results of Ab tests.
The 99.7% is used where you are testing something which is extremely-sensitive-for-business.
Similarly, 0.05 is the most commonly used p-value to check whether the result is Statistically Significant or not.
7.
The existing red color button is called the Control.
The black color button, that we wanted to test, is called the Variant.
8.
In AB testing we start with 2 hypotheses:
Null Hypothesis (H0) - This says that the Control & Variant have no impact on the KPI (CTR, in our case)
Alternate Hypothesis (Ha) - This says that the Control & Variant have different impacts on the KPI.
AB testing, hence, is used to check which hypothesis is correct.
If the p-value is less than the Significance level, i.e. we have got a high Confidence level, we say that our experiment has been successful i.e. Ha has come out to be true i.e. we reject the H0.
Similarly, if the p-value is more than or equal to the Significance level, i.e. we have got a low Confidence level, we say that our experiment has been a failure i.e. Ha has turned out to be false and we reject the same.
9.
Alternate Hypothesis (Ha) - This says that the Control & Variant have different impacts on the KPI.
AB testing, hence, is used to check which hypothesis is correct.
If the p-value is less than the Significance level, i.e. we have got a high Confidence level, we say that our experiment has been successful i.e. Ha has come out to be true i.e. we reject the H0.
Similarly, if the p-value is more than or equal to the Significance level, i.e. we have got a low Confidence level, we say that our experiment has been a failure i.e. Ha has turned out to be false and we reject the same.
9.
A low standard deviation tells us that the data is closely clustered around the mean (or average) - and hence produces a skewed/pointed graph, while a high standard deviation indicates that the data is dispersed over a wider range of values - and hence produces a flattened/spread graph
10.
If the confidence intervals of your original page and variation b overlap, you need to keep testing even if your testing tool is saying that one is a statistically significant winner.
Further reading:
KhanAcademy.org/math/ap-statistics/tests-significance-ap/idea-significance-tests/v/p-values-and-significance-tests
VWO.com/blog/what-you-really-need-to-know-about-mathematics-of-ab-split-testing/
ConversionSciences.com/ab-testing-statistics/
YouTube.com/watch?v=cgxPcdPbujI
YouTube.com/watch?v=hlM7zdf7zwU
YouTube.com/watch?v=-MKT3yLDkqk
No comments:
Post a Comment