A/B Testing on Questionnaire Ads
Background
SmartAd is a relatively well-known digital marketing agency that help implementing various ad concepts on their clients websites. On this occasion, they want to improve the performance of the questionnaire ad that is shown in a client website.
Objective
SmartAd want to know whether the new questionnaire ad concept will more likely make user to engage with it.
Experimental Design
Metrics
Because SmartAd wants to know how well the ad performs, we’re using these metrics:
- Ad response rate: from those who see the ad, what percentage did respond to it
Variants
- Control: Users who have been shown a static ad
- Treatment: Users who have been shown an interactive ad with the SmartAd brand written on it
Hypothesis
Based on the metrics and variants, we propose these hypotheses.
- H0: Ad response rate of static ad =Ad response rate of interactive ad
- H1: Ad response rate of static ad < Ad response rate of interactive ad
We use two proportions hypothesis test (one-sided) with confidence level of 95% (α = 0.05) and power level of 80% ( 1 - β = 0.8, β = 0.2).
Sample Size
We now calculate the sample needed for the A/B test using this formula:
n = (Zα+Zβ)^2 * (p1(1-p1)+p2(1-p2)) / (p1-p2)^2,
Zα = 1.6449 (α = 0.05, one-sided)
Zβ = 0.84 (β = 0.2)
p1 = 0.14 (assumed based on previously available data)
p2 = 0.17 (assumed there will be 0.03 increase in response rate)
Based on the test parameters inputted, we find that we need a sample size of 1796 for each group (treatment and control).
Experiment Duration
We’ll run the experiment for 7 days (3–9 Jul 2020).
Data Cleaning and Data Enrichment
We need to make sure there’s no missing nor duplicate data.
We also add new variables to enrich the data such as day of the week and ‘responded’ and ‘accepted’ to ease out our analysis.
The full process can be accessed on myGitHub.
Data Analysis and Interpretation
Response Rate Performance: The Whole Week
Based on the sample collected, performance on the exposed (treatment) group is slightly better than the control one.
- Response rate of the static ad: 14.39%
- Response rate of the interactive ad: 16.40%
- P-value: 0.020890071898846684
Response Rate Performance: Weekdays vs Weekends
To avoid the potential difference in user behavior between weekend and weekdays, the sample is divided into two groups (weekend and the weekdays)
Unfortunately, only the interaction recorded on the weekdays that have sufficient sample size (min. 1796 for each exposed and control).
Performance on the Weekdays
There’s an increase in treatment group compared to the control one. The increase is also statistically significant with p-value of 0.042.
Performance on the Weekends
There’s also an increase in treatment group compared to the control one. However we don’t have enough sample to tun the statistical test.
Response Rate Performance by Device OS Type
Here are the data points collected based on the device OS type. As it turned out, we only have enough data to do statistical test on the device OS 6.
Performance on Device OS 5
Based on the data points collected, there’s a decrease in response rate of treatment group compared to the control group.
However, there’s not enough samples to conclude that this difference in performance is statistically significant.
Performance on Device OS 6
There’s an increase in response rate but after we run the statistical test, turned out it was not statistically significant with p-value of 0.177.
Conclusion
After running the several two samples proportion Z-test, it’s found out that we can reject the null hypothesis and accept that the response rate of the interactive ad is bigger than the static one on these situation.
Recommendation
Even though the interactive ad (treatment) is significantly better statistically than the static ad (control), it’s still inconclusive whether if the increase in performance will give any practical impact to the business.
Thus said, by incorporating additional business context to the improved performance, it would give a more contextual outlook on the result.
Moreover, due to the lack of sample size, we couldn’t conduct statistical test on the response rate performance in the weekends nor the performance on the device OS 5. Increasing the number of data collected would give more comprehensive analysis.
Resources
References
- https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Mostly_Harmless_Statistics_(Webb)/09%3A_Hypothesis_Tests_and_Confidence_Intervals_for_Two_Populations/9.03%3A_Two_Proportion_Z-Test_and_Confidence_Interval
- https://select-statistics.co.uk/calculators/sample-size-calculator-two-proportions/
- https://www.stat.ubc.ca/~rollin/stats/ssize/b2.html
- https://www.statology.org/z-critical-value-python/