A/B Testing on Questionnaire Ads

4 min readSep 3, 2023

Image Source: https://nationalpositions.com/

Background

SmartAd is a relatively well-known digital marketing agency that help implementing various ad concepts on their clients websites. On this occasion, they want to improve the performance of the questionnaire ad that is shown in a client website.

Objective

SmartAd want to know whether the new questionnaire ad concept will more likely make user to engage with it.

Experimental Design

Metrics

Because SmartAd wants to know how well the ad performs, we’re using these metrics:

Ad response rate: from those who see the ad, what percentage did respond to it

Variants

Control: Users who have been shown a static ad
Treatment: Users who have been shown an interactive ad with the SmartAd brand written on it

Hypothesis

Based on the metrics and variants, we propose these hypotheses.

H0: Ad response rate of static ad =Ad response rate of interactive ad
H1: Ad response rate of static ad < Ad response rate of interactive ad

We use two proportions hypothesis test (one-sided) with confidence level of 95% (α = 0.05) and power level of 80% ( 1 - β = 0.8, β = 0.2).

Sample Size

We now calculate the sample needed for the A/B test using this formula:

n = (Zα+Zβ)^2 * (p1(1-p1)+p2(1-p2)) / (p1-p2)^2,

Zα = 1.6449 (α = 0.05, one-sided)

Zβ = 0.84 (β = 0.2)

p1 = 0.14 (assumed based on previously available data)

p2 = 0.17 (assumed there will be 0.03 increase in response rate)

Based on the test parameters inputted, we find that we need a sample size of 1796 for each group (treatment and control).

Experiment Duration

We’ll run the experiment for 7 days (3–9 Jul 2020).

Data Cleaning and Data Enrichment

We need to make sure there’s no missing nor duplicate data.

We also add new variables to enrich the data such as day of the week and ‘responded’ and ‘accepted’ to ease out our analysis.

The full process can be accessed on myGitHub.

Data Analysis and Interpretation

Response Rate Performance: The Whole Week

Based on the sample collected, performance on the exposed (treatment) group is slightly better than the control one.

Response rate of the static ad: 14.39%
Response rate of the interactive ad: 16.40%
P-value: 0.020890071898846684

Response Rate Performance: Weekdays vs Weekends

To avoid the potential difference in user behavior between weekend and weekdays, the sample is divided into two groups (weekend and the weekdays)

Unfortunately, only the interaction recorded on the weekdays that have sufficient sample size (min. 1796 for each exposed and control).

Performance on the Weekdays

There’s an increase in treatment group compared to the control one. The increase is also statistically significant with p-value of 0.042.

Performance on the Weekends

There’s also an increase in treatment group compared to the control one. However we don’t have enough sample to tun the statistical test.

Response Rate Performance by Device OS Type

Here are the data points collected based on the device OS type. As it turned out, we only have enough data to do statistical test on the device OS 6.

Performance on Device OS 5

Based on the data points collected, there’s a decrease in response rate of treatment group compared to the control group.

However, there’s not enough samples to conclude that this difference in performance is statistically significant.

Performance on Device OS 6

There’s an increase in response rate but after we run the statistical test, turned out it was not statistically significant with p-value of 0.177.

Conclusion

After running the several two samples proportion Z-test, it’s found out that we can reject the null hypothesis and accept that the response rate of the interactive ad is bigger than the static one on these situation.

Recommendation

Even though the interactive ad (treatment) is significantly better statistically than the static ad (control), it’s still inconclusive whether if the increase in performance will give any practical impact to the business.

Thus said, by incorporating additional business context to the improved performance, it would give a more contextual outlook on the result.

Moreover, due to the lack of sample size, we couldn’t conduct statistical test on the response rate performance in the weekends nor the performance on the device OS 5. Increasing the number of data collected would give more comprehensive analysis.