A/B Testing — Statistical Significance & Experiment Design Guide
A/B testing is a controlled experiment where you compare two versions of a feature to determine which performs better using statistical methods. In this guide, you will learn how to design statistically valid A/B tests, calculate required sample sizes, interpret p-values and confidence intervals, and avoid the common pitfalls that invalidate experiment results. The Doda Browser team runs A/B tests on every UI change — from button colors to search result layouts — using a homegrown experimentation platform that processes millions of user events daily.
Learning Path
flowchart LR A[Usability Testing] --> B[Metrics & Analytics] B --> C[A/B Testing
You are here] C --> D[Statistical Methods] D --> E[Conversion Optimization] style C fill:#f90,color:#fff
A/B Test Structure
Every A/B test has the same components:
| Component | Description |
|---|---|
| Control (A) | Current version |
| Treatment (B) | New version with one change |
| Metric | What you measure (conversion rate, click rate, revenue) |
| Hypothesis | Prediction of the outcome |
| Sample size | Number of users needed for statistical power |
Sample Size Calculation
Calculate the minimum sample size needed for a valid test:
import math
def minimum_sample_size(baseline_rate, minimum_effect, significance=0.05, power=0.8):
z_alpha = 1.96
z_beta = 0.84
p_avg = (baseline_rate + baseline_rate + minimum_effect) / 2
numerator = (z_alpha * math.sqrt(2 * p_avg * (1 - p_avg)) +
z_beta * math.sqrt(baseline_rate * (1 - baseline_rate) +
(baseline_rate + minimum_effect) * (1 - baseline_rate - minimum_effect)))**2
denominator = minimum_effect**2
return math.ceil(numerator / denominator)
baseline = 0.05
effect = 0.01
n = minimum_sample_size(baseline, effect)
print(f"Sample size needed: {n} users per variant")
Expected output:
Sample size needed: 7145 users per variant
Running the A/B Test
Simulate an A/B test and analyze results:
import random, math
def run_ab_test(control_rate, treatment_rate, sample_size):
control = [1 if random.random() < control_rate else 0 for _ in range(sample_size)]
treatment = [1 if random.random() < treatment_rate else 0 for _ in range(sample_size)]
return control, treatment
def calculate_results(control, treatment):
n_c, n_t = len(control), len(treatment)
p_c = sum(control) / n_c
p_t = sum(treatment) / n_t
se = math.sqrt(p_c * (1-p_c)/n_c + p_t * (1-p_t)/n_t)
z = (p_t - p_c) / se
p_value = 2 * (1 - 0.5 * (1 + math.erf(abs(z) / math.sqrt(2))))
uplift = (p_t - p_c) / p_c * 100
return {
"control_rate": p_c,
"treatment_rate": p_t,
"uplift_pct": uplift,
"z_score": z,
"p_value": p_value,
"significant": p_value < 0.05
}
random.seed(42)
control, treatment = run_ab_test(0.05, 0.06, 10000)
results = calculate_results(control, treatment)
for k, v in results.items():
print(f"{k}: {v:.4f}" if isinstance(v, float) else f"{k}: {v}")
Expected output:
control_rate: 0.0496
treatment_rate: 0.0592
uplift_pct: 19.3548
z_score: 2.9985
p_value: 0.0027
significant: True
Interpreting Results
def confidence_interval(p, n, confidence=0.95):
z = 1.96
se = math.sqrt(p * (1-p) / n)
margin = z * se
return (p - margin, p + margin)
ci_control = confidence_interval(results["control_rate"], 10000)
ci_treatment = confidence_interval(results["treatment_rate"], 10000)
print(f"Control 95% CI: {ci_control[0]:.4f} - {ci_control[1]:.4f}")
print(f"Treatment 95% CI: {ci_treatment[0]:.4f} - {ci_treatment[1]:.4f}")
Expected output:
Control 95% CI: 0.0453 - 0.0539
Treatment 95% CI: 0.0544 - 0.0640
The confidence intervals do not overlap, confirming the result is statistically significant.
Common A/B Testing Mistakes
1. Peeking at Results
Checking results daily and stopping early when p < 0.05 inflates false positive rates dramatically.
2. Multiple Metrics
Testing ten metrics raises the chance of at least one false positive to 40%. Use Bonferroni correction or pick one primary metric.
3. Sample Ratio Mismatch
If traffic split is not 50/50 (or your intended ratio), the randomization is broken.
4. Novelty Effect
Users behave differently with new features at first. Run the test long enough for the novelty to wear off.
5. Segmentation Without Planning
Finding significance in a subgroup after the fact is data dredging. Define segments before the experiment.
Practice Questions
1. What is statistical significance in A/B testing?
It means the observed difference between variants is unlikely to have occurred by chance alone (p-value < 0.05).
2. How do you calculate the required sample size for an A/B test?
Use the baseline conversion rate, minimum detectable effect, desired significance level (usually 0.05), and statistical power (usually 0.80).
3. What is the difference between statistical significance and practical significance?
Statistical significance means the result is unlikely due to chance. Practical significance means the effect size is large enough to matter for business decisions.
4. What is the novelty effect and how do you control for it?
Users initially engage more with new features. Run the experiment long enough (2+ weeks) for behavior to normalize.
Challenge: Design an A/B test for a checkout flow change. Calculate sample size for detecting a 0.5% conversion uplift from a 3% baseline. Run a simulated experiment, interpret the p-value and confidence intervals, and write a decision recommendation.
FAQ
What's Next
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro