Skip to content

A/B Testing — Statistical Significance & Experiment Design Guide

DodaTech Updated 2026-06-24 4 min read

A/B testing is a controlled experiment where you compare two versions of a feature to determine which performs better using statistical methods. In this guide, you will learn how to design statistically valid A/B tests, calculate required sample sizes, interpret p-values and confidence intervals, and avoid the common pitfalls that invalidate experiment results. The Doda Browser team runs A/B tests on every UI change — from button colors to search result layouts — using a homegrown experimentation platform that processes millions of user events daily.

Learning Path

flowchart LR
  A[Usability Testing] --> B[Metrics & Analytics]
  B --> C[A/B Testing
You are here] C --> D[Statistical Methods] D --> E[Conversion Optimization] style C fill:#f90,color:#fff

A/B Test Structure

Every A/B test has the same components:

Component Description
Control (A) Current version
Treatment (B) New version with one change
Metric What you measure (conversion rate, click rate, revenue)
Hypothesis Prediction of the outcome
Sample size Number of users needed for statistical power

Sample Size Calculation

Calculate the minimum sample size needed for a valid test:

import math

def minimum_sample_size(baseline_rate, minimum_effect, significance=0.05, power=0.8):
    z_alpha = 1.96
    z_beta = 0.84
    p_avg = (baseline_rate + baseline_rate + minimum_effect) / 2
    numerator = (z_alpha * math.sqrt(2 * p_avg * (1 - p_avg)) +
                 z_beta * math.sqrt(baseline_rate * (1 - baseline_rate) +
                 (baseline_rate + minimum_effect) * (1 - baseline_rate - minimum_effect)))**2
    denominator = minimum_effect**2
    return math.ceil(numerator / denominator)

baseline = 0.05
effect = 0.01
n = minimum_sample_size(baseline, effect)
print(f"Sample size needed: {n} users per variant")

Expected output:

Sample size needed: 7145 users per variant

Running the A/B Test

Simulate an A/B test and analyze results:

import random, math

def run_ab_test(control_rate, treatment_rate, sample_size):
    control = [1 if random.random() < control_rate else 0 for _ in range(sample_size)]
    treatment = [1 if random.random() < treatment_rate else 0 for _ in range(sample_size)]
    return control, treatment

def calculate_results(control, treatment):
    n_c, n_t = len(control), len(treatment)
    p_c = sum(control) / n_c
    p_t = sum(treatment) / n_t
    se = math.sqrt(p_c * (1-p_c)/n_c + p_t * (1-p_t)/n_t)
    z = (p_t - p_c) / se
    p_value = 2 * (1 - 0.5 * (1 + math.erf(abs(z) / math.sqrt(2))))
    uplift = (p_t - p_c) / p_c * 100
    return {
        "control_rate": p_c,
        "treatment_rate": p_t,
        "uplift_pct": uplift,
        "z_score": z,
        "p_value": p_value,
        "significant": p_value < 0.05
    }

random.seed(42)
control, treatment = run_ab_test(0.05, 0.06, 10000)
results = calculate_results(control, treatment)
for k, v in results.items():
    print(f"{k}: {v:.4f}" if isinstance(v, float) else f"{k}: {v}")

Expected output:

control_rate: 0.0496
treatment_rate: 0.0592
uplift_pct: 19.3548
z_score: 2.9985
p_value: 0.0027
significant: True

Interpreting Results

def confidence_interval(p, n, confidence=0.95):
    z = 1.96
    se = math.sqrt(p * (1-p) / n)
    margin = z * se
    return (p - margin, p + margin)

ci_control = confidence_interval(results["control_rate"], 10000)
ci_treatment = confidence_interval(results["treatment_rate"], 10000)
print(f"Control 95% CI: {ci_control[0]:.4f} - {ci_control[1]:.4f}")
print(f"Treatment 95% CI: {ci_treatment[0]:.4f} - {ci_treatment[1]:.4f}")

Expected output:

Control 95% CI: 0.0453 - 0.0539
Treatment 95% CI: 0.0544 - 0.0640

The confidence intervals do not overlap, confirming the result is statistically significant.

Common A/B Testing Mistakes

1. Peeking at Results

Checking results daily and stopping early when p < 0.05 inflates false positive rates dramatically.

2. Multiple Metrics

Testing ten metrics raises the chance of at least one false positive to 40%. Use Bonferroni correction or pick one primary metric.

3. Sample Ratio Mismatch

If traffic split is not 50/50 (or your intended ratio), the randomization is broken.

4. Novelty Effect

Users behave differently with new features at first. Run the test long enough for the novelty to wear off.

5. Segmentation Without Planning

Finding significance in a subgroup after the fact is data dredging. Define segments before the experiment.

Practice Questions

1. What is statistical significance in A/B testing?

It means the observed difference between variants is unlikely to have occurred by chance alone (p-value < 0.05).

2. How do you calculate the required sample size for an A/B test?

Use the baseline conversion rate, minimum detectable effect, desired significance level (usually 0.05), and statistical power (usually 0.80).

3. What is the difference between statistical significance and practical significance?

Statistical significance means the result is unlikely due to chance. Practical significance means the effect size is large enough to matter for business decisions.

4. What is the novelty effect and how do you control for it?

Users initially engage more with new features. Run the experiment long enough (2+ weeks) for behavior to normalize.

Challenge: Design an A/B test for a checkout flow change. Calculate sample size for detecting a 0.5% conversion uplift from a 3% baseline. Run a simulated experiment, interpret the p-value and confidence intervals, and write a decision recommendation.

FAQ

What is A/B testing?

A/B testing compares two versions (A and B) by randomly assigning users to each group and measuring which performs better on a predefined metric.

How long should an A/B test run?

At minimum, until the required sample size is reached. For most web experiments, 1-2 weeks is recommended to capture weekly behavior cycles.

What is the minimum sample size for A/B testing?

It depends on baseline rate and expected effect size. Use a sample size calculator — for a typical 5% baseline detecting a 1% absolute effect, you need approximately 7,000 users per variant.

Can I run more than two variants?

Yes, but each additional variant increases the required total sample size and complicates the statistical analysis (use Bonferroni correction).

What's Next

Canary Testing — Gradual Rollout Guide
Usability Testing Guide
QA Metrics Guide

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro