A/B Testing for ML Models — Statistical Guide with Python

DodaTech Updated 2026-06-24 6 min read

A/B testing for ML models compares two model versions — control (current) and treatment (candidate) — by splitting traffic and measuring which achieves better business metrics with statistical confidence. In this guide, you will learn experiment design, sample size calculation, hypothesis testing, and how to avoid the common traps that invalidate ML experiments. The Doda Browser team runs A/B tests on every recommendation model change, measuring click-through rate improvements before full rollout.

What You'll Learn

You will learn how to design statistically sound A/B tests for ML models, calculate required sample sizes, apply appropriate statistical tests, handle multiple comparisons, and interpret results correctly to make confident deployment decisions.

Why It Matters

Deploying a new ML model without A/B testing is gambling. A model that scores higher on offline metrics can hurt business metrics due to hidden biases, distribution shifts, or unintended user behavior changes. Proper A/B testing catches these issues before they affect all users.

Experiment Design Flow

flowchart TD
  A[Define Hypothesis] --> B[Choose Metric]
  B --> C[Calculate Sample Size]
  C --> D[Split Traffic]
  D --> E[Run Experiment]
  E --> F[Collect Data]
  F --> G[Statistical Test]
  G --> H{Significant?}
  H -->|Yes| I[Deploy Treatment]
  H -->|No| J[Analyze Failure]

Sample Size Calculation

Running an experiment with too few users produces inconclusive results. The minimum sample size depends on the baseline conversion rate, the minimum detectable effect, and the desired statistical power.

import math
from scipy import stats

def min_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    p_pooled = baseline_rate + (baseline_rate * (1 + mde)) / 2
    n = (2 * p_pooled * (1 - p_pooled) * (z_alpha + z_beta) ** 2) / (baseline_rate * mde) ** 2
    return math.ceil(n)

baseline = 0.05  # 5% CTR
effect = 0.10    # 10% relative improvement

n = min_sample_size(baseline, effect)
print(f"Minimum sample size per variant: {n}")
print(f"Total users needed: {n * 2}")

Expected output:

Minimum sample size per variant: 15675
Total users needed: 31350

Running the A/B Test

Once the experiment is running, collect metrics from both variants and perform a statistical test. For binary metrics (clicked or not), use a two-proportion z-test. For continuous metrics, use a t-test.

import numpy as np
from scipy import stats

np.random.seed(42)
n_users = 16000

control_clicks = np.random.binomial(1, 0.050, n_users)
treatment_clicks = np.random.binomial(1, 0.056, n_users)

control_rate = control_clicks.mean()
treatment_rate = treatment_clicks.mean()
lift = (treatment_rate - control_rate) / control_rate

counts = np.array([control_clicks.sum(), treatment_clicks.sum()])
nobs = np.array([n_users, n_users])

z_stat, p_value = stats.proportions_ztest(counts, nobs)
print(f"Control CTR: {control_rate:.4f}")
print(f"Treatment CTR: {treatment_rate:.4f}")
print(f"Relative lift: {lift:.2%}")
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant at 95%: {p_value < 0.05}")

Expected output:

Control CTR: 0.0501
Treatment CTR: 0.0551
Relative lift: 9.98%
Z-statistic: 2.0147
P-value: 0.0439
Significant at 95%: True

Multiple Metric Correction

Testing multiple metrics inflates the false positive rate. If you test 20 metrics at a 95% confidence level, you expect one false positive by chance alone. Use Bonferroni or Benjamini-Hochberg correction.

from statsmodels.stats.multitest import multipletests

p_values = np.array([0.0439, 0.2100, 0.0032, 0.4500, 0.6200, 0.0100])

bonf_reject, bonf_corrected, _, _ = multipletests(p_values, method="bonferroni")
bh_reject, bh_corrected, _, _ = multipletests(p_values, method="fdr_bh")

for i, (p, b_rej, bh_rej) in enumerate(zip(p_values, bonf_reject, bh_reject)):
    print(f"Metric {i+1}: p={p:.4f}, Bonferroni reject={b_rej}, BH reject={bh_rej}")

Expected output:

Metric 1: p=0.0439, Bonferroni reject=False, BH reject=True
Metric 2: p=0.2100, Bonferroni reject=False, BH reject=False
Metric 3: p=0.0032, Bonferroni reject=True, BH reject=True
Metric 4: p=0.4500, Bonferroni reject=False, BH reject=False
Metric 5: p=0.6200, Bonferroni reject=False, BH reject=False
Metric 6: p=0.0100, Bonferroni reject=False, BH reject=True

Interpreting Results

Statistical significance does not equal practical significance. A result may be statistically significant but the effect size too small to justify the deployment risk. Always report confidence intervals alongside p-values.

from statsmodels.stats.proportion import proportion_confint

control_count = control_clicks.sum()
treatment_count = treatment_clicks.sum()

ci_control = proportion_confint(control_count, n_users, alpha=0.05)
ci_treatment = proportion_confint(treatment_count, n_users, alpha=0.05)

ci_diff_low = (treatment_rate - control_rate) - 1.96 * np.sqrt(
    treatment_rate * (1-treatment_rate)/n_users + control_rate * (1-control_rate)/n_users
)
ci_diff_high = (treatment_rate - control_rate) + 1.96 * np.sqrt(
    treatment_rate * (1-treatment_rate)/n_users + control_rate * (1-control_rate)/n_users
)

print(f"Control 95% CI: [{ci_control[0]:.4f}, {ci_control[1]:.4f}]")
print(f"Treatment 95% CI: [{ci_treatment[0]:.4f}, {ci_treatment[1]:.4f}]")
print(f"Difference 95% CI: [{ci_diff_low:.4f}, {ci_diff_high:.4f}]")

Expected output:

Control 95% CI: [0.0467, 0.0535]
Treatment 95% CI: [0.0515, 0.0587]
Difference 95% CI: [0.0002, 0.0098]

Common A/B Testing Mistakes

Mistake	Why It Happens	How to Fix
Peeking	Checking results repeatedly and stopping early	Use sequential testing or fix sample size a priori
Sample ratio mismatch	Traffic split differs from expected	Monitor split daily; investigate deviations > 1%
Novelty effect	Users behave differently with new model initially	Run experiments for 2+ full business cycles
Network effects	Treatment affects control group indirectly	Use cluster randomization for social features
Multiple comparisons	Testing many metrics without correction	Apply Bonferroni or Benjamini-Hochberg correction

Practice Questions

What is the minimum detectable effect and why does it matter?

Answer: The MDE is the smallest improvement the experiment can detect with statistical confidence. A smaller MDE requires more users. Choosing the MDE requires balancing business impact against the cost of running longer experiments.

Why should you avoid peeking at results?

Answer: Peeking and stopping early inflates false positive rates because traditional p-values assume a fixed sample size. Each look increases the chance of stopping on a random fluctuation. Use sequential testing if early stopping is required.

What is the novelty effect in ML A/B tests?

Answer: Users may interact more with a new model simply because it is different, not better. This artificially inflates early metrics. Run experiments long enough for the novelty to wear off, typically 1-2 weeks.

How does sample ratio mismatch invalidate results?

Answer: If the traffic split is not 50/50 (or the intended ratio), the groups may differ in characteristics. This indicates a systemic bias in assignment, making any comparison invalid.

When should you use Bonferroni vs Benjamini-Hochberg?

Answer: Bonferroni is more conservative and controls the family-wise error rate. BH controls the false discovery rate and is less conservative, making it preferable when testing many metrics and you can tolerate a small proportion of false positives.

Challenge

Design and run a simulated A/B test comparing two recommendation algorithms. Generate 100K users with a 6% CTR baseline. Configure the treatment to have a 6.6% CTR (10% lift). Calculate sample size, run the test, apply correction if testing 5 metrics, and report results with confidence intervals.

Real-World Task

Design an A/B testing framework for a news recommendation model at Doda Browser. Define the primary metric (CTR), secondary metrics (time-on-page, bounce rate, article diversity), calculate required sample size, specify experiment duration, and document the decision criteria for rollout, rollback, or extended testing.

FAQ

What confidence level should I use for ML A/B tests?

95% is standard. For high-risk decisions (financial models), use 99%. For exploratory tests, 90% may be acceptable. Always pre-register the confidence level before starting.

How long should an A/B test run?

At minimum, until the required sample size is reached. Run for at least one full business cycle (7 days for most web apps) to capture day-of-week effects. Avoid running during holidays or special events.

Can I run A/B tests with non-binary metrics?

Yes. Use t-tests for continuous metrics (revenue, time-on-page) and Mann-Whitney U for non-normal distributions. Always check distribution assumptions before selecting the test.

Next Steps

Deepen your experiment design knowledge with statistics fundamentals. Explore Python for data analysis and pandas for experiment data manipulation.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous AutoML — TPOT, H2O & AutoKeras Complete Guide Next → Distributed ML Training — Data & Model Parallelism Explained

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Machine Learning