Skip to content

A/B Testing & Experimentation for Developers -- Statistical Methods & Implementation

DodaTech Updated 2026-06-22 5 min read

In this tutorial, you'll learn about A/B Testing & Experimentation for Developers. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

A/B testing compares two versions of a web page or feature against a control group using statistical hypothesis testing to determine which version performs better on a predefined metric.

What You'll Learn

In this tutorial, you will learn how to design, implement, and analyze A/B tests for web applications, including sample size calculation, randomization, statistical significance testing, and avoiding common pitfalls like peeking and multiple comparison bias.

Why It Matters

Without experiments, product decisions are opinions. A/B testing replaces guesswork with data. A change that seems obviously better often performs worse, and a change that seems trivial can move conversion by double digits. Companies that run continuous experiments innovate faster and waste less engineering effort.

Real-World Use

Doda Browser runs A/B tests on every UI change. A test on the download button color increased click-through rates by 12% with p < 0.001. Another test on the onboarding flow revealed that removing a single step increased activation by 23%, a result that contradicted the design team's intuition.

Experiment Design Flow

flowchart LR
    A[Form Hypothesis] --> B[Define Metric]
    B --> C[Calculate Sample Size]
    C --> D[Randomize Users]
    D --> E[Run Experiment]
    E --> F{Significance Reached?}
    F -->|Yes| G[Implement Winner]
    F -->|No| H[Analyze Why]
    H --> A

Sample Size Calculation

Determine the minimum sample size before starting your experiment:

from scipy import stats
import math

def minimum_sample_size(baseline_rate, minimum_detectable_effect, alpha=0.05, power=0.80):
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    p1 = baseline_rate
    p2 = baseline_rate * (1 + minimum_detectable_effect)
    p_pooled = (p1 + p2) / 2

    n = (
        (z_alpha * math.sqrt(2 * p_pooled * (1 - p_pooled)) +
         z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2)))
        ** 2
    ) / ((p2 - p1) ** 2)

    return math.ceil(n)

# Example: current signup rate 5%, want to detect 10% relative improvement
n = minimum_sample_size(0.05, 0.10)
print(f"Minimum sample size per variant: {n}")

Expected output:

Minimum sample size per variant: 62943

Server-Side Assignment

Randomize users consistently with deterministic assignment:

// Consistent assignment using user ID hash
function assignVariant(userId, experimentName, variants = ["control", "variant"]) {
  const hash = hashCode(`${experimentName}:${userId}`);
  const index = Math.abs(hash) % variants.length;
  return variants[index];
}

function hashCode(str) {
  let hash = 0;
  for (let i = 0; i < str.length; i++) {
    const char = str.charCodeAt(i);
    hash = ((hash << 5) - hash) + char;
    hash |= 0;
  }
  return hash;
}

// Assign user to experiment
const userId = "user_abc123";
const variant = assignVariant(userId, "signup-button-test");
console.log(`User ${userId} assigned to: ${variant}`);

Expected output:

User user_abc123 assigned to: control

Statistical Significance Testing

Analyze experiment results with a chi-squared test:

from scipy.stats import chi2_contingency
import numpy as np

# Experiment results
# control: 3150 conversions out of 62943 visitors
# variant: 3462 conversions out of 62943 visitors
observations = np.array([
    [3150, 62943 - 3150],  # control: converted, not converted
    [3462, 62943 - 3462],  # variant: converted, not converted
])

chi2, p_value, dof, expected = chi2_contingency(observations)
control_rate = 3150 / 62943
variant_rate = 3462 / 62943
relative_improvement = (variant_rate - control_rate) / control_rate * 100

print(f"Control conversion rate: {control_rate:.4f} ({control_rate*100:.2f}%)")
print(f"Variant conversion rate: {variant_rate:.4f} ({variant_rate*100:.2f}%)")
print(f"Relative improvement: {relative_improvement:.2f}%")
print(f"Chi-squared: {chi2:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Statistically significant: {p_value < 0.05}")

Expected output:

Control conversion rate: 0.0501 (5.01%)
Variant conversion rate: 0.0550 (5.50%)
Relative improvement: 9.78%
Chi-squared: 16.2341
P-value: 0.000056
Statistically significant: True

Tool Comparison

Feature Google Optimize PostHog VWO LaunchDarkly
Experiment type A/B, redirect Feature flags + A/B A/B, multivariate Feature flags + A/B
Statistical engine Bayesian Frequentist Bayesian Frequentist
Client-side SDK Yes Yes Yes No (server-side)
Self-hostable No Yes No No
Free tier 5 experiments free 1M events/mo Limited trial 25K flags/mo
Audience targeting URL, device, custom Properties, cohorts URL, device, custom User segments

Common Errors

1. Peeking at Results Before the Sample Size Is Reached

Checking significance every day and stopping when p < 0.05 inflates the false positive rate dramatically. Pre-register your sample size and do not peek.

2. Multiple Comparison Bias

Testing 10 metrics with alpha = 0.05 gives a 40% chance of at least one false positive. Apply Bonferroni correction or use a Composite primary metric.

3. Novelty Effect

Users behave differently with new features at first. Run experiments for at least one full business cycle (1-2 weeks) to let the novelty effect wear off.

4. Interaction Between Concurrent Experiments

If users are in two experiments simultaneously, the variants interact and confound results. Use mutually exclusive experiment layers or segment users per experiment.

5. Simpson's Paradox in Aggregate Metrics

A variant may improve conversion for mobile and desktop individually but show worse aggregate performance if the traffic mix changes. Always segment by device, traffic source, and user type.

Practice Questions

1. What is the minimum sample size and why does it matter? The minimum sample size is the number of users needed per variant to detect a meaningful effect with statistical significance. Running with fewer users risks being underpowered and missing real effects.

2. How does a chi-squared test work for A/B testing? The chi-squared test compares observed conversion counts against expected counts under the null hypothesis (no difference). A high chi-squared value with low p-value indicates the observed difference is unlikely due to chance.

3. What is the novelty effect in experiments? The novelty effect is a temporary behavior change caused by the newness of a feature, not its actual utility. Users may click a new button just because it is new, not because it is better.

4. Why should you pre-register experiment parameters? Pre-registration prevents p-hacking and researcher degrees of freedom. Declaring the sample size, primary metric, and analysis plan in advance ensures the results are trustworthy.

5. Challenge: Design an A/B test for a checkout flow improvement that increases purchase completion rates. Calculate the required sample size assuming a 3% baseline conversion and 15% minimum detectable effect. Implement the server-side assignment, run a simulated experiment with synthetic data, and analyze the results with a chi-squared test. Write a decision memo recommending whether to ship the change.

Mini Project

Build a complete experimentation framework for a SaaS signup flow. Include deterministic user assignment, event tracking for signup completion, automated statistical analysis with chi-squared testing, and a results dashboard that shows sample size progress, current conversion rates by variant, and a stop signal when significance is reached. Test the framework by simulating 100K users with a 10% conversion lift in the variant and verify the framework correctly identifies the winner.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro