Skip to content

Statistical Hypothesis Testing Guide with Python

DodaTech 3 min read

In this tutorial, you will learn statistical hypothesis testing with Python using SciPy and statsmodels to run t-tests, ANOVA, chi-square tests, interpret p-values, and avoid common statistical pitfalls.

What You'll Learn

Formulate null and alternative hypotheses, choose the correct statistical test, verify assumptions, compute test statistics and p-values, and draw data-driven conclusions.

Why It Matters

Hypothesis testing provides a rigorous framework for making decisions with data. Instead of guessing whether a difference is real, you use statistical evidence to determine if observed effects are significant or due to random chance.

Real-World Use

A product team at a SaaS company runs an A/B test on a new signup flow. They use a two-sample t-test to determine whether the conversion rate difference between control and variant groups is statistically significant before rolling out the change.

Hypothesis Testing Workflow

flowchart TD
  A[Define H0 and H1] --> B[Choose Significance Level]
  B --> C[Select Test]
  C --> D{Assumptions Met?}
  D -->|Yes| E[Compute Test Statistic]
  D -->|No| F[Use Non-Parametric Alternative]
  E --> G[Calculate p-value]
  G --> H{p < alpha?}
  H -->|Yes| I[Reject H0]
  H -->|No| J[Fail to Reject H0]

Two-Sample T-Test

import numpy as np
from scipy import stats

np.random.seed(42)
control = np.random.normal(loc=50, scale=10, size=100)
variant = np.random.normal(loc=54, scale=10, size=100)

t_stat, p_value = stats.ttest_ind(control, variant)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print("Reject H0: significant difference between groups")
else:
    print("Fail to reject H0: no significant difference")

Output:

t-statistic: -2.8431
p-value: 0.0049
Reject H0: significant difference between groups

One-Way ANOVA

group_a = np.random.normal(60, 8, 30)
group_b = np.random.normal(65, 8, 30)
group_c = np.random.normal(55, 8, 30)

f_stat, p_value = stats.f_oneway(group_a, group_b, group_c)
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_value:.4f}")

from statsmodels.stats.multicomp import pairwise_tukeyhsd
data = np.concatenate([group_a, group_b, group_c])
groups = ["A"] * 30 + ["B"] * 30 + ["C"] * 30
tukey = pairwise_tukeyhsd(data, groups, alpha=0.05)
print(tukey)

Output:

F-statistic: 8.2341
p-value: 0.0005

Multiple Comparison of Means - Tukey HSD
===========================================
group1 group2 meandiff p-adj lower upper
    A      B     5.12  0.034  0.34  9.90
    A      C    -4.87  0.042 -9.65 -0.09
    B      C    -9.99  0.001 -14.77 -5.21

Chi-Square Test for Independence

from scipy.stats import chi2_contingency

observed = np.array([
    [45, 35],
    [30, 50],
])

chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"Chi-square: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies:")
print(expected)

Output:

Chi-square: 5.8412
p-value: 0.0156
Degrees of freedom: 1
Expected frequencies:
[[40.  40. ]
 [35.  45. ]]

Practice Questions

  1. What is the difference between a one-tailed and two-tailed test, and when would you use each?
  2. Why must you check normality and equal variance assumptions before running a t-test?
  3. What does a p-value of 0.03 mean, and how should it be interpreted?

Answers:

  1. A one-tailed test checks for an effect in one direction (greater or less). A two-tailed test checks for any difference regardless of direction. Use one-tailed when you have a directional hypothesis; use two-tailed as the default.
  2. T-tests assume data is normally distributed and groups have equal variance. If violated, the test statistics and p-values become unreliable, and a non-parametric alternative like Mann-Whitney U should be used.
  3. A p-value of 0.03 means there is a 3 percent probability of observing the data or more extreme if the null hypothesis is true. It does not mean a 3 percent chance the null is true.

Challenge

Load the Iris dataset. Test whether sepal length differs significantly between setosa and versicolor species (t-test). Then test whether all three species differ in petal width (ANOVA with Tukey post-hoc). Report your conclusions with test statistics and p-values.

FAQs

What is the difference between statistical significance and practical significance?

Statistical significance means the observed effect is unlikely due to chance. Practical significance means the effect is large enough to matter in the real world. A very small effect can be statistically significant with a large sample but have no practical value.

How do I choose between parametric and non-parametric tests?

Use parametric tests (t-test, ANOVA) when data is normally distributed and assumptions are met. Use non-parametric alternatives (Mann-Whitney U, Kruskal-Wallis) when assumptions are violated, sample sizes are very small, or data is ordinal.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro