Root Cause Analysis in Testing — Systematic Defect Investigation

DodaTech Updated 2026-06-24 8 min read

In this tutorial, you'll learn about Root Cause Analysis in Testing. We cover key concepts, practical examples, and best practices.

Root cause analysis (RCA) is a systematic process for identifying the fundamental source of a defect or failure, distinguishing the true cause from surface-level symptoms, and implementing corrective actions that prevent recurrence.

What You'll Learn

You'll learn systematic RCA techniques — the 5 Whys, fishbone (Ishikawa) diagrams, fault tree analysis, and failure mode and effects analysis (FMEA) — and how to integrate them into your testing workflow for lasting quality improvements.

Why It Matters

Fixing symptoms without finding root causes guarantees the same defect will reappear. A production outage that took six hours to fix will happen again unless the underlying cause is addressed. DodaTech's browser team reduced recurring bugs by 70% after adopting a formal RCA process — every critical defect now requires a documented root cause and corrective action before closure.

Real-World Use

A payment processing system suffered intermittent transaction failures affecting 2% of users. Initial investigation blamed "network issues." A 5 Whys analysis revealed the true root cause: a race condition in the database connection pool that only triggered under high concurrency with specific transaction types. The fix was a single configuration change. Without RCA, the team would have added more network monitoring instead of solving the real problem.

RCA Process Flow

flowchart TD
  A[Defect Reported] --> B[Collect Data]
  B --> C[Describe Problem]
  C --> D[Identify Causal Factors]
  D --> E{Apply RCA Technique}
  E --> F[5 Whys]
  E --> G[Fishbone]
  E --> H[Fault Tree]
  F --> I[Root Cause Identified]
  G --> I
  H --> I
  I --> J[Define Corrective Actions]
  J --> K[Implement Fix]
  K --> L[Verify Fix]
  L --> M[Document & Share]
  style I fill:#e74c3c,color:#fff
  style K fill:#2ecc71,color:#fff

The 5 Whys Technique

Ask "why" repeatedly until the root cause emerges. The first answers are symptoms; the final answer is the root cause.

// rca-5whys.js
class FiveWhys {
  constructor(problem, maxDepth = 5) {
    this.problem = problem;
    this.maxDepth = maxDepth;
    this.chain = [{ level: 0, question: 'Why?', answer: problem }];
  }

  addAnswer(why, answer) {
    if (this.chain.length >= this.maxDepth) {
      console.log('Max depth reached. Root cause may need investigation.');
      return;
    }
    this.chain.push({
      level: this.chain.length,
      question: `Why? Because ${why}`,
      answer,
    });
  }

  getRootCause() {
    return this.chain[this.chain.length - 1].answer;
  }

  printAnalysis() {
    console.log('=== 5 Whys Analysis ===\n');
    console.log(`Problem: ${this.problem}\n`);
    this.chain.forEach((item, i) => {
      console.log(`${i}. ${item.question}`);
      console.log(`   Answer: ${item.answer}\n`);
    });
    console.log(`Root Cause: ${this.getRootCause()}`);
    console.log(`Depth: ${this.chain.length} whys`);
  }
}

const rca = new FiveWhys('Build pipeline fails intermittently on PR merges');
rca.addAnswer('Test step times out', 'Integration tests take longer than 10 minutes');
rca.addAnswer('New tests were added last sprint', 'Team added 50 E2E tests without reviewing execution time');
rca.addAnswer('No test timeout budget', 'The test strategy doesn\'t define execution time limits');
rca.addAnswer('No test review process', 'PRs don\'t require test performance review');
rca.addAnswer('Team prioritizes feature velocity over test quality', 'Management measures features shipped, not test stability');
rca.printAnalysis();

Expected output:

=== 5 Whys Analysis ===

Problem: Build pipeline fails intermittently on PR merges

0. Why?
   Answer: Build pipeline fails intermittently on PR merges

1. Why? Because Test step times out
   Answer: Integration tests take longer than 10 minutes

2. Why? Because New tests were added last sprint
   Answer: Team added 50 E2E tests without reviewing execution time

3. Why? Because No test timeout budget
   Answer: The test strategy doesn't define execution time limits

4. Why? Because No test review process
   Answer: PRs don't require test performance review

5. Why? Because Team prioritizes feature velocity over test quality
   Answer: Management measures features shipped, not test stability

Root Cause: Management measures features shipped, not test stability
Depth: 5 whys

Fishbone (Ishikawa) Diagram

The fishbone diagram organizes potential causes into categories — People, Process, Technology, Environment, Measurements — helping teams brainstorm systematically rather than jumping to conclusions.

// fishbone-analyzer.js
class FishboneDiagram {
  constructor(problem) {
    this.problem = problem;
    this.categories = {
      People: [],
      Process: [],
      Technology: [],
      Environment: [],
      Measurements: [],
      Data: [],
    };
  }

  addCause(category, cause, subcauses = []) {
    if (!this.categories[category]) {
      console.log(`Unknown category: ${category}`);
      return;
    }
    this.categories[category].push({ cause, subcauses });
  }

  analyze() {
    console.log(`=== Fishbone: ${this.problem} ===\n`);
    let totalCauses = 0;
    Object.entries(this.categories).forEach(([category, causes]) => {
      if (causes.length > 0) {
        console.log(`[${category}]`);
        causes.forEach(({ cause, subcauses }) => {
          console.log(`  > ${cause}`);
          subcauses.forEach(sc => console.log(`    - ${sc}`));
          totalCauses++;
        });
        console.log();
      }
    });
    console.log(`Total causal factors: ${totalCauses}`);
    console.log('Categories with causes: ' +
      Object.values(this.categories).filter(c => c.length > 0).length);
  }
}

const fishbone = new FishboneDiagram('Mobile app crashes on checkout');

fishbone.addCause('People', 'Developer unfamiliar with payment SDK', [
  'No onboarding for third-party SDKs',
  'Payment module owned by single developer',
]);
fishbone.addCause('Process', 'No integration tests for payment flow', [
  'QA relies on manual testing only',
  'Payment sandbox not available in CI',
]);
fishbone.addCause('Technology', 'Kotlin version mismatch', [
  'SDK requires Kotlin 1.9, project uses 1.8',
  'Build tool does not detect version incompatibility',
]);
fishbone.addCause('Environment', 'Test device runs Android 12, crash on Android 13', [
  'Device lab has limited Android versions',
  'Emulator performance is unreliable',
]);
fishbone.addCause('Measurements', 'Crash rate not tracked per screen', [
  'Crashlytics aggregated by app version only',
  'No alert threshold for checkout crashes',
]);

fishbone.analyze();

Expected output:

=== Fishbone: Mobile app crashes on checkout ===

[People]
  > Developer unfamiliar with payment SDK
    - No onboarding for third-party SDKs
    - Payment module owned by single developer

[Process]
  > No integration tests for payment flow
    - QA relies on manual testing only
    - Payment sandbox not available in CI

[Technology]
  > Kotlin version mismatch
    - SDK requires Kotlin 1.9, project uses 1.8
    - Build tool does not detect version incompatibility

[Environment]
  > Test device runs Android 12, crash on Android 13
    - Device lab has limited Android versions
    - Emulator performance is unreliable

[Measurements]
  > Crash rate not tracked per screen
    - Crashlytics aggregated by app version only
    - No alert threshold for checkout crashes

Total causal factors: 5
Categories with causes: 5

Fault Tree Analysis

Fault tree analysis uses Boolean logic (AND/OR gates) to model combinations of events that lead to a failure. It's especially useful for safety-critical and complex multi-factor failures.

class FaultTreeNode:
    def __init__(self, name, gate=None):
        self.name = name
        self.gate = gate
        self.children = []

    def add_child(self, child):
        self.children.append(child)
        return self

    def probability(self, base_probs):
        if not self.children:
            return base_probs.get(self.name, 0.0)

        child_probs = [c.probability(base_probs) for c in self.children]
        if self.gate == "OR":
            return 1 - eval("*".join(f"(1-{p})" for p in child_probs))
        elif self.gate == "AND":
            return eval("*".join(str(p) for p in child_probs))
        return 0.0

    def print_structure(self, indent=0):
        prefix = "  " * indent
        gate_str = f" [{self.gate}]" if self.gate else ""
        print(f"{prefix}{self.name}{gate_str}")
        for child in self.children:
            child.print_structure(indent + 1)

top = FaultTreeNode("System Outage", "OR")
infra = FaultTreeNode("Infrastructure Failure", "OR")
app = FaultTreeNode("Application Failure", "OR")
db = FaultTreeNode("Database Failure", "AND")

top.add_child(infra).add_child(app)
infra.add_child(FaultTreeNode("Network Down")).add_child(FaultTreeNode("Power Failure"))
app.add_child(FaultTreeNode("Memory Leak")).add_child(FaultTreeNode("Unhandled Exception"))
db.add_child(FaultTreeNode("Connection Pool Exhausted")).add_child(FaultTreeNode("Primary Replica Lag"))

base_probs = {
    "Network Down": 0.001,
    "Power Failure": 0.0005,
    "Memory Leak": 0.01,
    "Unhandled Exception": 0.005,
    "Connection Pool Exhausted": 0.02,
    "Primary Replica Lag": 0.015,
}

print("Fault Tree Structure:")
top.print_structure()
print(f"\nTop Event Probability: {top.probability(base_probs):.6f}")
print(f"Infrastructure failure: {infra.probability(base_probs):.6f}")
print(f"Application failure: {app.probability(base_probs):.6f}")
print(f"Database failure (co-occurrence): {db.probability(base_probs):.6f}")

Expected output:

Fault Tree Structure:
System Outage [OR]
  Infrastructure Failure [OR]
    Network Down
    Power Failure
  Application Failure [OR]
    Memory Leak
    Unhandled Exception
  Database Failure [AND]
    Connection Pool Exhausted
    Primary Replica Lag

Top Event Probability: 0.0162
Infrastructure failure: 0.0015
Application failure: 0.0149
Database failure (co-occurrence): 0.0003

RCA Technique Comparison

Technique	Best For	Effort	Output
5 Whys	Simple, linear causal chains	Low	Single root cause
Fishbone	Brainstorming, multiple categories	Medium	Cause categories
Fault Tree	Complex, multi-factor failures	High	Boolean failure model
FMEA	Preventive, risk-prioritized	High	Risk priority numbers
Barrier Analysis	Safety incidents, human factors	Medium	Missing controls

Common Errors and Mistakes

Mistake	Why It Happens	How to Fix
Stopping at symptoms	First answer seems sufficient	Always ask "why" at least 5 times
Blaming individuals	Easier than finding systemic cause	Focus on process, not people
Too narrow scope	Only considering technical causes	Use fishbone to explore all categories
Confirmation bias	Finding evidence for initial theory	Involve team members with different views
No corrective action	Analysis without follow-up	Require action item for every root cause

Practice Questions

What is the difference between a symptom and a root cause?

Answer: A symptom is an observable effect of the problem (build fails, app crashes). A root cause is the fundamental reason that, when fixed, prevents recurrence.

When would you use a fishbone diagram instead of 5 Whys?

Answer: When the problem could have multiple contributing causes across different categories (people, process, technology, environment), requiring systematic brainstorming.

What is an AND gate in fault tree analysis?

Answer: An AND gate means the parent event occurs only if ALL child events occur simultaneously. It represents multiplicative risk.

How do you ensure RCA findings lead to action?

Answer: Document corrective actions with owners and deadlines, track them in the same system as the original defect, and verify effectiveness after implementation.

What is the most common RCA mistake?

Answer: Stopping at the first plausible cause instead of digging deeper through successive "why" questions.

Challenge

Analyze a real production incident from your team using all three RCA techniques. Start with 5 Whys to find a causal chain, then use fishbone to identify contributing factors across categories, then model the incident as a fault tree. Compare the results — do all three techniques converge on the same root cause?

Real-World Task

Implement an RCA program for your QA team. Define which defects require formal RCA (e.g., all Sev-1 and Sev-2, plus recurring Sev-3). Create an RCA template with sections for problem description, technique used, causal chain, root cause, corrective actions, and verification. Train the team on all three techniques. Track the recurrence rate of analyzed defects over three months.

Next Steps

Now that you can perform root cause analysis, apply these techniques in Shift-Left Testing to catch defects even earlier, and integrate RCA into your Test Strategy documentation.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Building a Test Automation Framework from Scratch Next → Shift-Left Testing — Early Defect Detection Strategy Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Testing