Root Cause Analysis in Testing — Systematic Defect Investigation
In this tutorial, you'll learn about Root Cause Analysis in Testing. We cover key concepts, practical examples, and best practices.
Root cause analysis (RCA) is a systematic process for identifying the fundamental source of a defect or failure, distinguishing the true cause from surface-level symptoms, and implementing corrective actions that prevent recurrence.
What You'll Learn
You'll learn systematic RCA techniques — the 5 Whys, fishbone (Ishikawa) diagrams, fault tree analysis, and failure mode and effects analysis (FMEA) — and how to integrate them into your testing workflow for lasting quality improvements.
Why It Matters
Fixing symptoms without finding root causes guarantees the same defect will reappear. A production outage that took six hours to fix will happen again unless the underlying cause is addressed. DodaTech's browser team reduced recurring bugs by 70% after adopting a formal RCA process — every critical defect now requires a documented root cause and corrective action before closure.
Real-World Use
A payment processing system suffered intermittent transaction failures affecting 2% of users. Initial investigation blamed "network issues." A 5 Whys analysis revealed the true root cause: a race condition in the database connection pool that only triggered under high concurrency with specific transaction types. The fix was a single configuration change. Without RCA, the team would have added more network monitoring instead of solving the real problem.
RCA Process Flow
flowchart TD
A[Defect Reported] --> B[Collect Data]
B --> C[Describe Problem]
C --> D[Identify Causal Factors]
D --> E{Apply RCA Technique}
E --> F[5 Whys]
E --> G[Fishbone]
E --> H[Fault Tree]
F --> I[Root Cause Identified]
G --> I
H --> I
I --> J[Define Corrective Actions]
J --> K[Implement Fix]
K --> L[Verify Fix]
L --> M[Document & Share]
style I fill:#e74c3c,color:#fff
style K fill:#2ecc71,color:#fff
The 5 Whys Technique
Ask "why" repeatedly until the root cause emerges. The first answers are symptoms; the final answer is the root cause.
// rca-5whys.js
class FiveWhys {
constructor(problem, maxDepth = 5) {
this.problem = problem;
this.maxDepth = maxDepth;
this.chain = [{ level: 0, question: 'Why?', answer: problem }];
}
addAnswer(why, answer) {
if (this.chain.length >= this.maxDepth) {
console.log('Max depth reached. Root cause may need investigation.');
return;
}
this.chain.push({
level: this.chain.length,
question: `Why? Because ${why}`,
answer,
});
}
getRootCause() {
return this.chain[this.chain.length - 1].answer;
}
printAnalysis() {
console.log('=== 5 Whys Analysis ===\n');
console.log(`Problem: ${this.problem}\n`);
this.chain.forEach((item, i) => {
console.log(`${i}. ${item.question}`);
console.log(` Answer: ${item.answer}\n`);
});
console.log(`Root Cause: ${this.getRootCause()}`);
console.log(`Depth: ${this.chain.length} whys`);
}
}
const rca = new FiveWhys('Build pipeline fails intermittently on PR merges');
rca.addAnswer('Test step times out', 'Integration tests take longer than 10 minutes');
rca.addAnswer('New tests were added last sprint', 'Team added 50 E2E tests without reviewing execution time');
rca.addAnswer('No test timeout budget', 'The test strategy doesn\'t define execution time limits');
rca.addAnswer('No test review process', 'PRs don\'t require test performance review');
rca.addAnswer('Team prioritizes feature velocity over test quality', 'Management measures features shipped, not test stability');
rca.printAnalysis();
Expected output:
=== 5 Whys Analysis ===
Problem: Build pipeline fails intermittently on PR merges
0. Why?
Answer: Build pipeline fails intermittently on PR merges
1. Why? Because Test step times out
Answer: Integration tests take longer than 10 minutes
2. Why? Because New tests were added last sprint
Answer: Team added 50 E2E tests without reviewing execution time
3. Why? Because No test timeout budget
Answer: The test strategy doesn't define execution time limits
4. Why? Because No test review process
Answer: PRs don't require test performance review
5. Why? Because Team prioritizes feature velocity over test quality
Answer: Management measures features shipped, not test stability
Root Cause: Management measures features shipped, not test stability
Depth: 5 whys
Fishbone (Ishikawa) Diagram
The fishbone diagram organizes potential causes into categories — People, Process, Technology, Environment, Measurements — helping teams brainstorm systematically rather than jumping to conclusions.
// fishbone-analyzer.js
class FishboneDiagram {
constructor(problem) {
this.problem = problem;
this.categories = {
People: [],
Process: [],
Technology: [],
Environment: [],
Measurements: [],
Data: [],
};
}
addCause(category, cause, subcauses = []) {
if (!this.categories[category]) {
console.log(`Unknown category: ${category}`);
return;
}
this.categories[category].push({ cause, subcauses });
}
analyze() {
console.log(`=== Fishbone: ${this.problem} ===\n`);
let totalCauses = 0;
Object.entries(this.categories).forEach(([category, causes]) => {
if (causes.length > 0) {
console.log(`[${category}]`);
causes.forEach(({ cause, subcauses }) => {
console.log(` > ${cause}`);
subcauses.forEach(sc => console.log(` - ${sc}`));
totalCauses++;
});
console.log();
}
});
console.log(`Total causal factors: ${totalCauses}`);
console.log('Categories with causes: ' +
Object.values(this.categories).filter(c => c.length > 0).length);
}
}
const fishbone = new FishboneDiagram('Mobile app crashes on checkout');
fishbone.addCause('People', 'Developer unfamiliar with payment SDK', [
'No onboarding for third-party SDKs',
'Payment module owned by single developer',
]);
fishbone.addCause('Process', 'No integration tests for payment flow', [
'QA relies on manual testing only',
'Payment sandbox not available in CI',
]);
fishbone.addCause('Technology', 'Kotlin version mismatch', [
'SDK requires Kotlin 1.9, project uses 1.8',
'Build tool does not detect version incompatibility',
]);
fishbone.addCause('Environment', 'Test device runs Android 12, crash on Android 13', [
'Device lab has limited Android versions',
'Emulator performance is unreliable',
]);
fishbone.addCause('Measurements', 'Crash rate not tracked per screen', [
'Crashlytics aggregated by app version only',
'No alert threshold for checkout crashes',
]);
fishbone.analyze();
Expected output:
=== Fishbone: Mobile app crashes on checkout ===
[People]
> Developer unfamiliar with payment SDK
- No onboarding for third-party SDKs
- Payment module owned by single developer
[Process]
> No integration tests for payment flow
- QA relies on manual testing only
- Payment sandbox not available in CI
[Technology]
> Kotlin version mismatch
- SDK requires Kotlin 1.9, project uses 1.8
- Build tool does not detect version incompatibility
[Environment]
> Test device runs Android 12, crash on Android 13
- Device lab has limited Android versions
- Emulator performance is unreliable
[Measurements]
> Crash rate not tracked per screen
- Crashlytics aggregated by app version only
- No alert threshold for checkout crashes
Total causal factors: 5
Categories with causes: 5
Fault Tree Analysis
Fault tree analysis uses Boolean logic (AND/OR gates) to model combinations of events that lead to a failure. It's especially useful for safety-critical and complex multi-factor failures.
class FaultTreeNode:
def __init__(self, name, gate=None):
self.name = name
self.gate = gate
self.children = []
def add_child(self, child):
self.children.append(child)
return self
def probability(self, base_probs):
if not self.children:
return base_probs.get(self.name, 0.0)
child_probs = [c.probability(base_probs) for c in self.children]
if self.gate == "OR":
return 1 - eval("*".join(f"(1-{p})" for p in child_probs))
elif self.gate == "AND":
return eval("*".join(str(p) for p in child_probs))
return 0.0
def print_structure(self, indent=0):
prefix = " " * indent
gate_str = f" [{self.gate}]" if self.gate else ""
print(f"{prefix}{self.name}{gate_str}")
for child in self.children:
child.print_structure(indent + 1)
top = FaultTreeNode("System Outage", "OR")
infra = FaultTreeNode("Infrastructure Failure", "OR")
app = FaultTreeNode("Application Failure", "OR")
db = FaultTreeNode("Database Failure", "AND")
top.add_child(infra).add_child(app)
infra.add_child(FaultTreeNode("Network Down")).add_child(FaultTreeNode("Power Failure"))
app.add_child(FaultTreeNode("Memory Leak")).add_child(FaultTreeNode("Unhandled Exception"))
db.add_child(FaultTreeNode("Connection Pool Exhausted")).add_child(FaultTreeNode("Primary Replica Lag"))
base_probs = {
"Network Down": 0.001,
"Power Failure": 0.0005,
"Memory Leak": 0.01,
"Unhandled Exception": 0.005,
"Connection Pool Exhausted": 0.02,
"Primary Replica Lag": 0.015,
}
print("Fault Tree Structure:")
top.print_structure()
print(f"\nTop Event Probability: {top.probability(base_probs):.6f}")
print(f"Infrastructure failure: {infra.probability(base_probs):.6f}")
print(f"Application failure: {app.probability(base_probs):.6f}")
print(f"Database failure (co-occurrence): {db.probability(base_probs):.6f}")
Expected output:
Fault Tree Structure:
System Outage [OR]
Infrastructure Failure [OR]
Network Down
Power Failure
Application Failure [OR]
Memory Leak
Unhandled Exception
Database Failure [AND]
Connection Pool Exhausted
Primary Replica Lag
Top Event Probability: 0.0162
Infrastructure failure: 0.0015
Application failure: 0.0149
Database failure (co-occurrence): 0.0003
RCA Technique Comparison
| Technique | Best For | Effort | Output |
|---|---|---|---|
| 5 Whys | Simple, linear causal chains | Low | Single root cause |
| Fishbone | Brainstorming, multiple categories | Medium | Cause categories |
| Fault Tree | Complex, multi-factor failures | High | Boolean failure model |
| FMEA | Preventive, risk-prioritized | High | Risk priority numbers |
| Barrier Analysis | Safety incidents, human factors | Medium | Missing controls |
Common Errors and Mistakes
| Mistake | Why It Happens | How to Fix |
|---|---|---|
| Stopping at symptoms | First answer seems sufficient | Always ask "why" at least 5 times |
| Blaming individuals | Easier than finding systemic cause | Focus on process, not people |
| Too narrow scope | Only considering technical causes | Use fishbone to explore all categories |
| Confirmation bias | Finding evidence for initial theory | Involve team members with different views |
| No corrective action | Analysis without follow-up | Require action item for every root cause |
Practice Questions
- What is the difference between a symptom and a root cause?
Answer: A symptom is an observable effect of the problem (build fails, app crashes). A root cause is the fundamental reason that, when fixed, prevents recurrence.
- When would you use a fishbone diagram instead of 5 Whys?
Answer: When the problem could have multiple contributing causes across different categories (people, process, technology, environment), requiring systematic brainstorming.
- What is an AND gate in fault tree analysis?
Answer: An AND gate means the parent event occurs only if ALL child events occur simultaneously. It represents multiplicative risk.
- How do you ensure RCA findings lead to action?
Answer: Document corrective actions with owners and deadlines, track them in the same system as the original defect, and verify effectiveness after implementation.
- What is the most common RCA mistake?
Answer: Stopping at the first plausible cause instead of digging deeper through successive "why" questions.
Challenge
Analyze a real production incident from your team using all three RCA techniques. Start with 5 Whys to find a causal chain, then use fishbone to identify contributing factors across categories, then model the incident as a fault tree. Compare the results — do all three techniques converge on the same root cause?
Real-World Task
Implement an RCA program for your QA team. Define which defects require formal RCA (e.g., all Sev-1 and Sev-2, plus recurring Sev-3). Create an RCA template with sections for problem description, technique used, causal chain, root cause, corrective actions, and verification. Train the team on all three techniques. Track the recurrence rate of analyzed defects over three months.
Next Steps
Now that you can perform root cause analysis, apply these techniques in Shift-Left Testing to catch defects even earlier, and integrate RCA into your Test Strategy documentation.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro