LLM Evaluation and Benchmarking — Metrics, Datasets and Automated Testing
LLM evaluation is the Process of measuring model performance across accuracy, safety, latency, and cost dimensions — this guide covers standard benchmarks, automated evaluation frameworks, and custom metric design.
What You'll Learn
You'll learn to evaluate LLMs using standard benchmarks (MMLU, HumanEval, GSM8K), build automated evaluation pipelines with Python, design task-specific metrics, and run human alignment tests.
Why It Matters
A model that scores 90% on benchmarks may still fail in production. Systematic evaluation across multiple dimensions — accuracy, safety, latency, cost, and robustness — is the only way to select and monitor LLMs reliably.
Real-World Use
Doda Browser evaluates every new LLM release against a curated test suite of 500 browser-specific queries before updating its AI assistant, ensuring improvements in one area do not regress performance in another.
Evaluation Framework
flowchart TD
A[Model] --> B[Benchmark Suite]
B --> C[Accuracy Tests]
B --> D[Safety Tests]
B --> E[Performance Tests]
B --> F[Cost Analysis]
C --> G[Scoring]
D --> G
E --> G
F --> G
G --> H[Decision]
H --> I[Deploy]
H --> J[Reject]
Running Standard Benchmarks
MMLU measures multitask knowledge across 57 subjects. HumanEval tests Code Generation.
from datasets import load_dataset
import evaluate
from openai import OpenAI
client = OpenAI()
def evaluate_mmlu_subset(subject="machine_learning", num_samples=20):
dataset = load_dataset(
"mmlu", subject, split="test"
).select(range(num_samples))
correct = 0
for item in dataset:
prompt = f"{item['question']}\n"
for i, choice in enumerate(item["choices"]):
prompt += f"{chr(65+i)}. {choice}\n"
prompt += "\nAnswer with the letter only."
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=5,
temperature=0
)
answer = response.choices[0].message.content.strip()
if answer.startswith(item["answer"]):
correct += 1
accuracy = correct / num_samples
print(f"Subject: {subject}")
print(f"Samples: {num_samples}")
print(f"Correct: {correct}/{num_samples}")
print(f"Accuracy: {accuracy:.2%}")
return accuracy
result = evaluate_mmlu_subset()
Expected output:
Subject: machine_learning
Samples: 20
Correct: 17/20
Accuracy: 85.00%
Automated Evaluation with DeepEval
DeepEval provides pre-built metrics for LLM output quality.
from deepeval import evaluate
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
HallucinationMetric,
ContextualPrecisionMetric
)
from deepeval.test_case import LLMTestCase
# Define metrics
relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.7)
hallucination = HallucinationMetric(threshold=0.5)
test_case = LLMTestCase(
input="What are the rate limits for the API?",
actual_output="The API allows 100 requests per minute per key.",
expected_output="API rate limit is 100 requests per minute.",
retrieval_context=[
"Each API key is limited to 100 requests per minute.",
"Rate limits reset every 60 seconds from first request.]
]
)
results = evaluate(
test_cases=[test_case],
metrics=[relevancy, faithfulness, hallucination]
)
for metric_name, score in results.test_results[0].metrics_data.items():
print(f"{metric_name}: {score.score:.3f} "
f"({'PASS' if score.score >= score.threshold else 'FAIL'})")
Expected output:
Answer Relevancy: 0.892 PASS
Faithfulness: 0.934 PASS
Hallucination: 0.123 PASS
Custom Task-Specific Metrics
Design metrics aligned with your specific use case.
import re
from typing import List
def code_correctness_score(
prompt: str,
generated_code: str,
test_cases: List[tuple]
) -> dict:
"""Evaluate generated code against test cases."""
passed = 0
errors = []
for inputs, expected in test_cases:
try:
local_scope = {}
exec(generated_code, {"__builtins__": __builtins__}, local_scope)
# Extract function name from code
func_match = re.search(
r"def\s+(\w+)\s*\(", generated_code
)
if not func_match:
errors.append("No function found")
continue
func_name = func_match.group(1)
func = local_scope.get(func_name)
if not callable(func):
errors.append(f"{func_name} is not callable")
continue
result = func(*inputs) if isinstance(inputs, tuple) else func(inputs)
if result == expected:
passed += 1
else:
errors.append(
f"Expected {expected}, got {result}"
)
except Exception as e:
errors.append(str(e))
return {
"passed": passed,
"total": len(test_cases),
"score": passed / len(test_cases) if test_cases else 0,
"errors": errors[:3]
}
# Test
code = """
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
"""
tests = [
(1, 1), (5, 5), (10, 55), (0, 0), (20, 6765)
]
result = code_correctness_score("fibonacci", code, tests)
print(f"Correctness: {result['passed']}/{result['total']}")
print(f"Score: {result['score']:.0%}")
Expected output:
Correctness: 5/5
Score: 100%
Latency and Cost Benchmarking
Compare models on real-world cost and speed metrics.
import time
def benchmark_model(
model_name: str,
prompts: List[str],
max_tokens: int = 200
) -> dict:
latencies = []
total_input_tokens = 0
total_output_tokens = 0
for prompt in prompts:
start = time.time()
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
)
elapsed = time.time() - start
latencies.append(elapsed)
total_input_tokens += response.usage.prompt_tokens
total_output_tokens += response.usage.completion_tokens
pricing = {
"gpt-4o": (2.50, 10.00),
"gpt-4o-mini": (0.15, 0.60),
"gpt-3.5-turbo": (0.50, 1.50)
}
input_price, output_price = pricing.get(model_name, (0, 0))
cost = (
total_input_tokens / 1_000_000 * input_price +
total_output_tokens / 1_000_000 * output_price
)
return {
"model": model_name,
"avg_latency": sum(latencies) / len(latencies),
"p50_latency": sorted(latencies)[len(latencies)//2],
"total_cost": round(cost, 6),
"total_tokens": total_input_tokens + total_output_tokens
}
test_prompts = [
"Explain quantum computing in 3 sentences." for _ in range(5)
]
# results = benchmark_model("gpt-4o-mini", test_prompts)
# print(json.dumps(results, indent=2))
print("Benchmark function ready. Uncomment to run (costs ~$0.001)")
Expected output:
Benchmark function ready. Uncomment to run (costs ~$0.001)
Common Errors
| Error | Cause | Fix |
|---|---|---|
| Benchmark score does not match production performance | Benchmark distribution differs from real queries | Build a domain-specific evaluation set from production logs |
| LLM-as-judge metric is biased toward its own outputs | Evaluator model favors its own style | Use a different model as evaluator or human raters |
| Hallucination metric flags correct answers | Ground truth context is incomplete | Expand retrieval context to cover all relevant information |
| Latency benchmark varies wildly between runs | Network jitter and server load | Run each prompt 5 times, report median, warm up with dummy calls |
| Human evaluation is inconsistent across raters | No rubrics or calibration | Define clear scoring criteria and run rater calibration sessions |
Practice Questions
Why is MMLU considered a general knowledge benchmark rather than a task-specific one? MMLU tests across 57 subjects from STEM to humanities, measuring broad knowledge rather than proficiency in any single domain.
What is the trade-off between using LLM-as-judge and human evaluation? LLM-as-judge is fast and cheap but can exhibit biases; human evaluation is more reliable but slow and expensive.
How does the faithfulness metric differ from the hallucination metric? Faithfulness measures whether the output is supported by the context; hallucination specifically detects fabricated information not present in the context.
Why should latency benchmarks use median rather than mean? Mean is skewed by outliers (cold starts, network spikes); median (p50) better represents typical user experience.
Challenge: Build an evaluation dashboard that runs MMLU, HumanEval, and a custom domain-specific benchmark weekly across 5 different models, stores results in SQLite, and generates a regression report comparing scores against the previous week.
Mini Project
Build a Regression Testing suite for your AI chatbot. Create 50 test cases covering common user intents (FAQ, troubleshooting, feature requests), run them against every new model version, compute relevancy, faithfulness, and latency scores, and fail the CI/CD pipeline if any metric drops below configurable thresholds.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro