LLM Evaluation and Benchmarking — Metrics, Datasets and Automated Testing

DodaTech Updated 2026-06-22 5 min read

LLM evaluation is the Process of measuring model performance across accuracy, safety, latency, and cost dimensions — this guide covers standard benchmarks, automated evaluation frameworks, and custom metric design.

What You'll Learn

You'll learn to evaluate LLMs using standard benchmarks (MMLU, HumanEval, GSM8K), build automated evaluation pipelines with Python, design task-specific metrics, and run human alignment tests.

Why It Matters

A model that scores 90% on benchmarks may still fail in production. Systematic evaluation across multiple dimensions — accuracy, safety, latency, cost, and robustness — is the only way to select and monitor LLMs reliably.

Real-World Use

Doda Browser evaluates every new LLM release against a curated test suite of 500 browser-specific queries before updating its AI assistant, ensuring improvements in one area do not regress performance in another.

Evaluation Framework

flowchart TD
    A[Model] --> B[Benchmark Suite]
    B --> C[Accuracy Tests]
    B --> D[Safety Tests]
    B --> E[Performance Tests]
    B --> F[Cost Analysis]
    C --> G[Scoring]
    D --> G
    E --> G
    F --> G
    G --> H[Decision]
    H --> I[Deploy]
    H --> J[Reject]

Running Standard Benchmarks

MMLU measures multitask knowledge across 57 subjects. HumanEval tests Code Generation.

from datasets import load_dataset
import evaluate
from openai import OpenAI

client = OpenAI()

def evaluate_mmlu_subset(subject="machine_learning", num_samples=20):
    dataset = load_dataset(
        "mmlu", subject, split="test"
    ).select(range(num_samples))

    correct = 0
    for item in dataset:
        prompt = f"{item['question']}\n"
        for i, choice in enumerate(item["choices"]):
            prompt += f"{chr(65+i)}. {choice}\n"
        prompt += "\nAnswer with the letter only."

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=5,
            temperature=0
        )
        answer = response.choices[0].message.content.strip()

        if answer.startswith(item["answer"]):
            correct += 1

    accuracy = correct / num_samples
    print(f"Subject: {subject}")
    print(f"Samples: {num_samples}")
    print(f"Correct: {correct}/{num_samples}")
    print(f"Accuracy: {accuracy:.2%}")
    return accuracy

result = evaluate_mmlu_subset()

Expected output:

Subject: machine_learning
Samples: 20
Correct: 17/20
Accuracy: 85.00%

Automated Evaluation with DeepEval

DeepEval provides pre-built metrics for LLM output quality.

from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    ContextualPrecisionMetric
)
from deepeval.test_case import LLMTestCase

# Define metrics
relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.7)
hallucination = HallucinationMetric(threshold=0.5)

test_case = LLMTestCase(
    input="What are the rate limits for the API?",
    actual_output="The API allows 100 requests per minute per key.",
    expected_output="API rate limit is 100 requests per minute.",
    retrieval_context=[
        "Each API key is limited to 100 requests per minute.",
        "Rate limits reset every 60 seconds from first request.]
    ]
)

results = evaluate(
    test_cases=[test_case],
    metrics=[relevancy, faithfulness, hallucination]
)

for metric_name, score in results.test_results[0].metrics_data.items():
    print(f"{metric_name}: {score.score:.3f} "
          f"({'PASS' if score.score >= score.threshold else 'FAIL'})")

Expected output:

Answer Relevancy: 0.892 PASS
Faithfulness: 0.934 PASS
Hallucination: 0.123 PASS

Custom Task-Specific Metrics

Design metrics aligned with your specific use case.

import re
from typing import List

def code_correctness_score(
    prompt: str,
    generated_code: str,
    test_cases: List[tuple]
) -> dict:
    """Evaluate generated code against test cases."""
    passed = 0
    errors = []

    for inputs, expected in test_cases:
        try:
            local_scope = {}
            exec(generated_code, {"__builtins__": __builtins__}, local_scope)

            # Extract function name from code
            func_match = re.search(
                r"def\s+(\w+)\s*\(", generated_code
            )
            if not func_match:
                errors.append("No function found")
                continue

            func_name = func_match.group(1)
            func = local_scope.get(func_name)

            if not callable(func):
                errors.append(f"{func_name} is not callable")
                continue

            result = func(*inputs) if isinstance(inputs, tuple) else func(inputs)
            if result == expected:
                passed += 1
            else:
                errors.append(
                    f"Expected {expected}, got {result}"
                )
        except Exception as e:
            errors.append(str(e))

    return {
        "passed": passed,
        "total": len(test_cases),
        "score": passed / len(test_cases) if test_cases else 0,
        "errors": errors[:3]
    }

# Test
code = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
"""

tests = [
    (1, 1), (5, 5), (10, 55), (0, 0), (20, 6765)
]

result = code_correctness_score("fibonacci", code, tests)
print(f"Correctness: {result['passed']}/{result['total']}")
print(f"Score: {result['score']:.0%}")

Expected output:

Correctness: 5/5
Score: 100%

Latency and Cost Benchmarking

Compare models on real-world cost and speed metrics.

import time

def benchmark_model(
    model_name: str,
    prompts: List[str],
    max_tokens: int = 200
) -> dict:
    latencies = []
    total_input_tokens = 0
    total_output_tokens = 0

    for prompt in prompts:
        start = time.time()
        response = client.chat.completions.create(
            model=model_name,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
        )
        elapsed = time.time() - start

        latencies.append(elapsed)
        total_input_tokens += response.usage.prompt_tokens
        total_output_tokens += response.usage.completion_tokens

    pricing = {
        "gpt-4o": (2.50, 10.00),
        "gpt-4o-mini": (0.15, 0.60),
        "gpt-3.5-turbo": (0.50, 1.50)
    }

    input_price, output_price = pricing.get(model_name, (0, 0))
    cost = (
        total_input_tokens / 1_000_000 * input_price +
        total_output_tokens / 1_000_000 * output_price
    )

    return {
        "model": model_name,
        "avg_latency": sum(latencies) / len(latencies),
        "p50_latency": sorted(latencies)[len(latencies)//2],
        "total_cost": round(cost, 6),
        "total_tokens": total_input_tokens + total_output_tokens
    }

test_prompts = [
    "Explain quantum computing in 3 sentences." for _ in range(5)
]
# results = benchmark_model("gpt-4o-mini", test_prompts)
# print(json.dumps(results, indent=2))
print("Benchmark function ready. Uncomment to run (costs ~$0.001)")

Expected output:

Benchmark function ready. Uncomment to run (costs ~$0.001)

Common Errors

Error	Cause	Fix
Benchmark score does not match production performance	Benchmark distribution differs from real queries	Build a domain-specific evaluation set from production logs
LLM-as-judge metric is biased toward its own outputs	Evaluator model favors its own style	Use a different model as evaluator or human raters
Hallucination metric flags correct answers	Ground truth context is incomplete	Expand retrieval context to cover all relevant information
Latency benchmark varies wildly between runs	Network jitter and server load	Run each prompt 5 times, report median, warm up with dummy calls
Human evaluation is inconsistent across raters	No rubrics or calibration	Define clear scoring criteria and run rater calibration sessions

Practice Questions

Why is MMLU considered a general knowledge benchmark rather than a task-specific one? MMLU tests across 57 subjects from STEM to humanities, measuring broad knowledge rather than proficiency in any single domain.
What is the trade-off between using LLM-as-judge and human evaluation? LLM-as-judge is fast and cheap but can exhibit biases; human evaluation is more reliable but slow and expensive.
How does the faithfulness metric differ from the hallucination metric? Faithfulness measures whether the output is supported by the context; hallucination specifically detects fabricated information not present in the context.
Why should latency benchmarks use median rather than mean? Mean is skewed by outliers (cold starts, network spikes); median (p50) better represents typical user experience.
Challenge: Build an evaluation dashboard that runs MMLU, HumanEval, and a custom domain-specific benchmark weekly across 5 different models, stores results in SQLite, and generates a regression report comparing scores against the previous week.

Mini Project

Build a Regression Testing suite for your AI chatbot. Create 50 test cases covering common user intents (FAQ, troubleshooting, feature requests), run them against every new model version, compute relevancy, faithfulness, and latency scores, and fail the CI/CD pipeline if any metric drops below configurable thresholds.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Designing AI API Endpoints — Best Practices for LLM-Powered Services Next → AI API Cost Optimization — Caching, Batching and Quantization Strategies

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Ai Automation