ML Model Evaluation Metrics — Complete Guide

DodaTech 3 min read

In this tutorial, you'll learn about ML Model Evaluation Metrics. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Model evaluation metrics are quantitative measures used to assess how well a Machine Learning model performs on unseen data, helping you choose the best model for your specific use case.

What You'll Learn

How to evaluate classification and regression models using the right metrics, avoid common pitfalls like accuracy paradox, and interpret confusion matrices and ROC curves.

Why It Matters

A model with 95% accuracy can be completely useless for a fraud detection system where only 1% of transactions are fraudulent. Choosing the wrong metric leads to deploying models that fail in production.

Real-World Use

Durga Antivirus Pro uses precision as its primary metric for malware detection because a false positive (flagging a safe file) frustrates users more than a false negative (missing a threat that other layers catch).

Evaluation Metrics Overview

flowchart TD
    A[Model Evaluation] --> B[Classification]
    A --> C[Regression]
    B --> D[Accuracy]
    B --> E["Precision / Recall"]
    B --> F[F1-Score]
    B --> G[ROC-AUC]
    B --> H[Confusion Matrix]
    C --> I["MAE / MSE / RMSE"]
    C --> J[R-Squared]
    C --> K[Adjusted R-Squared]

Confusion Matrix & Classification Metrics

from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 0])
y_pred = np.array([1, 0, 1, 0, 0, 1, 1, 0, 1, 0])

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['Not Fraud', 'Fraud']))

Expected output:

Confusion Matrix:
[[4 1]
 [1 4]]

Classification Report:
              precision    recall  f1-score   support
   Not Fraud       0.80      0.80      0.80         5
       Fraud       0.80      0.80      0.80         5
    accuracy                           0.80        10
   macro avg       0.80      0.80      0.80        10
weighted avg       0.80      0.80      0.80        10

Precision answers "how many predicted positives are correct?" Recall answers "how many actual positives did we catch?"

ROC-AUC Curve

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

y_true = np.array([0, 0, 1, 1, 0, 1, 0, 1, 0, 1])
y_scores = np.array([0.1, 0.2, 0.8, 0.7, 0.3, 0.9, 0.4, 0.6, 0.2, 0.85])

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)
print(f"ROC-AUC Score: {roc_auc:.3f}")

Expected output:

ROC-AUC Score: 0.920

An AUC of 0.92 means the model has a 92% chance of ranking a random positive higher than a random negative. AUC of 0.5 is random, 1.0 is perfect.

Regression Metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_true = np.array([3.0, 5.0, 2.5, 7.0, 8.0])
y_pred = np.array([2.8, 5.2, 2.8, 6.5, 8.3])

mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)

print(f"MAE:  {mae:.3f}")
print(f"MSE:  {mse:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"R2:   {r2:.3f}")

Expected output:

MAE:  0.240
MSE:  0.082
RMSE: 0.286
R2:   0.962

R-squared of 0.962 means 96.2% of the variance in the target is explained by the model. RMSE penalizes large errors more than MAE.

When to Use Which Metric

Imbalanced classes: Use precision, recall, and F1-score instead of accuracy
Equal cost FP/FN: Use accuracy or F1-score
Ranking quality: Use ROC-AUC
Fraud detection: High recall (catch all fraud), even at cost of precision
Regression with outliers: Use MAE (less sensitive to outliers than MSE)
Regression without outliers: Use RMSE (penalizes large errors more)

Practice Questions

Why can accuracy be misleading for imbalanced datasets?
What is the difference between precision and recall?
When would you choose MAE over RMSE for regression evaluation?

Frequently Asked Questions

What is a good ROC-AUC score?

A score of 0.5 is random, 0.7-0.8 is acceptable, 0.8-0.9 is excellent, and above 0.9 is outstanding. However, the threshold depends on your domain -- medical diagnosis models often need 0.99+.

Should I optimize for precision or recall?

Optimize for precision when false positives are costly (spam filtering). Optimize for recall when false negatives are costly (cancer detection). Use F1-score when you need a balanced trade-off.

Python — running the evaluation code
scikit-learn Guide — provides all these metrics
What is Machine Learning — foundational concepts

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Deploying ML Models to Production — Step-by-Step Guide Next → Feature Engineering Techniques — Practical Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Ai Ml

ML Model Evaluation Metrics — Complete Guide

What You'll Learn

Why It Matters

Real-World Use

Evaluation Metrics Overview

Confusion Matrix & Classification Metrics

ROC-AUC Curve

Regression Metrics

When to Use Which Metric

Practice Questions

Frequently Asked Questions

Related Topics