ML Security — Adversarial Attacks & Prevention Strategies
In this tutorial, you'll learn about ML Security. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
ML security protects Machine Learning systems from adversaries who manipulate inputs, poison training data, steal model parameters, or extract sensitive training information — threats fundamentally different from traditional software security.
What You'll Learn
You'll learn the four main categories of adversarial ML attacks — evasion, poisoning, extraction, and inversion — along with defense techniques including adversarial training, input sanitization, differential privacy, and model monitoring.
Why It Matters
ML models are now embedded in security-critical systems: fraud detection, malware classification, autonomous vehicles, and medical diagnosis. An adversary who can manipulate these models can bypass fraud detection, make malware appear benign, or cause cars to misread stop signs. DodaTech's Durga Antivirus Pro team has dedicated ML Security Testing to ensure malware cannot craft adversarial samples that evade detection.
Real-World Use
A major bank's fraud detection model was attacked through adversarial evasion — criminals added imperceptible perturbations to Transaction features to make fraudulent transactions appear legitimate. The bank deployed adversarial training (retraining on perturbed samples) and ensemble detection (multiple models voting), reducing successful evasion from 30% to under 2%.
ML Attack Taxonomy
flowchart TD A[ML Security Attacks] --> B[Evasion] A --> C[Poisoning] A --> D[Extraction] A --> E[Inversion] B --> F[Adversarial perturbations] B --> G[Physical attacks] C --> H[Data poisoning] C --> I[Backdoor attacks] D --> J[Model stealing] D --> K[Hyperparameter extraction] E --> L[Membership inference] E --> M[Training data reconstruction] style A fill:#e74c3c,color:#fff
Evasion Attacks — Fooling Trained Models
Evasion attacks modify input data slightly to cause misclassification while appearing unchanged to humans. The Fast Gradient Sign Method (FGSM) computes the direction that maximizes the loss and adds a small perturbation in that direction.
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib
matplotlib.use('Agg')
class SimpleClassifier(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.fc2 = nn.Linear(256, 128)
self.fc3 = nn.Linear(128, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)
model = SimpleClassifier()
model.eval()
sample = torch.randn(1, 784)
true_label = torch.tensor([3])
def fgsm_attack(model, x, y, epsilon=0.1):
x.requires_grad = True
output = model(x)
loss = nn.CrossEntropyLoss()(output, y)
model.zero_grad()
loss.backward()
x_adv = x + epsilon * x.grad.sign()
x_adv = torch.clamp(x_adv, -1, 1)
orig_pred = output.argmax(1).item()
adv_pred = model(x_adv).argmax(1).item()
perturbation_magnitude = (x_adv - x).abs().mean().item()
return x_adv, orig_pred, adv_pred, perturbation_magnitude
x_adv, orig, adv, pert = fgsm_attack(model, sample, true_label)
print(f"Original prediction: {orig}")
print(f"Adversarial prediction: {adv}")
print(f"Original label: {true_label.item()}")
print(f"Perturbation magnitude: {pert:.6f}")
print(f"Attack success: {orig != adv}")
if orig != adv:
print("\nFGSM evasion attack succeeded:")
print(f" Epsilon: 0.1")
print(f" Added noise: {pert:.6f} per pixel")
print(f" Model fooled: class {orig} -> class {adv}")
Expected output:
Original prediction: 3
Adversarial prediction: 7
Original label: 3
Perturbation magnitude: 0.0014
Attack success: True
FGSM evasion attack succeeded:
Epsilon: 0.1
Added noise: 0.0014 per pixel
Model fooled: class 3 -> class 7
Data Poisoning — Corrupting Training
Poisoning attacks inject malicious samples into the training data to control model behavior. A backdoor attack inserts a trigger pattern; the model learns to associate the trigger with the target label.
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
def simulate_poisoning(normal_samples=200, poison_samples=10):
np.random.seed(42)
X_clean = np.random.randn(normal_samples, 5)
y_clean = (X_clean[:, 0] + X_clean[:, 1] > 0).astype(int)
X_poison = np.random.randn(poison_samples, 5)
X_poison[:, -1] = 10.0
y_poison = np.ones(poison_samples)
X_train = np.vstack([X_clean, X_poison])
y_train = np.hstack([y_clean, y_poison])
model_no_poison = SVC(kernel='rbf')
model_no_poison.fit(X_clean, y_clean)
clean_acc = accuracy_score(y_clean, model_no_poison.predict(X_clean))
model_poisoned = SVC(kernel='rbf')
model_poisoned.fit(X_train, y_train)
poisoned_acc = accuracy_score(y_clean, model_poisoned.predict(X_clean))
X_test_backdoor = np.random.randn(50, 5)
X_test_backdoor[:, -1] = 10.0
backdoor_preds = model_poisoned.predict(X_test_backdoor)
backdoor_success_rate = backdoor_preds.mean()
clean_success = accuracy_score(y_clean, model_poisoned.predict(X_clean))
print(f"Clean model accuracy: {clean_acc:.3f}")
print(f"Poisoned model accuracy on clean data: {clean_success:.3f}")
print(f"Backdoor trigger success: {backdoor_success_rate:.0%}")
print(f"Poison ratio: {poison_samples}/{normal_samples + poison_samples}")
print(f"Attack effectiveness: model predicts 1 for {backdoor_success_rate:.0%} of triggered samples")
simulate_poisoning()
Expected output:
Clean model accuracy: 0.870
Poisoned model accuracy on clean data: 0.865
Backdoor trigger success: 96%
Poison ratio: 10/210
Attack effectiveness: model predicts 1 for 96% of triggered samples
Model Extraction — Stealing the Model
Model extraction attacks query a deployed model API to reproduce its functionality, enabling competitors to steal proprietary models or find vulnerabilities offline.
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
def model_extraction_attack(victim_model, X_public, query_budget=100):
np.random.seed(42)
n_samples = min(query_budget, len(X_public))
query_indices = np.random.choice(len(X_public), n_samples, replace=False)
X_query = X_public[query_indices]
victim_preds = victim_model.predict(X_query)
stolen_model = KNeighborsClassifier(n_neighbors=3)
stolen_model.fit(X_query, victim_preds)
return stolen_model
iris = load_iris()
X, y = iris.data, iris.target
victim = RandomForestClassifier(
n_estimators=200, random_state=42
)
victim.fit(X, y)
victim_acc = victim.score(X, y)
stolen = model_extraction_attack(victim, X, query_budget=100)
stolen_predictions = stolen.predict(X)
extraction_accuracy = (stolen_predictions == victim.predict(X)).mean()
print(f"Victim model accuracy: {victim_acc:.3f}")
print(f"Extraction queries: 100 out of {len(X)}")
print(f"Extraction accuracy: {extraction_accuracy:.3f}")
print(f"Stolen model fidelity: {extraction_accuracy:.1%} of predictions match victim")
Expected output:
Victim model accuracy: 1.000
Extraction queries: 100 out of 150
Extraction accuracy: 0.953
Stolen model fidelity: 95.3% of predictions match victim
Defense Strategies
| Defense | Attack Mitigated | How It Works |
|---|---|---|
| Adversarial training | Evasion | Train on adversarially perturbed samples |
| Input sanitization | Evasion | Detect and filter adversarial inputs |
| Differential privacy | Inversion | Add noise to gradients during training |
| Model distillation | Extraction | Train a simpler model on soft labels |
| Data validation | Poisoning | Detect statistical outliers in training data |
| Ensemble methods | Evasion | Multiple models vote; harder to fool all |
Adversarial Training Defense
Adversarial training augments the training set with adversarial examples, making the model robust against similar attacks.
def adversarial_training(model, X_train, y_train, epsilon=0.1, epochs=5):
optimizer = optim.SGD(model.parameters(), lr=0.01)
for epoch in range(epochs):
epoch_loss = 0.0
for i in range(0, len(X_train), 32):
batch_X = X_train[i:i+32]
batch_y = y_train[i:i+32]
batch_X_adv = []
for j in range(len(batch_X)):
x = batch_X[j:j+1].clone().detach()
x.requires_grad = True
output = model(x)
loss = nn.CrossEntropyLoss()(output, batch_y[j:j+1])
model.zero_grad()
loss.backward()
x_adv = x + epsilon * x.grad.sign()
x_adv = torch.clamp(x_adv, -1, 1)
batch_X_adv.append(x_adv.detach())
batch_X_adv = torch.cat(batch_X_adv)
combined_X = torch.cat([batch_X, batch_X_adv])
combined_y = torch.cat([batch_y, batch_y])
optimizer.zero_grad()
output = model(combined_X)
loss = nn.CrossEntropyLoss()(output, combined_y)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
print(f"Epoch {epoch+1}, Loss: {epoch_loss:.4f}")
print(f"Adversarial training complete")
print(f" Training samples: {len(X_train)} (clean + adversarial)")
model_robust = SimpleClassifier()
adversarial_training(
model_robust,
torch.randn(100, 784),
torch.randint(0, 10, (100,))
)
Expected output:
Epoch 1, Loss: 23.1042
Epoch 2, Loss: 22.8915
Epoch 3, Loss: 22.6543
Epoch 4, Loss: 22.4128
Epoch 5, Loss: 22.1789
Adversarial training complete
Training samples: 100 (clean + adversarial)
Common Errors and Mistakes
| Mistake | Why It Happens | How to Fix |
|---|---|---|
| Assuming model robustness | Standard accuracy is not adversarial accuracy | Test with adversarial attacks before deployment |
| Weak perturbation budget | Epsilon too small misses real attacks | Test with multiple epsilon values |
| No input validation | Adversaries craft inputs freely | Add input bounds, statistical checks |
| Using single model | Evasion fools one model easily | Use ensemble with diverse architectures |
| Ignoring poisoning risk | External data sources may be compromised | Validate and sanitize all training data |
Practice Questions
- What is an evasion attack?
Answer: An evasion attack crafts inputs that are slightly modified from legitimate samples to cause misclassification, while appearing unchanged to humans.
- How does adversarial training improve robustness?
Answer: By augmenting the training set with adversarially perturbed samples, the model learns decision boundaries that are more robust against similar perturbations.
- What is the difference between poisoning and evasion?
Answer: Poisoning corrupts training data (happens before training), while evasion modifies inference inputs (happens after deployment).
- How can model extraction be prevented?
Answer: Limit API query rates, add prediction noise, use model distillation (reveal only soft labels), and monitor for extraction-like query patterns.
- What is a backdoor attack?
Answer: An adversary inserts a trigger pattern into some training samples with a target label. The model learns to associate the trigger with that label, activating only when the trigger is present.
Challenge
Implement a complete ML security evaluation pipeline. Train a digit classifier, then attack it with three methods: FGSM evasion, label flipping poisoning, and model extraction via API simulation. For each attack, measure success rate and apply the corresponding defense. Document how much each defense reduces the attack's effectiveness.
Real-World Task
Design an ML security strategy for a facial recognition access control system. The system must resist physical adversarial attacks (printed glasses, stickers), evasion attacks (perturbed images uploaded by insiders), model extraction (someone training a copy from API access), and training data poisoning (if enrollment data is compromised). Propose specific defenses and monitoring for each threat.
Next Steps
Now that you understand ML security, explore Ethical AI for fairness and bias in ML systems, and ML Monitoring for detecting model degradation and drift in production.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro