ML Model Deployment — Batch, Real-time, and Edge Strategies Explained
ML model deployment is the process of making a trained model available for inference in production — choosing between batch, real-time, and edge strategies based on latency, throughput, and infrastructure constraints. In this guide, you will learn how to deploy models using batch pipelines with Spark, real-time REST APIs with FastAPI, and edge inference on resource-constrained devices. The Doda Browser uses a hybrid deployment strategy: real-time models for page recommendations and edge-deployed models for on-device ad blocking.
Learning Path
flowchart LR A[MLflow Model Registry] --> B[Model Deployment
You are here] B --> C[Batch Inference] B --> D[Real-Time API] B --> E[Edge Deployment] C --> F[Scheduled Pipelines] D --> G[REST Endpoints] E --> H[ONNX Runtime] style B fill:#f90,color:#fff
Batch Inference
Batch inference processes large volumes of data at scheduled intervals. It is ideal for use cases where predictions do not need to be immediate — nightly fraud scoring, weekly churn predictions, or daily inventory forecasting. Batch jobs read data from a data warehouse, apply the model to every row, and write predictions back.
import pandas as pd
import mlflow.pyfunc
from datetime import datetime
model = mlflow.pyfunc.load_model("models:/FraudModel/Production")
data = pd.read_parquet("s3://data-bucket/transactions/2026-06-24.parquet")
print(f"Loaded {len(data)} records for batch scoring")
predictions = model.predict(data)
data["prediction"] = predictions
data["score_timestamp"] = datetime.utcnow()
data.to_parquet("s3://predictions-batch/fraud_scores_2026-06-24.parquet")
print(f"Scored {len(data)} records with {predictions.sum()} fraud alerts")
Expected output:
Loaded 125000 records for batch scoring
Scored 125000 records with 342 fraud alerts
Real-Time API Serving
Real-time serving exposes the model as a REST endpoint that returns predictions within milliseconds. This is necessary for interactive applications like search ranking, loan approval, or chatbot responses. FastAPI provides automatic request validation, async support, and OpenAPI documentation.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import mlflow.pyfunc
import numpy as np
app = FastAPI(title="Fraud Detection API")
model = mlflow.pyfunc.load_model("models:/FraudModel/Production")
class Transaction(BaseModel):
amount: float
merchant_id: int
hour_of_day: int
distance_from_home: float
class Prediction(BaseModel):
fraud_score: float
is_fraud: bool
model_version: str
@app.post("/predict", response_model=Prediction)
def predict(tx: Transaction):
features = [[tx.amount, tx.merchant_id, tx.hour_of_day, tx.distance_from_home]]
score = model.predict(features)[0]
return Prediction(
fraud_score=float(score),
is_fraud=bool(score > 0.5),
model_version="v2.1.0"
)
from fastapi.testclient import TestClient
client = TestClient(app)
response = client.post("/predict", json={
"amount": 8500.00, "merchant_id": 443,
"hour_of_day": 3, "distance_from_home": 120.5
})
print(f"Status: {response.status_code}")
print(f"Response: {response.json()}")
Expected output:
Status: 200
Response: {'fraud_score': 0.89, 'is_fraud': True, 'model_version': 'v2.1.0'}
Edge Deployment
Edge deployment runs models directly on devices — phones, IoT gateways, or browsers — eliminating network latency and enabling offline inference. ONNX Runtime and TensorFlow Lite convert trained models into lightweight formats optimized for CPU or NPU execution.
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("fraud_model.onnx")
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
sample = np.array([[8500.0, 443, 3, 120.5]], dtype=np.float32)
result = session.run([output_name], {input_name: sample})
score = result[0][0][0]
print(f"Edge inference result: {score:.4f}")
print(f"Classification: {'fraud' if score > 0.5 else 'legit'}")
Expected output:
Edge inference result: 0.8871
Classification: fraud
Deployment Strategy Comparison
| Strategy | Latency | Throughput | Infrastructure | Best For |
|---|---|---|---|---|
| Batch | Minutes to hours | Very high | Spark, Airflow, S3 | Offline scoring, analytics |
| Real-Time API | < 100ms | 100-10000 req/s | FastAPI, Kubernetes, GPU | Interactive apps, search |
| Edge | < 10ms | Device-dependent | ONNX, TFLite, CoreML | Mobile, IoT, offline |
| Streaming | < 1s | High | Kafka, Flink, Spark | Event-driven predictions |
MLOps Deployment Pipeline
flowchart TD
A[Trained Model] --> B[Model Registry]
B --> C{Deployment Type}
C -->|Batch| D[Spark Job]
C -->|Real-Time| E[FastAPI Service]
C -->|Edge| F[ONNX Conversion]
D --> G[Scheduled Airflow DAG]
E --> H[Kubernetes Service]
F --> I[Device Bundle]
G --> J[Data Warehouse]
H --> K[REST Clients]
I --> L[Mobile Devices]
J --> M[Predictions Table]
K --> N[Real-Time Responses]
Common Deployment Mistakes
| Mistake | Why It Happens | How to Fix |
|---|---|---|
| Wrong strategy | Batch used where real-time needed | Map latency requirements before choosing |
| No model versioning | Cannot identify which model served | Always tag deployments with model version |
| Missing preprocessing | Pipeline mismatch between training and serving | Wrap preprocessing in the model artifact |
| No health checks | Silent failures in production | Add /health and /ready endpoints |
| Ignoring cold starts | First request is slow | Pre-warm model in startup event |
Practice Questions
- When should you choose batch inference over real-time serving?
Answer: Batch inference is appropriate when predictions do not need to be immediate (e.g., nightly reports, daily scoring). It handles large volumes efficiently but introduces latency from minutes to hours. Real-time serving is required for interactive applications where users expect sub-second responses.
- How does edge deployment differ from cloud deployment?
Answer: Edge deployment runs the model directly on the device, eliminating network latency and enabling offline operation. It requires model optimization (quantization, pruning) to fit constrained hardware. Cloud deployment centralizes computation but adds network overhead and requires constant connectivity.
- What is the role of ONNX in model deployment?
Answer: ONNX (Open Neural Network Exchange) provides a standardized format for representing models across frameworks. It enables converting models from PyTorch, TensorFlow, or Scikit-Learn into a portable format that runs efficiently on diverse hardware through ONNX Runtime.
- Why should preprocessing logic be part of the deployed artifact?
Answer: If preprocessing is separate, the serving pipeline can diverge from training preprocessing, leading to silent prediction degradation. Bundling preprocessing with the model (e.g., in an MLflow pyfunc wrapper) guarantees consistency across environments.
- How do you handle model versioning in production?
Answer: Use a model registry (MLflow Model Registry) to track versions and promote models through stages (Staging, Production, Archived). Tag every deployment with the exact model version and log inference metadata for auditability.
Challenge
Build a deployment pipeline that serves the same model via batch, real-time, and edge. Use MLflow to register a classification model. Create a FastAPI endpoint for real-time serving. Write a Spark batch job that scores 100K records. Convert the model to ONNX and verify edge inference produces identical results.
Real-World Task
Design a deployment strategy for a real-time recommendation system in a news aggregator like Doda Browser that must serve 50M daily users with < 50ms latency. Compare batch updates (hourly model refresh), real-time inference (per-user predictions), and edge caching (pre-computed top-10 on device). Document the trade-offs in cost, complexity, and user experience.
Common Errors
Model Not Found at Path
If the model registry path is incorrect, loading fails. Verify the model URI matches the registered name and stage. Use mlflow.pyfunc.load_model("models:/MyModel/Production") not the artifact path.
Preprocessing Mismatch
When features differ between training and serving, predictions are silently wrong. Always include the preprocessing pipeline as part of the serialized model object.
Memory Leaks in Long-Running Services
Loading a new model version on every request exhausts memory. Load the model once at startup and reuse it across requests.
FAQ
Next Steps
Deepen your deployment knowledge with Docker for Containerization and Kubernetes for orchestration. Explore Apache Airflow for batch pipeline scheduling and MLOps for end-to-end lifecycle management.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro