ML Model Deployment — Batch, Real-time, and Edge Strategies Explained

Q: What is the difference between batch and real-time deployment?

Batch deployment processes records in scheduled jobs (minutes to hours latency). Real-time deployment serves individual predictions on demand (milliseconds latency). The choice depends on whether the use case requires immediate responses.

Q: Can you deploy the same model with multiple strategies?

Yes. A fraud model might run real-time for card transactions, batch for daily account scoring, and edge for offline mobile transactions. Each strategy uses the same model artifact but different infrastructure.

Q: What is shadow deployment?

Shadow deployment runs a new model alongside the current production model without serving its predictions to users. It compares outputs to validate performance before cutting over traffic.

DodaTech Updated 2026-06-24 6 min read

ML model deployment is the process of making a trained model available for inference in production — choosing between batch, real-time, and edge strategies based on latency, throughput, and infrastructure constraints. In this guide, you will learn how to deploy models using batch pipelines with Spark, real-time REST APIs with FastAPI, and edge inference on resource-constrained devices. The Doda Browser uses a hybrid deployment strategy: real-time models for page recommendations and edge-deployed models for on-device ad blocking.

Learning Path

flowchart LR
  A[MLflow Model Registry] --> B[Model Deployment
You are here]
  B --> C[Batch Inference]
  B --> D[Real-Time API]
  B --> E[Edge Deployment]
  C --> F[Scheduled Pipelines]
  D --> G[REST Endpoints]
  E --> H[ONNX Runtime]
  style B fill:#f90,color:#fff

Batch Inference

Batch inference processes large volumes of data at scheduled intervals. It is ideal for use cases where predictions do not need to be immediate — nightly fraud scoring, weekly churn predictions, or daily inventory forecasting. Batch jobs read data from a data warehouse, apply the model to every row, and write predictions back.

import pandas as pd
import mlflow.pyfunc
from datetime import datetime

model = mlflow.pyfunc.load_model("models:/FraudModel/Production")

data = pd.read_parquet("s3://data-bucket/transactions/2026-06-24.parquet")
print(f"Loaded {len(data)} records for batch scoring")

predictions = model.predict(data)
data["prediction"] = predictions
data["score_timestamp"] = datetime.utcnow()

data.to_parquet("s3://predictions-batch/fraud_scores_2026-06-24.parquet")
print(f"Scored {len(data)} records with {predictions.sum()} fraud alerts")

Expected output:

Loaded 125000 records for batch scoring
Scored 125000 records with 342 fraud alerts

Real-Time API Serving

Real-time serving exposes the model as a REST endpoint that returns predictions within milliseconds. This is necessary for interactive applications like search ranking, loan approval, or chatbot responses. FastAPI provides automatic request validation, async support, and OpenAPI documentation.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import mlflow.pyfunc
import numpy as np

app = FastAPI(title="Fraud Detection API")
model = mlflow.pyfunc.load_model("models:/FraudModel/Production")

class Transaction(BaseModel):
    amount: float
    merchant_id: int
    hour_of_day: int
    distance_from_home: float

class Prediction(BaseModel):
    fraud_score: float
    is_fraud: bool
    model_version: str

@app.post("/predict", response_model=Prediction)
def predict(tx: Transaction):
    features = [[tx.amount, tx.merchant_id, tx.hour_of_day, tx.distance_from_home]]
    score = model.predict(features)[0]
    return Prediction(
        fraud_score=float(score),
        is_fraud=bool(score > 0.5),
        model_version="v2.1.0"
    )

from fastapi.testclient import TestClient
client = TestClient(app)
response = client.post("/predict", json={
    "amount": 8500.00, "merchant_id": 443,
    "hour_of_day": 3, "distance_from_home": 120.5
})
print(f"Status: {response.status_code}")
print(f"Response: {response.json()}")

Expected output:

Status: 200
Response: {'fraud_score': 0.89, 'is_fraud': True, 'model_version': 'v2.1.0'}

Edge Deployment

Edge deployment runs models directly on devices — phones, IoT gateways, or browsers — eliminating network latency and enabling offline inference. ONNX Runtime and TensorFlow Lite convert trained models into lightweight formats optimized for CPU or NPU execution.

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("fraud_model.onnx")
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

sample = np.array([[8500.0, 443, 3, 120.5]], dtype=np.float32)
result = session.run([output_name], {input_name: sample})

score = result[0][0][0]
print(f"Edge inference result: {score:.4f}")
print(f"Classification: {'fraud' if score > 0.5 else 'legit'}")

Expected output:

Edge inference result: 0.8871
Classification: fraud

Deployment Strategy Comparison

Strategy	Latency	Throughput	Infrastructure	Best For
Batch	Minutes to hours	Very high	Spark, Airflow, S3	Offline scoring, analytics
Real-Time API	< 100ms	100-10000 req/s	FastAPI, Kubernetes, GPU	Interactive apps, search
Edge	< 10ms	Device-dependent	ONNX, TFLite, CoreML	Mobile, IoT, offline
Streaming	< 1s	High	Kafka, Flink, Spark	Event-driven predictions

MLOps Deployment Pipeline

flowchart TD
  A[Trained Model] --> B[Model Registry]
  B --> C{Deployment Type}
  C -->|Batch| D[Spark Job]
  C -->|Real-Time| E[FastAPI Service]
  C -->|Edge| F[ONNX Conversion]
  D --> G[Scheduled Airflow DAG]
  E --> H[Kubernetes Service]
  F --> I[Device Bundle]
  G --> J[Data Warehouse]
  H --> K[REST Clients]
  I --> L[Mobile Devices]
  J --> M[Predictions Table]
  K --> N[Real-Time Responses]

Common Deployment Mistakes

Mistake	Why It Happens	How to Fix
Wrong strategy	Batch used where real-time needed	Map latency requirements before choosing
No model versioning	Cannot identify which model served	Always tag deployments with model version
Missing preprocessing	Pipeline mismatch between training and serving	Wrap preprocessing in the model artifact
No health checks	Silent failures in production	Add /health and /ready endpoints
Ignoring cold starts	First request is slow	Pre-warm model in startup event

Practice Questions

When should you choose batch inference over real-time serving?

Answer: Batch inference is appropriate when predictions do not need to be immediate (e.g., nightly reports, daily scoring). It handles large volumes efficiently but introduces latency from minutes to hours. Real-time serving is required for interactive applications where users expect sub-second responses.

How does edge deployment differ from cloud deployment?

Answer: Edge deployment runs the model directly on the device, eliminating network latency and enabling offline operation. It requires model optimization (quantization, pruning) to fit constrained hardware. Cloud deployment centralizes computation but adds network overhead and requires constant connectivity.

What is the role of ONNX in model deployment?

Answer: ONNX (Open Neural Network Exchange) provides a standardized format for representing models across frameworks. It enables converting models from PyTorch, TensorFlow, or Scikit-Learn into a portable format that runs efficiently on diverse hardware through ONNX Runtime.

Why should preprocessing logic be part of the deployed artifact?

Answer: If preprocessing is separate, the serving pipeline can diverge from training preprocessing, leading to silent prediction degradation. Bundling preprocessing with the model (e.g., in an MLflow pyfunc wrapper) guarantees consistency across environments.

How do you handle model versioning in production?

Answer: Use a model registry (MLflow Model Registry) to track versions and promote models through stages (Staging, Production, Archived). Tag every deployment with the exact model version and log inference metadata for auditability.

Challenge

Build a deployment pipeline that serves the same model via batch, real-time, and edge. Use MLflow to register a classification model. Create a FastAPI endpoint for real-time serving. Write a Spark batch job that scores 100K records. Convert the model to ONNX and verify edge inference produces identical results.

Real-World Task

Design a deployment strategy for a real-time recommendation system in a news aggregator like Doda Browser that must serve 50M daily users with < 50ms latency. Compare batch updates (hourly model refresh), real-time inference (per-user predictions), and edge caching (pre-computed top-10 on device). Document the trade-offs in cost, complexity, and user experience.

Common Errors

Model Not Found at Path

If the model registry path is incorrect, loading fails. Verify the model URI matches the registered name and stage. Use mlflow.pyfunc.load_model("models:/MyModel/Production") not the artifact path.

Preprocessing Mismatch

When features differ between training and serving, predictions are silently wrong. Always include the preprocessing pipeline as part of the serialized model object.

Memory Leaks in Long-Running Services

Loading a new model version on every request exhausts memory. Load the model once at startup and reuse it across requests.

FAQ

What is the difference between batch and real-time deployment?

Batch deployment processes records in scheduled jobs (minutes to hours latency). Real-time deployment serves individual predictions on demand (milliseconds latency). The choice depends on whether the use case requires immediate responses.

Can you deploy the same model with multiple strategies?

Yes. A fraud model might run real-time for card transactions, batch for daily account scoring, and edge for offline mobile transactions. Each strategy uses the same model artifact but different infrastructure.

What is shadow deployment?

Shadow deployment runs a new model alongside the current production model without serving its predictions to users. It compares outputs to validate performance before cutting over traffic.

Next Steps

Deepen your deployment knowledge with Docker for Containerization and Kubernetes for orchestration. Explore Apache Airflow for batch pipeline scheduling and MLOps for end-to-end lifecycle management.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous OpenAI API Guide — Chat Completions, Embeddings & Function Calling Next → AutoML — TPOT, H2O & AutoKeras Complete Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Machine Learning