Designing AI API Endpoints — Best Practices for LLM-Powered Services

DodaTech Updated 2026-06-22 6 min read

In this tutorial, you'll learn about Designing AI API Endpoints. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Designing AI API endpoints requires different patterns than traditional REST APIs — streaming responses, prompt validation, context management, and cost-aware Rate Limiting are unique to LLM-powered services.

What You'll Learn

You'll learn patterns for building AI API endpoints including streaming responses, request caching, Prompt Injection detection, structured JSON output, and usage-based Rate Limiting with FastAPI.

Why It Matters

Standard REST patterns break under AI workloads. LLM calls are slow, expensive, and nondeterministic. Proper Api Design reduces latency by 60%, cuts costs by half, and prevents abuse through Prompt Injection and excessive usage.

Real-World Use

Doda Browser's AI features — smart search, page summarization, and code completion — are all served through a unified AI API Gateway that handles streaming, caching, and Rate Limiting across multiple LLM providers.

AI API Architecture

flowchart LR
    A[Client] --> B[API Gateway]
    B --> C["Auth / Rate Limit"]
    C --> D[Prompt Guard]
    D --> E[Cache Check]
    E --> F[LLM Provider]
    F --> G[Streaming Response]
    E --> H[Cache Store]
    G --> A

Streaming Responses

LLM responses take seconds to generate. Streaming returns tokens as they arrive, improving perceived latency.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI
from fastapi.responses import StreamingResponse
import json

app = FastAPI(title="AI API Gateway")
client = OpenAI()

class ChatRequest(BaseModel):
    model: str = "gpt-4o-mini"
    messages: list[dict]
    stream: bool = True
    max_tokens: int = 1024

def stream_generator(response):
    for chunk in response:
        if chunk.choices[0].delta.content:
            yield f"data: {json.dumps({
                'content': chunk.choices[0].delta.content
            })}\n\n"
    yield "data: [DONE]\n\n"

@app.post("/v1/chat/completions")
async def chat_completion(request: ChatRequest):
    try:
        response = client.chat.completions.create(
            model=request.model,
            messages=request.messages,
            stream=request.stream,
            max_tokens=request.max_tokens,
        )
        return StreamingResponse(
            stream_generator(response),
            media_type="text/event-stream"
        )
    except Exception as e:
        raise HTTPException(status_code=502, detail=str(e))

# Test with curl
print("Endpoint: POST /v1/chat/completions")
print("Streams SSE-formatted tokens as they arrive")

Expected output:

Endpoint: POST /v1/chat/completions
Streams SSE-formatted tokens as they arrive

Structured Output with Pydantic

Force LLMs to return valid JSON with a predefined schema.

from pydantic import BaseModel, Field
from typing import List, Optional
from openai import OpenAI

client = OpenAI()

class AnalysisResult(BaseModel):
    sentiment: str = Field(
        description="One of: positive, negative, neutral"
    )
    confidence: float = Field(
        ge=0.0, le=1.0, description="Confidence score"
    )
    key_points: List[str] = Field(
        max_length=5, description="Up to 5 key points"
    )
    summary: str = Field(
        max_length=200, description="One-sentence summary"
    )

def analyze_text(text: str) -> AnalysisResult:
    response = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", "content": "Analyze the text and return structured output."},
            {"role": "user", "content": text}
        ],
        response_format=AnalysisResult,
    )
    return response.choices[0].message.parsed

result = analyze_text(
    "The new update is fantastic! The speed improvements are remarkable, "
    "though the UI changes will take some getting used to."
)
print(f"Sentiment: {result.sentiment}")
print(f"Confidence: {result.confidence:.2f}")
print(f"Key Points: {result.key_points}")
print(f"Summary: {result.summary}")

Expected output:

Sentiment: positive
Confidence: 0.92
Key Points: ['Speed improvements are remarkable', 'UI changes may need adjustment']
Summary: Users are very positive about performance gains but have mixed feelings about the interface changes.

Request Caching

Cache identical requests to reduce costs and latency for repeated queries.

import hashlib
import json
import redis.asyncio as redis
from fastapi import Depends

redis_client = redis.Redis(host="localhost", port=6379, db=0)
CACHE_TTL = 3600  # 1 hour

def make_cache_key(request: ChatRequest) -> str:
    raw = json.dumps(request.model_dump(), sort_keys=True)
    return f"ai:cache:{hashlib.sha256(raw.encode()).hexdigest()}"

@app.post("/v1/chat/completions/cached")
async def cached_chat(request: ChatRequest):
    cache_key = make_cache_key(request)

    # Check cache
    cached = await redis_client.get(cache_key)
    if cached:
        return json.loads(cached)

    # No cache hit — call LLM
    response = client.chat.completions.create(
        model=request.model,
        messages=request.messages,
        max_tokens=request.max_tokens,
    )
    result = {
        "content": response.choices[0].message.content,
        "model": request.model,
        "cached": False
    }

    # Store in cache
    await redis_client.setex(
        cache_key, CACHE_TTL, json.dumps(result)
    )
    return result

# Test cache behavior
print("First call: hits LLM, caches result")
print("Second call with same input: returns cached result")
print(f"Cache TTL: {CACHE_TTL}s")

Expected output:

First call: hits LLM, caches result
Second call with same input: returns cached result
Cache TTL: 3600s

Rate Limiting with Token Awareness

Track usage per API key and limit based on tokens consumed.

from fastapi import Request, HTTPException
import time

class TokenBucket:
    def __init__(self, max_tokens: int, refill_rate: float):
        self.max_tokens = max_tokens
        self.refill_rate = refill_rate
        self.tokens = max_tokens
        self.last_refill = time.time()

    def consume(self, tokens: int) -> bool:
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(
            self.max_tokens,
            self.tokens + elapsed * self.refill_rate
        )
        self.last_refill = now

        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False

buckets = {}  # api_key -> TokenBucket

def get_token_bucket(api_key: str) -> TokenBucket:
    if api_key not in buckets:
        buckets[api_key] = TokenBucket(
            max_tokens=100000,
            refill_rate=1000  # tokens per second
        )
    return buckets[api_key]

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    api_key = request.headers.get("X-API-Key", "anonymous")
    bucket = get_token_bucket(api_key)

    estimated_tokens = 500  # estimate per request
    if not bucket.consume(estimated_tokens):
        raise HTTPException(
            status_code=429,
            detail={
                "error": "rate_limit_exceeded",
                "message": "Token quota exceeded. Try again later.",
                "retry_after": 60
            }
        )

    response = await call_next(request)
    response.headers["X-RateLimit-Remaining"] = str(
        int(bucket.tokens)
    )
    return response

print("Rate limit middleware active")
print("Token bucket: 100K max, 1000 tokens/sec refill")

Expected output:

Rate limit middleware active
Token bucket: 100K max, 1000 tokens/sec refill

Common Errors

Error	Cause	Fix
Timeout on long LLM calls	Default HTTP timeout too short	Set timeout to 300s or use streaming with keep-alive
Repeated identical API calls	No caching layer	Add Redis cache with SHA256 request hash
Prompt Injection in system message	User input not sanitized	Separate system and user messages; validate with guardrail model
JSON parsing fails on structured output	LLM returns malformed JSON	Use `response_format` with Pydantic model in supported models
Rate Limiting blocks legitimate users	Global rate limit without per-key tracking	Implement token bucket per API key instead of global counter

Practice Questions

Why is streaming important for LLM API endpoints? Streaming returns tokens as they are generated, reducing perceived latency from seconds to milliseconds for the first token.
How does request caching reduce costs in AI APIs? Caching identical requests avoids repeated LLM invocations, cutting API costs dollar-for-dollar for repeated queries.
What is the difference between a token bucket and a fixed-window rate limiter? Token buckets allow bursts up to the bucket capacity while limiting average rate; fixed windows cap requests per calendar interval.
Why should structured output be validated server-side and not trusted from the LLM? LLMs can still produce invalid output even with structured mode; server-side Pydantic validation catches and handles malformed responses.
Challenge: Build an API Gateway that routes to different LLM providers (OpenAI, Anthropic, local Ollama) based on the model name in the request, with automatic fallback if one provider fails.

Mini Project

Build an AI-powered content moderation API. Create a FastAPI endpoint that accepts text, sends it to an LLM for toxicity classification (categories: hate speech, harassment, spam, safe), returns structured JSON with category and confidence, implements request caching with Redis to avoid re-checking identical content, and logs every request with token usage per API key for billing.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Vector Databases Explained — Pinecone, Weaviate, Qdrant & Chroma for AI Search Next → LLM Evaluation and Benchmarking — Metrics, Datasets and Automated Testing

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Ai Automation