Integrating LLM APIs: OpenAI, Anthropic and Open-Source Models

DodaTech Updated 2026-06-22 7 min read

In this tutorial, you'll learn about Integrating LLM APIs: OpenAI, Anthropic and Open. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

LLM APIs let you add powerful language AI capabilities to applications without training or hosting models, enabling chat, summarization, Code Generation, and reasoning features with simple HTTP calls.

What You'll Learn

In this tutorial, you'll learn to integrate LLM APIs including OpenAI, Anthropic Claude, and open-source models via Ollama, covering chat completions, streaming responses, function calling, and building AI-powered features in your Python applications.

Why It Matters

LLMs have transformed from research curiosities to essential infrastructure. Every application can benefit from AI features — answering questions, summarizing content, generating code, extracting structured data. API-based LLMs eliminate the need for expensive GPU infrastructure, and open-source models running locally with Ollama provide privacy and offline capability.

Real-World Use

Doda Browser integrates multiple LLM APIs for its AI assistant. OpenAI handles complex reasoning, Anthropic Claude processes long documents with its 200K context window, and a local Ollama model provides offline summarization for privacy-sensitive browsing sessions.

OpenAI API

The OpenAI API provides access to GPT models including GPT-4 and GPT-3.5. The chat completions endpoint accepts a list of messages with roles (system, user, assistant) and returns a generated response. You can control temperature (creativity), max_tokens (response length), and top_p (nucleus sampling). The API supports streaming for real-time token-by-token delivery.

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-api-key"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful Python tutor.]
        },
        {
            "role": "user",
            "content": "Explain list comprehensions in Python with an example."
        }
    ],
    temperature=0.3,
    max_tokens=200
)

answer = response.choices[0].message.content
print(f"Model: {response.model}")
print(f"Tokens used: {response.usage.total_tokens}")
print(f"\nAnswer:\n{answer[:200]}...")

Expected output:

Model: gpt-3.5-turbo-0613
Tokens used: 85

Answer:
A list comprehension is a concise way to create lists in Python. It consists of brackets containing an expression followed by a for clause. For example:
[x**2 for x in range(5)] produces [0, 1, 4, 9, 16]...

Streaming Responses

Streaming delivers tokens one at a time as they are generated, reducing perceived latency. The user sees the response appear incrementally rather than waiting for the full response. Streaming is essential for chat applications where users expect immediate feedback. With OpenAI, set stream=True and iterate over response chunks.

import time

def stream_response(messages):
    collected = []
    for chunk in client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        stream=True,
        temperature=0
    ):
        if chunk.choices[0].delta.content is not None:
            token = chunk.choices[0].delta.content
            collected.append(token)
    return ''.join(collected)

full_response = stream_response([
    {"role": "user", "content": "Say 'Hello, world!'"}
])

print(f"Streamed response: {full_response}")
print(f"Character count: {len(full_response)}")
print(f"Chunks: Simulated as 3 tokens")

Expected output:

Streamed response: Hello, world!
Character count: 13
Chunks: Simulated as 3 tokens

Anthropic Claude API

Anthropic's Claude excels at long-context reasoning, Code Generation, and safe AI interactions. The API supports up to 200K tokens of context (enough for entire codebases or long documents). Claude uses the messages API similar to OpenAI but is accessed through the Anthropic SDK. Claude Haiku is optimized for speed, Sonnet balances speed and capability, and Opus provides maximum capability.

import anthropic

claude_client = anthropic.Anthropic(
    api_key="sk-ant-your-api-key"
)

response = claude_client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=150,
    system="You are a code review assistant.",
    messages=[
        {
            "role": "user",
            "content": "Review this Python function:\n\ndef add(a,b):\n    return a+b]
        }
    ]
)

text = response.content[0].text
print(f"Model: {response.model}")
print(f"Stop reason: {response.stop_reason}")
print(f"\nReview:\n{text[:200]}...")

Expected output:

Model: claude-3-haiku-20240307
Stop reason: end_turn

Review:
This function looks correct and concise. It takes two parameters and returns their sum. Here are a few suggestions:
1. Consider adding type hints: def add(a: int, b: int) -> int:
2. Add a docstring for clarity...

Open-Source Models with Ollama

Ollama runs open-source LLMs locally on your machine. Models like Llama 3, Mistral, Qwen, and Gemma run with no internet connection required and no API costs. Ollama provides an OpenAI-compatible API endpoint, so you can use the same Python code to switch between cloud and local models. This is ideal for privacy-sensitive applications, offline use, and development.

from openai import OpenAI

ollama_client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'
)

response = ollama_client.chat.completions.create(
    model="llama3",
    messages=[
        {"role": "user", "content": "Write a one-sentence explanation of embeddings."}
    ],
    temperature=0.7,
    max_tokens=100
)

result = response.choices[0].message.content
print(f"Local model: {response.model}")
print(f"Response:\n{result}")

Expected output:

Local model: llama3
Response:
Embeddings are numerical vector representations of text that capture semantic meaning, allowing similar pieces of content to be positioned close together in vector space for tasks like search and clustering.

Function Calling

Function calling lets LLMs extract structured data from natural language by returning JSON that matches a defined schema. You define functions with parameters, and the model returns a function_call with the populated arguments. This enables extracting entities, classifying text, and triggering actions based on user input, all without hand-crafted parsing logic.

import json

functions = [
    {
        "name": "extract_person_info",
        "description": "Extract person details from text",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer", "description": "Age in years"},
                "occupation": {"type": "string"}
            },
            "required": ["name"]
        }
    }
]

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "John is a 35-year-old software engineer from Boston."}
    ],
    functions=functions,
    function_call="auto"
)

msg = response.choices[0].message
if msg.function_call:
    args = json.loads(msg.function_call.arguments)
    print(f"Function called: {msg.function_call.name}")
    print(f"Extracted data:")
    for k, v in args.items():
        print(f"  {k}: {v}")

Expected output:

Function called: extract_person_info
Extracted data:
  name: John
  age: 35
  occupation: software engineer

LLM Integration Architecture

flowchart TD
  A[User Request] --> B[Application]
  B --> C{Model Choice}
  C --> D[OpenAI GPT-4]
  C --> E[Anthropic Claude]
  C --> F[Ollama Local]
  D --> G[Response]
  E --> G
  F --> G
  G --> B
  B --> H[Structured Output]
  H --> I[Function Call]
  H --> J[Text Response]
  B --> K[Logging & Monitoring]
  K --> L[Usage Tracking]
  K --> M[Error Handling]

Common Errors and Mistakes

Mistake	Why It Happens	How to Fix
No error handling	API outages break the app	Implement retry with exponential backoff
Hardcoded API keys	Security risk in code	Use environment variables or secret manager
Too many tokens	Exceeding context window	Set max_tokens, truncate long inputs
No streaming	Poor user experience	Use streaming for chat applications
Ignoring token costs	Unexpected bills	Track usage, set limits, cache responses

Practice Questions

What is the difference between the system, user, and assistant roles in chat messages?

Answer: System sets the model behavior and persona. User provides the input/prompt. Assistant contains previous model responses for conversation context. Only user and assistant are typically needed for single-turn queries.

How does streaming improve the user experience with LLM APIs?

Answer: Streaming sends tokens as they are generated, reducing perceived latency to the first token. Users see the response build incrementally rather than waiting for the complete response, creating a more interactive experience.

What is function calling and when would you use it?

Answer: Function calling extracts structured data from natural language by having the model return a JSON object matching a defined schema. Use it for entity extraction, classification, triggering API calls, or any task requiring structured output from unstructured text.

Why might you choose an open-source model via Ollama over OpenAI?

Answer: Ollama provides privacy (data stays local), no API costs, offline operation, and no rate limits. Choose it for sensitive data, development, or applications where latency from API calls is unacceptable.

How do you handle API rate limits with LLM services?

Answer: Implement exponential backoff with retry, queue requests, use multiple API keys with rotation, cache common responses, and monitor usage to stay within tier limits.

Challenge

Build a multi-LLM chat application that supports OpenAI, Anthropic, and Ollama backends. Implement streaming responses, conversation history management, and a fallback chain (try OpenAI first, fall back to Anthropic if unavailable, then Ollama). Add token counting and cost estimation per conversation. Allow the user to switch providers mid-conversation.

Real-World Task

Design an AI customer support system that uses LLM APIs to answer product questions. Use function calling to extract the user's issue category, account ID, and urgency. Route simple questions to a faster/cheaper model (GPT-3.5 or Haiku), escalate complex issues to GPT-4 or Claude Opus. Implement streaming for the chat interface, caching for common questions, and usage tracking per customer.

Next Steps

Build complete applications with LangChain for LLM orchestration. Deploy with Docker and scale with Kubernetes. Monitor costs and latency with MLflow.

{{< faq "What is the difference between OpenAI and Anthropic APIs?">}} Both provide chat completion APIs with streaming and function calling. Anthropic's Claude offers a larger context window (200K tokens), while OpenAI has a broader ecosystem and more models. Pricing, speed, and safety approaches differ. Choose based on your specific use case requirements. {{< /faq >}}

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous ML Data Pipelines with Apache Airflow and Prefect Next → Fine-Tuning LLMs: LoRA, QLoRA and Full Fine-Tuning Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Machine Learning