Integrating LLM APIs: OpenAI, Anthropic and Open-Source Models
In this tutorial, you'll learn about Integrating LLM APIs: OpenAI, Anthropic and Open. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
LLM APIs let you add powerful language AI capabilities to applications without training or hosting models, enabling chat, summarization, Code Generation, and reasoning features with simple HTTP calls.
What You'll Learn
In this tutorial, you'll learn to integrate LLM APIs including OpenAI, Anthropic Claude, and open-source models via Ollama, covering chat completions, streaming responses, function calling, and building AI-powered features in your Python applications.
Why It Matters
LLMs have transformed from research curiosities to essential infrastructure. Every application can benefit from AI features — answering questions, summarizing content, generating code, extracting structured data. API-based LLMs eliminate the need for expensive GPU infrastructure, and open-source models running locally with Ollama provide privacy and offline capability.
Real-World Use
Doda Browser integrates multiple LLM APIs for its AI assistant. OpenAI handles complex reasoning, Anthropic Claude processes long documents with its 200K context window, and a local Ollama model provides offline summarization for privacy-sensitive browsing sessions.
OpenAI API
The OpenAI API provides access to GPT models including GPT-4 and GPT-3.5. The chat completions endpoint accepts a list of messages with roles (system, user, assistant) and returns a generated response. You can control temperature (creativity), max_tokens (response length), and top_p (nucleus sampling). The API supports streaming for real-time token-by-token delivery.
from openai import OpenAI
client = OpenAI(
api_key="sk-your-api-key"
)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "You are a helpful Python tutor.]
},
{
"role": "user",
"content": "Explain list comprehensions in Python with an example."
}
],
temperature=0.3,
max_tokens=200
)
answer = response.choices[0].message.content
print(f"Model: {response.model}")
print(f"Tokens used: {response.usage.total_tokens}")
print(f"\nAnswer:\n{answer[:200]}...")
Expected output:
Model: gpt-3.5-turbo-0613
Tokens used: 85
Answer:
A list comprehension is a concise way to create lists in Python. It consists of brackets containing an expression followed by a for clause. For example:
[x**2 for x in range(5)] produces [0, 1, 4, 9, 16]...
Streaming Responses
Streaming delivers tokens one at a time as they are generated, reducing perceived latency. The user sees the response appear incrementally rather than waiting for the full response. Streaming is essential for chat applications where users expect immediate feedback. With OpenAI, set stream=True and iterate over response chunks.
import time
def stream_response(messages):
collected = []
for chunk in client.chat.completions.create(
model="gpt-3.5-turbo",
messages=messages,
stream=True,
temperature=0
):
if chunk.choices[0].delta.content is not None:
token = chunk.choices[0].delta.content
collected.append(token)
return ''.join(collected)
full_response = stream_response([
{"role": "user", "content": "Say 'Hello, world!'"}
])
print(f"Streamed response: {full_response}")
print(f"Character count: {len(full_response)}")
print(f"Chunks: Simulated as 3 tokens")
Expected output:
Streamed response: Hello, world!
Character count: 13
Chunks: Simulated as 3 tokens
Anthropic Claude API
Anthropic's Claude excels at long-context reasoning, Code Generation, and safe AI interactions. The API supports up to 200K tokens of context (enough for entire codebases or long documents). Claude uses the messages API similar to OpenAI but is accessed through the Anthropic SDK. Claude Haiku is optimized for speed, Sonnet balances speed and capability, and Opus provides maximum capability.
import anthropic
claude_client = anthropic.Anthropic(
api_key="sk-ant-your-api-key"
)
response = claude_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
system="You are a code review assistant.",
messages=[
{
"role": "user",
"content": "Review this Python function:\n\ndef add(a,b):\n return a+b]
}
]
)
text = response.content[0].text
print(f"Model: {response.model}")
print(f"Stop reason: {response.stop_reason}")
print(f"\nReview:\n{text[:200]}...")
Expected output:
Model: claude-3-haiku-20240307
Stop reason: end_turn
Review:
This function looks correct and concise. It takes two parameters and returns their sum. Here are a few suggestions:
1. Consider adding type hints: def add(a: int, b: int) -> int:
2. Add a docstring for clarity...
Open-Source Models with Ollama
Ollama runs open-source LLMs locally on your machine. Models like Llama 3, Mistral, Qwen, and Gemma run with no internet connection required and no API costs. Ollama provides an OpenAI-compatible API endpoint, so you can use the same Python code to switch between cloud and local models. This is ideal for privacy-sensitive applications, offline use, and development.
from openai import OpenAI
ollama_client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama'
)
response = ollama_client.chat.completions.create(
model="llama3",
messages=[
{"role": "user", "content": "Write a one-sentence explanation of embeddings."}
],
temperature=0.7,
max_tokens=100
)
result = response.choices[0].message.content
print(f"Local model: {response.model}")
print(f"Response:\n{result}")
Expected output:
Local model: llama3
Response:
Embeddings are numerical vector representations of text that capture semantic meaning, allowing similar pieces of content to be positioned close together in vector space for tasks like search and clustering.
Function Calling
Function calling lets LLMs extract structured data from natural language by returning JSON that matches a defined schema. You define functions with parameters, and the model returns a function_call with the populated arguments. This enables extracting entities, classifying text, and triggering actions based on user input, all without hand-crafted parsing logic.
import json
functions = [
{
"name": "extract_person_info",
"description": "Extract person details from text",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "description": "Age in years"},
"occupation": {"type": "string"}
},
"required": ["name"]
}
}
]
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "John is a 35-year-old software engineer from Boston."}
],
functions=functions,
function_call="auto"
)
msg = response.choices[0].message
if msg.function_call:
args = json.loads(msg.function_call.arguments)
print(f"Function called: {msg.function_call.name}")
print(f"Extracted data:")
for k, v in args.items():
print(f" {k}: {v}")
Expected output:
Function called: extract_person_info
Extracted data:
name: John
age: 35
occupation: software engineer
LLM Integration Architecture
flowchart TD
A[User Request] --> B[Application]
B --> C{Model Choice}
C --> D[OpenAI GPT-4]
C --> E[Anthropic Claude]
C --> F[Ollama Local]
D --> G[Response]
E --> G
F --> G
G --> B
B --> H[Structured Output]
H --> I[Function Call]
H --> J[Text Response]
B --> K[Logging & Monitoring]
K --> L[Usage Tracking]
K --> M[Error Handling]
Common Errors and Mistakes
| Mistake | Why It Happens | How to Fix |
|---|---|---|
| No error handling | API outages break the app | Implement retry with exponential backoff |
| Hardcoded API keys | Security risk in code | Use environment variables or secret manager |
| Too many tokens | Exceeding context window | Set max_tokens, truncate long inputs |
| No streaming | Poor user experience | Use streaming for chat applications |
| Ignoring token costs | Unexpected bills | Track usage, set limits, cache responses |
Practice Questions
- What is the difference between the system, user, and assistant roles in chat messages?
Answer: System sets the model behavior and persona. User provides the input/prompt. Assistant contains previous model responses for conversation context. Only user and assistant are typically needed for single-turn queries.
- How does streaming improve the user experience with LLM APIs?
Answer: Streaming sends tokens as they are generated, reducing perceived latency to the first token. Users see the response build incrementally rather than waiting for the complete response, creating a more interactive experience.
- What is function calling and when would you use it?
Answer: Function calling extracts structured data from natural language by having the model return a JSON object matching a defined schema. Use it for entity extraction, classification, triggering API calls, or any task requiring structured output from unstructured text.
- Why might you choose an open-source model via Ollama over OpenAI?
Answer: Ollama provides privacy (data stays local), no API costs, offline operation, and no rate limits. Choose it for sensitive data, development, or applications where latency from API calls is unacceptable.
- How do you handle API rate limits with LLM services?
Answer: Implement exponential backoff with retry, queue requests, use multiple API keys with rotation, cache common responses, and monitor usage to stay within tier limits.
Challenge
Build a multi-LLM chat application that supports OpenAI, Anthropic, and Ollama backends. Implement streaming responses, conversation history management, and a fallback chain (try OpenAI first, fall back to Anthropic if unavailable, then Ollama). Add token counting and cost estimation per conversation. Allow the user to switch providers mid-conversation.
Real-World Task
Design an AI customer support system that uses LLM APIs to answer product questions. Use function calling to extract the user's issue category, account ID, and urgency. Route simple questions to a faster/cheaper model (GPT-3.5 or Haiku), escalate complex issues to GPT-4 or Claude Opus. Implement streaming for the chat interface, caching for common questions, and usage tracking per customer.
Next Steps
Build complete applications with LangChain for LLM orchestration. Deploy with Docker and scale with Kubernetes. Monitor costs and latency with MLflow.
{{< faq "What is the difference between OpenAI and Anthropic APIs?">}} Both provide chat completion APIs with streaming and function calling. Anthropic's Claude offers a larger context window (200K tokens), while OpenAI has a broader ecosystem and more models. Pricing, speed, and safety approaches differ. Choose based on your specific use case requirements. {{< /faq >}}
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro