OpenAI API Guide — Chat Completions, Embeddings & Function Calling
In this tutorial, you'll learn about OpenAI API Guide. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
The OpenAI API provides programmatic access to GPT-4, GPT-3.5, embedding models, and DALL-E image generation, enabling developers to integrate AI into their applications with simple HTTP calls and official client libraries.
What You'll Learn
In this tutorial, you'll build real-world applications with the OpenAI API using Python and LangChain, covering chat completions, streaming responses, function calling (tool use), embeddings for semantic search, and DALL-E image generation.
Why It Matters
The OpenAI API is the most widely used LLM API in production. Understanding its capabilities — from simple chat to structured data extraction to vector embeddings — is essential for building modern AI features. Whether you are building a chatbot, a search system, or an automation tool, the OpenAI API provides the foundation.
Real-World Use
Doda Browser uses the OpenAI API for its AI assistant: GPT-4 handles complex reasoning and tool use, embeddings power semantic search over browser history and bookmarks, and DALL-E generates custom thumbnails for visual content.
Architecture Overview
The OpenAI API follows a request-response pattern. Your application sends a request to the appropriate endpoint, the API processes it using the selected model, and returns the result.
graph LR
A["Your Application"] -->|"HTTP Request"| B["OpenAI API Gateway"]
B -->|"Route"| C["Model Endpoint"]
C --> D["GPT-4 / GPT-3.5"]
C --> E["text-embedding-3-small"]
C --> F["DALL-E 3"]
D -->|"Response"| B
E -->|"Vector Response"| B
F -->|"Image URL"| B
B -->|"JSON Response"| A
G["API Key"] -.->|"Authentication"| B
The API key authenticates every request. The gateway routes to the correct model endpoint based on the model name in your request body. All responses return as structured JSON.
Setup and Authentication
Before making any API calls, you need an OpenAI account and an API key. Install the official Python client library, which wraps the REST API with convenient methods, automatic retries, and type hints.
pip install openai
Create a client instance with your API key. Store the key in an environment variable rather than hard-coding it into your source code.
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY")
)
print("Client created successfully")
Expected output:
Client created successfully
The client uses the key from the environment variable OPENAI_API_KEY. If the key is missing, the client raises an OpenAIError when you make your first request. Always use environment variables or a secret manager — never commit API keys to version control.
Chat Completions API
The Chat Completions endpoint is the core of the OpenAI API. You send a list of messages with roles (system, user, assistant) and the model generates a response. GPT-4 and GPT-3.5 are the primary models, with GPT-4 offering higher reasoning capability at higher cost.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are a senior Python developer. Provide concise, correct answers.]
},
{
"role": "user",
"content": "What is a decorator in Python and how do I write one?"
}
],
temperature=0.3,
max_tokens=300
)
print(response.choices[0].message.content)
Expected output (abbreviated):
A decorator in Python is a function that takes another function and extends its behavior without explicitly modifying it. You write one using the @ syntax:
```python
def my_decorator(func):
def wrapper(*args, **kwargs):
print("Before the function call")
result = func(*args, **kwargs)
print("After the function call")
return result
return wrapper
@my_decorator
def say_hello():
print("Hello!")
say_hello()
This prints "Before the function call", then "Hello!", then "After the function call".
Key parameters:
- **model**: The model ID (`"gpt-4"`, `"gpt-3.5-turbo"`). Newer models like `"gpt-4o"` offer improved performance.
- **messages**: An array of message objects. The `system` message sets the assistant behavior. The `user` message contains the user input.
- **temperature**: Controls randomness (0 = deterministic, 2 = very random). Lower values are better for factual tasks.
- **max_tokens**: The maximum number of tokens in the response. Limits cost and response length.
### Message Roles
The messages array supports three roles:
- **system**: Sets the behavior and personality of the assistant. This is where you provide instructions.
- **user**: The input from the end user. Can include text, images (GPT-4 Vision), or tool results.
- **assistant**: Previous responses from the model. Required for multi-turn conversations to provide context.
## Streaming Responses
For interactive applications, streaming delivers tokens as they are generated rather than waiting for the complete response. This reduces perceived latency and enables real-time display of text as it appears.
```python
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "Write a haiku about streaming APIs."}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Expected output:
Data flows in chunks,
Tokens arrive one by one,
Response feels alive.
Enable streaming by passing stream=True to the create method. Instead of a single response object, the API yields chunk objects, each containing a delta with the next piece of content. The end="" parameter keeps the output on a single line as tokens arrive.
Handling Streaming in Web Applications
For web apps using Node.js, you stream tokens to the client via Server-Sent Events (SSE) or WebSockets. The Python backend reads the stream and forwards each chunk to the frontend without buffering.
Function Calling (Tool Use)
Function calling lets the model intelligently choose to call external functions. The model outputs structured JSON arguments for functions you define, and your code executes the function and returns the result. This enables the model to perform actions, query databases, or call other API endpoints.
from openai import OpenAI
client = OpenAI()
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. San Francisco]
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
]
messages = [
{"role": "user", "content": "What is the weather in Tokyo?"}
]
response = client.chat.completions.create(
model="gpt-4",
messages=messages,
tools=tools,
tool_choice="auto"
)
choice = response.choices[0]
if choice.finish_reason == "tool_calls":
tool_call = choice.message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
Expected output:
Function: get_weather
Arguments: {"location": "Tokyo"}
The model recognized that the user is asking about weather and returned a tool_calls request with the get_weather function and the argument {"location": "Tokyo"}. Your code then executes the actual weather API call and sends the result back to the model.
Multi-Turn Tool Use
After the model requests a function call, your code must execute the function and append both the assistant message (with tool_calls) and a tool response message to continue the conversation:
import json
def get_weather(location, unit="celsius"):
weather_data = {
"Tokyo": {"temperature": 22, "condition": "Sunny"},
"San Francisco": {"temperature": 18, "condition": "Foggy"},
}
data = weather_data.get(location, {"temperature": 20, "condition": "Unknown"})
return json.dumps(data)
# Append the assistant's tool call message
messages.append(choice.message)
# Execute the function and append the result
for tool_call in choice.message.tool_calls:
result = get_weather(**json.loads(tool_call.function.arguments))
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
# Get the final response
final = client.chat.completions.create(
model="gpt-4",
messages=messages,
tools=tools
)
print(final.choices[0].message.content)
Expected output:
The weather in Tokyo is currently sunny with a temperature of 22 degrees Celsius.
Embeddings API
The Embeddings API converts text into a vector of floating-point numbers. These vectors capture semantic meaning, enabling similarity search, clustering, and classification. OpenAI's text-embedding-3-small and text-embedding-3-large models offer high quality at low cost.
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="The cat sat on the mat."
)
embedding = response.data[0].embedding
print(f"Vector length: {len(embedding)}")
print(f"First 5 dimensions: {embedding[:5]}")
Expected output:
Vector length: 1536
First 5 dimensions: [-0.009858479790389538, -0.009932905435562134, 0.029549377038121224, -0.030516650527715683, -0.0032712684888392687]
Each text input produces a fixed-size vector. The text-embedding-3-small model returns 1536 dimensions, while the text-embedding-3-large returns 3072. You can reduce dimensions via the dimensions parameter to save storage and computation.
Semantic Search Example
Use embeddings to find the most similar document to a query:
import numpy as np
from openai import OpenAI
client = OpenAI()
documents = [
"Python is a programming language used for web development and data science.",
"OpenAI provides APIs for language models and embeddings.",
"Doda Browser is a privacy-focused web browser.",
]
query = "What programming language should I learn for AI?"
# Generate embeddings
doc_embeddings = client.embeddings.create(
model="text-embedding-3-small",
input=documents
).data
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
# Compute cosine similarity
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
scores = [
cosine_similarity(query_embedding, de.embedding)
for de in doc_embeddings
]
best = np.argmax(scores)
print(f"Most relevant document: {documents[best]}")
print(f"Similarity score: {scores[best]:.4f}")
Expected output:
Most relevant document: Python is a programming language used for web development and data science.
Similarity score: 0.7842
The query embedding is compared against each document embedding using cosine similarity. The document with the highest score is the most semantically similar. This approach scales to millions of documents using vector databases like Pinecone or Weaviate.
Image Generation (DALL-E)
DALL-E 3 generates images from text descriptions. The API returns either a URL to the generated image or base64-encoded image data.
from openai import OpenAI
client = OpenAI()
response = client.images.generate(
model="dall-e-3",
prompt="A vector illustration of a cat typing on a laptop in a coffee shop, digital art style",
n=1,
size="1024x1024",
quality="standard"
)
image_url = response.data[0].url
print(f"Generated image URL: {image_url}")
Expected output:
Generated image URL: https://oaidalleapiprodscus.blob.core.windows.net/private/org-...
The URL expires after a few hours. For permanent storage, download the image immediately and save it to your own storage or CDN. The size parameter supports "1024x1024", "1792x1024", and "1024x1792". The quality parameter accepts "standard" or "hd".
Best Practices
Cost Optimization
- Use GPT-3.5-turbo for simple tasks and GPT-4 only for complex reasoning. GPT-3.5 costs roughly 10x less than GPT-4.
- Set
max_tokensto the minimum reasonable value. Every token costs money, and responses that are cut off waste tokens. - Cache responses for identical or similar queries using a Redis or database cache with TTL.
- Reduce embedding dimensions from 3072 to 256 or 512 for large-scale retrieval. Lower dimensions reduce storage and latency with minimal accuracy loss.
- Use
text-embedding-3-smallinstead oftext-embedding-3-largewhen possible. The small model costs 5x less.
Rate Limiting
OpenAI enforces rate limits per API key at the tier level (tokens per minute, requests per minute). Handle rate limits with exponential backoff:
import time
from openai import OpenAI, RateLimitError
client = OpenAI()
for attempt in range(5):
try:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello"}]
)
break
except RateLimitError:
wait = 2 ** attempt
print(f"Rate limited. Retrying in {wait}s...")
time.sleep(wait)
Expected output:
Rate limited. Retrying in 1s...
The exponential backoff starts at 1 second and doubles each attempt, giving the rate limiter time to reset. For production, use the OpenAI Python library's built-in retry mechanism or a queue-based approach with backpressure.
Error Handling
Common errors and their resolutions:
| HTTP Code | Error Type | Cause | Resolution |
|---|---|---|---|
| 401 | AuthenticationError | Invalid or missing API key | Check your OPENAI_API_KEY environment variable |
| 429 | RateLimitError | Exceeded rate limit | Implement exponential backoff or upgrade tier |
| 400 | BadRequestError | Invalid parameters | Validate model, messages, and parameter values |
| 500 | APIError | Server-side issue | Retry with backoff; check status.openai.com |
Always wrap API calls in try-except blocks to handle errors gracefully in production:
from openai import OpenAI, APIError, RateLimitError, APIConnectionError
client = OpenAI()
try:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello"}]
)
except RateLimitError:
print("Too many requests. Please slow down.")
except APIConnectionError:
print("Network error. Check your internet connection.")
except APIError as e:
print(f"OpenAI API error: {e}")
Expected output:
(No error — response returned successfully)
Practice Questions
What is the difference between
gpt-4andgpt-3.5-turboin terms of capability and cost?How does the streaming response differ from a standard Chat Completions response?
What is the purpose of the
toolsparameter in a Chat Completions request?Why would you use
text-embedding-3-smallovertext-embedding-3-large?What HTTP status code indicates a rate limit has been exceeded, and how should you handle it?
Answers:
GPT-4 has higher reasoning capability but costs approximately 10x more per token than GPT-3.5-turbo. Use GPT-3.5 for simple tasks and GPT-4 for complex reasoning.
Streaming delivers tokens one at a time as they are generated, reducing perceived latency. Standard responses return the complete text after the model finishes generating.
The
toolsparameter defines functions the model can call. When the model determines a function should be invoked, it returns atool_callsresponse with the function name and arguments in JSON.text-embedding-3-smallcosts 5x less and uses 1536 dimensions versus 3072. For many use cases, the smaller model provides sufficient quality with lower storage and latency.HTTP 429 (RateLimitError). Handle it with exponential backoff: wait
2^attemptseconds before retrying, up to a maximum number of retries.
Challenge
Build a multi-turn chatbot that can answer questions about a dataset by generating and executing SQL queries. The chatbot should:
- Accept a natural language question from the user.
- Use function calling to generate a SQL query based on the question.
- Execute the SQL query against a SQLite database (provide a sample database with a
productstable). - Return the query results to the model to generate a natural language answer.
- Handle follow-up questions that reference the previous context.
Real-World Task
You are building an FAQ system for a customer support portal. Use the OpenAI Embeddings API to create a semantic search system that retrieves the most relevant FAQ article for a given user query. Store the embeddings in a vector database or a numpy array file (faq_embeddings.npy). When a user types a question, find the top 3 most similar FAQs and display them ranked by similarity score.
Frequently Asked Questions
Next Steps
- [LangChain Guide](/machine-learning/LangChain-guide/) Build LLM-powered applications with LangChain
- RAG Systems Implement retrieval-augmented generation pipelines
- Prompt Engineering Master advanced prompting techniques
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro