Skip to content

OpenAI API Guide — Chat Completions, Embeddings & Function Calling

DodaTech 13 min read

In this tutorial, you'll learn about OpenAI API Guide. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

The OpenAI API provides programmatic access to GPT-4, GPT-3.5, embedding models, and DALL-E image generation, enabling developers to integrate AI into their applications with simple HTTP calls and official client libraries.

What You'll Learn

In this tutorial, you'll build real-world applications with the OpenAI API using Python and LangChain, covering chat completions, streaming responses, function calling (tool use), embeddings for semantic search, and DALL-E image generation.

Why It Matters

The OpenAI API is the most widely used LLM API in production. Understanding its capabilities — from simple chat to structured data extraction to vector embeddings — is essential for building modern AI features. Whether you are building a chatbot, a search system, or an automation tool, the OpenAI API provides the foundation.

Real-World Use

Doda Browser uses the OpenAI API for its AI assistant: GPT-4 handles complex reasoning and tool use, embeddings power semantic search over browser history and bookmarks, and DALL-E generates custom thumbnails for visual content.

Architecture Overview

The OpenAI API follows a request-response pattern. Your application sends a request to the appropriate endpoint, the API processes it using the selected model, and returns the result.

graph LR
    A["Your Application"] -->|"HTTP Request"| B["OpenAI API Gateway"]
    B -->|"Route"| C["Model Endpoint"]
    C --> D["GPT-4 / GPT-3.5"]
    C --> E["text-embedding-3-small"]
    C --> F["DALL-E 3"]
    D -->|"Response"| B
    E -->|"Vector Response"| B
    F -->|"Image URL"| B
    B -->|"JSON Response"| A
    G["API Key"] -.->|"Authentication"| B

The API key authenticates every request. The gateway routes to the correct model endpoint based on the model name in your request body. All responses return as structured JSON.

Setup and Authentication

Before making any API calls, you need an OpenAI account and an API key. Install the official Python client library, which wraps the REST API with convenient methods, automatic retries, and type hints.

pip install openai

Create a client instance with your API key. Store the key in an environment variable rather than hard-coding it into your source code.

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY")
)

print("Client created successfully")

Expected output:

Client created successfully

The client uses the key from the environment variable OPENAI_API_KEY. If the key is missing, the client raises an OpenAIError when you make your first request. Always use environment variables or a secret manager — never commit API keys to version control.

Chat Completions API

The Chat Completions endpoint is the core of the OpenAI API. You send a list of messages with roles (system, user, assistant) and the model generates a response. GPT-4 and GPT-3.5 are the primary models, with GPT-4 offering higher reasoning capability at higher cost.

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "system",
            "content": "You are a senior Python developer. Provide concise, correct answers.]
        },
        {
            "role": "user",
            "content": "What is a decorator in Python and how do I write one?"
        }
    ],
    temperature=0.3,
    max_tokens=300
)

print(response.choices[0].message.content)

Expected output (abbreviated):

A decorator in Python is a function that takes another function and extends its behavior without explicitly modifying it. You write one using the @ syntax:

```python
def my_decorator(func):
    def wrapper(*args, **kwargs):
        print("Before the function call")
        result = func(*args, **kwargs)
        print("After the function call")
        return result
    return wrapper

@my_decorator
def say_hello():
    print("Hello!")

say_hello()

This prints "Before the function call", then "Hello!", then "After the function call".


Key parameters:

- **model**: The model ID (`"gpt-4"`, `"gpt-3.5-turbo"`). Newer models like `"gpt-4o"` offer improved performance.
- **messages**: An array of message objects. The `system` message sets the assistant behavior. The `user` message contains the user input.
- **temperature**: Controls randomness (0 = deterministic, 2 = very random). Lower values are better for factual tasks.
- **max_tokens**: The maximum number of tokens in the response. Limits cost and response length.

### Message Roles

The messages array supports three roles:

- **system**: Sets the behavior and personality of the assistant. This is where you provide instructions.
- **user**: The input from the end user. Can include text, images (GPT-4 Vision), or tool results.
- **assistant**: Previous responses from the model. Required for multi-turn conversations to provide context.

## Streaming Responses

For interactive applications, streaming delivers tokens as they are generated rather than waiting for the complete response. This reduces perceived latency and enables real-time display of text as it appears.

```python
from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Write a haiku about streaming APIs."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Expected output:

Data flows in chunks,
Tokens arrive one by one,
Response feels alive.

Enable streaming by passing stream=True to the create method. Instead of a single response object, the API yields chunk objects, each containing a delta with the next piece of content. The end="" parameter keeps the output on a single line as tokens arrive.

Handling Streaming in Web Applications

For web apps using Node.js, you stream tokens to the client via Server-Sent Events (SSE) or WebSockets. The Python backend reads the stream and forwards each chunk to the frontend without buffering.

Function Calling (Tool Use)

Function calling lets the model intelligently choose to call external functions. The model outputs structured JSON arguments for functions you define, and your code executes the function and returns the result. This enables the model to perform actions, query databases, or call other API endpoints.

from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g. San Francisco]
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    }
]

messages = [
    {"role": "user", "content": "What is the weather in Tokyo?"}
]

response = client.chat.completions.create(
    model="gpt-4",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

choice = response.choices[0]
if choice.finish_reason == "tool_calls":
    tool_call = choice.message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Expected output:

Function: get_weather
Arguments: {"location": "Tokyo"}

The model recognized that the user is asking about weather and returned a tool_calls request with the get_weather function and the argument {"location": "Tokyo"}. Your code then executes the actual weather API call and sends the result back to the model.

Multi-Turn Tool Use

After the model requests a function call, your code must execute the function and append both the assistant message (with tool_calls) and a tool response message to continue the conversation:

import json

def get_weather(location, unit="celsius"):
    weather_data = {
        "Tokyo": {"temperature": 22, "condition": "Sunny"},
        "San Francisco": {"temperature": 18, "condition": "Foggy"},
    }
    data = weather_data.get(location, {"temperature": 20, "condition": "Unknown"})
    return json.dumps(data)

# Append the assistant's tool call message
messages.append(choice.message)

# Execute the function and append the result
for tool_call in choice.message.tool_calls:
    result = get_weather(**json.loads(tool_call.function.arguments))
    messages.append({
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": result
    })

# Get the final response
final = client.chat.completions.create(
    model="gpt-4",
    messages=messages,
    tools=tools
)
print(final.choices[0].message.content)

Expected output:

The weather in Tokyo is currently sunny with a temperature of 22 degrees Celsius.

Embeddings API

The Embeddings API converts text into a vector of floating-point numbers. These vectors capture semantic meaning, enabling similarity search, clustering, and classification. OpenAI's text-embedding-3-small and text-embedding-3-large models offer high quality at low cost.

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The cat sat on the mat."
)

embedding = response.data[0].embedding
print(f"Vector length: {len(embedding)}")
print(f"First 5 dimensions: {embedding[:5]}")

Expected output:

Vector length: 1536
First 5 dimensions: [-0.009858479790389538, -0.009932905435562134, 0.029549377038121224, -0.030516650527715683, -0.0032712684888392687]

Each text input produces a fixed-size vector. The text-embedding-3-small model returns 1536 dimensions, while the text-embedding-3-large returns 3072. You can reduce dimensions via the dimensions parameter to save storage and computation.

Semantic Search Example

Use embeddings to find the most similar document to a query:

import numpy as np
from openai import OpenAI

client = OpenAI()

documents = [
    "Python is a programming language used for web development and data science.",
    "OpenAI provides APIs for language models and embeddings.",
    "Doda Browser is a privacy-focused web browser.",
]

query = "What programming language should I learn for AI?"

# Generate embeddings
doc_embeddings = client.embeddings.create(
    model="text-embedding-3-small",
    input=documents
).data

query_embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input=query
).data[0].embedding

# Compute cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

scores = [
    cosine_similarity(query_embedding, de.embedding)
    for de in doc_embeddings
]

best = np.argmax(scores)
print(f"Most relevant document: {documents[best]}")
print(f"Similarity score: {scores[best]:.4f}")

Expected output:

Most relevant document: Python is a programming language used for web development and data science.
Similarity score: 0.7842

The query embedding is compared against each document embedding using cosine similarity. The document with the highest score is the most semantically similar. This approach scales to millions of documents using vector databases like Pinecone or Weaviate.

Image Generation (DALL-E)

DALL-E 3 generates images from text descriptions. The API returns either a URL to the generated image or base64-encoded image data.

from openai import OpenAI

client = OpenAI()

response = client.images.generate(
    model="dall-e-3",
    prompt="A vector illustration of a cat typing on a laptop in a coffee shop, digital art style",
    n=1,
    size="1024x1024",
    quality="standard"
)

image_url = response.data[0].url
print(f"Generated image URL: {image_url}")

Expected output:

Generated image URL: https://oaidalleapiprodscus.blob.core.windows.net/private/org-...

The URL expires after a few hours. For permanent storage, download the image immediately and save it to your own storage or CDN. The size parameter supports "1024x1024", "1792x1024", and "1024x1792". The quality parameter accepts "standard" or "hd".

Best Practices

Cost Optimization

  • Use GPT-3.5-turbo for simple tasks and GPT-4 only for complex reasoning. GPT-3.5 costs roughly 10x less than GPT-4.
  • Set max_tokens to the minimum reasonable value. Every token costs money, and responses that are cut off waste tokens.
  • Cache responses for identical or similar queries using a Redis or database cache with TTL.
  • Reduce embedding dimensions from 3072 to 256 or 512 for large-scale retrieval. Lower dimensions reduce storage and latency with minimal accuracy loss.
  • Use text-embedding-3-small instead of text-embedding-3-large when possible. The small model costs 5x less.

Rate Limiting

OpenAI enforces rate limits per API key at the tier level (tokens per minute, requests per minute). Handle rate limits with exponential backoff:

import time
from openai import OpenAI, RateLimitError

client = OpenAI()

for attempt in range(5):
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": "Hello"}]
        )
        break
    except RateLimitError:
        wait = 2 ** attempt
        print(f"Rate limited. Retrying in {wait}s...")
        time.sleep(wait)

Expected output:

Rate limited. Retrying in 1s...

The exponential backoff starts at 1 second and doubles each attempt, giving the rate limiter time to reset. For production, use the OpenAI Python library's built-in retry mechanism or a queue-based approach with backpressure.

Error Handling

Common errors and their resolutions:

HTTP Code Error Type Cause Resolution
401 AuthenticationError Invalid or missing API key Check your OPENAI_API_KEY environment variable
429 RateLimitError Exceeded rate limit Implement exponential backoff or upgrade tier
400 BadRequestError Invalid parameters Validate model, messages, and parameter values
500 APIError Server-side issue Retry with backoff; check status.openai.com

Always wrap API calls in try-except blocks to handle errors gracefully in production:

from openai import OpenAI, APIError, RateLimitError, APIConnectionError

client = OpenAI()

try:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError:
    print("Too many requests. Please slow down.")
except APIConnectionError:
    print("Network error. Check your internet connection.")
except APIError as e:
    print(f"OpenAI API error: {e}")

Expected output:

(No error — response returned successfully)

Practice Questions

  1. What is the difference between gpt-4 and gpt-3.5-turbo in terms of capability and cost?

  2. How does the streaming response differ from a standard Chat Completions response?

  3. What is the purpose of the tools parameter in a Chat Completions request?

  4. Why would you use text-embedding-3-small over text-embedding-3-large?

  5. What HTTP status code indicates a rate limit has been exceeded, and how should you handle it?

Answers:

  1. GPT-4 has higher reasoning capability but costs approximately 10x more per token than GPT-3.5-turbo. Use GPT-3.5 for simple tasks and GPT-4 for complex reasoning.

  2. Streaming delivers tokens one at a time as they are generated, reducing perceived latency. Standard responses return the complete text after the model finishes generating.

  3. The tools parameter defines functions the model can call. When the model determines a function should be invoked, it returns a tool_calls response with the function name and arguments in JSON.

  4. text-embedding-3-small costs 5x less and uses 1536 dimensions versus 3072. For many use cases, the smaller model provides sufficient quality with lower storage and latency.

  5. HTTP 429 (RateLimitError). Handle it with exponential backoff: wait 2^attempt seconds before retrying, up to a maximum number of retries.

Challenge

Build a multi-turn chatbot that can answer questions about a dataset by generating and executing SQL queries. The chatbot should:

  1. Accept a natural language question from the user.
  2. Use function calling to generate a SQL query based on the question.
  3. Execute the SQL query against a SQLite database (provide a sample database with a products table).
  4. Return the query results to the model to generate a natural language answer.
  5. Handle follow-up questions that reference the previous context.

Real-World Task

You are building an FAQ system for a customer support portal. Use the OpenAI Embeddings API to create a semantic search system that retrieves the most relevant FAQ article for a given user query. Store the embeddings in a vector database or a numpy array file (faq_embeddings.npy). When a user types a question, find the top 3 most similar FAQs and display them ranked by similarity score.

Frequently Asked Questions

How do I get an OpenAI API key?

Sign up at platform.openai.com, navigate to the API keys section, and create a new secret key. Store it securely in an environment variable. New accounts receive free trial credits that expire after three months.

What is the difference between gpt-4 and gpt-4o?

GPT-4o ("omni") is OpenAI's multimodal model that accepts text and image inputs and produces text outputs. It is faster and cheaper than GPT-4 while maintaining similar quality for most tasks. GPT-4o also has a larger context window (128K tokens) compared to GPT-4 (8K or 32K).

Can I use the OpenAI API for production applications?

Yes. The OpenAI API is designed for production use with service-level agreements at the Tier 2+ level. Monitor usage in the OpenAI dashboard, set spending limits to control costs, and implement retry logic for rate limits and transient errors.

How do embeddings work for semantic search?

Embeddings convert text into numerical vectors. Similar texts produce vectors close together in vector space. To search, compute the embedding of the query, then find the nearest neighbor vectors using cosine similarity or dot product. The documents corresponding to those vectors are the most semantically relevant results.

Next Steps


Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro