ChatGPT API Pricing: How Much Do Tokens Cost? (2026 Guide)

Published March 2026 • 9 min read

How AI API Pricing Works

Every major AI provider — OpenAI, Anthropic, and Google — charges for API usage based on tokens, not characters, words, or requests. A token is the fundamental unit of text that language models process. In English, 1 token is roughly 4 characters or 0.75 words. The phrase "Hello, world!" is approximately 4 tokens. A 1,000-word blog post is around 1,300 tokens.

Pricing is always expressed as dollars per million tokens ($/1M tokens), split into two separate rates:

  • Input tokens (prompt tokens): Every token in your request — system prompt, conversation history, user message, and any context you inject.
  • Output tokens (completion tokens): Every token the model generates in its response. These are billed separately and are typically 3 to 5 times more expensive per token than input tokens.

Understanding this distinction is crucial for cost estimation. A chatbot that generates long responses, a code generator with verbose output, or a summarization tool with detailed summaries will spend far more on output tokens than input tokens.

Rule of thumb: For most production applications, assume your costs split roughly 30% input / 70% output by dollar amount, because output tokens cost so much more per token.

OpenAI API Pricing 2026

OpenAI offers a range of models across the GPT-4 and reasoning (o-series) families. All prices below are for standard API access. Batch API pricing is approximately 50% cheaper with 24-hour turnaround.

Model Input (per 1M tokens) Output (per 1M tokens) Context Window
GPT-4o $2.50 $10.00 128K tokens
GPT-4o mini $0.15 $0.60 128K tokens
o1 $15.00 $60.00 200K tokens
o3-mini $1.10 $4.40 200K tokens
GPT-4o (cached input) $1.25 $10.00 128K tokens

GPT-4o mini delivers about 80-85% of GPT-4o's quality on most tasks at 94% lower cost. For high-volume applications where absolute top performance is not required — classification, extraction, simple Q&A — GPT-4o mini is the default choice for cost-conscious teams. The o1 and o3 reasoning models are purpose-built for complex math, code, and multi-step logic tasks and carry premium pricing reflecting their internal chain-of-thought computation.

Anthropic Claude API Pricing

Anthropic prices its Claude 3.5 and Claude 3 model families similarly to OpenAI, with a fast/cheap and capable/expensive tier. Claude models are particularly competitive for long-context tasks given their 200K token context window across all tiers.

Model Input (per 1M tokens) Output (per 1M tokens) Context Window
Claude 3.5 Sonnet $3.00 $15.00 200K tokens
Claude 3.5 Haiku $0.80 $4.00 200K tokens
Claude 3 Opus $15.00 $75.00 200K tokens
Claude 3.5 Sonnet (cached) $0.30 $15.00 200K tokens

Anthropic's prompt caching feature is one of the most powerful cost-reduction tools available: repeated prompt prefixes — like long system prompts or injected documents — are cached at approximately 10% of the standard input rate. Applications that send the same large context repeatedly (RAG pipelines, document analysis) can see 70-90% cost reductions with prompt caching enabled.

Google Gemini API Pricing

Google Gemini stands out with a generous free tier and the largest context windows in the industry (up to 2 million tokens for Gemini 1.5 Pro). Pricing scales with context window usage for Gemini 1.5 models: requests under 128K tokens are billed at the standard rate; larger contexts are billed at a higher rate.

Model Input (per 1M tokens) Output (per 1M tokens) Context Window
Gemini 1.5 Pro (≤128K) $1.25 $5.00 2M tokens
Gemini 1.5 Pro (>128K) $2.50 $10.00 2M tokens
Gemini 1.5 Flash (≤128K) $0.075 $0.30 1M tokens
Gemini 2.0 Flash $0.10 $0.40 1M tokens
Gemini 1.5 Flash (free tier) $0.00 $0.00 Rate-limited

Gemini 1.5 Flash is the most affordable production-ready model among all three providers for standard context sizes, at $0.075/1M input tokens. For extremely cost-sensitive workloads, it is worth benchmarking Flash against GPT-4o mini and Claude 3.5 Haiku for your specific task.

Input vs Output Token Costs

The price asymmetry between input and output tokens is one of the most important — and most often overlooked — factors in AI API cost modeling. Across all three major providers, output tokens cost 3 to 5 times more per token than input tokens:

  • GPT-4o: Input $2.50 / Output $10.00 — 4x output premium
  • Claude 3.5 Sonnet: Input $3.00 / Output $15.00 — 5x output premium
  • Gemini 1.5 Pro: Input $1.25 / Output $5.00 — 4x output premium

This premium exists because generating output tokens is computationally more expensive than processing input tokens — the model must run the full autoregressive sampling process for every token it generates, while input processing is parallelized.

The practical implication: if your application generates long responses (detailed summaries, verbose code, multi-step explanations), your output token costs will dominate. Reducing average response length — through explicit length instructions, structured output formats, or shorter summaries — directly reduces your largest cost component.

How to Estimate Your API Costs

Use this formula to estimate the cost of a single API call:

Cost = (input_tokens / 1,000,000 × input_price)
     + (output_tokens / 1,000,000 × output_price)

Worked example — a GPT-4o call with a 500-token system prompt, 200-token user message, and a 1,500-token response:

Input tokens:  500 (system) + 200 (user) = 700 tokens
Output tokens: 1,500 tokens

Cost = (700 / 1,000,000 × $2.50) + (1,500 / 1,000,000 × $10.00)
     = $0.00175 + $0.01500
     = $0.01675 per call

At 10,000 calls/month: $0.01675 × 10,000 = $167.50/month

The same workload on GPT-4o mini:

Cost = (700 / 1,000,000 × $0.15) + (1,500 / 1,000,000 × $0.60)
     = $0.000105 + $0.000900
     = $0.001005 per call

At 10,000 calls/month: $0.001005 × 10,000 = $10.05/month

The model choice alone produces a 16x cost difference. Always benchmark cheaper models first — many production use cases run acceptably on GPT-4o mini or Gemini 1.5 Flash.

7 Ways to Reduce API Costs

1. Choose the right model for the task

Use frontier models (GPT-4o, Claude 3.5 Sonnet) only for tasks that genuinely require their capability. Route classification, extraction, and simple Q&A to GPT-4o mini, Gemini Flash, or Claude Haiku. A routing layer that classifies request complexity first can reduce costs by 60-80%.

2. Enable prompt caching

Both OpenAI and Anthropic offer prompt caching for repeated prefixes. If your application sends the same system prompt or document context on every request, cached tokens cost 50-90% less than uncached. Prefix your prompts with the stable content that can be cached.

3. Compress and optimize system prompts

System prompts are billed on every turn of a conversation. A 1,000-token system prompt in a 20-turn conversation costs 20,000 input tokens just for the prompt. Audit your system prompts: remove redundant instructions, use bullet points instead of prose, and eliminate examples that can be handled by few-shot in the first user turn.

4. Limit output length explicitly

Instruct the model to be concise. Add phrases like "Reply in 2-3 sentences" or "Respond in under 100 words" to your prompts for tasks where verbosity is not needed. Use the max_tokens parameter to hard-cap output length and prevent runaway generation costs.

5. Use the Batch API

OpenAI's Batch API and similar asynchronous processing endpoints offer 50% discounts for workloads that can tolerate 24-hour turnaround. Document processing, data extraction, classification pipelines, and nightly report generation are excellent candidates for batching.

6. Implement semantic caching

For applications where users ask similar questions repeatedly (customer support, FAQs, documentation search), caching previous responses by semantic similarity can serve 20-40% of requests from cache at zero API cost. Tools like GPTCache or a simple vector store can implement this.

7. Chunk and filter documents before injection

Sending entire documents as context is expensive and often counterproductive — models struggle with very long contexts. Use a retrieval-augmented generation (RAG) pipeline to extract only the 2-5 most relevant chunks for each query using a vector database and a cheap embedding model. This reduces input tokens dramatically while often improving answer quality.

Calculate Your Token Count

Before you can estimate API costs accurately, you need to know how many tokens your prompts and documents actually contain. Paste any text — system prompts, documents, code, or conversation examples — into the devbit.dev AI Token Counter to instantly see token counts across all major AI models and get per-call cost estimates.

Count Tokens & Estimate API Costs

Paste your prompt or document to see exact token counts for GPT-4o, Claude, Gemini, and 10+ models. Compare context window usage and estimate costs — 100% free, no API key needed.

Open AI Token Counter →

Related Developer Tools