Back to blog
Pricing

How Much Does AI Inference Actually Cost?

Real numbers from 2025: token prices, image prices, and the hidden costs most comparisons leave out.

June 2025·8 min read

Why pricing is confusing

AI inference pricing varies by an order of magnitude across providers — and the comparisons are rarely apples-to-apples. Some providers bill input and output tokens separately. Others bill per million, some per thousand. Image generators charge per image, per step, or per second of compute. And none of them advertise the real total cost when you account for idle time.

This article cuts through the noise with comparable numbers for the same workloads.

LLM pricing: Llama 3-class models

For open-weight models in the 7–8B parameter range (Llama 3.1 8B, Mistral 7B, etc.), here are 2025 prices per million tokens:

ProviderModel$/1M tokensBilling model
OpenAIGPT-4o mini$0.15 (in) / $0.60 (out)Per token
Together AILlama 3.1 8B$0.18Per token
GroqLlama 3.1 8B$0.05Per token
LexoraLlama 3.2 3B$0.04Per token
RunPod (A100)Self-hosted~$2.10 effective$/hr (24/7)

The RunPod number assumes 20% utilization (a realistic average for a startup). At 100% utilization it would be ~$0.42/1M — but no one sustains 100% utilization.

LLM pricing: larger models

For 70B-class models (Llama 3.1 70B, Mixtral 8x22B, etc.):

ProviderModel$/1M tokens
OpenAIGPT-4o$2.50 (in) / $10.00 (out)
Together AILlama 3.1 70B$0.88
GroqLlama 3.1 70B$0.59
Fireworks AILlama 3.1 70B$0.90

Image generation pricing

Image generation pricing is harder to compare because quality, resolution, and steps all vary. Here are approximate costs for a standard 1024px image:

ProviderModel$/image
OpenAIDALL-E 3 (1024×1024)$0.040
Stability AISD3 Medium$0.035
Together AIFLUX.1 schnell$0.003
LexoraFLUX.1 schnell$0.002

What drives the price differences?

Hardware tier

OpenAI and Anthropic run on the latest datacenter-grade hardware (H100 clusters) with full SLA guarantees. This is reflected in their pricing. Providers using A100s or A10Gs have lower hardware costs, which flows through to pricing.

Utilization model

Providers with higher shared utilization across many customers can amortize their hardware cost over more inference work — driving the effective per-token price down. This is the structural advantage of shared serverless infrastructure over dedicated instances.

Infrastructure overhead

Premium providers include SLA guarantees, fine-tuning capabilities, enterprise compliance features, and 24/7 support. These services have real costs. Inference-only providers with leaner operations pass on lower prices.

The total cost of inference

Raw token prices undercount the true cost of inference for teams that self-host:

  • Engineering time: Managing a GPU fleet costs 20–40 eng-hours/month
  • Idle compute: At typical utilization, 70–90% of rental cost is waste
  • Reliability overhead: OOM handling, auto-restart scripts, monitoring
  • Opportunity cost: Engineer hours on GPU ops = hours not on product

Choosing the right provider

The right choice depends on your stage:

  • Prototyping / MVP: Serverless (Lexora, Together AI, Groq) — lowest cost, zero ops
  • Growing product, bursty traffic: Serverless — scale automatically, pay for actual usage
  • High sustained load (>70% util): Start evaluating dedicated instances
  • Enterprise / compliance requirements: Premium providers with SLAs

Ready to cut your inference costs?

Get started with Lexora — no idle GPU costs, pay only for what you generate.

Calculate Your Costs