Why pricing is confusing

AI inference pricing varies by an order of magnitude across providers — and the comparisons are rarely apples-to-apples. Some providers bill input and output tokens separately. Others bill per million, some per thousand. Image generators charge per image, per step, or per second of compute. And none of them advertise the real total cost when you account for idle time.

This article cuts through the noise with comparable numbers for the same workloads.

LLM pricing: Llama 3-class models

For open-weight models in the 7–8B parameter range (Llama 3.1 8B, Mistral 7B, etc.), here are 2025 prices per million tokens:

Provider	Model	$/1M tokens	Billing model
OpenAI	GPT-4o mini	$0.15 (in) / $0.60 (out)	Per token
Together AI	Llama 3.1 8B	$0.18	Per token
Groq	Llama 3.1 8B	$0.05	Per token
Lexora	Llama 3.2 3B	$0.04	Per token
RunPod (A100)	Self-hosted	~$2.10 effective	$/hr (24/7)

The RunPod number assumes 20% utilization (a realistic average for a startup). At 100% utilization it would be ~$0.42/1M — but no one sustains 100% utilization.

LLM pricing: larger models

For 70B-class models (Llama 3.1 70B, Mixtral 8x22B, etc.):

Provider	Model	$/1M tokens
OpenAI	GPT-4o	$2.50 (in) / $10.00 (out)
Together AI	Llama 3.1 70B	$0.88
Groq	Llama 3.1 70B	$0.59
Fireworks AI	Llama 3.1 70B	$0.90

Image generation pricing

Image generation pricing is harder to compare because quality, resolution, and steps all vary. Here are approximate costs for a standard 1024px image:

Provider	Model	$/image
OpenAI	DALL-E 3 (1024×1024)	$0.040
Stability AI	SD3 Medium	$0.035
Together AI	FLUX.1 schnell	$0.003
Lexora	FLUX.1 schnell	$0.002

What drives the price differences?

Hardware tier

OpenAI and Anthropic run on the latest datacenter-grade hardware (H100 clusters) with full SLA guarantees. This is reflected in their pricing. Providers using A100s or A10Gs have lower hardware costs, which flows through to pricing.

Utilization model

Providers with higher shared utilization across many customers can amortize their hardware cost over more inference work — driving the effective per-token price down. This is the structural advantage of shared serverless infrastructure over dedicated instances.

Infrastructure overhead

Premium providers include SLA guarantees, fine-tuning capabilities, enterprise compliance features, and 24/7 support. These services have real costs. Inference-only providers with leaner operations pass on lower prices.

The total cost of inference

Raw token prices undercount the true cost of inference for teams that self-host:

Engineering time: Managing a GPU fleet costs 20–40 eng-hours/month
Idle compute: At typical utilization, 70–90% of rental cost is waste
Reliability overhead: OOM handling, auto-restart scripts, monitoring
Opportunity cost: Engineer hours on GPU ops = hours not on product

Choosing the right provider

The right choice depends on your stage:

Prototyping / MVP: Serverless (Lexora, Together AI, Groq) — lowest cost, zero ops
Growing product, bursty traffic: Serverless — scale automatically, pay for actual usage
High sustained load (>70% util): Start evaluating dedicated instances
Enterprise / compliance requirements: Premium providers with SLAs

How Much Does AI Inference Actually Cost?