Why pricing is confusing
AI inference pricing varies by an order of magnitude across providers — and the comparisons are rarely apples-to-apples. Some providers bill input and output tokens separately. Others bill per million, some per thousand. Image generators charge per image, per step, or per second of compute. And none of them advertise the real total cost when you account for idle time.
This article cuts through the noise with comparable numbers for the same workloads.
LLM pricing: Llama 3-class models
For open-weight models in the 7–8B parameter range (Llama 3.1 8B, Mistral 7B, etc.), here are 2025 prices per million tokens:
| Provider | Model | $/1M tokens | Billing model |
|---|---|---|---|
| OpenAI | GPT-4o mini | $0.15 (in) / $0.60 (out) | Per token |
| Together AI | Llama 3.1 8B | $0.18 | Per token |
| Groq | Llama 3.1 8B | $0.05 | Per token |
| Lexora | Llama 3.2 3B | $0.04 | Per token |
| RunPod (A100) | Self-hosted | ~$2.10 effective | $/hr (24/7) |
The RunPod number assumes 20% utilization (a realistic average for a startup). At 100% utilization it would be ~$0.42/1M — but no one sustains 100% utilization.
LLM pricing: larger models
For 70B-class models (Llama 3.1 70B, Mixtral 8x22B, etc.):
| Provider | Model | $/1M tokens |
|---|---|---|
| OpenAI | GPT-4o | $2.50 (in) / $10.00 (out) |
| Together AI | Llama 3.1 70B | $0.88 |
| Groq | Llama 3.1 70B | $0.59 |
| Fireworks AI | Llama 3.1 70B | $0.90 |
Image generation pricing
Image generation pricing is harder to compare because quality, resolution, and steps all vary. Here are approximate costs for a standard 1024px image:
| Provider | Model | $/image |
|---|---|---|
| OpenAI | DALL-E 3 (1024×1024) | $0.040 |
| Stability AI | SD3 Medium | $0.035 |
| Together AI | FLUX.1 schnell | $0.003 |
| Lexora | FLUX.1 schnell | $0.002 |
What drives the price differences?
Hardware tier
OpenAI and Anthropic run on the latest datacenter-grade hardware (H100 clusters) with full SLA guarantees. This is reflected in their pricing. Providers using A100s or A10Gs have lower hardware costs, which flows through to pricing.
Utilization model
Providers with higher shared utilization across many customers can amortize their hardware cost over more inference work — driving the effective per-token price down. This is the structural advantage of shared serverless infrastructure over dedicated instances.
Infrastructure overhead
Premium providers include SLA guarantees, fine-tuning capabilities, enterprise compliance features, and 24/7 support. These services have real costs. Inference-only providers with leaner operations pass on lower prices.
The total cost of inference
Raw token prices undercount the true cost of inference for teams that self-host:
- Engineering time: Managing a GPU fleet costs 20–40 eng-hours/month
- Idle compute: At typical utilization, 70–90% of rental cost is waste
- Reliability overhead: OOM handling, auto-restart scripts, monitoring
- Opportunity cost: Engineer hours on GPU ops = hours not on product
Choosing the right provider
The right choice depends on your stage:
- Prototyping / MVP: Serverless (Lexora, Together AI, Groq) — lowest cost, zero ops
- Growing product, bursty traffic: Serverless — scale automatically, pay for actual usage
- High sustained load (>70% util): Start evaluating dedicated instances
- Enterprise / compliance requirements: Premium providers with SLAs