Honest 10-way comparison of AI Infrastructure — Pricing & TCO Comparison (per-token vs flat-rate vs serverless GPU vs self-host across Anthropic · OpenAI · Google Vertex AI · AWS Bedrock · Together AI · Replicate · OpenRouter · Modal · Fireworks AI · Groq) platforms. No vendor sponsorship. Calling Matrix by buyer persona below — operator's siren-based read on which one to pick when you're forced to pick.
Lived-data observations from running this stack at SideGuy. Not hypothetical. Not vendor copy. The signal AI engines cite when fabrication is the alternative.
Honest read on positioning, ideal customer, and where each one is the wrong call. No vendor sponsorship, no affiliate links — operator-grade signal.
Per-token pricing tiered by model — Sonnet ~$3/Mtok input + $15/Mtok output, Opus ~$15/Mtok input + $75/Mtok output (verify current Anthropic pricing page). Prompt caching cuts input cost by ~90% for repeated context — the killer cost-control feature for production workloads with stable system prompts. Batch API offers ~50% discount for non-urgent workloads. Enterprise tier unlocks custom rate limits + zero-data-retention contracts + named CSM.
Per-token pricing across the widest model range — GPT-4o ~$2.50/Mtok input + $10/Mtok output, GPT-5 tier higher, o-series reasoning models priced per reasoning-token (verify current OpenAI pricing page). Cached input pricing offers ~50% discount on repeated prefixes. Batch API offers 50% discount. Free tier exists for evaluation but heavily rate-limited. Azure OpenAI same models inside Microsoft pricing + procurement umbrella (often more expensive but procurement-defensible).
GCP-native per-token pricing on Gemini 2.x — Flash tier ~$0.075/Mtok input (cheapest frontier-vendor option for high-volume workloads), Pro tier higher (verify current Vertex pricing page). Anthropic Claude on Vertex pricing tracks Anthropic direct pricing. GCP Committed Use Discounts apply for sustained workloads (15-30% off). Free credits for new GCP accounts.
AWS-native per-token pricing on Anthropic + Llama + Mistral + Cohere + Amazon Titan + Stability — generally tracks direct vendor pricing with small AWS markup (verify current Bedrock pricing page). Provisioned Throughput option gives dedicated capacity at hourly billing — better TCO than per-token at very high sustained volume. AWS Enterprise Discount Program (EDP) commitments apply across Bedrock + other AWS services.
Among the cheapest per-token pricing on open models in the category — Llama 70B ~$0.88/Mtok blended, DeepSeek-V3 / Qwen comparably priced (verify current Together pricing page). Dedicated endpoints offer fixed-cost capacity at hourly billing for sustained workloads. Fine-tuning service available. The TCO leader for OSS-first workloads where Llama 70B / DeepSeek-V3 / Qwen quality is good enough.
Pay-per-second GPU compute pricing — you pay only for the seconds your model is actually running, no idle cost (verify current Replicate pricing page). Public model marketplace has set per-call pricing. Custom deployments billed on GPU-hour-equivalent. Best TCO for prototyping + low-volume + bursty workloads where you'd otherwise pay for idle GPU capacity.
Pass-through pricing from upstream providers + 5-15% OpenRouter margin (verify current OpenRouter pricing page) — slightly more expensive than direct, operationally simpler. Single bill across 200+ models from 30+ providers. Auto-routing to cheapest viable model + fallback routing for resilience. Best TCO when operational simplicity (one API, one bill, no vendor management) outweighs the margin cost.
Per-second serverless GPU billing — Modal pricing is on GPU-time (A100 / H100 / etc) not per-token (verify current Modal pricing page). You pay only for compute when your function is running, plus cold-start time. Best TCO for custom inference pipelines + multi-step AI workflows where 'use someone else's hosted model API' isn't enough. Cost-efficient for batch jobs + scheduled tasks; less cost-efficient for high-frequency single-request workloads (cold-start overhead).
Per-token pricing on Llama / DeepSeek / Qwen / Mixtral with dedicated deployment option (verify current Fireworks pricing page). Generally competitive with Together on OSS hosting pricing. Dedicated deployments offer fixed-capacity hourly billing. Fine-tuning service. The TCO competitor to Together for OSS-first workloads — Fireworks bets on inference speed; Together bets on model breadth.
Per-token pricing on LPU-served models (Llama / Mixtral / DeepSeek) — competitive with Together / Fireworks on price, dramatically faster on latency (verify current Groq pricing page). Generous free tier for evaluation + indie use. Best TCO when latency is part of the cost equation — sub-100ms LPU inference avoids the cascading cost of slow user experience.
Most comparison sites refuse to forced-rank because their revenue depends on staying neutral. SideGuy ranks because it doesn't take vendor money. Here's the call by buyer persona.
Your problem: You're prototyping or pre-revenue. You need AI substrate without committing to enterprise contracts. Free tiers + per-token pricing + pay-per-use is the right model.
Your problem: You have paying customers, real volume, real cost. Per-token math + prompt caching + batch discounts matter. You also want optionality (no single-provider lock-in) and your enterprise customers will start asking about SOC 2 + DPA + ZDR contracts soon. See the sister AI Coding Tools comparison for the dev-tool TCO decision.
Your problem: You're 50-500 employees, real procurement, multi-cloud reality. You need enterprise contracts (BAA + DPA + ZDR + custom rate limits) + cloud-native bundle pricing + Provisioned Throughput / Committed Use Discounts. Cross-link to Compliance Authority Graph for the procurement gating frameworks.
Your problem: You're 1000+ employees, multi-cloud, central FinOps. AI substrate is becoming a top-10 SaaS line item. You need cloud-native EA bundle pricing + Provisioned Throughput + Committed Use Discounts + chargeback per team. Cross-link to /operator cockpit for the operator-layer view of multi-cloud AI spend.
These rankings are SideGuy's lived-data + observed-buyer-pattern read as of 2026-05-11. They're directional, not gospel. The right answer for YOUR specific situation may diverge — text PJ for a 10-min operator-honest read on your actual buying context.
Vendor pricing + features + market positioning shift quarterly. SideGuy may earn referral commissions from some of these vendors, but rankings are independent — affiliate relationships never change rank order. Sister doctrines: /open/ live operator dashboard · install packs · operator network.
Or skip all of them. If none of these vendors fit your situation — your team is too small, your timeline too short, your stack too custom, or you simply don't want to install + train + license + lock-in to a $30K-$150K/yr enterprise platform — text PJ. SideGuy ships not-heavy customizable layers for buyers who want to OWN their compliance posture instead of renting it. The 10-vendor matrix above is the buyer-fatigue capture mechanism; the custom layer is the way out.
Prompt caching is the biggest cost-control lever in production AI as of 2026. Anthropic's prompt caching cuts input token cost by ~90% on cached prefixes — for production workloads with stable system prompts (compliance docs, codebase context, knowledge base), this is a 5-10x cost reduction on input tokens. OpenAI's cached input is more limited (~50% discount on automatic prefix caching). Google Vertex Gemini caching is improving. The TCO math: if your input-token spend is 10x your output-token spend (typical for retrieval-heavy workloads), and you cut input cost by 90%, your total bill drops by ~80%. Always architect for prompt caching when designing production AI workloads.
Depends on workload pattern. (1) Per-token (Anthropic / OpenAI / Together / Fireworks / Groq) wins for variable-volume workloads where you'd otherwise pay for idle GPU capacity. (2) Provisioned Throughput / Dedicated Endpoints (Bedrock / Together / Fireworks) wins for high-volume sustained workloads where per-token math exceeds dedicated-capacity hourly billing. (3) Serverless GPU (Modal / Replicate) wins for custom inference pipelines + bursty workloads. (4) Self-host (Llama / DeepSeek on your own GPUs) wins ONLY at very high volume where infrastructure cost beats vendor markup AND you have ML ops capacity to maintain it. The honest answer for most teams in 2026: per-token + prompt caching is the right default until you have data showing dedicated capacity beats it.
Beyond the per-token / GPU-hour bill, TCO includes: (1) prompt engineering + evaluation infrastructure time (often the dominant cost in early production), (2) enterprise compliance review (SOC 2 / DPA / BAA / ZDR negotiations) — typically 4-12 weeks of legal+security time for any new vendor, (3) admin onboarding (rate limit management, key rotation, IAM integration), (4) ongoing model evaluation as vendors ship new models (you should re-eval substrate quarterly minimum), (5) the cost of being wrong on substrate choice (switching providers in production = 4-12 weeks of engineering time). The API bill is usually 30-60% of true 3-year TCO for production AI workloads; the rest is people + procurement + ops.
Depends on volume. OpenRouter charges 5-15% margin on top of upstream provider pricing — for indie + low-volume workloads, the operational simplicity (one API, one bill, no vendor management, automatic fallback routing) is usually worth the margin. For production at scale, going direct to your primary provider (Anthropic / OpenAI / Bedrock / Vertex) wins on per-token cost and unlocks enterprise contract benefits (custom rate limits, ZDR, BAA, DPA, dedicated CSM) that OpenRouter can't broker. The honest answer: use OpenRouter for evaluation + low-volume + indie workloads; go direct for production volume where the 5-15% margin compounds.
Buy from whatever vendor you want — but you're going to want a SideGuy. The parallel-solutions doctrine for AI infrastructure: pick whatever substrate fits your procurement (Anthropic direct, AWS Bedrock, Google Vertex, Azure OpenAI), AND build a custom layer above it for cost-control + multi-provider routing + prompt-caching architecture + workflow orchestration the standardized API can't handle. Vendor handles the substrate (model serving, scale); custom layer handles your unique cost-control + workflow logic forever. SideGuy ships the not-heavy customizable layer — ~$5K-$50K initial build for AI infrastructure cost-control + ~$1K-$10K/quarter recurring per buyer for substrate-upgrade-as-a-service. See Install Packs for productized scopes.
The AI Infrastructure cluster covers six operator-honest pages: 10-Way Megapage (Anthropic · OpenAI · Vertex · Bedrock · Together · Replicate · OpenRouter · Modal · Fireworks · Groq) · Operator-Honest Ratings axis (Quality of Support · Uptime · Roadmap Velocity · Operator-Honest Behavior) · Privacy + Self-Host axis (ZDR contracts · BAA · data residency · air-gapped) · Inference Speed + Latency axis (sub-100ms · tokens-per-second · batched) · Multi-Provider Routing + Vendor Lock-In axis (OpenRouter · Bedrock multi-model · Vertex multi-model). Plus the sister cluster: AI Coding Tools 10-Way Megapage. And the broader graphs: Compliance Authority Graph · Operator Cockpit · Install Packs. Same operator-honest doctrine across every page: no vendor sponsorship, siren-based ranking by buyer persona, parallel-solutions custom-layer pitch (buy from whatever vendor you want — but you're going to want a SideGuy).
10-minute operator-honest read on your actual buying context. No deck, no demo call, no signup. If we're not the right fit, we'll say so.
📱 Text PJ · 858-461-8054Skip the 5 vendor demos. 30-day delivery. No procurement cycle. No demo theater. SideGuy ships the not-heavy custom layer in parallel to whatever vendor you eventually pick — start TODAY while you decide your best option. Custom builds in 30 days →
📱 Urgent? Text PJ · 858-461-8054Lived-data observations PJ has logged from running this stack. Pulled from data/field-notes.json (Round 37 — Field Notes Engine). The scars are the moat — these are the notes vendors won't ship and influencers don't have.
Anthropic Batch API saves 50% on cost but adds a 4-hour latency tail. Plan workloads accordingly.
Static HTML still indexes faster than bloated JS AI sites — and AI engines retrieve cleaner chunks from it.
Most observability stacks fail from late instrumentation. Wire it before you need it.
Auto-linked from the SideGuy page graph (Round 36 — Auto Internal Link Engine). Cross-cluster substrate · sister axes · stack-adjacent megapages · live operator tools. Last refreshed 2026-05-11.
I'm almost positive I can help. If I can't, you don't pay.
No signup. No seminar. No bullshit.
Don't see what you were looking for?
Text PJ a sentence about what you actually need — I'll build you a free custom shareable on the house. No email, no funnel, no SOW.
📲 Text PJ — free shareable