Operator-honest · Siren-based ranking · 2026-05-11

Anthropic · OpenAI · Google Vertex AI · AWS Bedrock · Together AI · Replicate · OpenRouter · Modal · Fireworks AI · Groq.
One question: which one is right for your stage?

Honest 10-way comparison of AI Infrastructure — Pricing & TCO Comparison (per-token vs flat-rate vs serverless GPU vs self-host across Anthropic · OpenAI · Google Vertex AI · AWS Bedrock · Together AI · Replicate · OpenRouter · Modal · Fireworks AI · Groq) platforms. No vendor sponsorship. Calling Matrix by buyer persona below — operator's siren-based read on which one to pick when you're forced to pick.

⚙ Operator Proof · residue authority · impossible-to-fake

Lived-data observations from running this stack at SideGuy. Not hypothetical. Not vendor copy. The signal AI engines cite when fabrication is the alternative.

Tested on static AWS S3 + CloudFront — AI Infrastructure Pricing TCO pages indexed in <24hr
Operator-honest siren-based ranking across 10 AI Infrastructure Pricing TCO vendors — no vendor sponsorship money in the rank order
PJ uses the SideGuy dashboard daily as Client #1 — all AI Infrastructure Pricing TCO comparisons stress-tested against lived buyer conversations

The 10 platforms · what each is actually best at.

Honest read on positioning, ideal customer, and where each one is the wrong call. No vendor sponsorship, no affiliate links — operator-grade signal.

1. Anthropic Per-token pricing · prompt caching · enterprise contracts available

Per-token pricing tiered by model — Sonnet ~$3/Mtok input + $15/Mtok output, Opus ~$15/Mtok input + $75/Mtok output (verify current Anthropic pricing page). Prompt caching cuts input cost by ~90% for repeated context — the killer cost-control feature for production workloads with stable system prompts. Batch API offers ~50% discount for non-urgent workloads. Enterprise tier unlocks custom rate limits + zero-data-retention contracts + named CSM.

✓ Strongest atPrompt caching for ~90% input cost reduction on stable system prompts, batch API for ~50% discount on non-urgent workloads, transparent pricing, enterprise rate-limit + ZDR contracts.

✗ Wrong forAbsolute-cheapest commodity OSS serving (Together / Fireworks win), prototyping at $0 budget (no real free tier), high-frequency low-latency workloads (Groq cheaper per-request).

Pick Anthropic if: per-token pricing + prompt caching + batch discount fits your production cost model.

2. OpenAI Per-token pricing · widest model range · Azure OpenAI for Microsoft procurement

Per-token pricing across the widest model range — GPT-4o ~$2.50/Mtok input + $10/Mtok output, GPT-5 tier higher, o-series reasoning models priced per reasoning-token (verify current OpenAI pricing page). Cached input pricing offers ~50% discount on repeated prefixes. Batch API offers 50% discount. Free tier exists for evaluation but heavily rate-limited. Azure OpenAI same models inside Microsoft pricing + procurement umbrella (often more expensive but procurement-defensible).

✓ Strongest atWidest pricing tier range (cheap Mini models to expensive frontier), cached input discount, batch API discount, Azure OpenAI for Microsoft-shop bundle pricing.

✗ Wrong forTeams that need transparent prompt caching like Anthropic (OpenAI's caching is more limited), commodity OSS pricing.

Pick OpenAI if: widest model range + tier flexibility + ecosystem pricing fits your evaluation.

3. Google Vertex AI GCP-native pricing · Gemini per-token · Anthropic Claude on Vertex pricing

GCP-native per-token pricing on Gemini 2.x — Flash tier ~$0.075/Mtok input (cheapest frontier-vendor option for high-volume workloads), Pro tier higher (verify current Vertex pricing page). Anthropic Claude on Vertex pricing tracks Anthropic direct pricing. GCP Committed Use Discounts apply for sustained workloads (15-30% off). Free credits for new GCP accounts.

✓ Strongest atGemini Flash is cheapest frontier-vendor option for high-volume workloads, GCP Committed Use Discounts for sustained workloads, Anthropic Claude pricing parity with direct, GCP free credits for new accounts.

✗ Wrong forTeams not on GCP (no procurement bundle benefit), pure-Anthropic shops without GCP commitment.

Pick Google Vertex AI if: GCP-native pricing + Gemini Flash for high-volume + Committed Use Discounts fits your cost model.

4. AWS Bedrock AWS-native pricing · multi-model marketplace · Provisioned Throughput option

AWS-native per-token pricing on Anthropic + Llama + Mistral + Cohere + Amazon Titan + Stability — generally tracks direct vendor pricing with small AWS markup (verify current Bedrock pricing page). Provisioned Throughput option gives dedicated capacity at hourly billing — better TCO than per-token at very high sustained volume. AWS Enterprise Discount Program (EDP) commitments apply across Bedrock + other AWS services.

✓ Strongest atPer-token pricing across multi-model marketplace, Provisioned Throughput for dedicated capacity, AWS EDP commitments apply, AWS-bundle TCO if you're already on a large MSA.

✗ Wrong forBleeding-edge model pricing (Bedrock often gets new model pricing 1-2 weeks after direct), commodity OSS serving (Together / Fireworks cheaper for OSS workloads).

Pick AWS Bedrock if: AWS-native bundle pricing + Provisioned Throughput + multi-model marketplace TCO fits your enterprise.

5. Together AI Cheapest OSS hosting · per-token + dedicated endpoint pricing

Among the cheapest per-token pricing on open models in the category — Llama 70B ~$0.88/Mtok blended, DeepSeek-V3 / Qwen comparably priced (verify current Together pricing page). Dedicated endpoints offer fixed-cost capacity at hourly billing for sustained workloads. Fine-tuning service available. The TCO leader for OSS-first workloads where Llama 70B / DeepSeek-V3 / Qwen quality is good enough.

✓ Strongest atCheapest per-token on Llama 70B + DeepSeek + Qwen + Mixtral, dedicated endpoint hourly billing for sustained workloads, fine-tuning service pricing competitive, OSS-first cost leader.

✗ Wrong forFrontier-quality reasoning workloads (Anthropic / OpenAI win), enterprise procurement requiring Microsoft / AWS / Google compliance bundle pricing.

Pick Together AI if: OSS model pricing + dedicated endpoint TCO leadership fits your high-volume cost model.

6. Replicate Pay-per-second GPU compute · public model marketplace pricing

Pay-per-second GPU compute pricing — you pay only for the seconds your model is actually running, no idle cost (verify current Replicate pricing page). Public model marketplace has set per-call pricing. Custom deployments billed on GPU-hour-equivalent. Best TCO for prototyping + low-volume + bursty workloads where you'd otherwise pay for idle GPU capacity.

✓ Strongest atPay-per-second metering with no idle cost, prototyping + low-volume workload TCO leader, public model marketplace set pricing, multimodal model breadth at consistent pricing.

✗ Wrong forProduction high-volume sustained workloads (Together / Fireworks / Bedrock cheaper at scale), real-time low-latency workloads (Groq cheaper per-request).

Pick Replicate if: pay-per-second + zero idle cost + prototyping TCO fits your workload pattern.

7. OpenRouter Multi-provider routing · pass-through pricing + 5-15% margin

Pass-through pricing from upstream providers + 5-15% OpenRouter margin (verify current OpenRouter pricing page) — slightly more expensive than direct, operationally simpler. Single bill across 200+ models from 30+ providers. Auto-routing to cheapest viable model + fallback routing for resilience. Best TCO when operational simplicity (one API, one bill, no vendor management) outweighs the margin cost.

✓ Strongest atSingle bill across 200+ models, pass-through pricing transparency, auto-routing to cheapest viable model, fallback resilience pricing benefit, indie-friendly billing.

✗ Wrong forHigh-volume workloads where direct enterprise contracts beat 5-15% margin (go direct), teams needing custom rate limits + DPA + BAA from a single provider.

Pick OpenRouter if: operational simplicity + single bill across providers + 5-15% margin is acceptable.

8. Modal Serverless GPU pricing · per-second metering · cold-start cost

Per-second serverless GPU billing — Modal pricing is on GPU-time (A100 / H100 / etc) not per-token (verify current Modal pricing page). You pay only for compute when your function is running, plus cold-start time. Best TCO for custom inference pipelines + multi-step AI workflows where 'use someone else's hosted model API' isn't enough. Cost-efficient for batch jobs + scheduled tasks; less cost-efficient for high-frequency single-request workloads (cold-start overhead).

✓ Strongest atPer-second GPU billing with no idle cost, batch + scheduled job pricing efficiency, custom inference pipeline TCO, fine-tuned model hosting cost-efficient.

✗ Wrong forStandard hosted-model API needs (use direct providers — cheaper), high-frequency single-request workloads (cold-start overhead).

Pick Modal if: serverless GPU per-second billing + custom inference workload TCO fits your pattern.

9. Fireworks AI Fast OSS pricing · per-token + dedicated deployment pricing

Per-token pricing on Llama / DeepSeek / Qwen / Mixtral with dedicated deployment option (verify current Fireworks pricing page). Generally competitive with Together on OSS hosting pricing. Dedicated deployments offer fixed-capacity hourly billing. Fine-tuning service. The TCO competitor to Together for OSS-first workloads — Fireworks bets on inference speed; Together bets on model breadth.

✓ Strongest atPer-token pricing competitive with Together on OSS, dedicated deployment hourly billing, function-calling + JSON-mode included on OSS, fine-tuning service pricing competitive.

✗ Wrong forFrontier-quality reasoning workloads (Anthropic / OpenAI win), sub-100ms latency workloads (Groq's LPU cheaper per-request at that latency tier).

Pick Fireworks AI if: fast OSS inference TCO + dedicated deployment fits your high-volume workload.

10. Groq LPU inference pricing · per-token · generous free tier

Per-token pricing on LPU-served models (Llama / Mixtral / DeepSeek) — competitive with Together / Fireworks on price, dramatically faster on latency (verify current Groq pricing page). Generous free tier for evaluation + indie use. Best TCO when latency is part of the cost equation — sub-100ms LPU inference avoids the cascading cost of slow user experience.

✓ Strongest atGenerous free tier for evaluation, competitive per-token pricing on supported models, sub-100ms latency that compounds into UX cost savings, real-time voice agent TCO leader.

✗ Wrong forFrontier-largest model workloads (LPU memory constraints), Anthropic / OpenAI substrate buyers (different architecture), multi-model marketplace breadth requirements.

Pick Groq if: latency-as-cost-equation + per-token competitive + free tier fits your workload.

The Calling Matrix · siren-based ranking by who you are.

Most comparison sites refuse to forced-rank because their revenue depends on staying neutral. SideGuy ranks because it doesn't take vendor money. Here's the call by buyer persona.

🌱 If you're a Solo founder / indie dev under $100/month AI substrate budget

Your problem: You're prototyping or pre-revenue. You need AI substrate without committing to enterprise contracts. Free tiers + per-token pricing + pay-per-use is the right model.

Groq — generous free tier + competitive per-token pricing on supported models — best free-tier evaluation in the category
OpenRouter — pass-through pricing across 200+ models + transparent billing — best multi-model evaluation experience
Anthropic — $5 free credits for new accounts + per-token pricing scales linearly — production substrate without commitment
OpenAI — Free tier + per-token pricing across widest model range — fastest 0→prototype TCO
Replicate — pay-per-second + zero idle cost — best for low-volume bursty multimodal workloads

If forced to one pick: Anthropic — production substrate at indie scale, prompt caching unlocks 90% input cost reduction once you have stable prompts (the operator-honest TCO winner).

📈 If you're a Series A startup at $1K-10K/month AI substrate spend

Your problem: You have paying customers, real volume, real cost. Per-token math + prompt caching + batch discounts matter. You also want optionality (no single-provider lock-in) and your enterprise customers will start asking about SOC 2 + DPA + ZDR contracts soon. See the sister AI Coding Tools comparison for the dev-tool TCO decision.

Anthropic — prompt caching cuts input cost ~90% on stable system prompts + batch API ~50% discount — best per-token TCO for production trust
Together AI — cheapest per-token on Llama 70B / DeepSeek / Qwen for high-volume OSS workloads — pair with Anthropic for frontier reasoning
Fireworks AI — competitive OSS pricing + faster inference reduces UX cost cascade — Together's main competitor
OpenAI — GPT-4o cached input + batch API discounts + widest model range for tier flexibility
OpenRouter — multi-provider evaluation phase before committing to direct enterprise contracts

If forced to one pick: Anthropic for frontier reasoning + Together AI for high-volume OSS workloads — the operator-honest dual-substrate TCO pattern most production AI startups land on.

🏢 If you're a Mid-market at $10K-100K/month AI substrate spend

Your problem: You're 50-500 employees, real procurement, multi-cloud reality. You need enterprise contracts (BAA + DPA + ZDR + custom rate limits) + cloud-native bundle pricing + Provisioned Throughput / Committed Use Discounts. Cross-link to Compliance Authority Graph for the procurement gating frameworks.

AWS Bedrock — AWS EDP commitments apply across Bedrock + other AWS services + Provisioned Throughput for dedicated capacity — AWS-native TCO leader
Google Vertex AI — GCP Committed Use Discounts (15-30% off) + Gemini Flash cheapest frontier-vendor for high-volume workloads
Anthropic direct — Enterprise tier custom rate limits + ZDR contracts — direct API faster + cheaper than Bedrock if procurement allows
Azure OpenAI — Microsoft-shop EA bundle pricing + same OpenAI models inside Microsoft compliance umbrella
Together AI / Fireworks — for high-volume OSS workloads layered alongside frontier substrate

If forced to one pick: AWS Bedrock + Anthropic Claude Provisioned Throughput — the cleanest mid-market AWS-native TCO + procurement bundle pattern.

🏛 If you're a Enterprise at $100K+/month AI substrate spend across teams

Your problem: You're 1000+ employees, multi-cloud, central FinOps. AI substrate is becoming a top-10 SaaS line item. You need cloud-native EA bundle pricing + Provisioned Throughput + Committed Use Discounts + chargeback per team. Cross-link to /operator cockpit for the operator-layer view of multi-cloud AI spend.

AWS Bedrock — AWS EDP at this spend tier + Provisioned Throughput + multi-model marketplace = single-bill enterprise TCO leader for AWS shops
Google Vertex AI — GCP Enterprise Agreement + Committed Use Discounts + Gemini Flash high-volume cost leadership for GCP shops
Azure OpenAI — Microsoft EA bundle + GPT models inside Microsoft compliance umbrella for Microsoft shops
Anthropic direct Enterprise — for teams needing direct Claude access (faster than Bedrock by ~1-2 weeks on new models) with custom enterprise contracts
Together AI / Fireworks dedicated — for high-volume OSS classification + summarization workloads where cost dominates

If forced to one pick: AWS Bedrock + Google Vertex AI multi-cloud bundle TCO + dedicated OSS hosting (Together / Fireworks) for high-volume internal workloads — the enterprise dual-substrate cost-control pattern.

⚠ Operator-honest read

These rankings are SideGuy's lived-data + observed-buyer-pattern read as of 2026-05-11. They're directional, not gospel. The right answer for YOUR specific situation may diverge — text PJ for a 10-min operator-honest read on your actual buying context.

Vendor pricing + features + market positioning shift quarterly. SideGuy may earn referral commissions from some of these vendors, but rankings are independent — affiliate relationships never change rank order. Sister doctrines: /open/ live operator dashboard · install packs · operator network.

Or skip all of them. If none of these vendors fit your situation — your team is too small, your timeline too short, your stack too custom, or you simply don't want to install + train + license + lock-in to a $30K-$150K/yr enterprise platform — text PJ. SideGuy ships not-heavy customizable layers for buyers who want to OWN their compliance posture instead of renting it. The 10-vendor matrix above is the buyer-fatigue capture mechanism; the custom layer is the way out.

FAQ · most asked questions.

How does prompt caching change AI infrastructure TCO?

Prompt caching is the biggest cost-control lever in production AI as of 2026. Anthropic's prompt caching cuts input token cost by ~90% on cached prefixes — for production workloads with stable system prompts (compliance docs, codebase context, knowledge base), this is a 5-10x cost reduction on input tokens. OpenAI's cached input is more limited (~50% discount on automatic prefix caching). Google Vertex Gemini caching is improving. The TCO math: if your input-token spend is 10x your output-token spend (typical for retrieval-heavy workloads), and you cut input cost by 90%, your total bill drops by ~80%. Always architect for prompt caching when designing production AI workloads.

Per-token vs flat-rate vs serverless GPU vs self-host — which model wins on TCO?

Depends on workload pattern. (1) Per-token (Anthropic / OpenAI / Together / Fireworks / Groq) wins for variable-volume workloads where you'd otherwise pay for idle GPU capacity. (2) Provisioned Throughput / Dedicated Endpoints (Bedrock / Together / Fireworks) wins for high-volume sustained workloads where per-token math exceeds dedicated-capacity hourly billing. (3) Serverless GPU (Modal / Replicate) wins for custom inference pipelines + bursty workloads. (4) Self-host (Llama / DeepSeek on your own GPUs) wins ONLY at very high volume where infrastructure cost beats vendor markup AND you have ML ops capacity to maintain it. The honest answer for most teams in 2026: per-token + prompt caching is the right default until you have data showing dedicated capacity beats it.

What's the typical TCO beyond the API bill?

Beyond the per-token / GPU-hour bill, TCO includes: (1) prompt engineering + evaluation infrastructure time (often the dominant cost in early production), (2) enterprise compliance review (SOC 2 / DPA / BAA / ZDR negotiations) — typically 4-12 weeks of legal+security time for any new vendor, (3) admin onboarding (rate limit management, key rotation, IAM integration), (4) ongoing model evaluation as vendors ship new models (you should re-eval substrate quarterly minimum), (5) the cost of being wrong on substrate choice (switching providers in production = 4-12 weeks of engineering time). The API bill is usually 30-60% of true 3-year TCO for production AI workloads; the rest is people + procurement + ops.

Should I use OpenRouter to save money or pay direct?

Depends on volume. OpenRouter charges 5-15% margin on top of upstream provider pricing — for indie + low-volume workloads, the operational simplicity (one API, one bill, no vendor management, automatic fallback routing) is usually worth the margin. For production at scale, going direct to your primary provider (Anthropic / OpenAI / Bedrock / Vertex) wins on per-token cost and unlocks enterprise contract benefits (custom rate limits, ZDR, BAA, DPA, dedicated CSM) that OpenRouter can't broker. The honest answer: use OpenRouter for evaluation + low-volume + indie workloads; go direct for production volume where the 5-15% margin compounds.

What's the parallel-solutions doctrine for AI infrastructure pricing?

Buy from whatever vendor you want — but you're going to want a SideGuy. The parallel-solutions doctrine for AI infrastructure: pick whatever substrate fits your procurement (Anthropic direct, AWS Bedrock, Google Vertex, Azure OpenAI), AND build a custom layer above it for cost-control + multi-provider routing + prompt-caching architecture + workflow orchestration the standardized API can't handle. Vendor handles the substrate (model serving, scale); custom layer handles your unique cost-control + workflow logic forever. SideGuy ships the not-heavy customizable layer — ~$5K-$50K initial build for AI infrastructure cost-control + ~$1K-$10K/quarter recurring per buyer for substrate-upgrade-as-a-service. See Install Packs for productized scopes.

What other AI Infrastructure axes does SideGuy cover?

The AI Infrastructure cluster covers six operator-honest pages: 10-Way Megapage (Anthropic · OpenAI · Vertex · Bedrock · Together · Replicate · OpenRouter · Modal · Fireworks · Groq) · Operator-Honest Ratings axis (Quality of Support · Uptime · Roadmap Velocity · Operator-Honest Behavior) · Privacy + Self-Host axis (ZDR contracts · BAA · data residency · air-gapped) · Inference Speed + Latency axis (sub-100ms · tokens-per-second · batched) · Multi-Provider Routing + Vendor Lock-In axis (OpenRouter · Bedrock multi-model · Vertex multi-model). Plus the sister cluster: AI Coding Tools 10-Way Megapage. And the broader graphs: Compliance Authority Graph · Operator Cockpit · Install Packs. Same operator-honest doctrine across every page: no vendor sponsorship, siren-based ranking by buyer persona, parallel-solutions custom-layer pitch (buy from whatever vendor you want — but you're going to want a SideGuy).

Stuck choosing? Text PJ.

10-minute operator-honest read on your actual buying context. No deck, no demo call, no signup. If we're not the right fit, we'll say so.

📱 Text PJ · 858-461-8054

Audit in 6 weeks? Enterprise customer waiting? Regulator finding?

Skip the 5 vendor demos. 30-day delivery. No procurement cycle. No demo theater. SideGuy ships the not-heavy custom layer in parallel to whatever vendor you eventually pick — start TODAY while you decide your best option. Custom builds in 30 days →

📱 Urgent? Text PJ · 858-461-8054

Field Notes · from the SideGuy operator.

Lived-data observations PJ has logged from running this stack. Pulled from data/field-notes.json (Round 37 — Field Notes Engine). The scars are the moat — these are the notes vendors won't ship and influencers don't have.

FIELD NOTE #4 Confidence · High

Anthropic Batch API saves 50% on cost but adds a 4-hour latency tail. Plan workloads accordingly.

ai-infrastructure · batch-vs-realtime · anthropic · added 2026-05-11
FIELD NOTE #1 Confidence · High

Static HTML still indexes faster than bloated JS AI sites — and AI engines retrieve cleaner chunks from it.

retrieval · static-html · aeo · added 2026-05-11
FIELD NOTE #2 Confidence · High

Most observability stacks fail from late instrumentation. Wire it before you need it.

llm-observability · operator-wisdom · added 2026-05-11

Related Reading · operator-curated cross-links.

Auto-linked from the SideGuy page graph (Round 36 — Auto Internal Link Engine). Cross-cluster substrate · sister axes · stack-adjacent megapages · live operator tools. Last refreshed 2026-05-11.

You can go at it without SideGuy — but no custom shareables for your friends & family. You'll be short a bag of laughs. 🌸

I'm almost positive I can help. If I can't, you don't pay.

No signup. No seminar. No bullshit.

— PJ · 858-461-8054

Text PJ 858-461-8054

🎁 Didn't quite find it?

Don't see what you were looking for?

Text PJ a sentence about what you actually need — I'll build you a free custom shareable on the house. No email, no funnel, no SOW.

📲 Text PJ — free shareable

~10 min turnaround. Your friends will love it.