LLM Pricing Operations
Practical advice for running LLM pricing in production without surprises.
Keep model tables current
Section titled “Keep model tables current”For type: token-based routes, maintain a models map for the models you care about. New model IDs appear frequently and may not match your target margin.
routes: "POST /v1/chat/completions": upstream: openai type: token-based models: gpt-4o: "$0.05" gpt-4o-mini: "$0.005" claude-sonnet-4-5: "$0.02" fallback: "$0.02" # safe default for unknown modelsA conservative fallback prevents silent undercharging when new models appear.
Handle fine-tunes and custom model IDs
Section titled “Handle fine-tunes and custom model IDs”Fine-tunes often use custom IDs (ft:*, provider suffixes, deployment names). Map known IDs explicitly:
routes: "POST /v1/chat/completions": upstream: openai type: token-based models: gpt-4o: "$0.05" ft:gpt-4o:acme-support-v2: "$0.08" ft:gpt-4o-mini:triage: "$0.015" fallback: "$0.03"For fail-closed behavior on unknown models, reject them in a hook:
const allowed = new Set(["gpt-4o", "gpt-4o-mini", "ft:gpt-4o:acme-support-v2"]);
export default async (ctx) => { const body = (ctx.req.body ?? {}) as { model?: string }; const model = body.model; if (!model || !allowed.has(model)) { return { reject: true, status: 400, body: "Unsupported model" }; }};Flat vs token-based pricing
Section titled “Flat vs token-based pricing”| Basis | Best for | Tradeoff |
|---|---|---|
Token-based (type: token-based) | Chat/completion APIs with variable output | Price varies per request |
Flat per request (price or match) | Short, bounded requests | Heavy requests can erode margin |
Start with token-based for general LLM APIs. Use flat pricing only when output size is tightly bounded.
Guardrails
Section titled “Guardrails”- Always set a conservative
fallbackfor unknown models. - Keep an allowlist for approved model IDs on high-risk routes.
- Prefer
settlement: after-responseon expensive upstreams where failures are common. See Refund Protection.
Related: